infra/stacks/monitoring/modules/monitoring
Viktor Barzin 6cb823e431
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
k8s-version-upgrade: complete autonomy P0 — blocked alert + deeper postflight + runbook
Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt +
alert when not":
- monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning)
  in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see
  Slack for why" signal. (Until monitoring is applied, a block still surfaces via
  the already-live K8sUpgradeChainJobFailed.)
- upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests —
  apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and
  core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns)
  Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't
  downgrade). Catches a "pods look Running but cluster is broken" upgrade.
- runbook: documents the compat gate, the blocked alert, how to clear a block,
  matrix maintenance, and the detector minor-probe fix.

After deploy, the nightly chain detects 1.35 (minor detection now works) and
correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting
via K8sUpgradeBlocked — the autonomy working as designed until the catch-up
clears those addons.
2026-06-19 11:27:17 +00:00
..
dashboards fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
server-power-cycle fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
alert_digest.py monitoring: reduce Slack alert noise (alert-on-change + daily digest) 2026-06-12 20:35:56 +00:00
alert_digest.tf monitoring: reduce Slack alert noise (alert-on-change + daily digest) 2026-06-12 20:35:56 +00:00
alloy.yaml fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
authentik_walloff_probe.tf t3: differential drop-attribution probe + devvm metrics 2026-06-10 21:11:29 +00:00
Dockerfile fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
goflow2.tf fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
grafana.tf fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
grafana_chart_values.yaml fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
idrac.tf fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
k8s-monitoring-values.yaml fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
loki.tf t3: session-auth detection for the gated nightly tracker (dispatch fallback logging + Loki alerts) 2026-06-16 09:56:55 +00:00
loki.yaml fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
loki_ingress.tf fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
main.tf fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
prometheus.tf fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
prometheus_chart_values.tpl k8s-version-upgrade: complete autonomy P0 — blocked alert + deeper postflight + runbook 2026-06-19 11:27:17 +00:00
prometheus_snmp_chart_values.yaml fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
pve_exporter.tf fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
snmp_exporter.tf fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
ups_snmp_values.yaml fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00