infra/docs/architecture
Viktor Barzin 7270e2be3b
Some checks failed
ci/woodpecker/push/default Pipeline failed
monitoring: K8sUpgradeChainJobFailed must not double-fire on a compat-gate block
Last night (2026-06-20) the detector + compat-gate fixes worked: the chain
resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno
1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
fired as designed. But the refusal also made the preflight Job exit 1
(block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped
K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm
for what is the intended halt-and-alert outcome.

Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate
block sets that gauge (and it stays 1 until the next preflight resets it), so
the chain-job-failed alert is suppressed for the blocked period; a genuine
wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires
(preserving the alert's original purpose — catching the pre-in_flight preflight
failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs
updated to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:35:35 +00:00
..
agent-task-tracking.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
authentication.md chore(authentik): tear down obsolete tripit-enrollment (ADR-0020 superseded by ADR-0028) 2026-06-20 20:04:24 +00:00
automated-upgrades.md monitoring: K8sUpgradeChainJobFailed must not double-fire on a compat-gate block 2026-06-21 16:35:35 +00:00
backup-dr.md monitoring: VzdumpBackup{Stale,NeverRun,Failing} alerts for the new VM-image backup 2026-06-10 09:10:46 +00:00
chrome-service.md tripit: enable live flight-fare scrape via shared chrome-service CDP 2026-06-11 14:23:53 +00:00
ci-cd.md cleanup: fully remove orphaned council-complaints app 2026-06-21 13:32:10 +00:00
compute.md apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip] 2026-06-11 18:00:08 +00:00
databases.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
dns.md pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere 2026-06-10 18:41:07 +00:00
homepage.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
incident-response.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
llama-cpp.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
mailserver.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
monitoring.md monitoring: RpiSofiaUndervoltage alerts on new brown-out, not until reboot 2026-06-19 08:45:39 +00:00
multi-tenancy.md Add per-user Claude auth renewal 2026-06-20 20:10:40 +00:00
networking.md docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed) 2026-06-21 13:39:26 +00:00
overview.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
secrets.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
security.md docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed) 2026-06-21 13:39:26 +00:00
storage.md docs: sync compute/storage/proxmox-inventory with live state (memory audit) [ci skip] 2026-06-11 17:50:43 +00:00
vpn.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
wave1-egress-observation-2026-05-22.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00