infra/docs/architecture
Viktor Barzin fb638cd8ec
Some checks failed
ci/woodpecker/push/default Pipeline failed
k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs
Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to
the terminal job-condition reasons (BackoffLimitExceeded|DeadlineExceeded). A
phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every
firing alert also halts kured, so a bare-count false-positive would block all
OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics:
the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0
for the terminal reasons.

Docs updated to match the behaviour change (per the same-commit docs rule):
  - docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the
    "kill a stuck Job" recovery now leads with retry-on-failure self-heal.
  - docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert;
    retry-on-failure note on the deterministic-naming paragraph.
  - .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend
    entry, and drill-down (also copied to the active ~/.claude copy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:10:18 +00:00
..
agent-task-tracking.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
authentication.md feat(authentik): TripIt external self-signup group + forward-auth fence (ADR-0020) 2026-06-15 21:48:04 +00:00
automated-upgrades.md k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs 2026-06-17 13:10:18 +00:00
backup-dr.md monitoring: VzdumpBackup{Stale,NeverRun,Failing} alerts for the new VM-image backup 2026-06-10 09:10:46 +00:00
chrome-service.md tripit: enable live flight-fare scrape via shared chrome-service CDP 2026-06-11 14:23:53 +00:00
ci-cd.md docs: ghcr_pull_token is now a scoped read:packages PAT, not the admin alias 2026-06-15 20:19:17 +00:00
compute.md apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip] 2026-06-11 18:00:08 +00:00
databases.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
dns.md pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere 2026-06-10 18:41:07 +00:00
homepage.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
incident-response.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
llama-cpp.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
mailserver.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
monitoring.md pve-host/dns: register loki.viktorbarzin.lan CNAME, drop the /etc/hosts pin 2026-06-10 22:55:20 +00:00
multi-tenancy.md workstation: per-user playwright browser MCP for all users, reproducible from git 2026-06-16 20:33:47 +00:00
networking.md cloudflared: disable in-place autoupdate (--no-autoupdate) 2026-06-10 21:00:05 +00:00
overview.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
secrets.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
security.md claude-breakglass: in-cluster warm break-glass UI for the devvm 2026-06-12 21:40:17 +00:00
storage.md docs: sync compute/storage/proxmox-inventory with live state (memory audit) [ci skip] 2026-06-11 17:50:43 +00:00
vpn.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
wave1-egress-observation-2026-05-22.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00