infra

Viktor Barzin fb638cd8ec Some checks failed ci/woodpecker/push/default Pipeline failed Details k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to the terminal job-condition reasons (BackoffLimitExceeded\|DeadlineExceeded). A phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every firing alert also halts kured, so a bare-count false-positive would block all OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics: the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0 for the terminal reasons. Docs updated to match the behaviour change (per the same-commit docs rule): - docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the "kill a stuck Job" recovery now leads with retry-on-failure self-heal. - docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert; retry-on-failure note on the deterministic-naming paragraph. - .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend entry, and drill-down (also copied to the active ~/.claude copy). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>		2026-06-17 13:10:18 +00:00
..
agent-task-tracking.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
authentication.md	feat(authentik): TripIt external self-signup group + forward-auth fence (ADR-0020)	2026-06-15 21:48:04 +00:00
automated-upgrades.md	k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs	2026-06-17 13:10:18 +00:00
backup-dr.md	monitoring: VzdumpBackup{Stale,NeverRun,Failing} alerts for the new VM-image backup	2026-06-10 09:10:46 +00:00
chrome-service.md	tripit: enable live flight-fare scrape via shared chrome-service CDP	2026-06-11 14:23:53 +00:00
ci-cd.md	docs: ghcr_pull_token is now a scoped read:packages PAT, not the admin alias	2026-06-15 20:19:17 +00:00
compute.md	apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip]	2026-06-11 18:00:08 +00:00
databases.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
dns.md	pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere	2026-06-10 18:41:07 +00:00
homepage.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
incident-response.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
llama-cpp.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
mailserver.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
monitoring.md	pve-host/dns: register loki.viktorbarzin.lan CNAME, drop the /etc/hosts pin	2026-06-10 22:55:20 +00:00
multi-tenancy.md	workstation: per-user playwright browser MCP for all users, reproducible from git	2026-06-16 20:33:47 +00:00
networking.md	cloudflared: disable in-place autoupdate (--no-autoupdate)	2026-06-10 21:00:05 +00:00
overview.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
secrets.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
security.md	claude-breakglass: in-cluster warm break-glass UI for the devvm	2026-06-12 21:40:17 +00:00
storage.md	docs: sync compute/storage/proxmox-inventory with live state (memory audit) [ci skip]	2026-06-11 17:50:43 +00:00
vpn.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
wave1-egress-observation-2026-05-22.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00