infra/docs/plans
Viktor Barzin eebb6c8594 k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case
The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.

compat-gate.py now classifies each blocker:
  - ACTIONABLE: a newer addon version in addon-compat.json supports the target
    -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
    nightly report).
  - WAITING-on-upstream: no released version supports the target yet -> held.
  - PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.

Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.

nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).

Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:08:20 +00:00
..
2026-02-22-anti-ai-scraping-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-02-22-anti-ai-scraping-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-02-22-node-drift-quick-wins-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-02-22-talos-linux-migration-evaluation.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-02-23-mailserver-hardening-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-02-23-mailserver-hardening-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-02-28-ci-build-caching-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-02-28-ci-build-caching-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-02-28-network-visualization-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-02-28-network-visualization-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-02-28-storage-reliability-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-03-01-nfs-csi-migration-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-03-01-nfs-csi-migration-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-03-01-traefik-resilience-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-03-01-traefik-resilience-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-03-02-security-observability-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-03-03-cluster-hardening-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-03-07-k8s-portal-onboarding-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-03-07-sops-migration-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-03-28-storage-migration-truenas-elimination.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-04-03-proxmox-csi-cleanup-todo.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-04-20-infra-audit-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-04-25-nfs-hostile-migration-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-04-25-nfs-hostile-migration-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-07-forgejo-registry-consolidation-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-07-forgejo-registry-consolidation-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-16-auto-upgrade-apps-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-16-auto-upgrade-apps-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-17-agent-presence-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-19-mysql-8.4.9-upgrade-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-19-mysql-8.4.9-upgrade-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-21-ha-control-plane-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-21-ha-control-plane-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-22-openclaw-devvm-access-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-26-talos-migration-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-28-wealth-projections-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-28-wealth-projections-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-30-breakglass-ssh-access-design.md break-glass SSH: drop port-knock for exposed key-only :52222; version host config 2026-06-11 18:23:39 +00:00
2026-05-30-breakglass-ssh-access-plan.md break-glass SSH: drop port-knock for exposed key-only :52222; version host config 2026-06-11 18:23:39 +00:00
2026-05-30-traefik-dedicated-ip-etp-local-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-30-traefik-dedicated-ip-etp-local-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-01-t3-auto-provision-design.md t3-dispatch: re-pair on present-but-invalid t3_session cookie 2026-06-09 21:21:39 +00:00
2026-06-01-t3-auto-provision-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-01-topolvm-evaluation.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-01-wealth-dashboard-consolidation-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-03-lb-ip-hygiene-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-04-f1-stream-extraction-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-04-f1-stream-extraction-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-04-k8s-dashboard-sso-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-04-k8s-dashboard-sso-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-04-pve-fan-control-design.md fan-control docs: sync runbook/env/service/design to the HA-actuator + anti-flap model 2026-06-16 08:11:48 +00:00
2026-06-05-block-storage-harden-nfs-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-05-idrac-snmp-migration-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-05-idrac-snmp-migration-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-07-multi-user-workstation-design.md Add per-user Claude auth renewal 2026-06-20 20:10:40 +00:00
2026-06-07-multi-user-workstation-plan.md workstation: per-user playwright browser MCP for all users, reproducible from git 2026-06-16 20:33:47 +00:00
2026-06-08-matrix-synapse-to-tuwunel-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-08-matrix-synapse-to-tuwunel-plan.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-09-workstation-authentik-membership-design.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-09-workstation-authentik-membership-plan.md workstation: v2 membership implementation plan [ci skip] 2026-06-09 21:21:39 +00:00
2026-06-11-breakglass-ssh-redesign-design.md break-glass SSH: drop port-knock for exposed key-only :52222; version host config 2026-06-11 18:23:39 +00:00
2026-06-21-eso-0.12-to-2.x-migration-design.md eso: complete migration — chart 2.6.0, all CRs on v1, 1.35 gate cleared 2026-06-23 09:55:51 +00:00
2026-06-21-t3-idle-migrate-design.md docs: t3-migrate-idle runbook section + service-catalog + design status 2026-06-21 12:40:46 +00:00
2026-06-21-t3-idle-migrate-plan.md t3-idle-migrate: implementation plan 2026-06-21 12:26:05 +00:00
2026-06-28-k8s-upgrade-gate-held-classification.md k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case 2026-06-28 10:08:20 +00:00