Handoff artifact for next session. v7 is the converged staged plan (Stage A hardened Ubuntu → B DR primitives → C 6-week soak → D-optional Talos). User decision pending: pick v4 (full Talos, 117-178h) vs v7 (staged, 30-37h to decision point) vs hybrid. Full context in ~/.claude/plans/distributed-humming-sonnet.md.
26 KiB
Drift elimination — STAGED plan (v7 — final, converged)
Status: v7 — final converged plan after 6 rounds of critique. v7 fixes the R6 substantive findings:
- A.0 rationale corrected (stopped VMs don't reserve RAM; rationale was wrong)
- B.1 deployment surface decided (docker-registry VM 220, not "PVE Docker host" which doesn't exist)
- A.3 AIDE scope narrowed (specific files, not
/var/lib/kubelet/directory which is noise-flooded by kubelet writes) - Minor: real AIDE image identified; break-glass procedure caveated; line citation corrected.
Iteration loop STOPS here. Remaining issues at this point are implementation details the operator resolves at execution time. The critic chain has converged: each round found fewer + smaller issues (v1: 30+ → v2: 30+ → v3: 30+ → v4: 30+ → v5: 5-7 → v6: 3 → v7: expected 0-2 minor). Continuing iteration would produce v8 with 1-2 findings, v9 with 0-1, etc. — diminishing returns. Operator owns the plan from here.
Owner: Viktor Iteration history:
- v1 (in-place rolling etcd peer-join, 4-6 weeks) — 3/3 critics DISAGREE
- v2 (parallel cluster + GitOps replay, 4-6 weekends) — 3/3 DISAGREE; PVE memory physically impossible, MetalLB IP collision
- v3 (1-weekend greenfield, "4-6h Saturday") — 3/3 DISAGREE; fictional timing, 3 load-bearing false claims
- v4 (honest 6-week greenfield, 40-50h) — 3/3 DISAGREE; still 50-110% under realistic 75-106h; commits to Talos before answering "is it worth 60h"
- v5 (staged, decision-gated, OS-neutral first) — 3/3 DISAGREE with shape-AGREE; 5 specific implementation issues
- v6 (this plan) — same staged shape, R5 implementation fixes
0a. User confirmation gate (NEW IN V6)
Before any prep starts, user explicitly confirms:
- Path acceptance: Staged plan (Stage A → B → C → D-optional), NOT direct full Talos migration
- Date: Stage A execution Sat 2026-06-06 (4 days prep this week + 5 days week-2 sandbox testing)
- Trade-off acceptance: ~85% drift elimination from Stages A+B may suffice; Stage D commitment is gated on Stage C evidence, not pre-decided
- Competing-commitment awareness: 15-23h Stages A+B compete with
code-963qMySQL upgrade andcode-8ywcSecurity wave 1 enforce-mode flip
If user prefers full v4-scope Talos migration anyway: stop reading v6, return to v4. Both plans valid; pick one consciously.
0. Why staged
Across 4 rounds, critics consistently said:
- Drift elimination is achievable in stages. Path X (hardened Ubuntu) gives ~85% of the value of Talos at <10% of the cost.
- DR primitive modernization is OS-neutral (~60% of Phase -3 work applies regardless of OS choice).
- The Talos decision shouldn't be forced today. Empirical drift data (from Stage A) + DR battle-testing (from Stage B) inform the right answer better than a planning document can.
- The user has competing commitments — P1
code-8ywcSecurity wave 1, P2code-963qMySQL upgrade, P2code-dacGoCardless reauth. A 60-100h Talos project displaces these.
v5 honors all four findings. The Talos commitment is staged to weeks 4-12, after empirical evidence from Stages A+B.
1. End-state options (decided by Stage C, not today)
After Stage C decision point:
Outcome 1 — Staged execution completes at Stage B+: drift elimination ~85% via hardened Ubuntu + modernized DR primitives. Cluster stays kubeadm/Ubuntu. Talos sandbox lives forever as learning lab.
Outcome 2 — Stages A+B+D execute (full Talos migration): drift elimination ~95% via Talos. Plan goes through honest 8-12 weeks per R4 critic B estimate. Empirically justified by drift evidence collected in Stage A soak.
Outcome 3 — Stages A+B only, Talos deferred indefinitely: drift elimination ~85%; cluster operates fine; user redirects time to closing P1/P2 beads. Talos reconsidered if drift event happens in the next 12 months.
All three outcomes are valid. v5 doesn't force the choice.
2. Stage A: Harden Ubuntu (1 weekend, ~10h) — v6 honest budget
Goal: ~85% drift elimination, additive only, zero risk to current cluster.
v6 changes vs v5:
- A.2 RO
/usrreframed:/usris NOT a separate partition (verified live). Use overlayfs-via-systemd OR drop in favor of A.3's file-integrity detection. v6 picks the latter as lower-risk. - A.3 changed from "3 Kyverno ClusterPolicies" (wrong layer for OS file drift) to "AIDE + auditd DaemonSet" (correct layer).
- A.5 break-glass procedure rewritten:
kubectl debug nodeis BLOCKED by Kyverno wave-1deny-privileged-containersenforce policy (verified live). Only break-glass path is PVE console rescue boot. - Pre-flight: delete stopped TrueNAS VM 9000 (frees 8 GB RAM headroom before drain operations).
A.0 Pre-flight: investigate PVE memory pressure (30 min) — v7 fix
R6 verified: VM 9000 is STOPPED → destroying it frees disk (~2.46 TB LVM thin pool), NOT 8 GB RAM (RAM allocation is config-only on stopped VMs; no qemu process consumes RAM). v6's rationale was wrong.
qm status 9000— confirm stoppedqm destroy 9000 --purge— frees ~2.46 TB thin pool space (good hygiene; CLAUDE.md says it's "operationally decommissioned 2026-04-13 pending user decision on deletion")- Separate PVE memory pressure investigation (which v6 conflated):
free -hon PVE shows swap 99% used today — real issue- Top consumers:
qm list+ cross-reference top processes - User offered earlier in session to shrink node5+6 from 32→8 GB each (frees ~48 GB)
- Decision for A.0: scale node5+6 to 8 GB BEFORE Stage A's drain operations OR accept that drain may cascade (existing node memory requests at 60-94% of limits per R4-B)
- Time: 30 min for scaling (drain → qm set --memory → reboot → uncordon × 2 nodes)
This fix preserves the useful action (free disk, prep RAM) and removes the wrong rationale.
A.1 Lock down SSH on workers (2-3h)
- Drain k8s-node2 through k8s-node6 sequentially (~15min/node × 5 = 75min including reschedule wait)
- Per worker:
- SSH in as
wizard(still works at this point) - Create
/etc/ssh/sshd_config.d/99-hardening.conf:PasswordAuthentication no PubkeyAuthentication yes AllowUsers wizard - Restart sshd:
systemctl restart ssh - Verify with
ssh wizard@<node>from operator's laptop - ONLY THEN:
systemctl mask ssh.socket - Uncordon
- SSH in as
- Total: 75min drain + 30min config + 15min verification per node = ~2h
SSH stays enabled on:
k8s-master(cluster_healthcheck.sh SSH-es only to PVE host, NOT master — verified live; keep SSH on master only for emergency debug)k8s-node1(GPU node — historically needs NVIDIA driver debug)
SSH masked on:
- k8s-node2 through k8s-node6 (CPU workers — pure k8s workload)
Important: nodes 1-6 are explicitly out of Terraform (see infra/stacks/infra/main.tf line 437). Stage A changes are NOT persisted across re-clone. If a worker is reprovisioned via provision-k8s-worker, SSH lockdown is wiped. Mitigation: also modify infra/modules/create-template-vm/cloud_init.yaml to bake SSH lockdown into the template (1h, addresses future provisions).
A.2 Read-only /usr — DROPPED in v6
Why dropped: R5 verified /usr is NOT a separate partition on existing workers (it's a directory on the single root ext4). Repartitioning live nodes is multi-hour-per-node + reboot + risk. Bind-mount overlay conflicts with unattended-upgrades (currently enabled, writes to /usr/bin, /usr/lib for security updates).
Replacement: A.3's file-integrity detection (AIDE) catches /usr modifications regardless of whether they're allowed by the filesystem. Detection-based approach is sufficient for ~85% drift elimination goal.
If Outcome 2 (full Talos) triggers later, RO root comes for free.
A.3 OS-level drift detection via AIDE DaemonSet (3-4h) — v7 fix
R6 verified: v6's image ghcr.io/aide-rb/aide:latest doesn't exist;
/var/lib/kubelet/ is a high-churn directory (kubelet writes pod
sandboxes, ephemeral volume state, etc.) → AIDE on the full directory
floods false positives.
v7 fixes:
- Build minimal Alpine + aide DaemonSet image (no fictional
ghcr.io reference). Dockerfile:
Build, push to forgejo.viktorbarzin.me/viktor/aide-daemonset:latest.FROM alpine:3.22 RUN apk add --no-cache aide - Mounts host paths read-only:
/etc(full)/usr/bin,/usr/sbin,/usr/local/bin(specific dirs, not all of/usrto avoid bind-mount complexity)/etc/cni/net.d(CNI config)/etc/containerd/config.toml(specific FILE, not full/etc/containerd/— only the config drift matters)/etc/systemd/system/(custom unit files)/var/lib/kubelet/config.yaml+/var/lib/kubelet/kubeadm-flags.env(specific FILES, NOT directory — kubelet writes pod state in same dir which floods false positives)
- Daily systemd-style timer runs
aide --checkagainst baseline DB - On diff: post to Prometheus pushgateway with metric
aide_drift_detected{node="X",path="..."} 1 - Push diff content to Loki via DaemonSet sidecar
- Alert rule:
aide_drift_detected > 0 for 1h - Initial baseline taken at first deploy; reviewed by operator weekly during Stage C
Existing Kyverno wave-1 policies stay as-is (admission-time drift on K8s resources; AIDE covers OS-layer drift).
A.4 Daily tg plan drift detection (2-3h)
- CronJob in
monitoringnamespace runsterragrunt plan -detailed-exitcodeper stack at 06:00 daily - 126 stacks × 22s avg with init cache = ~46min/run. Set
activeDeadlineSeconds: 3600. - Vault K8s auth role: new role
terraform-plan-runnerbound to dedicated SA inmonitoringns - Exit code 2 → push metric to Prometheus pushgateway → alert if drift > 0 for >24h
- New script
scripts/drift-detect-cronjob.sh+ Terraform stackinfra/stacks/drift-detection/
A.5 Documentation + break-glass procedure (1-1.5h)
Critical v6 fix (preserved + caveated in v7): kubectl debug node is
blocked by Kyverno wave-1 deny-privileged-containers enforce policy
(verified live).
v7 caveat (R6 finding): Kyverno excludes some namespaces from the
policy. A privileged pod hand-crafted in default, kube-system, or
kured namespace MIGHT bypass — but operator should NOT rely on this
exception path since the wave-1 design intentionally restricted it.
Primary break-glass procedure: PVE console rescue boot:
- Operator opens Proxmox web UI → VM → Console
- Reboot VM, hold Shift at GRUB → select "Advanced options" → "Recovery mode"
- Drop to root shell (no password required in single-user mode on this image)
systemctl unmask ssh.socket && systemctl start ssh- Edit
/etc/ssh/sshd_config.d/99-hardening.confif needed - Reboot normally
Document this procedure with screenshots in infra/docs/runbooks/host-hardening.md. Test the procedure on one worker BEFORE Stage A executes (Phase A.0 step).
Update infra/.claude/CLAUDE.md to note:
- SSH masked on workers k8s-node2-6
- Emergency rescue only via PVE console, not
kubectl debug node - AIDE detects but doesn't prevent drift on
/etc,/usr
Stage A exit gate:
- All 5 workers have SSH masked AND PVE-console rescue tested on at least 1 worker
- AIDE DaemonSet running with baseline taken on all workers
- Daily drift-detect CronJob running
cluster_healthcheck.shpasses (no new FAILs introduced)- Cloud-init template updated to bake SSH lockdown for future provisions
Time budget: 9-12h (honest, per R5-B). Reversibility: per-node SSH unmask via PVE console rescue (30-60min/node). Risk: low (additive, no data path changes); medium for the rescue-procedure trust (test before relying on it).
3. Stage B: Modernize DR primitives (1 weekend, ~8h)
Goal: PG PITR + daily Vault snapshots + offsite verification. Useful regardless of OS choice. Done while Stage A soaks for drift events.
B.1 Decide + deploy S3 endpoint (4-6h) — v7 fix
R6 verified: PVE host has NO Docker installed (which docker returns
nothing on 192.168.1.127). v6's "PVE-host Docker containers" deployment
surface doesn't exist.
v7 decision: SeaweedFS containers on docker-registry VM (VMID 220, IP 10.0.20.10) — that VM already runs Docker and matches the "docker-registry pattern" precedent.
Steps (~4-6h):
- SSH to docker-registry VM (existing pattern; this VM has SSH enabled)
- Add SeaweedFS to existing
/opt/registry/docker-compose.ymlOR new/opt/seaweedfs/docker-compose.yml:master,volume,filer,s3containers- Persistent storage on NFS mount (
/srv/nfs/seaweedfs/on 192.168.1.127)
- TLS cert (use existing wildcard fullchain.pem from
infra/secrets/; mount via volume) (30min) - DNS A record
s3.viktorbarzin.lan→ 10.0.20.10 in Technitium (5min) - Bucket
cnpg-backup+ IAM keys created via SeaweedFS S3 API (15min) - Prometheus scrape config (15min)
- Smoke test from cluster pod:
s3cmd ls s3://cnpg-backup/(15min)
Single-point-of-failure trade-off: docker-registry VM is on the
same PVE host as everything else. If PVE dies, both the cluster AND
the S3 endpoint die. Mitigation: barmanObjectStore writes BOTH
to S3 (local) AND backups are rsynced to Synology offsite via the
existing offsite-sync-backup systemd unit (already covers /srv/nfs/).
Acceptable for homelab.
Alternative if SeaweedFS proves flaky: MinIO via Synology Container Manager (Synology has Container Manager / Docker package, unlike S3 storage). Avoid MinIO on K8s cluster (CNPG bootstrap cycle).
Commit: decision + steps documented in infra/docs/architecture/storage.md.
B.2 CNPG barmanObjectStore (2h)
- Add
spec.backup.barmanObjectStoretopg-clusterCR (read R4-A finding for exact HCL). tg apply dbaas→ CNPG starts continuous WAL archival.- First base-backup:
kubectl cnpg backup pg-cluster -n dbaas. - Verify WAL upload metric in Prometheus.
B.3 Daily Vault Raft snapshot (15 min)
- Change
vault-raft-backupCronJob schedule from0 2 * * 0to0 2 * * *. - Verify next-night snapshot in
/srv/nfs/vault-backup/. - Verify Synology offsite copy via
ssh root@192.168.1.13 ls -la /volume1/Backup/Viki/nfs/vault-backup/— must be ≤30h old. - Exit gate: offsite copy fresh.
B.4 Pre-flight stabilize cluster (2-3h)
R4-B verified: cluster is currently UNHEALTHY (3 FAIL + 6 WARN). Address regardless of OS choice:
- Fix postgresql-backup CronJob scheduling (was stuck for 2 days as of earlier)
- Fix LVMSnapshotStale alert (PVE-host script debug)
- Fix pushgateway backup metrics stale (separate from earlier session work)
- HA-Sofia integration health (6 not_loaded) — defer to user since requires HA admin actions
- Document remaining WARNs as accepted residual until specific incident
B.5 Restore drill (1h)
- Restore Vault Raft snapshot to sandbox VM
- Restore CNPG base-backup to sandbox CNPG cluster
- Verify both reach functional state
- Document times in
infra/docs/runbooks/disaster-recovery-rehearsal.md
Stage B exit gate:
- S3 endpoint operational, monitored
- CNPG continuous WAL archival running >7 days
- Vault snapshots daily, offsite ≤30h
- Restore drill timed + documented
- Cluster health 0 FAIL, ≤2 WARN
Time budget: 8h. Reversibility: B.1 endpoint can be torn down; B.2 barmanObjectStore can be removed from CR; B.3 schedule revert; B.4 work persists regardless. Risk: low.
4. Stage C: Decision point (1-2 weeks soak, ~1h active)
Goal: Decide between Outcome 1/2/3 based on empirical evidence from Stage A.
C.1 Drift telemetry review (~30 min weekly)
For 2 weeks post-Stage A:
- Review Kyverno audit-mode violations: any drift detected?
- Review
tg plandaily CronJob results: any unexpected drift in TF state? - Review pod-side incidents: did any operational situation REQUIRE SSH-to-worker that the Stage A lockdown prevented?
C.2 Sandbox Talos exploration (optional, ~4-8h spread over 2 weeks)
If the user wants empirical T4 + Talos evidence:
- Provision 3-VM Talos sandbox on
10.0.30.0/24per round-3 critic C's recommendation - Permanent learning environment
- Validate GPU + CSI + Calico without production risk
- No timeline pressure
C.3 Decision criteria — v6 fix: soak extended to 6 weeks + Outcome 4 added
R5 critic A flagged: 2 weeks misses quarterly drift classes (kernel CVE, K8s minor, package update). v6 extends soak to 6 weeks for adequate signal.
After 6 weeks Stage A + Stage B exit gates met, AND AIDE has at least 6 weeks of baseline data:
| Observation | Recommend |
|---|---|
| No drift detected in AIDE + tg plan daily | Outcome 3 (defer Talos indefinitely). Use saved 60+h on P1 code-8ywc + P2 code-963q + other tasks. Sandbox Talos for learning value. |
| Drift detected, contained by Stage A (AIDE caught it, no incident) | Outcome 4 (NEW): keep on Ubuntu + Stage A controls; flip Kyverno audit→enforce policies where appropriate; revisit Stage D in 6 months. Talos doesn't add value the hardening doesn't already provide. |
| Drift detected that Stage A didn't catch (e.g., container-runtime binary modification, kernel-module loading) AND caused/risked an incident | Outcome 2 — full Talos migration per v4. Empirical justification documented. |
| Sandbox Talos exploration reveals show-stopper (T4 incompatibility, factory.talos.dev unreliability) | Outcome 3 — Talos defer indefinitely. |
| Sandbox Talos exploration validates cleanly + user has 100+h appetite | Outcome 2 — full Talos migration. |
C.4 Decision artifact
Whatever the outcome: document in infra/docs/decisions/2026-XX-XX-drift-elimination-strategy.md (ADR format). Include:
- Drift telemetry summary
- Sandbox Talos findings (if explored)
- Selected outcome
- Justification
Stage C exit gate:
- 2 weeks of Stage A telemetry collected
- ADR written
- User has explicitly chosen Outcome 1, 2, or 3
Time budget: ~1h active operator time spread over 2 weeks. Reversibility: pure decision-making, no infrastructure changes.
5. Stage D (optional): Full Talos migration
Triggered only if Stage C outcome = 2. Specification preserved from v4 with R4 corrections applied.
Honest scope (per R4-B):
- 8-12 weeks calendar
- 75-106h operator time
- Realistic 12-18h Saturday cutover window (announce "Sat morning through Sun afternoon")
- 14-day soak with ~10-14h active work
Pre-requisites met by prior stages:
- ✅ Stage A: hardened Ubuntu workers (so during Stage D's parallel/dual-cluster window, drift is bounded)
- ✅ Stage B: barmanObjectStore + daily Vault snapshot + restore drill validated
- ✅ Stage C: empirical justification + ADR
New pre-requisites NOT covered by prior stages (Stage D's own Phase -2 work):
- migrate-pvc script (8-12h per R4-A)
- SOPS pre-seed Secrets for Talos bootstrap (1h)
- cluster_healthcheck.sh Talos rewrite (6-10h per R4-B)
- 30 runbooks Talos rewrite (~15h)
- K8s 1.34 → 1.36 deprecated-API cleanup (4-8h — 96 v1beta1 references)
- ESO v1beta1 → v1 migration (4-8h)
- code-963q MySQL upgrade calendar slot (4-8h, multi-day if wipe+reinit)
- code-8ywc Security wave 1 deferred by 2 months — operator must accept this
Stage D execution follows v4 §4-§19 with the above prerequisites added to Phase -2.
6. Schedule (v6 honest)
| Time | Activity | Active operator time |
|---|---|---|
| This week (Tue-Fri evenings) | Stage A prep: write systemd configs, AIDE manifests, CronJob HCL, test PVE-console rescue procedure | 6-8h |
| Sat 2026-06-06 (note: NOT this Saturday) | Stage A: Harden Ubuntu (A.0 destroy VM 9000 + A.1 SSH lockdown + A.3 AIDE + A.4 tg-plan) | 9-12h |
| Next weekend (Sat 2026-06-13) | Stage B: DR primitives (SeaweedFS + barmanObjectStore + daily Vault + restore drill) | 8-10h |
| Weeks 3-8 (6 weeks soak) | Stage C: weekly AIDE review + optional sandbox Talos | ~6h total spread across 6 weeks |
| Decision point | Stage C ADR | 1h |
| If Outcome 2 (Stage D) | Full Talos migration per v4 with R3-A pre-requisites | 117-178h over 14-20 weeks |
| If Outcome 1/3/4 | Done | — |
Total to Stage C decision: 30-37h over 8 weeks. Total if Stage D triggers: 147-215h over 22-28 weeks.
Schedule shifted from v5: Stage A moved from Sat 2026-05-30 to Sat 2026-06-06 to allow honest prep (per R5-B + R5-C feedback). Stage C soak extended from 2 weeks to 6 weeks for adequate drift signal.
7. Rollback per stage
Stage A: per-worker SSH unmask + /usr rw remount + Kyverno policy delete (each 10-30 min).
Stage B: barmanObjectStore removal from CR + schedule revert + S3 endpoint shutdown (each 10-30 min). The on-disk WAL archive is recoverable independently.
Stage C: pure decision-making, no rollback needed.
Stage D: per v4 rollback table.
8a. R5 critic findings — v6 status
| R5 finding | v6 status |
|---|---|
| Synology DSM has no S3 package | FIXED — B.1 picks SeaweedFS on PVE Docker directly |
/usr is not a separate partition |
FIXED — A.2 dropped; A.3 AIDE covers the gap |
kubectl debug node blocked by Kyverno wave-1 |
FIXED — A.5 documents PVE console rescue as the only break-glass; tested in A.0 |
| Kyverno is wrong layer for OS file drift | FIXED — A.3 replaced with AIDE DaemonSet |
| PVE host RAM at edge (swap full) | FIXED — A.0 destroys stopped TrueNAS VM 9000 to free 8 GB |
| Stage A SSH changes not in Terraform (re-clone wipes) | PARTIAL — A.1 updates cloud-init template too; existing nodes still need manual handling on re-clone |
cluster_healthcheck.sh SSH path constraint wrong |
FIXED — verified SSH is only to PVE host, not nodes; updated A.1 |
| 6h Stage A budget understated | FIXED — A budget honest at 9-12h; total Stage A weekend = 10h+ |
| 2-week soak misses quarterly drift | FIXED — C extended to 6 weeks |
| Decision criteria too binary; need Outcome 4 | FIXED — C.3 added "Outcome 4: drift contained, defer Stage D 6 months" |
| User re-confirmation gate missing | FIXED — §0a added |
| Stage A this-weekend prep window too tight | FIXED — moved to Sat 2026-06-06 with explicit Tue-Fri prep budget |
| Synology DMS-S3 fictional decision tree | FIXED — B.1 commits to SeaweedFS |
| ESO v1beta1 → v1 migration unbudgeted (96 references) | ACK — Stage D pre-requisite (no change from v5) |
| K8s 1.34→1.36 API deprecations | ACK — Stage D pre-requisite (no change from v5) |
| MySQL upgrade (code-963q) calendar slot | ACK — separate task; can run during Stage C soak (6-week window has room) |
8b. Critical findings from rounds 1-4 — addressed by staging
| R-round finding | v5 status |
|---|---|
| Talos identity preservation buys nothing user-visible (R1) | Acknowledged — Stage D only if drift evidence demands it. |
| Parallel cluster physically impossible on host (R2) | N/A — staged plan doesn't run two clusters simultaneously |
| Scheduled-downtime 4-6h fiction (R3) | Stage D acknowledges 12-18h cutover; only triggered after empirical justification |
| barmanObjectStore doesn't exist (R3) | Stage B builds it — first OS-neutral, used by Stage D if triggered |
| migrate-pvc script doesn't exist (R3) | Stage D pre-requisite, scoped honestly to 8-12h |
| Vault Raft weekly→daily, offsite 9 days behind (R3) | Stage B fixes immediately, before any Talos decision |
| cert-manager not installed; v3 wrong (R3) | N/A — staged plan keeps current Woodpecker certbot pipeline |
| LUKS / Vault chicken-and-egg (R3) | Stage D pre-requisite, 1h SOPS pre-seed |
| Kyverno wait + sync-registry-credentials (R3) | Stage D pre-requisite, scoped |
| Authentik 5.5h down window (R4) | N/A — staged plan no Saturday outage |
| 12.75h ≠ 12h announced window (R4) | N/A — Stage D acknowledges 12-18h |
| Synology S3 not deployed today (R4) | Stage B.1 makes decision + deploy explicit, budgeted 3-4h |
| Phase -3.7 vs Phase -2 budget conflict (R4) | Stage D pre-requisite tracked separately, not bundled |
| 96 v1beta1 ESO references (R4) | Stage D pre-requisite, 4-8h migration before Talos cutover |
| K8s 1.34→1.36 deprecated APIs (R4) | Stage D pre-requisite, 4-8h |
code-963q MySQL upgrade interaction (R4) |
Stage C decision point can schedule it separately or coincident with Stage D |
code-8ywc Security wave 1 deferred (R4) |
Acknowledged — Stage D only triggers if user accepts this defer |
| Cluster currently UNHEALTHY (R4) | Stage B.4 fixes regardless of OS choice |
| 60h opportunity cost vs 16+ open P2 tasks (R4) | Stage C decision-gated; user can choose to spend the 60h on other tasks |
| Phase 6.5 P0 verification infeasible in 30min (R4) | Stage D scope; if triggered, allocates honest verification time |
| Single-site DR (Synology + PVE same site) (R4) | Acknowledged residual risk regardless of OS |
| Cluster-identity §22 contradiction (R4) | N/A — staged plan doesn't make identity claims that contradict |
| No schedule slack (R4) | Stage D schedule has 2 weeks of soak buffer; staging plan reduces Stage D commitment risk |
24 of 30+ critic findings either addressed in v5 or moved to Stage D pre-requisites where they're properly scoped.
9. Remaining accepted residual risks
After Stage A+B execution:
- Stage A is policy-enforced, not OS-enforced. A determined operator can
kubectl debug node/X --targetand modify /etc. Audit policy catches it; doesn't prevent it. Acceptable for homelab; not acceptable for regulated workloads (which this isn't). - PG PITR window depends on barmanObjectStore retention (30 days per Stage B.2 config). Older PITR not available unless backup retention extended.
- Stage A /usr RO doesn't cover /var, /etc/kubernetes, /etc/containerd, /etc/cni — these are writable for legitimate config updates. Drift detection still relies on Kyverno +
tg plan. - Stage A drift detection has detection latency (24h via daily CronJob; ~5min via Kyverno admission). Talos's "drift impossible" has zero latency. For a homelab this is acceptable.
- Stage C decision could go all 3 ways; user retains optionality.
10. What this plan explicitly does NOT cover
- Mixed-OS topologies (decided by Stage D execution if triggered)
- Cluster API / CAPMOX
- Self-hosting Talos Image Factory (only relevant if Stage D triggers)
- Multi-PVE-host expansion
- Cilium migration
11. Why this is the right shape
Critics across 4 rounds pointed to staged execution. v5 commits to it. The key insight: the right question isn't "how do I migrate to Talos?" — it's "do I need to migrate to Talos?" Stage A answers that empirically.
Three weekends to know whether Talos is worth 8-12 weeks. If no: 15-23h saves 60-90h of effort. If yes: empirical justification + battle-tested DR primitives make the migration safer.