infra/docs/plans/2026-05-26-talos-migration-design.md
Viktor Barzin 3526089457 docs: Talos migration design v7 — staged plan after 6 rounds of critique [ci skip]
Handoff artifact for next session. v7 is the converged staged plan
(Stage A hardened Ubuntu → B DR primitives → C 6-week soak →
D-optional Talos). User decision pending: pick v4 (full Talos, 117-178h)
vs v7 (staged, 30-37h to decision point) vs hybrid.

Full context in ~/.claude/plans/distributed-humming-sonnet.md.
2026-05-26 19:45:48 +00:00

453 lines
26 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Drift elimination — STAGED plan (v7 — final, converged)
**Status:** v7 — final converged plan after 6 rounds of critique. v7
fixes the R6 substantive findings:
- A.0 rationale corrected (stopped VMs don't reserve RAM; rationale was wrong)
- B.1 deployment surface decided (docker-registry VM 220, not "PVE Docker host" which doesn't exist)
- A.3 AIDE scope narrowed (specific files, not `/var/lib/kubelet/` directory which is noise-flooded by kubelet writes)
- Minor: real AIDE image identified; break-glass procedure caveated; line citation corrected.
**Iteration loop STOPS here.** Remaining issues at this point are
implementation details the operator resolves at execution time. The
critic chain has converged: each round found fewer + smaller issues
(v1: 30+ → v2: 30+ → v3: 30+ → v4: 30+ → v5: 5-7 → v6: 3 → v7:
expected 0-2 minor). Continuing iteration would produce v8 with 1-2
findings, v9 with 0-1, etc. — diminishing returns. Operator owns the
plan from here.
**Owner:** Viktor
**Iteration history**:
- v1 (in-place rolling etcd peer-join, 4-6 weeks) — 3/3 critics DISAGREE
- v2 (parallel cluster + GitOps replay, 4-6 weekends) — 3/3 DISAGREE; PVE memory physically impossible, MetalLB IP collision
- v3 (1-weekend greenfield, "4-6h Saturday") — 3/3 DISAGREE; fictional timing, 3 load-bearing false claims
- v4 (honest 6-week greenfield, 40-50h) — 3/3 DISAGREE; still 50-110% under realistic 75-106h; commits to Talos before answering "is it worth 60h"
- v5 (staged, decision-gated, OS-neutral first) — 3/3 DISAGREE with shape-AGREE; 5 specific implementation issues
- **v6 (this plan) — same staged shape, R5 implementation fixes**
## 0a. User confirmation gate (NEW IN V6)
Before any prep starts, user explicitly confirms:
- [ ] **Path acceptance**: Staged plan (Stage A → B → C → D-optional), NOT direct full Talos migration
- [ ] **Date**: Stage A execution Sat 2026-06-06 (4 days prep this week + 5 days week-2 sandbox testing)
- [ ] **Trade-off acceptance**: ~85% drift elimination from Stages A+B may suffice; Stage D commitment is gated on Stage C evidence, not pre-decided
- [ ] **Competing-commitment awareness**: 15-23h Stages A+B compete with `code-963q` MySQL upgrade and `code-8ywc` Security wave 1 enforce-mode flip
If user prefers full v4-scope Talos migration anyway: stop reading v6, return to v4. Both plans valid; pick one consciously.
## 0. Why staged
Across 4 rounds, critics consistently said:
1. **Drift elimination is achievable in stages.** Path X (hardened Ubuntu) gives ~85% of the value of Talos at <10% of the cost.
2. **DR primitive modernization is OS-neutral** (~60% of Phase -3 work applies regardless of OS choice).
3. **The Talos decision shouldn't be forced today.** Empirical drift data (from Stage A) + DR battle-testing (from Stage B) inform the right answer better than a planning document can.
4. **The user has competing commitments** P1 `code-8ywc` Security wave 1, P2 `code-963q` MySQL upgrade, P2 `code-dac` GoCardless reauth. A 60-100h Talos project displaces these.
v5 honors all four findings. The Talos commitment is staged to weeks 4-12, **after** empirical evidence from Stages A+B.
## 1. End-state options (decided by Stage C, not today)
After Stage C decision point:
**Outcome 1 — Staged execution completes at Stage B+:** drift elimination ~85% via hardened Ubuntu + modernized DR primitives. Cluster stays kubeadm/Ubuntu. Talos sandbox lives forever as learning lab.
**Outcome 2 — Stages A+B+D execute (full Talos migration):** drift elimination ~95% via Talos. Plan goes through honest 8-12 weeks per R4 critic B estimate. Empirically justified by drift evidence collected in Stage A soak.
**Outcome 3 — Stages A+B only, Talos deferred indefinitely:** drift elimination ~85%; cluster operates fine; user redirects time to closing P1/P2 beads. Talos reconsidered if drift event happens in the next 12 months.
All three outcomes are valid. v5 doesn't force the choice.
## 2. Stage A: Harden Ubuntu (1 weekend, ~10h) — v6 honest budget
**Goal**: ~85% drift elimination, additive only, zero risk to current cluster.
**v6 changes vs v5**:
- A.2 RO `/usr` reframed: `/usr` is NOT a separate partition (verified live). Use overlayfs-via-systemd OR drop in favor of A.3's file-integrity detection. v6 picks the latter as lower-risk.
- A.3 changed from "3 Kyverno ClusterPolicies" (wrong layer for OS file drift) to "AIDE + auditd DaemonSet" (correct layer).
- A.5 break-glass procedure rewritten: `kubectl debug node` is BLOCKED by Kyverno wave-1 `deny-privileged-containers` enforce policy (verified live). Only break-glass path is PVE console rescue boot.
- Pre-flight: delete stopped TrueNAS VM 9000 (frees 8 GB RAM headroom before drain operations).
### A.0 Pre-flight: investigate PVE memory pressure (30 min) — v7 fix
R6 verified: VM 9000 is STOPPED destroying it frees disk (~2.46 TB
LVM thin pool), NOT 8 GB RAM (RAM allocation is config-only on
stopped VMs; no qemu process consumes RAM). v6's rationale was wrong.
- `qm status 9000` confirm stopped
- `qm destroy 9000 --purge` frees ~2.46 TB thin pool space (good
hygiene; CLAUDE.md says it's "operationally decommissioned 2026-04-13
pending user decision on deletion")
- **Separate PVE memory pressure investigation** (which v6 conflated):
- `free -h` on PVE shows swap 99% used today real issue
- Top consumers: `qm list` + cross-reference top processes
- User offered earlier in session to shrink node5+6 from 328 GB each (frees ~48 GB)
- Decision for A.0: scale node5+6 to 8 GB BEFORE Stage A's drain
operations OR accept that drain may cascade (existing node memory
requests at 60-94% of limits per R4-B)
- Time: 30 min for scaling (drain qm set --memory reboot uncordon × 2 nodes)
This fix preserves the useful action (free disk, prep RAM) and
removes the wrong rationale.
### A.1 Lock down SSH on workers (2-3h)
- Drain k8s-node2 through k8s-node6 sequentially (~15min/node × 5 = 75min including reschedule wait)
- Per worker:
1. SSH in as `wizard` (still works at this point)
2. Create `/etc/ssh/sshd_config.d/99-hardening.conf`:
```
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers wizard
```
3. Restart sshd: `systemctl restart ssh`
4. Verify with `ssh wizard@<node>` from operator's laptop
5. ONLY THEN: `systemctl mask ssh.socket`
6. Uncordon
- Total: 75min drain + 30min config + 15min verification per node = ~2h
**SSH stays enabled on**:
- `k8s-master` (cluster_healthcheck.sh SSH-es only to PVE host, NOT master — verified live; keep SSH on master only for emergency debug)
- `k8s-node1` (GPU node — historically needs NVIDIA driver debug)
**SSH masked on**:
- k8s-node2 through k8s-node6 (CPU workers — pure k8s workload)
**Important**: nodes 1-6 are explicitly **out of Terraform** (see `infra/stacks/infra/main.tf` line 437). Stage A changes are NOT persisted across re-clone. If a worker is reprovisioned via `provision-k8s-worker`, SSH lockdown is wiped. **Mitigation**: also modify `infra/modules/create-template-vm/cloud_init.yaml` to bake SSH lockdown into the template (1h, addresses future provisions).
### A.2 ~~Read-only /usr~~ — DROPPED in v6
**Why dropped**: R5 verified `/usr` is NOT a separate partition on existing workers (it's a directory on the single root ext4). Repartitioning live nodes is multi-hour-per-node + reboot + risk. Bind-mount overlay conflicts with `unattended-upgrades` (currently enabled, writes to `/usr/bin`, `/usr/lib` for security updates).
**Replacement**: A.3's file-integrity detection (AIDE) catches `/usr` modifications regardless of whether they're allowed by the filesystem. Detection-based approach is sufficient for ~85% drift elimination goal.
If Outcome 2 (full Talos) triggers later, RO root comes for free.
### A.3 OS-level drift detection via AIDE DaemonSet (3-4h) — v7 fix
R6 verified: v6's image `ghcr.io/aide-rb/aide:latest` doesn't exist;
`/var/lib/kubelet/` is a high-churn directory (kubelet writes pod
sandboxes, ephemeral volume state, etc.) → AIDE on the full directory
floods false positives.
v7 fixes:
- **Build minimal Alpine + aide DaemonSet image** (no fictional
ghcr.io reference). Dockerfile:
```
FROM alpine:3.22
RUN apk add --no-cache aide
```
Build, push to forgejo.viktorbarzin.me/viktor/aide-daemonset:latest.
- Mounts host paths read-only:
- `/etc` (full)
- `/usr/bin`, `/usr/sbin`, `/usr/local/bin` (specific dirs, not all of `/usr` to avoid bind-mount complexity)
- `/etc/cni/net.d` (CNI config)
- `/etc/containerd/config.toml` (specific FILE, not full `/etc/containerd/` — only the config drift matters)
- `/etc/systemd/system/` (custom unit files)
- `/var/lib/kubelet/config.yaml` + `/var/lib/kubelet/kubeadm-flags.env` (specific FILES, NOT directory — kubelet writes pod state in same dir which floods false positives)
- Daily systemd-style timer runs `aide --check` against baseline DB
- On diff: post to Prometheus pushgateway with metric
`aide_drift_detected{node="X",path="..."} 1`
- Push diff content to Loki via DaemonSet sidecar
- Alert rule: `aide_drift_detected > 0 for 1h`
- Initial baseline taken at first deploy; reviewed by operator weekly
during Stage C
**Existing Kyverno wave-1 policies stay as-is** (admission-time drift
on K8s resources; AIDE covers OS-layer drift).
### A.4 Daily `tg plan` drift detection (2-3h)
- CronJob in `monitoring` namespace runs `terragrunt plan -detailed-exitcode` per stack at 06:00 daily
- 126 stacks × 22s avg with init cache = ~46min/run. Set `activeDeadlineSeconds: 3600`.
- Vault K8s auth role: new role `terraform-plan-runner` bound to dedicated SA in `monitoring` ns
- Exit code 2 → push metric to Prometheus pushgateway → alert if drift > 0 for >24h
- New script `scripts/drift-detect-cronjob.sh` + Terraform stack `infra/stacks/drift-detection/`
### A.5 Documentation + break-glass procedure (1-1.5h)
**Critical v6 fix (preserved + caveated in v7)**: `kubectl debug node` is
blocked by Kyverno wave-1 `deny-privileged-containers` enforce policy
(verified live).
**v7 caveat (R6 finding)**: Kyverno excludes some namespaces from the
policy. A privileged pod hand-crafted in `default`, `kube-system`, or
`kured` namespace MIGHT bypass — but operator should NOT rely on this
exception path since the wave-1 design intentionally restricted it.
**Primary break-glass procedure: PVE console rescue boot**:
1. Operator opens Proxmox web UI → VM → Console
2. Reboot VM, hold Shift at GRUB → select "Advanced options" → "Recovery mode"
3. Drop to root shell (no password required in single-user mode on this image)
4. `systemctl unmask ssh.socket && systemctl start ssh`
5. Edit `/etc/ssh/sshd_config.d/99-hardening.conf` if needed
6. Reboot normally
**Document this procedure with screenshots** in `infra/docs/runbooks/host-hardening.md`. Test the procedure on one worker BEFORE Stage A executes (Phase A.0 step).
Update `infra/.claude/CLAUDE.md` to note:
- SSH masked on workers k8s-node2-6
- Emergency rescue only via PVE console, not `kubectl debug node`
- AIDE detects but doesn't prevent drift on `/etc`, `/usr`
**Stage A exit gate**:
- All 5 workers have SSH masked AND PVE-console rescue tested on at least 1 worker
- AIDE DaemonSet running with baseline taken on all workers
- Daily drift-detect CronJob running
- `cluster_healthcheck.sh` passes (no new FAILs introduced)
- Cloud-init template updated to bake SSH lockdown for future provisions
**Time budget**: 9-12h (honest, per R5-B). **Reversibility**: per-node SSH unmask via PVE console rescue (30-60min/node). **Risk**: low (additive, no data path changes); medium for the rescue-procedure trust (test before relying on it).
## 3. Stage B: Modernize DR primitives (1 weekend, ~8h)
**Goal**: PG PITR + daily Vault snapshots + offsite verification. Useful regardless of OS choice. Done while Stage A soaks for drift events.
### B.1 Decide + deploy S3 endpoint (4-6h) — v7 fix
R6 verified: PVE host has NO Docker installed (`which docker` returns
nothing on 192.168.1.127). v6's "PVE-host Docker containers" deployment
surface doesn't exist.
**v7 decision: SeaweedFS containers on docker-registry VM (VMID 220,
IP 10.0.20.10)** — that VM already runs Docker and matches the
"docker-registry pattern" precedent.
Steps (~4-6h):
1. SSH to docker-registry VM (existing pattern; this VM has SSH enabled)
2. Add SeaweedFS to existing `/opt/registry/docker-compose.yml` OR new
`/opt/seaweedfs/docker-compose.yml`:
- `master`, `volume`, `filer`, `s3` containers
- Persistent storage on NFS mount (`/srv/nfs/seaweedfs/` on
192.168.1.127)
3. TLS cert (use existing wildcard fullchain.pem from
`infra/secrets/`; mount via volume) (30min)
4. DNS A record `s3.viktorbarzin.lan` → 10.0.20.10 in Technitium (5min)
5. Bucket `cnpg-backup` + IAM keys created via SeaweedFS S3 API (15min)
6. Prometheus scrape config (15min)
7. Smoke test from cluster pod: `s3cmd ls s3://cnpg-backup/` (15min)
**Single-point-of-failure trade-off**: docker-registry VM is on the
same PVE host as everything else. If PVE dies, both the cluster AND
the S3 endpoint die. **Mitigation**: barmanObjectStore writes BOTH
to S3 (local) AND backups are rsynced to Synology offsite via the
existing `offsite-sync-backup` systemd unit (already covers `/srv/nfs/`).
Acceptable for homelab.
**Alternative if SeaweedFS proves flaky**: MinIO via Synology Container
Manager (Synology has Container Manager / Docker package, unlike S3
storage). Avoid MinIO on K8s cluster (CNPG bootstrap cycle).
**Commit**: decision + steps documented in `infra/docs/architecture/storage.md`.
### B.2 CNPG barmanObjectStore (2h)
- Add `spec.backup.barmanObjectStore` to `pg-cluster` CR (read R4-A finding for exact HCL).
- `tg apply dbaas` → CNPG starts continuous WAL archival.
- First base-backup: `kubectl cnpg backup pg-cluster -n dbaas`.
- Verify WAL upload metric in Prometheus.
### B.3 Daily Vault Raft snapshot (15 min)
- Change `vault-raft-backup` CronJob schedule from `0 2 * * 0` to `0 2 * * *`.
- Verify next-night snapshot in `/srv/nfs/vault-backup/`.
- Verify Synology offsite copy via `ssh root@192.168.1.13 ls -la /volume1/Backup/Viki/nfs/vault-backup/` — must be ≤30h old.
- **Exit gate**: offsite copy fresh.
### B.4 Pre-flight stabilize cluster (2-3h)
R4-B verified: cluster is currently UNHEALTHY (3 FAIL + 6 WARN). Address regardless of OS choice:
- Fix postgresql-backup CronJob scheduling (was stuck for 2 days as of earlier)
- Fix LVMSnapshotStale alert (PVE-host script debug)
- Fix pushgateway backup metrics stale (separate from earlier session work)
- HA-Sofia integration health (6 not_loaded) — defer to user since requires HA admin actions
- Document remaining WARNs as accepted residual until specific incident
### B.5 Restore drill (1h)
- Restore Vault Raft snapshot to sandbox VM
- Restore CNPG base-backup to sandbox CNPG cluster
- Verify both reach functional state
- Document times in `infra/docs/runbooks/disaster-recovery-rehearsal.md`
**Stage B exit gate**:
- S3 endpoint operational, monitored
- CNPG continuous WAL archival running >7 days
- Vault snapshots daily, offsite ≤30h
- Restore drill timed + documented
- Cluster health 0 FAIL, ≤2 WARN
**Time budget**: 8h. **Reversibility**: B.1 endpoint can be torn down; B.2 barmanObjectStore can be removed from CR; B.3 schedule revert; B.4 work persists regardless. **Risk**: low.
## 4. Stage C: Decision point (1-2 weeks soak, ~1h active)
**Goal**: Decide between Outcome 1/2/3 based on empirical evidence from Stage A.
### C.1 Drift telemetry review (~30 min weekly)
For 2 weeks post-Stage A:
- Review Kyverno audit-mode violations: any drift detected?
- Review `tg plan` daily CronJob results: any unexpected drift in TF state?
- Review pod-side incidents: did any operational situation REQUIRE SSH-to-worker that the Stage A lockdown prevented?
### C.2 Sandbox Talos exploration (optional, ~4-8h spread over 2 weeks)
If the user wants empirical T4 + Talos evidence:
- Provision 3-VM Talos sandbox on `10.0.30.0/24` per round-3 critic C's recommendation
- Permanent learning environment
- Validate GPU + CSI + Calico without production risk
- No timeline pressure
### C.3 Decision criteria — v6 fix: soak extended to 6 weeks + Outcome 4 added
R5 critic A flagged: 2 weeks misses quarterly drift classes (kernel CVE, K8s minor, package update). v6 extends soak to **6 weeks** for adequate signal.
After 6 weeks Stage A + Stage B exit gates met, AND AIDE has at least 6 weeks of baseline data:
| Observation | Recommend |
|---|---|
| **No drift detected** in AIDE + tg plan daily | **Outcome 3** (defer Talos indefinitely). Use saved 60+h on P1 `code-8ywc` + P2 `code-963q` + other tasks. Sandbox Talos for learning value. |
| **Drift detected, contained by Stage A** (AIDE caught it, no incident) | **Outcome 4** (NEW): keep on Ubuntu + Stage A controls; flip Kyverno audit→enforce policies where appropriate; revisit Stage D in 6 months. Talos doesn't add value the hardening doesn't already provide. |
| **Drift detected that Stage A didn't catch** (e.g., container-runtime binary modification, kernel-module loading) AND caused/risked an incident | **Outcome 2** — full Talos migration per v4. Empirical justification documented. |
| **Sandbox Talos exploration reveals show-stopper** (T4 incompatibility, factory.talos.dev unreliability) | **Outcome 3** — Talos defer indefinitely. |
| **Sandbox Talos exploration validates cleanly** + user has 100+h appetite | **Outcome 2** — full Talos migration. |
### C.4 Decision artifact
Whatever the outcome: document in `infra/docs/decisions/2026-XX-XX-drift-elimination-strategy.md` (ADR format). Include:
- Drift telemetry summary
- Sandbox Talos findings (if explored)
- Selected outcome
- Justification
**Stage C exit gate**:
- 2 weeks of Stage A telemetry collected
- ADR written
- User has explicitly chosen Outcome 1, 2, or 3
**Time budget**: ~1h active operator time spread over 2 weeks. **Reversibility**: pure decision-making, no infrastructure changes.
## 5. Stage D (optional): Full Talos migration
**Triggered only if Stage C outcome = 2.** Specification preserved from v4 with R4 corrections applied.
**Honest scope** (per R4-B):
- 8-12 weeks calendar
- 75-106h operator time
- Realistic 12-18h Saturday cutover window (announce "Sat morning through Sun afternoon")
- 14-day soak with ~10-14h active work
**Pre-requisites met by prior stages**:
- ✅ Stage A: hardened Ubuntu workers (so during Stage D's parallel/dual-cluster window, drift is bounded)
- ✅ Stage B: barmanObjectStore + daily Vault snapshot + restore drill validated
- ✅ Stage C: empirical justification + ADR
**New pre-requisites NOT covered by prior stages** (Stage D's own Phase -2 work):
- migrate-pvc script (8-12h per R4-A)
- SOPS pre-seed Secrets for Talos bootstrap (1h)
- cluster_healthcheck.sh Talos rewrite (6-10h per R4-B)
- 30 runbooks Talos rewrite (~15h)
- K8s 1.34 → 1.36 deprecated-API cleanup (4-8h — 96 v1beta1 references)
- ESO v1beta1 → v1 migration (4-8h)
- code-963q MySQL upgrade calendar slot (4-8h, multi-day if wipe+reinit)
- code-8ywc Security wave 1 deferred by 2 months — operator must accept this
**Stage D execution** follows v4 §4-§19 with the above prerequisites added to Phase -2.
## 6. Schedule (v6 honest)
| Time | Activity | Active operator time |
|---|---|---|
| **This week (Tue-Fri evenings)** | Stage A prep: write systemd configs, AIDE manifests, CronJob HCL, test PVE-console rescue procedure | 6-8h |
| **Sat 2026-06-06** (note: NOT this Saturday) | Stage A: Harden Ubuntu (A.0 destroy VM 9000 + A.1 SSH lockdown + A.3 AIDE + A.4 tg-plan) | 9-12h |
| **Next weekend (Sat 2026-06-13)** | Stage B: DR primitives (SeaweedFS + barmanObjectStore + daily Vault + restore drill) | 8-10h |
| **Weeks 3-8 (6 weeks soak)** | Stage C: weekly AIDE review + optional sandbox Talos | ~6h total spread across 6 weeks |
| **Decision point** | Stage C ADR | 1h |
| **If Outcome 2 (Stage D)** | Full Talos migration per v4 with R3-A pre-requisites | 117-178h over 14-20 weeks |
| **If Outcome 1/3/4** | Done | — |
**Total to Stage C decision**: 30-37h over 8 weeks. **Total if Stage D triggers**: 147-215h over 22-28 weeks.
**Schedule shifted from v5**: Stage A moved from Sat 2026-05-30 to Sat 2026-06-06 to allow honest prep (per R5-B + R5-C feedback). Stage C soak extended from 2 weeks to 6 weeks for adequate drift signal.
## 7. Rollback per stage
Stage A: per-worker SSH unmask + `/usr` rw remount + Kyverno policy delete (each 10-30 min).
Stage B: barmanObjectStore removal from CR + schedule revert + S3 endpoint shutdown (each 10-30 min). The on-disk WAL archive is recoverable independently.
Stage C: pure decision-making, no rollback needed.
Stage D: per v4 rollback table.
## 8a. R5 critic findings — v6 status
| R5 finding | v6 status |
|---|---|
| Synology DSM has no S3 package | FIXED — B.1 picks SeaweedFS on PVE Docker directly |
| `/usr` is not a separate partition | FIXED — A.2 dropped; A.3 AIDE covers the gap |
| `kubectl debug node` blocked by Kyverno wave-1 | FIXED — A.5 documents PVE console rescue as the only break-glass; tested in A.0 |
| Kyverno is wrong layer for OS file drift | FIXED — A.3 replaced with AIDE DaemonSet |
| PVE host RAM at edge (swap full) | FIXED — A.0 destroys stopped TrueNAS VM 9000 to free 8 GB |
| Stage A SSH changes not in Terraform (re-clone wipes) | PARTIAL — A.1 updates cloud-init template too; existing nodes still need manual handling on re-clone |
| `cluster_healthcheck.sh` SSH path constraint wrong | FIXED — verified SSH is only to PVE host, not nodes; updated A.1 |
| 6h Stage A budget understated | FIXED — A budget honest at 9-12h; total Stage A weekend = 10h+ |
| 2-week soak misses quarterly drift | FIXED — C extended to 6 weeks |
| Decision criteria too binary; need Outcome 4 | FIXED — C.3 added "Outcome 4: drift contained, defer Stage D 6 months" |
| User re-confirmation gate missing | FIXED — §0a added |
| Stage A this-weekend prep window too tight | FIXED — moved to Sat 2026-06-06 with explicit Tue-Fri prep budget |
| Synology DMS-S3 fictional decision tree | FIXED — B.1 commits to SeaweedFS |
| ESO v1beta1 → v1 migration unbudgeted (96 references) | ACK — Stage D pre-requisite (no change from v5) |
| K8s 1.34→1.36 API deprecations | ACK — Stage D pre-requisite (no change from v5) |
| MySQL upgrade (code-963q) calendar slot | ACK — separate task; can run during Stage C soak (6-week window has room) |
## 8b. Critical findings from rounds 1-4 — addressed by staging
| R-round finding | v5 status |
|---|---|
| Talos identity preservation buys nothing user-visible (R1) | Acknowledged — Stage D only if drift evidence demands it. |
| Parallel cluster physically impossible on host (R2) | N/A — staged plan doesn't run two clusters simultaneously |
| Scheduled-downtime 4-6h fiction (R3) | Stage D acknowledges 12-18h cutover; only triggered after empirical justification |
| barmanObjectStore doesn't exist (R3) | Stage B builds it — first OS-neutral, used by Stage D if triggered |
| migrate-pvc script doesn't exist (R3) | Stage D pre-requisite, scoped honestly to 8-12h |
| Vault Raft weekly→daily, offsite 9 days behind (R3) | Stage B fixes immediately, before any Talos decision |
| cert-manager not installed; v3 wrong (R3) | N/A — staged plan keeps current Woodpecker certbot pipeline |
| LUKS / Vault chicken-and-egg (R3) | Stage D pre-requisite, 1h SOPS pre-seed |
| Kyverno wait + sync-registry-credentials (R3) | Stage D pre-requisite, scoped |
| Authentik 5.5h down window (R4) | N/A — staged plan no Saturday outage |
| 12.75h ≠ 12h announced window (R4) | N/A — Stage D acknowledges 12-18h |
| Synology S3 not deployed today (R4) | Stage B.1 makes decision + deploy explicit, budgeted 3-4h |
| Phase -3.7 vs Phase -2 budget conflict (R4) | Stage D pre-requisite tracked separately, not bundled |
| 96 v1beta1 ESO references (R4) | Stage D pre-requisite, 4-8h migration before Talos cutover |
| K8s 1.34→1.36 deprecated APIs (R4) | Stage D pre-requisite, 4-8h |
| `code-963q` MySQL upgrade interaction (R4) | Stage C decision point can schedule it separately or coincident with Stage D |
| `code-8ywc` Security wave 1 deferred (R4) | Acknowledged — Stage D only triggers if user accepts this defer |
| Cluster currently UNHEALTHY (R4) | Stage B.4 fixes regardless of OS choice |
| 60h opportunity cost vs 16+ open P2 tasks (R4) | Stage C decision-gated; user can choose to spend the 60h on other tasks |
| Phase 6.5 P0 verification infeasible in 30min (R4) | Stage D scope; if triggered, allocates honest verification time |
| Single-site DR (Synology + PVE same site) (R4) | Acknowledged residual risk regardless of OS |
| Cluster-identity §22 contradiction (R4) | N/A — staged plan doesn't make identity claims that contradict |
| No schedule slack (R4) | Stage D schedule has 2 weeks of soak buffer; staging plan reduces Stage D commitment risk |
**24 of 30+ critic findings either addressed in v5 or moved to Stage D pre-requisites where they're properly scoped.**
## 9. Remaining accepted residual risks
After Stage A+B execution:
1. **Stage A is policy-enforced, not OS-enforced.** A determined operator can `kubectl debug node/X --target` and modify /etc. Audit policy catches it; doesn't prevent it. Acceptable for homelab; not acceptable for regulated workloads (which this isn't).
2. **PG PITR window depends on barmanObjectStore retention** (30 days per Stage B.2 config). Older PITR not available unless backup retention extended.
3. **Stage A /usr RO doesn't cover /var, /etc/kubernetes, /etc/containerd, /etc/cni** — these are writable for legitimate config updates. Drift detection still relies on Kyverno + `tg plan`.
4. **Stage A drift detection has detection latency** (24h via daily CronJob; ~5min via Kyverno admission). Talos's "drift impossible" has zero latency. For a homelab this is acceptable.
5. **Stage C decision could go all 3 ways**; user retains optionality.
## 10. What this plan explicitly does NOT cover
- Mixed-OS topologies (decided by Stage D execution if triggered)
- Cluster API / CAPMOX
- Self-hosting Talos Image Factory (only relevant if Stage D triggers)
- Multi-PVE-host expansion
- Cilium migration
## 11. Why this is the right shape
Critics across 4 rounds pointed to staged execution. v5 commits to it. The key insight: **the right question isn't "how do I migrate to Talos?" — it's "do I need to migrate to Talos?"** Stage A answers that empirically.
Three weekends to know whether Talos is worth 8-12 weeks. If no: 15-23h saves 60-90h of effort. If yes: empirical justification + battle-tested DR primitives make the migration safer.