cloudflared: fix tunnel origin .200 -> Traefik svc DNS (full-site 502 outage) [ci skip]

The Cloudflare tunnel routed *.viktorbarzin.me and the apex to https://10.0.20.200:443, but Traefik moved off the shared MetalLB .200 onto its dedicated 10.0.20.203 on 2026-05-30 (commit 0c01adac). Nothing serves HTTPS on .200:443 anymore, so cloudflared could not reach its origin (no route to host / i/o timeout) and Cloudflare returned 502 for every externally-proxied service. Internal/LAN access (split-horizon -> .203) was unaffected, which masked the outage. Repoint both ingress rules at the in-cluster Traefik Service DNS (https://traefik.traefik.svc.cluster.local:443) -- the design the docs already described but the code never implemented -- so the tunnel is decoupled from the Traefik LB IP and this cannot recur on a future move. Applied live via targeted apply on the tunnel config resource only; [ci skip] because live already matches and a full stack apply would churn unrelated pre-existing drift (Keel annotations, DKIM re-chunk). Post-mortem: docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
docs(kms): document reboot-after-uninstall / 1603 handling + real-hardware status
2026-06-01 21:22:05 +00:00 · 2026-06-01 21:22:05 +00:00 · 2026-06-01 21:22:05 +00:00 · 2026-06-01 21:22:05 +00:00 · 2026-06-01 21:22:05 +00:00
7 changed files with 275 additions and 12 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -38,7 +38,7 @@ Violations cause state drift, which causes future applies to break or silently r
  - **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend).
 - **Anubis PoW challenge** (`modules/kubernetes/anubis_instance/`): per-site reverse proxy that issues a 30-day JWT cookie after a tiny PoW solve. Use for **public, content-bearing sites without app-level auth** (blog, docs, wikis, static landing pages). Pattern: declare `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://<backend>.<ns>.svc.cluster.local" }`, then in `ingress_factory` set `service_name = module.anubis.service_name`, `port = module.anubis.service_port`, `anti_ai_scraping = false`. Shared ed25519 key in Vault `secret/viktor` -> `anubis_ed25519_key`; cookie scoped to `viktorbarzin.me` so one solve covers all Anubis-fronted subdomains. **DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints** — clients without JS can't solve PoW. **Replicas default to 1** because Anubis stores in-flight challenges in process memory; a challenge issued by pod A and solved against pod B errors with `store: key not found` (HTTP 500). Bumping replicas requires wiring a shared Redis store (TODO). For path-level carve-outs (e.g. wrongmove has `/` behind Anubis but `/api` direct, blog has `/net-diag.sh` direct), declare a second `ingress_factory` with `ingress_path = ["/<path>"]` pointing at the bare backend service. Active on: blog (except `/net-diag.sh`), www, kms, travel, f1, cc, json, pb (privatebin), home (homepage), wrongmove (UI only). See `.claude/reference/patterns.md` "Anti-AI Scraping" for full layering.
 - **Docker images**: Always build for `linux/amd64`. SHA-tag rule is being phased out — see `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`. New model: CI pushes `:latest` (optionally also `:<8-char-sha>` for traceability), Keel polls and triggers rollouts. Cache-staleness concern from the old rule is resolved at the nginx layer (URL-split — manifests pass through, blobs cached). Until Phase 1 of the migration completes (per the plan), follow the SHA-tag rule for new services to match existing pattern.
- **Private registry**: `forgejo.viktorbarzin.me/viktor/<name>` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. Containerd `hosts.toml` on every node redirects to in-cluster Traefik LB `10.0.20.200` to avoid hairpin NAT. Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest`; integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07.
+- **Private registry**: `forgejo.viktorbarzin.me/viktor/<name>` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. Containerd `hosts.toml` on every node redirects to in-cluster Traefik LB `10.0.20.203` (with `skip_verify = true`, since the node dials Traefik by IP but the cert is for `forgejo.viktorbarzin.me`) to avoid hairpin NAT. **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left this redirect pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull (cached images kept running, so it stayed hidden until a new image tag was pulled). Redirect source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest`; integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07.
 - **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts.
 - **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi.
 - **Node OS disk tuning** (in `stacks/infra/main.tf`): kubelet `imageGCHighThresholdPercent=70` (was 85), `imageGCLowThresholdPercent=60` (was 80), ext4 `commit=60` in fstab (was default 5s), journald `SystemMaxUse=200M` + `MaxRetentionSec=3day`.
--- a/docs/plans/2026-06-01-topolvm-evaluation.md
+++ b/docs/plans/2026-06-01-topolvm-evaluation.md
@ -0,0 +1,156 @@
+# TopoLVM Migration Evaluation
+
+**Date**: 2026-06-01
+**Status**: Evaluation — not yet decided
+**Decision**: Pending. Used to understand whether to commit to the migration.
+
+## Problem statement
+
+The cluster's block storage hits a **hardcoded 29-PVC-per-VM ceiling** in `sergelogvinov/proxmox-csi-plugin` (`pkg/csi/utils.go:394`, `for lun = 1; lun < 30; lun++`). The plugin scans Proxmox SCSI indices `scsi1..scsi29`; when all are taken, `ControllerPublishVolume` returns `Internal desc = no free lun found`. We hit this on 2026-05-26 with 4 stuck PVCs on k8s-node1 and responded by scaling from 4 → 6 worker VMs.
+
+Path 1 (patch the plugin to `lun < 31`) buys +1 slot per VM. Path 2 (NFS-migrate non-DB workloads) buys 20-30 PVCs of headroom. Both are tactical. This doc evaluates **Path 3 — replace the CSI driver with TopoLVM**, which removes the cap permanently by changing the storage architecture from "PVE-host LVM-thin + SCSI hotplug" to "per-VM LVM-thin + local provisioning".
+
+## What TopoLVM is
+
+CSI driver from cybozu-go. Each K8s node runs an `lvmd` daemon managing one or more LVM volume groups. The CSI controller creates `LogicalVolume` CRDs; `topolvm-node` on the target node reconciles them by asking `lvmd` to `lvcreate` an LV in the chosen VG. The LV is mounted directly on the node (no virtio-scsi hotplug). PVCs are LV slices, not separate SCSI devices — there is no per-VM cap beyond kernel LV count limits (effectively thousands).
+
+Mature project, used in production by Cybozu and others. Supports:
+- Thin provisioning (`type: thin` device class with overprovision ratio)
+- Multiple device classes per node (e.g., one for SSD, one for HDD)
+- CSI VolumeSnapshot CRDs (thin-provisioned volumes only; restore pinned to source node)
+- Online volume expansion (ext4, xfs, btrfs)
+- Striping and RAID via `lvcreate-options`
+
+## The big architectural trade-off — read this first
+
+| Aspect | proxmox-csi (today) | TopoLVM |
+|---|---|---|
+| Storage location | PVE-host thin pool (sdc) | Per-VM thin pool on a dedicated disk |
+| Per-VM PVC cap | **29** (plugin source) | None (kernel LV limits, thousands) |
+| **PVC mobility** | **Migrates between VMs** — CSI re-attaches LV to wherever the pod schedules | **Pinned to one node** via `topology.topolvm.cybozu.com/node` label |
+| Failure recovery | Pod reschedules to another VM, PVC follows | Pod can only restart on the same node; if the node dies, data is on the dead node |
+| IO contention | All VMs share sdc thin pool | Each VM's pool is on its own disk (which may still share underlying physical media) |
+| Snapshot mechanism | PVE-host `lvm-pvc-snapshot` script (custom) | CSI VolumeSnapshot CRDs (standard) |
+| Encryption | LUKS via Proxmox CSI `extraParameters` + ESO-synced secret | LUKS via `csi.storage.k8s.io/{node-stage,node-expand}-secret` — same pattern, different secret target |
+| Backup pipeline | sda → Synology via `daily-backup` script that mounts LVM snapshots on PVE | Same idea but snapshots live inside K8s VMs; backup script would need to run on each VM (or use CSI snapshot → object store) |
+| Operational model | "Storage is a shared pool, VMs are cattle" | "Storage is per-node, like local-path with LVM features" |
+
+**Data mobility is the most important difference.** Today, when k8s-node1 is drained for maintenance, all its PVC pods reschedule to other nodes and the proxmox-csi controller detaches/re-attaches the LVs accordingly. With TopoLVM, draining a node means **the PVC data is still on that node's local disk** — pods cannot start elsewhere until either (a) the data is migrated, or (b) the node returns.
+
+For Viktor's setup specifically:
+- **Pro**: the underlying PVE host is a single point of failure anyway (192.168.1.127). If the host dies, all VMs and all storage die together. The "mobility" of proxmox-csi is partially illusory at the homelab scale — the data isn't actually mobile across physical machines.
+- **Con**: VM-level failures (kernel panic, OOM, manual qm shutdown for maintenance) DO happen routinely. Today, the pod just reschedules; with TopoLVM, you wait for the VM to recover or you accept downtime.
+- **Mitigation**: For services that already have replication built in (CNPG Postgres cluster has 3 replicas, Redis-v2 has 3, Vault has 3-node Raft), the data-locality penalty is minimal — one replica's local LV being unavailable triggers a re-replication elsewhere. The PAIN is concentrated in single-replica stateful services: MySQL standalone, Nextcloud, Vaultwarden, mailserver, claude-memory, all the SQLite-backed services.
+
+## Disk layout — three options
+
+TopoLVM needs a dedicated LVM VG per node. Three ways to provision it:
+
+### Option A — Carve from sdc (HDD), one VG per VM
+
+Add a second virtual disk to each K8s VM, sized for its expected PVC load. The disk lives on the existing sdc thin pool. Format as LVM PV → its own VG → TopoLVM thin pool.
+
+- **Sizing**: rough math from session-1 audit: 1.2 TB total LV allocation across 76 PVCs. Add 30% headroom = 1.6 TB. Distribute by current node placement:
+  - node1: Prometheus (433G) + others ≈ 600-700 GiB → **768 GiB disk**
+  - node2: Loki (50G) + smaller DBs ≈ 200 GiB → **256 GiB disk**
+  - node3: MySQL standalone + Immich PG + several DBs ≈ 200 GiB → **256 GiB disk**
+  - node4: smaller → **256 GiB disk**
+  - node5: smaller → **256 GiB disk**
+  - node6: Nextcloud + Vaultwarden + mailserver + small DBs ≈ 200 GiB → **256 GiB disk**
+  - **Total: ~2 TiB** carved from sdc thin pool (currently 66% used, 3.5 TiB free)
+- **Pro**: simplest physical change, no hardware needed, just `qm set --scsiN local-lvm:NNN`
+- **Con**: IO contention on sdc unchanged. The 6 thin pools all sit on the same HDD physical layer. Storms hit harder because there's no inter-pool isolation at the LVM level.
+
+### Option B — Move hot workloads to sdb (SSD), keep cold on sdc
+
+Use a hybrid layout:
+- Per-VM SSD disk (sdb, 931 GB total, ~675 GB free) for hot DBs
+- Per-VM HDD disk (sdc) for cold/bulk
+
+TopoLVM supports multiple device classes per node — each VM would have an `ssd-thin` and `hdd-thin` class.
+
+- **Pro**: separates hot/cold IO; SSD-backed DBs are dramatically faster; partial IO-contention relief on sdc
+- **Con**: 675 GB SSD has to host DBs across 6 VMs (~112 GiB each, tight). Need to identify which PVCs are hot. The encrypted PVCs (45 currently) are mostly DBs and would be the SSD candidates.
+
+### Option C — Add a second physical disk for storage
+
+Add a real SSD (e.g., a 2 TB NVMe) to the PVE host. Carve per-VM disks from it for TopoLVM. Keep sdc for VM root + nfs-data only.
+
+- **Pro**: cleanest physical isolation. Solves both LUN cap AND IO contention (the underlying beads `code-oflt` task).
+- **Con**: hardware investment. ~£200 for a 2 TB NVMe. Requires PVE host downtime to install. Existing PVE has 2 SATA ports used (sda + sdb) + M.2 slot (might be in use, need to check). LVM/thin pool setup is straightforward.
+
+## Migration approach
+
+Same pattern as the 2026-05-26 Wave 1 NFS migration, multiplied across more PVCs:
+
+1. **Install TopoLVM alongside proxmox-csi** — both run in parallel; new StorageClass `topolvm-provisioner` and `topolvm-provisioner-encrypted` created without touching existing PVCs
+2. **Per-VM data disk provisioning** — `qm set <vmid> --scsi8 local-lvm:NNN`, add `vgcreate` + `lvcreate` per VM (one-time)
+3. **lvmd config per node** — Helm values point to the right VG per node
+4. **Pilot migration** — pick a small, low-criticality PVC (e.g., a single-replica config-only service). Run the same scale-to-0 → rsync helper → swap claim_name → apply pattern from Wave 1. Validate.
+5. **Phased rollout** — migrate PVCs in batches by criticality:
+   - Wave A: regenerable / cache (5-10 PVCs, low risk)
+   - Wave B: app config PVCs with SQLite (15-20 PVCs, blip per service)
+   - Wave C: medium DBs (Postgres, MySQL, Redis with replicas) (10-15 PVCs)
+   - Wave D: critical singletons (Vaultwarden, Nextcloud, mailserver, MySQL standalone) (5-10 PVCs)
+   - Wave E: huge ones (Prometheus, Loki, Forgejo) (3-5 PVCs)
+6. **Rewrite backup pipeline** — current `daily-backup` mounts LVM snapshots on PVE host; new flow needs to either (a) run snapshot logic inside each K8s VM via DaemonSet, or (b) use CSI VolumeSnapshot CRDs + an external-snapshotter → restic/borg backend
+7. **Deprecate proxmox-csi** — once all PVCs migrated, remove the Helm release and the `proxmox-lvm` / `proxmox-lvm-encrypted` StorageClasses
+8. **Update docs** — `docs/architecture/storage.md`, `CLAUDE.md`, ingress factory references, several runbooks
+
+## Effort estimate
+
+| Phase | Time | Notes |
+|-------|------|-------|
+| Decision + Option A/B/C pick | 1 day | Includes any hardware ordering for Option C |
+| TopoLVM install + lvmd config | 1 day | Helm chart, secrets, RBAC, test on one node first |
+| Per-VM data disk provisioning | 0.5 day | Six VMs; coordinate with kubelet restart |
+| Encrypted PVC LUKS plumbing | 1 day | Verify the ExternalSecret pattern works with TopoLVM's secret refs |
+| Pilot migration (1 PVC) | 0.5 day | Includes rollback rehearsal |
+| Waves A-D migrations (~45 PVCs) | 5-7 days | ~20 min per PVC like Wave 1, plus verification |
+| Wave E (huge PVCs) | 2-3 days | Prometheus 433 GiB will take hours to rsync; needs careful staging |
+| Backup pipeline rewrite | 2-3 days | Snapshot-driven backup is a different model; testing |
+| Deprecation + cleanup | 1 day | Remove proxmox-csi, update SCs, update docs |
+| Docs + runbook updates | 1 day | storage.md, scale runbook, CLAUDE.md, post-mortems for incidents during migration |
+
+**Total: ~2.5-3 weeks of focused infra time.** Could stretch over a quarter if done alongside other work.
+
+## Risks
+
+| Risk | Likelihood | Mitigation |
+|------|------------|------------|
+| Data loss during PVC migration | Low | Rsync with `--checksum`, verify before deleting source, keep proxmox-csi running until each migration validates |
+| Data-locality penalty during VM reboot | High | Reboot one VM at a time; multi-replica services handle it; single-replica = brief downtime (same as today for kured-driven reboots, but more frequent in TopoLVM model) |
+| LUKS encryption plumbing different from current | Medium | Pilot encrypted PVC migration before committing |
+| Backup pipeline regression | High | Keep old `daily-backup` running until new pipeline proven for ≥2 weeks |
+| Snapshot semantics change (restore pinned to source node) | Medium | Document; not a blocker for normal use but matters for cross-VM restore scenarios |
+| TopoLVM does not solve IO contention | Certain (unless Option C) | Beads `code-oflt` remains open as a separate task |
+| Migration window for huge PVCs (Prometheus 433G) | Medium | Stage during low-traffic period; use rsync with checkpoint resumption |
+| Surprise incompatibility (Kyverno policy, Authentik, etc.) | Low | Pilot catches most |
+| Reverse migration if we change our mind | Medium | Always possible via the same rsync pattern, but tedious |
+
+## Decision criteria
+
+Pick TopoLVM (any option) if:
+- We hit the LUN cap repeatedly (≥2 incidents in 6 months)
+- We want to fix IO contention at the same time (then Option C only)
+- We're comfortable with single-node data locality
+
+Stay on proxmox-csi if:
+- The Path 1 + 2 combo gives us enough headroom for the foreseeable future
+- We value data mobility (any-pod-can-run-anywhere) over architectural cleanliness
+- The migration cost (3 weeks) outweighs the LUN-cap risk over the next year
+
+## Recommended next steps if pursuing
+
+1. **Run a small pilot first** — install TopoLVM on one node (k8s-node5 or node6 since they're newest and have less critical workloads), provision a 50 GB data disk, create a test PVC, migrate one tiny non-critical PVC, verify the operational pattern works end-to-end before committing to full migration
+2. **Pick Option A or C** — Option B is too SSD-constrained for the encrypted PVC volume we have
+3. **Order hardware if Option C** — NVMe + a hot-swap caddy or M.2 adapter; verify PVE host has the slot
+4. **Schedule a 3-week window** — partition the migration waves around other infra commitments; flag in beads as a P1
+
+## Related
+
+- `docs/architecture/storage.md` — current storage architecture
+- `docs/runbooks/scale-k8s-cluster.md` — current scaling playbook (Path 1+2 alternative)
+- `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md` — IO contention is the related-but-separate concern
+- Beads `code-oflt` — IO isolation long-term fix (Option C would close this)
+- Remote memory id=2788 — proxmox-csi-plugin LUN cap explanation
--- a/docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md
+++ b/docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md
@ -0,0 +1,73 @@
+# Post-Mortem: Cloudflare Tunnel Pointed at Traefik's Old LB IP → Full External 502
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-06-01 |
+| **Duration** | Misconfiguration latent since 2026-05-30 08:09Z (Traefik LB-IP move). Confirmed external outage in cloudflared logs from ~20:58Z; root-caused and fixed at 21:15Z; all pods converged by 21:16Z. Detection→fix window ~17 min. |
+| **Severity** | SEV1 — *every* Cloudflare-proxied hostname (`viktorbarzin.me` + all `*.viktorbarzin.me`) returned HTTP 502 to external clients. Internal/LAN access was unaffected (split-horizon → Traefik direct), which is why it stayed hidden. |
+| **Affected Services** | All external ingress: viktorbarzin.me, nextcloud, vault, authentik, vaultwarden, immich, linkwarden, nas, technitium, terminal, speedtest, and every other proxied app. |
+| **Issue** | None filed (diagnosed and fixed in-session). |
+| **Status** | Resolved. |
+| **Recurrence count** | 1st of this kind. Same *class* as the 2026-06-01 forgejo-registry `.200→.203` redirect breakage (containerd mirror) — both are fallout from the 2026-05-30 Traefik LB-IP move leaving a hard-coded `10.0.20.200` reference behind. |
+
+## Summary
+
+On 2026-05-30 (commit `0c01adac`) Traefik was moved off the shared MetalLB IP `10.0.20.200` onto its own dedicated IP `10.0.20.203` (with `externalTrafficPolicy: Local`). The Cloudflare tunnel's ingress rules — Terraform-managed in `stacks/cloudflared/modules/cloudflared/cloudflare.tf` — still routed `*.viktorbarzin.me` and `viktorbarzin.me` to `https://10.0.20.200:443`. After the move, nothing serves HTTPS on `.200:443` (the shared IP keeps only the non-HTTP LB services: postgresql-lb, headscale, wireguard, coturn, xray). cloudflared therefore could not reach its origin (`connect: no route to host` / `i/o timeout`), and Cloudflare returned 502 for the entire public surface.
+
+The fix: repoint both ingress rules at the in-cluster Traefik **Service DNS** `https://traefik.traefik.svc.cluster.local:443` — the design the docs already *described* (CLAUDE.md "Networking" §) but which the code never actually implemented. Service DNS decouples the tunnel from the LB IP, so a future Traefik IP change cannot reproduce this.
+
+## Impact
+
+- **User-facing**: 100% of externally-reachable services returned 502 via Cloudflare. LAN/internal access (which resolves `*.viktorbarzin.me` → `10.0.20.203` via Technitium split-horizon, bypassing Cloudflare) kept working — this masked the outage.
+- **Blast radius**: every proxied hostname. Origin (Traefik) was healthy the entire time — purely a tunnel-origin routing fault.
+- **Data loss**: none.
+- **Collateral**: Vault's own public hostname (`vault.viktorbarzin.me`) was also 502, creating a bootstrap problem for the fix — `terragrunt apply` needs Vault for the PG state-backend creds, but Vault was only reachable via the broken tunnel from the dev box. Worked around with a temporary `/etc/hosts` entry pointing `vault.viktorbarzin.me` → `10.0.20.203` (internal Traefik), removed after the apply.
+
+## Root Cause
+
+A hard-coded LB IP (`10.0.20.200`) in the tunnel origin survived the Traefik dedicated-IP migration. The 2026-05-30 migration updated Traefik's Service and the split-horizon DNS but did not grep for every consumer of the old `.200` HTTPS endpoint. The cloudflared tunnel origin (and, separately, the containerd forgejo-registry redirect — fixed earlier the same day in `42db69a2`) were missed.
+
+Contributing factors:
+- **Docs described intent as reality.** CLAUDE.md stated cloudflared targets `traefik.traefik.svc.cluster.local:443` "so proxied apps are decoupled from the LB IP." The code used a raw IP. The doc gave false confidence that the decoupling existed.
+- **No guard** tied the tunnel origin to Traefik's actual address; a stale value plans/applies cleanly.
+- **Detection gap (masking).** Split-horizon means LAN users never see external-only breakage. The `[External]` Uptime-Kuma monitors + `ExternalAccessDivergence` alert are the only signal for this failure mode.
+
+## Timeline (UTC)
+
+| Time | Event |
+|------|-------|
+| **2026-05-30 08:09** | Commit `0c01adac` — Traefik moves to dedicated LB IP `10.0.20.203`. `.200:443` stops serving HTTPS. Tunnel origin still `.200`. Outage latent from here. |
+| **2026-06-01 ~20:51** | Keel auto-patches the cloudflared image; all 3 pods roll (coincidental — not the cause; the misconfig predates it). |
+| **2026-06-01 ~20:58** | cloudflared logs show every proxied hostname failing: `originService=https://10.0.20.200:443 … no route to host / i/o timeout`. |
+| **2026-06-01 ~21:08** | User reports "no ingress coming in." Investigation starts. |
+| **21:09** | Isolated: origin healthy (direct to `.203` → 200/302), public path → 502. cloudflared logs pin origin to dead `.200:443`. |
+| **21:10** | Confirmed tunnel config is Terraform-managed (`cloudflare_zero_trust_tunnel_cloudflared_config.sof`), origin = `.200` on both ingress rules. |
+| **21:13** | Vault unreachable via public name (circular dep); worked around with temp `/etc/hosts` → `.203`. `tg init -reconfigure` (rotated PG backend creds). |
+| **21:15:25** | Targeted apply: both ingress origins → `https://traefik.traefik.svc.cluster.local:443`. `Apply complete! 1 changed`. |
+| **21:15:34–50** | cloudflared pushes config `version=253`; pods converge. |
+| **21:16** | 10/10 curls to `viktorbarzin.me` → 200; 0 `.200` errors across all pods; `vault.viktorbarzin.me` via real Cloudflare path → 200. Temp hosts entry removed. Resolved. |
+
+## Resolution
+
+Changed both `ingress_rule` blocks in `cloudflare.tf` from `https://10.0.20.200:443` to `https://traefik.traefik.svc.cluster.local:443` (`no_tls_verify = true` retained). Applied surgically with `-target` on the tunnel config resource only, to avoid touching two pre-existing, unrelated drift items the full plan surfaced (see below).
+
+## Pre-existing drift (NOT part of this incident, left untouched)
+
+The full `cloudflared` stack plan showed two extra in-place changes, deliberately **not** applied:
+1. `kubernetes_deployment.cloudflared` — TF would strip Keel's runtime annotations (`keel.sh/policy|pollSchedule|trigger|update-time`). The deployment ignores `dns_config` but not `metadata.annotations`, so Keel's enrollment annotations look like drift. Self-healing (Keel re-adds within its 1h poll), but a clean fix is to add `metadata[0].annotations` (and the template equivalent) to `ignore_changes`, or codify the policy annotation in TF.
+2. `cloudflare_record.mail_domainkey_rspamd` — cosmetic re-chunking of the DKIM TXT record (identical key, different 255-char split). Benign.
+
+## Action Items
+
+- [x] Repoint tunnel origin to Traefik Service DNS (this fix).
+- [x] Post-mortem written; CLAUDE.md networking claim is now actually true.
+- [ ] **Pin exact outage-start** via Uptime-Kuma `[External]` monitor history / `ExternalAccessDivergence` firing time (confirm whether it began at the 05-30 move and went unnoticed, or at a later tunnel re-apply).
+- [ ] **Verify `ExternalAccessDivergence` is wired to a channel that gets seen** — this is the only alert that catches external-only breakage; it apparently did not prompt action for ≤2.5 days.
+- [ ] **Migration checklist**: when an LB IP changes, grep the whole repo for the old IP before declaring done (this and the forgejo redirect were both missed `.200` references on 2026-05-30).
+- [ ] (Optional) Address the cloudflared Keel-annotation drift so the stack plans clean.
+
+## Lessons
+
+- Reference shared infra (Traefik) by **stable Service DNS, not LB IP**, from anything that can use cluster DNS. IPs are migration landmines.
+- Keep docs honest: a doc that describes intended design as current reality hides exactly this class of bug.
+- External-only outages are invisible from the LAN (split-horizon). The `[External]` divergence signal is load-bearing — it must be trustworthy and seen.
--- a/docs/runbooks/kms-public-exposure.md
+++ b/docs/runbooks/kms-public-exposure.md
@ -94,11 +94,39 @@ how to tune the rate limit, how to revoke if abused.
  overrides) — in-place edition UPGRADE, **needs a reboot then re-run**, one-way
  (no in-place downgrade). Office → slim ODT `setup.exe /configure` to a VL
  product (default ProPlus2024Volume; `$env:KMS_OFFICE_PRODUCT` overrides) — ~3 GB
-  download, closes Office. Non-interactive runs only proceed with an explicit env
-  override. setup-kms.ps1 stays minimal and points non-VL editions at the
-  bootstrap. NOTE: the changepk/ODT execution paths are unverified on real
-  hardware (no Home/retail test box; the Pro test VM can't be switched reversibly)
-  — syntax-checked + activation regression-tested only.
+  download, closes Office. If an INCOMPATIBLE Click-to-Run Office is installed
+  (retail/M365 — `ProductReleaseIds` not ending in `Volume`), it's named in the
+  prompt and **uninstalled first** via ODT `<Remove>` of just those products (VL
+  products of other families are kept), then the VL product installs. The ODT run
+  is one shared `Invoke-Odt` for both `<Add>` and `<Remove>`. **Removing the bundled
+  consumer Office leaves a pending reboot**, so a VL install in the same run — or a
+  re-run before rebooting — fails with `setup.exe` exit **1603**. Two guards: a
+  hard-reboot (CBS/WU) gate before the ~3 GB download, and a reboot-aware 1603
+  message telling the user to reboot + re-run (idempotent — the incompatible Office
+  is already gone). `Invoke-Odt` checks the setup.exe exit code and on failure
+  captures the C2R log from `%TEMP%` into telemetry; `Wait-OfficeInstalled` polls
+  on-disk state (ospp.vbs + ProductReleaseIds) because `setup.exe` can return before
+  the C2R install finishes. Non-interactive runs only proceed with an explicit env
+  override. setup-kms.ps1 stays minimal and points non-VL editions at the bootstrap.
+  NOTE: real-hardware status (2026-06-01) — the incompatible-uninstall path DID run
+  on a real M365/Office-Home box (`O365HomePremRetail` removed cleanly); the VL
+  install then needs a reboot first (hit 1603, now guided). changepk edition-switch
+  remains untested (no Home test box; the Pro test VM can't be switched reversibly).
+- **Self-hosted ODT bootstrapper**: the Office reinstall path fetches the Office
+  Deployment Tool from `https://kms.viktorbarzin.me/scripts/odt-setup.exe` (a
+  committed copy in `kms-website/static/scripts/`), NOT from Microsoft —
+  `download.microsoft.com`'s ODT URL is build-numbered and rotates every release
+  (the old hardcoded one 404'd). `$env:KMS_ODT_URL` overrides. The bootstrapper
+  self-updates the Office payload, so refresh the committed copy only occasionally.
+- **Client telemetry → Loki**: the scripts POST a small ANONYMOUS diagnostics
+  event per run to `https://kms.viktorbarzin.me/diag` (action, outcome, error +
+  exit codes, EditionID/build/locale, detected Office products, script version;
+  NO hostname/user/keys). Fire-and-forget (3s, swallowed) — never affects
+  activation. `$env:KMS_NO_TELEMETRY=1` opts out; `$env:KMS_DIAG_URL` overrides.
+  Collector: standalone `kms-diag` Deployment (`stacks/kms`, python stdlib HTTP
+  on :9102) reachable via the `/diag` ingress carve-out (bypasses Anubis like
+  `/scripts`); it prints `KMSDIAG <json>` to stdout → Loki. Query in Grafana:
+  `{namespace="kms",pod=~"kms-diag.*"} |= "KMSDIAG"`. Disclosed in the site FAQ.

 ## Where the logs are

--- a/modules/create-template-vm/k8s-node-containerd-setup.sh
+++ b/modules/create-template-vm/k8s-node-containerd-setup.sh
@ -57,8 +57,9 @@ mkdir -p /etc/containerd/certs.d/forgejo.viktorbarzin.me
 cat > /etc/containerd/certs.d/forgejo.viktorbarzin.me/hosts.toml <<'FORGEJO'
 server = "https://forgejo.viktorbarzin.me"

-[host."https://10.0.20.200"]
+[host."https://10.0.20.203"]
  capabilities = ["pull", "resolve"]
+  skip_verify = true
 FORGEJO

 # quay.io + registry.k8s.io: include mirror configs that match node4's
--- a/scripts/setup-forgejo-containerd-mirror.sh
+++ b/scripts/setup-forgejo-containerd-mirror.sh
@ -21,8 +21,9 @@ set -euo pipefail
 CERTS_DIR=/etc/containerd/certs.d/forgejo.viktorbarzin.me
 HOSTS_TOML='server = "https://forgejo.viktorbarzin.me"

-[host."https://10.0.20.200"]
+[host."https://10.0.20.203"]
  capabilities = ["pull", "resolve"]
+  skip_verify = true
 '

 NODES=$(kubectl get nodes -o name | sed 's|^node/||')
--- a/stacks/cloudflared/modules/cloudflared/cloudflare.tf
+++ b/stacks/cloudflared/modules/cloudflared/cloudflare.tf
@ -74,18 +74,22 @@ resource "cloudflare_zero_trust_tunnel_cloudflared_config" "sof" {
    warp_routing {
      enabled = true
    }
-    # Wildcard rule routes all subdomains through tunnel to Traefik.
-    # Traefik handles host-based routing via K8s Ingress resources.
+    # Wildcard rule routes all subdomains through the tunnel to Traefik,
+    # which handles host-based routing via K8s Ingress resources.
+    # Origin = in-cluster Traefik Service DNS (NOT a MetalLB LB IP) so the
+    # tunnel is decoupled from LB-IP changes. A raw IP here caused a full-site
+    # 502 on 2026-06-01 when Traefik moved 10.0.20.200 -> .203; see
+    # docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md.
    ingress_rule {
      hostname = "*.viktorbarzin.me"
-      service  = "https://10.0.20.200:443"
+      service  = "https://traefik.traefik.svc.cluster.local:443"
      origin_request {
        no_tls_verify = true
      }
    }
    ingress_rule {
      hostname = "viktorbarzin.me"
-      service  = "https://10.0.20.200:443"
+      service  = "https://traefik.traefik.svc.cluster.local:443"
      origin_request {
        no_tls_verify = true
      }
Author	SHA1	Message	Date
Viktor Barzin	f807050eb5	cloudflared: fix tunnel origin .200 -> Traefik svc DNS (full-site 502 outage) [ci skip] The Cloudflare tunnel routed *.viktorbarzin.me and the apex to https://10.0.20.200:443, but Traefik moved off the shared MetalLB .200 onto its dedicated 10.0.20.203 on 2026-05-30 (commit `0c01adac`). Nothing serves HTTPS on .200:443 anymore, so cloudflared could not reach its origin (no route to host / i/o timeout) and Cloudflare returned 502 for every externally-proxied service. Internal/LAN access (split-horizon -> .203) was unaffected, which masked the outage. Repoint both ingress rules at the in-cluster Traefik Service DNS (https://traefik.traefik.svc.cluster.local:443) -- the design the docs already described but the code never implemented -- so the tunnel is decoupled from the Traefik LB IP and this cannot recur on a future move. Applied live via targeted apply on the tunnel config resource only; [ci skip] because live already matches and a full stack apply would churn unrelated pre-existing drift (Keel annotations, DKIM re-chunk). Post-mortem: docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 21:22:05 +00:00
Viktor Barzin	30a644d3cd	docs(kms): document reboot-after-uninstall / 1603 handling + real-hardware status The bundled consumer Office removal leaves a pending reboot; a same-run VL install (or re-run before rebooting) fails with setup.exe 1603. Document the two guards (hard-reboot gate + reboot-aware 1603 message), the C2R-log capture, and the on-disk completion poll. Record that the uninstall path is now verified on a real M365 box (O365HomePremRetail removed) and the install needs a reboot first.	2026-06-01 21:22:05 +00:00
Viktor Barzin	a382683c0e	infra: fix containerd forgejo-registry redirect .200->.203 (+skip_verify) Traefik moved off shared .200 to its dedicated .203 on 2026-05-30, but the containerd hosts.toml redirect for forgejo.viktorbarzin.me still pointed at the now-dead .200:443 -> every FRESH forgejo pull failed (cached images kept running, so it stayed hidden until a new image tag was pulled). Retarget to .203 and add skip_verify (node dials Traefik by IP; cert is for forgejo.viktorbarzin.me) in both the new-node cloud-init and existing-node deploy scripts. Already rolled to all 7 nodes (rewrite + restart containerd, no drain). Doc fix in .claude/CLAUDE.md.	2026-06-01 21:22:05 +00:00
Viktor Barzin	82855848d1	plans: TopoLVM migration evaluation (Path 3 for LUN-cap relief) Decision-support doc, NOT a commitment. Evaluates whether replacing proxmox-csi with TopoLVM would lift the per-VM 29-PVC ceiling permanently and at what cost. Key trade-off documented: TopoLVM PVCs are pinned to the node where the LV lives (topology.topolvm.cybozu.com/node). proxmox-csi PVCs migrate between VMs when pods reschedule. The data-locality penalty matters most for single-replica stateful services (MySQL standalone, Nextcloud, Vaultwarden, mailserver, claude-memory, ~30 SQLite-backed apps); replicated services (CNPG PG cluster, Redis-v2, Vault Raft) absorb it. Three disk-layout options: A. Carve per-VM data disks from sdc — simple, no hardware, IO contention unchanged B. Hybrid SSD/HDD — SSD-constrained at 675 GiB free C. Add a dedicated NVMe — also closes beads code-oflt (IO contention), ~£200 hardware investment Effort estimate: 2.5-3 weeks of focused work for the full migration; covers TopoLVM install, lvmd config, per-VM disk provisioning, LUKS plumbing, 5 migration waves (regenerable → huge PVCs), backup-pipeline rewrite, deprecation. Recommended next step before committing: small pilot on k8s-node5/6 with one non-critical PVC to validate the operational pattern end-to-end. Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap, docs/runbooks/scale-k8s-cluster.md (Path 1+2 alternative), beads code-oflt (IO isolation).	2026-06-01 21:22:05 +00:00
Viktor Barzin	599d67db51	docs(kms): self-hosted ODT bootstrapper + anonymous client telemetry (kms-diag/Loki) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 21:22:05 +00:00