Commit graph

213 commits

Author SHA1 Message Date
Viktor Barzin
c04efa3d3a k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Disruptive node drains should run when the cluster is idle. Move the
k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC
(00:00 London) — overnight, low usage, and clear of the kured OS-reboot window
(01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.)

  - stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *.
  - scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot
    (was next_daily_noon_utc).
  - docs (runbook, architecture) + upgrade-state SKILL: schedule references
    updated to 23:00 UTC nightly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:16:32 +00:00
Viktor Barzin
bfb86e653f k8s-version-upgrade: ignore CoreDNS preflight on kubeadm upgrade plan too
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The prior commit added the CoreDNS ignore/skip flags only to `kubeadm upgrade
apply`, but `kubeadm upgrade plan` runs the SAME CoreDNS preflight. Once master's
kubeadm binary is on the target version (the first attempt's apt step already
bumps it), both plan calls fail on the Keel-drifted CoreDNS 1.12.4 under
set -euo pipefail and abort:
  - preflight Job step 4 (upgrade-step.sh) — `plan` output is grepped for the
    target version; the failing pipeline killed the whole preflight.
  - update_k8s.sh master path line 85 — bare `plan` before the apply.

Both now pass --ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins.
Verified read-only on master: plan exits 0 and still emits
"kubeadm upgrade apply v1.34.9".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:49:06 +00:00
Viktor Barzin
037a609f27 k8s-version-upgrade: unblock 1.34.9 — skip kubeadm CoreDNS addon + busybox-date fix
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The 1.34.9 master upgrade hard-failed `kubeadm upgrade apply` preflight: CoreDNS
is at v1.12.4 (Keel auto-bumped it 1.12.1 -> 1.12.4 on 2026-05-26 via a stale
kube-system out-of-band annotation), and 1.12.4 is ahead of kubeadm 1.34.9's
bundled corefile-migration table ("start version not supported").

- scripts/update_k8s.sh: master `kubeadm upgrade apply` now runs with
  `--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins
  --skip-phases=addon/coredns`. A dry-run proved --ignore ALONE would overwrite
  our custom split-horizon Corefile with kubeadm's default AND downgrade the
  image; --skip-phases leaves CoreDNS 100% untouched while the control plane
  upgrades. CoreDNS is pinned off Keel (keel.sh/policy=never) to stop the drift.
- stacks/k8s-version-upgrade/scripts/upgrade-step.sh: fix the preflight
  quiet-baseline (settle-window) check, which silently no-op'd on the ghcr
  claude-agent-service image's busybox `date` (can't parse ISO8601). Now tries
  GNU then busybox `-D`, and warns+skips on parse failure (no silent fail-open).
- docs: runbook + architecture document the CoreDNS handling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:45:05 +00:00
Viktor Barzin
dfa1a12a86 k8s-version-upgrade: retry failed phases + surface wedged chain (fix 5-day silent stall)
The 1.34.9 patch auto-upgrade sat stuck for 5 days without anyone knowing.
On 2026-06-12 a transient critical alert (the ttyd web-terminal probe on the
devvm) was firing when the daily detection ran; the preflight's "halt on any
critical alert" gate aborted it, so the preflight Job Failed (backoffLimit=1).
Two design gaps then turned that blip into a multi-day wedge:

  * the detection guard and spawn_next only checked whether the phase Job
    EXISTED, not whether it succeeded — and the Failed Job lingers 7 days via
    ttlSecondsAfterFinished, so every daily run skipped re-spawning it;
  * the abort happens before the in-flight metric is pushed, so neither
    K8sUpgradeStalled nor upgrade_state.sh could see it — the pipeline reported
    "never ran" while actually being stuck.

Fixes:
  D1 retry-on-failure: detection CronJob (main.tf) and spawn_next
     (upgrade-step.sh) now delete + re-spawn a terminally-Failed phase Job
     instead of skipping it, so a transient gate self-corrects next cycle
     rather than wedging the pipeline for a week.
  D2 WebterminalTtydUnreachable critical -> warning: a devvm developer
     web-terminal is not cluster infrastructure and must not block upgrades.
  D3 observability: new K8sUpgradeChainJobFailed alert
     (kube_job_status_failed in k8s-upgrade ns) and upgrade_state.sh now flags
     a Failed chain Job as "chain failed" — closing the pre-in-flight blind
     spot so a wedge is visible immediately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:07:36 +00:00
Viktor Barzin
8a2a3d9eca Merge remote-tracking branch 'origin/master' into wizard/reconcile-mirror
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
# Conflicts:
#	scripts/t3-provision-users.sh
2026-06-16 22:32:43 +00:00
Viktor Barzin
0a6ed4b2fe workstation: per-user playwright browser MCP for all users, reproducible from git
Viktor asked that the playwright browser MCP be available for every devvm user
in every directory, with each user running their own server and multiple
concurrent sessions per user.

Before this, playwright was hand-set-up per user (~/.config/systemd/user/
playwright-mcp.service on 8931/8932/8933) and only wizard was actually wired —
emo's and anca's servers ran but their ~/.claude.json had no playwright entry,
so their Claude never connected. None of it was reproducible from git (units,
refresh script, and the Vault snapshot token lived only in user homes), so a
devvm rebuild would silently lose it.

This makes it reproducible and fixes the unwired users:

- roster_engine.py: sticky per-user PLAYWRIGHT_PORT (PLAYWRIGHT_BASE_PORT=8931,
  allocated for every roster user incl. the admin), emitted in the derive JSON.
- scripts/workstation/playwright/: system-level TEMPLATE units
  (playwright-mcp@.service + playwright-snapshot-refresh@.{service,timer},
  User=%i — system manager, so no systemd --user / linger) + the refresh script.
  @playwright/mcp pinned to 0.0.76 (avoids the @latest silent-fleet-roll
  footgun, same rationale as T3_PIN).
- setup-devvm.sh: install the templates + script (9e); stage the chrome-service
  snapshot bearer token from Vault to a root file (8c) — the hourly root
  reconcile has no Vault token, mirrors the Claude OAuth staging in 8a.
- t3-provision-users.sh: install_playwright() (ALL tiers incl. admin) writes
  PLAYWRIGHT_PORT, seeds the token if-absent, wires the user-scope ~/.claude.json
  by running `claude mcp add` AS the user (clobber-proof + if-absent, so it fixes
  existing/new/admin without rewriting a populated config), and enable --now's the
  instances (idempotent, never restarts a running server). Also hardened the
  section-1 *.env scan to skip the new playwright-*.env files (no T3_PORT -> grep
  no-match would abort under set -e -o pipefail).
- Docs: chrome-service-snapshot runbook (new Provisioning section + system-unit
  commands), multi-tenancy.md, and the 2026-06-07 plan Task 2.3.

Supersedes the hand-made per-user --user units (one-time idle-gated migration to
follow on the live host).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:33:47 +00:00
Viktor Barzin
044444d328 cluster-health: helm check #18 catches pending/failed releases (helm list -a)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
check_helm_releases used `helm list` without -a, which HIDES pending-upgrade and
failed releases — so on 2026-06-16 check #18 reported "All deployed" while the
prometheus release sat in pending-upgrade for ~4 days, silently blocking every
monitoring terragrunt apply (frozen alert/rule config). Add -a to surface them
and flag pending-* (FAIL, blocks applies) + failed (WARN); deployed/uninstalled/
superseded stay green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 15:39:06 +00:00
Viktor Barzin
e74f4208f5 t3-backup-state: retention 14 -> 6 (bound devvm root fs)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
wizard's state.sqlite grew to ~1.1GB and the new gated nightly tracker adds a
pre-bump snapshot per bump on top of this daily one; 14 x ~1.1GB would fill the
devvm root fs (was trending to ~16GB of wizard backups on a disk with ~9GB
free). 6 is ample — rollback only ever needs the most recent pre-bump backup.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 14:26:03 +00:00
Viktor Barzin
cdd9ecd199 t3: docs for the gated nightly tracker (runbook, post-mortem, service-catalog)
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Phase 4 docs for the enforcer -> gated-tracker change:
- runbook t3-version-bump.md: rewritten around the tracker — how each bump is
  gated, plus freeze/revert/pin/dry-run/manual-rollback ops.
- post-mortem 2026-06-09: append the deliberate 2026-06-16 reversal and how the
  gates close each named root-cause/lesson (historical sections left intact).
- service-catalog t3 row: "PINNED 0.0.24 enforcer" -> gated nightly tracker;
  replace the stale "auto-pair 401-broken on 0.0.26" note (re-verified healthy
  2026-06-16, cookieless -> 302 + t3_session).
- t3-provision-users.sh step 5b comment: enforcer -> tracker; note Persistent dropped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 11:33:49 +00:00
Viktor Barzin
36521839fc t3: gated nightly tracker (replaces pinned enforcer) + drop timer Persistent
Phase 2 of "track t3 nightly, accept the risk, but make sure session auth works
and revert if it breaks". Rewrites the daily t3-autoupdate from a pinned-version
enforcer into a NIGHTLY TRACKER that gates every bump so a bad build self-heals
instead of repeating 2026-06-09:

- follows the t3@nightly npm dist-tag (T3_TRACK; T3_PIN still works as a hard
  freeze; /etc/t3-autoupdate.freeze is the manual revert switch);
- downgrade-guard (the nightly tag is mutable — never move backward) + channel
  sanity (target must be a -nightly. build);
- pre-bump per-user state.sqlite backup (online VACUUM INTO) BEFORE install, so
  rollback is a restore not sqlite surgery;
- health-check now SEEDS a throwaway instance with a COPY of a real POPULATED
  state.sqlite, exercising the forward MIGRATION (the actual 2026-06-09 failure
  class) + the real mint->exchange->t3_session pairing handshake before trusting
  a build. Scratch dir is on /var/tmp (disk), not the 2G tmpfs /tmp;
- canary rollout: restart idle instances ONE AT A TIME, verify pairing through
  the real dispatch after each, and on the first failure roll back (binary +
  that user's DB from the pre-bump backup) AND self-freeze so it can't re-flap
  onto bad builds. Active-agent instances are deferred, never killed. Rollback
  target is the recorded LAST-GOOD, not "whatever was installed";
- DRY_RUN mode (T3_DRY_RUN=1) previews the gate against a temp-prefix install —
  validated: 0.0.28-nightly.20260616.571 PASSES the populated-DB migration gate.

timer: drop Persistent=true (a missed 04:00 must not fire a real bump on boot
mid-day with users active — a 2026-06-09 contributing factor).
setup-devvm.sh: install t3@nightly on fresh boxes (no state to break), in sync.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 10:08:12 +00:00
Viktor Barzin
994d305d04 t3: session-auth detection for the gated nightly tracker (dispatch fallback logging + Loki alerts)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Before auto-tracking t3 nightly builds (Viktor's call, risk accepted), stand up
the detection that was missing on 2026-06-09 — when an auto-pulled nightly broke
pairing for ALL users and nothing alerted. Viktor's explicit requirement: make
sure session auth keeps working and revert if the pairing fallback/failure rate
climbs. This is phase 0 (detection) of that work.

- t3-dispatch: exchangeCredential now reports WHICH pairing endpoint answered,
  and autoPair logs every outcome (paired user=.. endpoint=.. fallback=..) — so
  the real-user browser-session->bootstrap fallback rate is observable. A
  non-zero rate flags that a build moved the pairing API (the 2026-06-09 class).
- Loki ruler alerts (devvm journal -> Alertmanager -> Slack): T3PairingBroken
  (real users failing to pair), T3PairFallbackHigh (build moved the pairing API),
  T3AutoUpdateRolledBack / RollbackFailed / Frozen (enforcer outcomes). Closes
  the post-mortem's open "nothing monitors end-to-end pairing" detection gap.

The existing t3-probe only checks GET /api/auth/session==200, which stays 200
even when pairing is dead, so it never caught the outage class.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:56:55 +00:00
Viktor Barzin
b0e8e3599f nfs-mirror: exclude SQLite WAL/SHM sidecars + treat rsync exit 24 as success
NfsMirrorFailing fired ~13% of nights (3/23 runs, all rsync exit 24). Root cause:
calibre-web-automated keeps a WAL-mode SQLite queue.db on /srv/nfs, whose -wal/-shm
sidecars are created/checkpointed/deleted constantly and vanish between rsync's
file-list scan and the transfer ("file has vanished" -> exit 24). The mirror
actually completes every run; only transient files disappear.

Two fixes: (1) exclude *-wal/*-shm/*-journal -- these must never be in a raw mirror
anyway (a WAL without an atomic .db snapshot is useless to restore; daily-backup
makes the consistent SQLite copies). (2) Treat rsync exit 24 as success-with-warning
so the run still appends to the offsite manifest (a code-24 night previously skipped
that, delaying those changes to the monthly full sync) and the alert stops
false-firing.

Deployed to the PVE host via scp to /usr/local/bin/nfs-mirror (host script, not TF).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:34:22 +00:00
1ba453c65d fan-control docs: sync runbook/env/service/design to the HA-actuator + anti-flap model
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The committed docs still described the 2026-06-04 presence-aware daemon. Bring
them in line with what is actually deployed: HA computes the setpoint, the host
is a thin actuator (COMMAND_ENTITY/STALE_SECS/HA_GRACE_SECS), additive bias,
anti-flap hold-last, and the new HA readout sensors (command/equilibrium/
cpu_load/fan_speed_avg/fan_power_avg). Earlier doc edits were made in a clone
lost in the workstation reshuffle; re-created here.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 08:11:48 +00:00
5bc3d27d1b Merge remote-tracking branch 'forgejo/master' into emo/fan-control-ha-actuator
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-16 08:08:27 +00:00
2cfe338419 fan-control: hold last command through transient HA losses (stop fan flapping)
The actuator dumped the fans to Dell auto on every brief loss of the HA command
(~14% of the time, every few minutes) — crashing them to the ~7100 rpm floor and
bouncing back: the "fans surge then crash then surge" the owner reported. Causes:
the command sensors last_updated going >120s old whenever CPU temp sat flat
(mis-read as stale), plus occasional unavailable blips. Fix: on a missing/stale
command, HOLD the last applied % for up to HA_GRACE_SECS (300s) instead of
falling back, and loosen STALE_SECS 120->1800 (staleness only happens at flat
temp, where the held value is still valid). The 83C CPU CEILING on our own IPMI
read stays the real overheat safety. Verified live: fallback 14% -> 0% over 8h,
command std 16 -> 3, no more rpm floor crashes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 08:07:52 +00:00
Viktor Barzin
ef555c7e02 workstation: put ~/.local/bin on PATH so the launcher finds native claude
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor hit "~/.local/bin is not part of the PATH". Root cause: the native claude
binary lives in ~/.local/bin, but the terminal launcher (start-claude.sh) runs in
tmux's NON-login bash env, which doesn't source the user's shell rc where the native
installer put ~/.local/bin on PATH. So `command -v claude` failed there → the
launcher's bootstrap re-ran the native installer → the installer printed the PATH
warning. (Interactive zsh already had ~/.local/bin via the per-user installer rc edit,
and t3-serve sets PATH in its unit — so only the terminal launcher was affected.)

- skel/start-claude.sh: prepend ~/.local/bin to PATH near the top (guarded/idempotent),
  before the launch logic — so `claude` is found, no reinstall, no warning.
- setup-devvm.sh: install /etc/profile.d/10-local-bin.sh — adds ~/.local/bin to PATH for
  all LOGIN shells machine-wide (SSH etc.), independent of the per-user installer rc edit
  (fresh-user-safe). zsh login picks it up via /etc/zsh/zprofile -> /etc/profile.
- docs/architecture/multi-tenancy.md: documented the three PATH-injection points.

Verified: guard adds-when-missing / no-dup-when-present; all scripts pass bash -n.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:20:03 +00:00
Viktor Barzin
eecd78233b workstation: standardize on the native claude install (drop npm-global + npx)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Question from Viktor: should claude run via the binary or npx? Answer: the native
install is the recommended runtime (self-contained, self-updating ~/.local/bin/claude;
installMethod=native) — and every existing user had already auto-migrated to it, leaving
the npm-global copy empty and the npx fallback dead. "Leave only the recommended setup":

- setup-devvm.sh: node is now installed ONLY for the t3 CLI; dropped the machine-wide
  `npm install -g @anthropic-ai/claude-code` (npm/npx is not the recommended runtime and
  just shadowed the per-user native installs).
- t3-provision-users.sh: new per-user `install_user_claude_native` (runs the official
  https://claude.ai/install.sh AS the user, idempotent/skip-if-present) — provisions native
  claude for BOTH the terminal launcher and each t3-serve instance, replacing the npm bootstrap.
- skel/start-claude.sh: launcher runs the native `claude` only; if missing it bootstraps via
  the native installer (was an `npx @anthropic-ai/claude-code` fallback).
- docs/architecture/multi-tenancy.md: documented the native-only runtime model.

node stays (the pinned t3 CLI is npm-global). Verified: native installer reachable +
produces ~/.local/bin/claude 2.1.177; all three scripts pass bash -n + shellcheck.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:12:05 +00:00
Viktor Barzin
bb3f5f2329 workstation: stop the Claude Code onboarding wizard reappearing for terminal users
All checks were successful
ci/woodpecker/push/default Pipeline was successful
emo reported being "logged out" on terminal.viktorbarzin.me: every new shell
dropped him at the first-run "Choose the text style" wizard, even though he'd
used many sessions and is in fact fully authenticated. Root cause is NOT a
logout — ~/.claude.json is a single file that all of a user's concurrent claude
processes (the ttyd terminal + their t3-serve instance + agent sessions)
read-modify-write, and a stale writer periodically drops top-level keys,
including hasCompletedOnboarding. That bounces the next interactive session back
to onboarding; credentials are safe in the separate ~/.claude/.credentials.json
(which is why T3 kept working). wizard's own ~/.claude.json showed the same key
loss, so this hits any heavy multi-session user.

Fix:
- skel/start-claude.sh: ensure_onboarding() idempotently re-asserts
  hasCompletedOnboarding (+ lastOnboardingVersion) in ~/.claude.json right before
  launching claude. Merge-only (never clobbers other keys), runs as the user, and
  no-ops if jq is missing or the file is empty/corrupt. So even if the race drops
  the flag, the next launch restores it before claude reads it.
- t3-provision-users.sh: deploy_user_launcher() re-copies skel/start-claude.sh
  into every non-admin home (copy-if-changed) on the hourly reconcile. /etc/skel
  only seeds the launcher at account creation, so without this the fix (and any
  future launcher edit) would never reach existing users. .tmux.conf is
  deliberately not re-copied — terminal-lobby appends a managed section to it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:37:59 +00:00
082bdfcc77 fan-control: thin actuator — HA computes the setpoint, host only applies it
The R730 fan-control logic now lives entirely in Home Assistant: the curve
thresholds, duty %, bias and asymmetric deadband, plus manual/lock, are set on
the dashboard and published as sensor.r730_fan_command_pct. The host daemon is
reduced to a thin actuator — it reads that one number each loop, validates it
(numeric + not older than STALE_SECS) and applies it over IPMI. Removed the
presence-aware two-curve logic and the garage-door coupling.

Safety stays independent on the host: CPU>=CEILING, repeated IPMI failures, or
HA unreachable/stale all hand the fans back to Dell auto. RPM telemetry now
averages all 6 chassis fans. Deployed and verified live on pve (applies the HA
command; fans follow).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 12:59:57 +00:00
Viktor Barzin
bdea34b992 offinfra-onboard: --dockerfile flag for non-root Dockerfiles
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
claude-memory-mcp's Dockerfile is at docker/Dockerfile, not repo root
(infra#20 build failed: 'open Dockerfile: no such file or directory').
build.yml template gains file: {{DOCKERFILE}} (default ./Dockerfile).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 02:37:25 +00:00
Viktor Barzin
2f3c58dff1 claude-agent-service image -> ghcr across all five consumer stacks (infra#19)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
GHA now builds+pushes ghcr.io/viktorbarzin/claude-agent-service (public
package, anonymous pulls). Repointed: claude-agent-service (deployment +
git-init/seed-beads-agent inits), claude-breakglass, ci-pipeline-health,
beads-server CronJobs, k8s-version-upgrade (tag var 2fd7670d -> latest —
the Forgejo registry lost that sha; node caches were the only thing
keeping those CronJobs alive). publish-gate: vendor-contact emails
(licensing@/legal@/security@/sales@) ruled license-boilerplate, not PII.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 01:47:54 +00:00
Viktor Barzin
8aba3a0179 offinfra-onboard --no-deploy; wealthfolio-sync image -> ghcr (ADR-0002 infra#25)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
broker-sync is a CronJob-only consumer (no deployment): new --no-deploy
mode skips Woodpecker registration and renders build.yml without the
deploy job — :latest+Always CronJobs pick up builds on the next run.
wealthfolio stack: ghcr-credentials pull secret + image base repoint.
The wealthfolio-sync image regains a reproducible rebuild path.

Closes: code-62tm

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 01:39:35 +00:00
Viktor Barzin
e696957ebf ci: ancestor guard on DIFF_BASE; gate allowlists the owner's work email [ci skip]
Restarted infra pipelines after master moved diffed in REVERSE and
re-applied stale trees (pipeline 148 reverted payslip-ingest's fresh
ghcr config — repaired by the wave-2 agent). Only trust
CI_PREV_COMMIT_SHA when it is an ancestor of HEAD. publish-gate:
viktorbarzin@meta.com is the owner's own work email (same class as the
allowlisted personal domain), not blockable PII — unblocks infra#18.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:31:33 +00:00
Viktor Barzin
72b5843e4b publish-gate: exclude package-lock + beads tracker from email heuristic; beadboard image base -> ghcr
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
infra#17: the gate flagged npm deprecation boilerplate (package-lock.json
escapes the *.lock filter) and the upstream fork author's email in tracked
.beads data — both already-public upstream content, ruled false positives.
Lock files excluded properly; .beads moved to the eyeball inventory.
beads-server stack: beadboard image base repointed (deployment image is
KEEL-ignored; no CronJobs use it).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:52:07 +00:00
Viktor Barzin
6b0d42c7bc publish-gate + tuya-bridge ghcr cutover prep (ADR-0002 infra#15)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline failed
publish-gate: gitleaks + trufflehog (full history) + PII heuristics;
CLEAN verdict gates any public flip, DIRTY = stays private. tuya-bridge:
ghcr-credentials pull secret + image base -> ghcr; namespace added to
the ghcr-credentials allowlist as a safety net (new ghcr packages
default PRIVATE even from public repos — prune after visibility flip).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:12:02 +00:00
Viktor Barzin
51682ee939 offinfra-onboard: require clean clone + ff to forgejo master first [ci skip]
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:00:55 +00:00
Viktor Barzin
09bb0b50a1 offinfra-onboard: forgejo token fallback to ~/.git-credentials [ci skip]
job-hunter's clone uses the credential-store helper (no token embedded
in the remote URL, unlike f1-stream).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:59:32 +00:00
Viktor Barzin
6f41de71fa offinfra-onboard: normalize Woodpecker repo to untrusted [ci skip]
Trusted repos get netrc injected into every step container; the
non-root bitnami/kubectl deploy step dies with '//.netrc: Permission
denied' (hit live on f1-stream's reactivated old-era repo 10, which
carried trusted=true; tripit 167 is untrusted and works).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:32:08 +00:00
Viktor Barzin
beac1b57a3 offinfra-onboard: re-activate inactive Woodpecker registrations [ci skip]
Hit live on f1-stream: the old GHA-era ViktorBarzin/f1-stream
registration (repo 10) existed but was deactivated; the lookup matched
it and skipped registration, leaving the deploy POST pointed at an
inactive repo. Now checks .active and re-activates in place via
forge_remote_id.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:28:03 +00:00
Viktor Barzin
baff3d7477 offinfra-onboard: per-repo GHA->ghcr migration tool + f1-stream ghcr pull secret
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
ADR-0002 tracer bullet (infra#13), per Viktor's go-ahead. Idempotent
script: GitHub mirror repo (create/unarchive/visibility), GHA secrets
via gh, Forgejo push-mirror (sync_on_commit) + initial sync, Woodpecker
mirror registration, renders build.yml/deploy.yml from templates
(single-manifest provenance:false, svu semver to Forgejo, ghcr keep-10
retention, Slack notify-failure, manual-event deploy), removes the old
in-cluster build pipeline, commits on the Canonical side. f1-stream
stack gains the ghcr-credentials imagePullSecret (first consumer).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:21:22 +00:00
Viktor Barzin
624747cc46 workstation: default Claude model fable-5 → opus-4-8 for all devvm users
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Viktor asked to make Opus the default for new Claude sessions — his own,
Emo's, and Anca's — because Fable 5 is overkill for most daily tasks.

The org-wide default lives in the managed-settings `model` key, which
overrides each user's personal ~/.claude/settings.json model (and no
per-user launcher passes --model anymore). So flipping this one value
makes every user's NEXT session default to Opus 4.8; current sessions
keep their model, and a per-session /model still overrides as before.
The hourly t3-provision-users reconcile deploys it to
/etc/claude-code/managed-settings.json within the cycle.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 20:59:03 +00:00
Viktor Barzin
df332b59e6 break-glass SSH: drop port-knock for exposed key-only :52222; version host config
Viktor got locked out of the break-glass path (forgot the port-knock setup) and
deleted the edge-router forwards, then asked to review and redesign it from
scratch.

Root cause of the lockout: the knock added no real security (key-only SSH is
already brute-force-proof) and its only benefit — hiding the port — came at the
cost of a circular dependency. The knock sequence lived only in in-cluster
Vault, which is unreachable in the exact away/cold scenario break-glass exists
for. So the unlock secret was unavailable precisely when needed.

New model (self-contained, nothing to remember): plain key-only SSH on the
Proxmox host's :52222, openly reachable. The edge router forwards WAN tcp/52222
-> 192.168.1.127:52222 (external port MUST equal internal on the TP-Link AX6000
- it rejects remaps; port 22 itself is reserved). The exposed port trusts only a
dedicated break-glass key via `Match LocalPort` (a leak of any other root key
does not grant internet access), rate-limited (iptables hashlimit) + fail2ban.

- Removed knockd (package + config) and the legacy Synology SSH forward
  (ext 3333 -> .13:22, a needless WAN exposure the original plan wanted gone).
- Fixed the fail2ban jail for Debian 13 (auth logs under sshd-session, not sshd
  - the stock journalmatch silently never banned).
- Versioned the host config in scripts/ (it was applied ad-hoc, never committed)
  and recorded the deliberate Wave-1 "no public-IP" exception in security.md +
  .claude/CLAUDE.md. Superseded the 2026-05-30 port-knock design docs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:23:39 +00:00
Viktor Barzin
e2788d1b2d workstation: lean managed-settings claudeMd — org red-lines + pointers [ci skip]
Viktor's agent-rules cleanup: the org claudeMd now carries only
governance red-lines (RBAC tiers, per-user secrets, Terraform-only,
git audit-trail rules, code-layout detection) and points to
~/.claude/rules/execution.md for the worktree lifecycle, which was
previously duplicated here in full. Settings precedence and the
model key are unchanged. Also refreshes a .gitignore comment that
cited the old execution.md section numbering.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:02:43 +00:00
Viktor Barzin
c3a63fcd38 apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip]
The raw string compare never matched qm config's canonical key order, so
the hourly timer re-issued 'qm set' against every running capped VM,
live-rewriting QEMU throttle state via QMP 24x/day. Implicated in today's
devvm freeze (15:21-16:48 UTC): the guest's disk I/O stalled inside QEMU
(blockstats frozen at 0 while QMP stayed responsive) on the legacy lsi
controller path with no iothread.

Viktor asked to root-cause the freeze before choosing fixes, then approved
mitigating via VM settings: this commit fixes the hourly trigger and
documents the incident; the controller swap (virtio-scsi-single +
iothread=1 + aio=threads) is staged on VM 102 separately, pending his
cold stop/start.

Adds docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md (evidence chain,
ruled-out causes, capture-before-kill autopsy steps) and syncs compute.md
+ proxmox-inventory.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:00:08 +00:00
Viktor Barzin
9b19caff47 t3: connection logging across the path for drop attribution
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Viktor asked to add connection logs (Traefik/Cloudflare) to catch the
real-path t3 WS drops: a direct-to-t3-serve browser ran 40 min clean
while real tunnel sessions cycle every 15-35s, so the drop originates
above t3-serve and we need to see which layer cuts the socket.

Traefik (/ws duration) and cloudflared (WS close events) already ship to
Loki; the gap was the devvm side. This adds:

- t3-dispatch logs every /ws open/close with dur_ms + cause:
  downstream_closed (client/CF/Traefik hung up = last-mile/network),
  upstream_closed (t3-serve closed/reset), or graceful. Graceful closes
  previously left no trace (default ReverseProxy only logs on error), so a
  watchdog-driven reconnect was invisible. Helpers unit-tested.
- devvm-promtail.{yaml,service}: ships devvm journald (t3-dispatch +
  t3-serve@<user>) to cluster Loki as job=devvm-journal, mirroring the
  pve/rpi-sofia shippers. devvm was never in Loki (standalone VM).

Joined in Loki the three layers attribute any future drop to a segment
with no repro needed. Runbook + service-catalog updated.
2026-06-11 13:48:10 +00:00
Viktor Barzin
bd60c3d5e0 pve-host/dns: register loki.viktorbarzin.lan CNAME, drop the /etc/hosts pin
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Follow-up to the pve-host Loki shipper (aac807fb). The host reached Loki via an
/etc/hosts pin of the Traefik LB IP — Viktor flagged that as the wrong solution
(no hardcoding; the DNS infra should handle it). Registered loki.viktorbarzin.lan
in Technitium as a CNAME -> ingress.viktorbarzin.lan (the anchor whose A record
auto-tracks the live Traefik LB IP, so it's renumber-proof), via the Technitium
API + zone-sync to all 3 instances. Removed the /etc/hosts pin from the PVE host;
promtail now resolves the name purely via DNS (verified still shipping to Loki).
insecure_skip_verify stays — the internal .lan cert isn't publicly trusted.

Docs (monitoring.md) + the pve-promtail.yaml header updated to drop the pin
references. The DNS record is API-managed (the viktorbarzin.lan zone convention),
not in this repo; auto-managing .lan CNAMEs in technitium-ingress-dns-sync
remains a noted follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 22:55:20 +00:00
Viktor Barzin
93ba67c84a devvm: install prometheus-node-exporter (was never installed)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The monitoring stack now scrapes devvm (job 'devvm') for the t3 drop
attribution work, but the box had no node_exporter at all — installed
via apt and persisted here so reprovisioning keeps it.
2026-06-10 21:29:17 +00:00
Viktor Barzin
a734155fb5 Merge remote-tracking branch 'forgejo/master' into wizard/t3-disconnect-fixes
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-10 21:11:30 +00:00
Viktor Barzin
9b55d53be0 t3: differential drop-attribution probe + devvm metrics
Closes the loop on Viktor's ask to find the t3 disconnect root cause and
definitively rule infra in or out. Server logs alone cannot separate
'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve
stalled' — every cause collapses into the same 20s-watchdog reconnect.

The t3-probe (stacks/t3code) holds three permanent legs that differ only
in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN ->
CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to
the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve
process). Whichever leg drops convicts its segment; all legs clean while
a user drops exonerates infra with data. Dispatch gains an
unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket,
test-first) behind an auth=none path carve-out, guarded by the
authentik-walloff probe.

Also starts scraping devvm's node_exporter (job 'devvm') — it ran
unscraped, so the box whose memory/IO stalls cause the drops had zero
pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook
docs/runbooks/t3-drop-attribution.md.
2026-06-10 21:11:29 +00:00
Viktor Barzin
ecef09ab87 tmux-persist: add single-user restore mode (restore [user])
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The web-terminal will get a "Restore sessions" button (common ask after an
OOM kills a user's tmux server without a reboot, which the boot-only restore
service doesn't catch). The button needs to restore ONE user's saved sessions
on demand, so teach `restore` an optional <user> argument: with no arg it
restores every terminal user (unchanged — the boot service path), with a
<user> arg it validates the name against /etc/ttyd-user-map and restores only
that user. Reuses the existing restore loop (single source of restore truth).

The terminal-lobby tmux-api will invoke this as root via a validated
tmux-restore-user sudo wrapper. Verified: bad user exits 2 (won't fall back to
restoring everyone), no-arg path unchanged, shellcheck clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 21:08:57 +00:00
Viktor Barzin
b5c6639272 t3-serve@: contain agent memory storms; survive child OOM kills
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Same t3-disconnect root-cause work: a runaway claude agent child grew to
10.8G anon RSS inside t3-serve@wizard's cgroup, swap-thrashed devvm off
its spinning disk (system-wide multi-10s freezes = every t3 client's 20s
watchdog firing = the 'frequent disconnects that self-recover'), then
the global OOM at 2026-06-10 19:56 took the whole unit down for 8.5min
because the default OOMPolicy=stop fails the unit when ANY cgroup child
is OOM-killed. Cap the cgroup (MemoryHigh=12G, MemoryMax=16G), forbid
swap so stalls can't smear into minute-long freezes, and OOMPolicy=continue
so a runaway agent dies alone while the WS server keeps serving.
2026-06-10 21:00:06 +00:00
Viktor Barzin
ac6f19dd3b tmux-persist: never let an empty snapshot clobber a saved manifest
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
emo's 5 web-terminal tmux sessions were OOM-killed (the server died, no
reboot), and the 5-minute save tick then overwrote his session manifest with
0 bytes — wiping the record that restore needs. Root cause: the save guard
only checked that the tmux socket *file* existed, but an OOM-killed server
leaves a stale /tmp/tmux-<uid>/default behind; list-panes then returns
nothing and that empty capture was installed over the good manifest. Because
the restore service only runs at boot, an OOM (not a reboot) skips restore
entirely, so the clobbered manifest was the only record left — and it was
already gone.

Fix: only overwrite <user>.tsv when the snapshot captured >=1 live session;
otherwise keep the last good manifest (now covers no-server AND
stale-socket/dead-server). Verified by reproducing the 0-byte clobber on the
old script and confirming the new one preserves the manifest, plus a live
save that still captures every active session.

emo's 5 sessions were recovered from their transcripts and are back; this
keeps the next OOM from destroying the manifest again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 20:38:59 +00:00
Viktor Barzin
8304ef0f70 Merge origin/master (pfsense SNI-routed internal 443) into forgejo/master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Reconciles the two live infra remotes after the pve-host logging change landed
on forgejo (which was a commit behind origin). Non-destructive merge — keeps both
eae35c51 (pfsense webmail SNI routing) and aac807fb (pve-host Loki shipping).
2026-06-10 19:35:55 +00:00
Viktor Barzin
aac807fb3a pve-host: ship journal to Loki (snoopy command audit + sshd-pve) for emo's root SSH
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Emo's Claude agent was given root SSH to the Proxmox host (`ssh pve`, dedicated
shared-root key emo-pve-agent@devvm) so he can manage the host — e.g. the R730
fan daemon — through his agent. To keep an audit trail of what that agent does,
and to feed the long-pending Wave-1 S1 security rule, the PVE host now ships its
systemd journal to cluster Loki:

- snoopy logs every execve() to journald (identifier=snoopy), enabled via
  /etc/ld.so.preload; config scripts/pve-snoopy.ini.
- promtail v3.5.1 (amd64) ships /var/log/journal to Loki as {job="pve-journal"}
  (full host journal; filter identifier="snoopy" for the command audit), and
  relabels sshd auth to {job="sshd-pve"} — which ACTIVATES S1 (it was PENDING
  only for lack of this shipper). Config/unit: scripts/pve-promtail.{yaml,service}.

S1 won't false-fire on legitimate access: the devvm SNATs through pfSense to
192.168.1.2, which is already in the S1 source-IP allowlist.

Loki is reached via an /etc/hosts pin (10.0.20.203 loki.viktorbarzin.lan);
follow-up noted to register a Technitium CNAME so it auto-tracks LB renumbers.

Host pieces are hand-managed (not Terraform), like fan-control and the rpi-sofia
promtail — these files are the source of truth. Docs updated: security.md
(S1 LIVE) and monitoring.md ("External host: pve").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 19:31:45 +00:00
Viktor Barzin
eae35c511a pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere
Completes the internal port table of the mail front door (10.0.20.1):
443 was squatted by the pfSense webGUI (self-signed cert expired 2022),
so internal webmail and the kuma [External] mail probe hit the firewall
login instead of Roundcube — the last leg of the mail split-brain name.

Design (Viktor): route by what the client asked for. New HAProxy
frontend internal_https_443 (binds 10.0.20.1+10.0.10.1 :443, mode tcp):
SNI present -> Traefik .203 with send-proxy-v2 (trusted, IPv6-bridge
pattern, no health check per the PROXY-probe gotcha); SNI of
pfsense.viktorbarzin.{lan,me} or NO SNI (bare-IP admin access) -> webGUI,
which moved to :8443 (invisible to habits — https://10.0.20.1 still
lands on the login page; :8443 doubles as direct fallback). The
reverse-proxy pfsense ingress now targets :8443 directly.

Declared idempotently in pfsense-haproxy-bootstrap.php; config.xml
backed up on-box (config.xml.bak-2026-06-10-pre-sni443). Verified:
bare IP -> GUI login; pfsense.viktorbarzin.lan -> GUI;
pfsense.viktorbarzin.me -> 302 via ingress; mail.viktorbarzin.me ->
Roundcube with STRICT cert validation; :993 IMAPS untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:41:07 +00:00
Viktor Barzin
5f7c2964ac workstation: session-launch freshen follows the checked-out branch (not just master)
Viktor asked to log Anca into her GitHub account so she can develop on
the devvm and deploy her apps through the existing CI/CD. Her GitHub
repos (Plotting-Your-Dream-Book, travel, My-Wardrobe — now cloned into
her ~/code workspace) default to main, and the launcher freshen only
fast-forwarded master, silently skipping them. ff the current branch's
upstream instead — same safety gates (on a branch, clean tree, upstream
configured, ff-only). Single-layout infra clones behave identically.
[ci skip]
2026-06-10 18:20:59 +00:00
Viktor Barzin
2825cb1703 workstation: per-user code_layout — workspace puts project repos under ~/code (ancamilea + tripit)
Viktor asked to restructure Anca's setup: her ~/code WAS the infra clone
itself; he wants ~/code to be the directory where all her project repos
(tripit etc.) live side by side, with infra moved to a subdirectory.

- roster.yaml gains per-user 'code_layout: single|workspace' + 'repos',
  validated + derived by roster_engine.py (12 new tests, 40 total).
- t3-provision-users reconcile: auto-migrates a single-layout ~/code to
  ~/code/infra (running processes follow the moved inode), hoists nested
  project clones to the workspace root, clones roster repos from Forgejo
  AS the user (their PAT makes private repos work), and wires the
  documented forgejo remote + forgejo/master upstream into clones that
  predate that contract.
- Fixed a latent TSV bug: empty jq @tsv fields collapse under tab-IFS
  read, shifting later fields left (groups was only safe by being the
  last field) — emit '-' sentinels instead.
- start-claude.sh session freshen is layout-aware (freshens each repo
  under ~/code for workspace users).
- managed claudeMd + AGENTS.md non-admin recipe + multi-tenancy.md
  updated in the same change.

Applied live: ancamilea = workspace (infra at ~/code/infra, her existing
tripit clone hoisted to ~/code/tripit, master upstream switched to
forgejo/master); emo stays single layout, untouched. [ci skip]
2026-06-10 18:05:31 +00:00
Viktor Barzin
3b6a5c6737 workstation: worktree-first feature work for all agents [ci skip]
Viktor asked that every feature task be developed in its own git worktree
and merged into master when done, enabling multiple agents to work the
same project concurrently. Encode the org rule in the managed claudeMd
(self-deploys to /etc via the hourly reconcile), add the worktree-first
paragraph to the AGENTS.md non-admin landing recipe, and gitignore
.worktrees/ so per-feature worktrees can live at the repo root. Full
lifecycle: ~/.claude/rules/execution.md §3.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:49:43 +00:00
Viktor Barzin
00bc1e052d technitium: mirror mail-auth records into internal zone; fix redfish check [ci skip]
Two fixes from the post-DNS-internalization health sweep:

1. The internal viktorbarzin.me zone served only ingress A/CNAME records.
   Since the mailserver pods now resolve the domain through it (CoreDNS
   viktorbarzin.me:53 -> Technitium, 59a531b8), rspamd's SPF checks on
   inbound @viktorbarzin.me mail saw SPF=none and quarantined it — the
   Brevo email-roundtrip probe failed from the 16:20 run onward
   (EmailRoundtripFailing/Stale). The ingress-dns-sync CronJob now also
   maintains the static mail-auth records (SPF, brevo-code TXT, MX;
   DMARC + DKIM were already present), idempotently. Principle: the
   internal zone must be a SUPERSET of the public zone for every record
   type internal clients consume. Verified in-pod: all four types
   resolve; roundtrip re-probe green.

2. cluster_healthcheck #30 queried instant `up`, which goes stale for
   ~5 of every 10 minutes on the deliberate 10m redfish-idrac remnant
   job -> intermittent false "redfish-idrac=missing". Now uses
   last_over_time(up[15m]) — same answers for fast jobs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:46:37 +00:00
Viktor Barzin
e7fbf986fb workstation: rename tmux persistence out of the t3 namespace [ci skip]
Viktor's correction: this feature is about the tmux web-terminal
sessions, not t3 — t3 auto-saves its own threads (~/.t3 state +
daily t3-backup-state). Renamed t3-tmux-sessions -> tmux-persist
(units tmux-persist-save.timer / tmux-persist-restore.service, state
/var/lib/tmux-persist), header rescoped to say exactly that. Same
mechanism, correct taxonomy. Old units removed, state migrated,
re-verified live (5 emo + 3 wizard sessions snapshotted).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:42:52 +00:00