Viktor's correction: this feature is about the tmux web-terminal
sessions, not t3 — t3 auto-saves its own threads (~/.t3 state +
daily t3-backup-state). Renamed t3-tmux-sessions -> tmux-persist
(units tmux-persist-save.timer / tmux-persist-restore.service, state
/var/lib/tmux-persist), header rescoped to say exactly that. Same
mechanism, correct taxonomy. Old units removed, state migrated,
re-verified live (5 emo + 3 wizard sessions snapshotted).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor: emo's open web-terminal sessions must persist across reboots.
Claude conversations were already durable on disk; the volatile part
was the tmux wiring (which named session runs which conversation).
t3-tmux-sessions save (5-min timer) snapshots every roster user's
sessions to /var/lib/t3-tmux-state/<user>.tsv — conversation uuid
taken from argv --resume (self-sustaining once restored) or the
newest transcript in the cwd-slug project dir created after process
start (fresh launcher sessions; claude does NOT hold its transcript
fd open, so fd-sniffing was a dead end). t3-tmux-sessions restore
(boot oneshot, also safe after partial loss) recreates missing
sessions with claude --resume <uuid>. Reconciler self-heals both
units' enablement.
Verified live: emo's 5 sessions snapshotted with correct uuids;
killed R730-cooling -> restore brought it back resuming the same
conversation (context meter identical); other sessions untouched.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Forward the viktorbarzin.me:53 pod block to the Technitium ClusterIP
(10.96.0.53, same as the .lan block) instead of 8.8.8.8/1.1.1.1. Pods
become ordinary internal clients (CNAME -> apex -> live Traefik LB;
mail -> 10.0.20.1), fixing the 27 non-proxied [External] uptime-kuma
monitors that rode the TP-Link NAT loopback (hard-down since 06-09;
loopback refuses flows whose source equals the reflection target, which
all pfSense-SNAT'd cluster traffic does).
Enabled by re-testing a stale premise: on k8s 1.34 pods DO reach the
ETP=Local Traefik LB IP (kube-proxy short-circuits in-cluster traffic
to LB IPs; verified from pods on three non-Traefik nodes) — re-verify
after major k8s upgrades; canary = [External] fleet going red. The
NAT-layer alternatives (pfSense rdr, SNAT-drop) were rejected: both
fight return-path asymmetry and deepen TP-Link dependency.
Verified in-pod: immich -> .203 + HTTPS 200, mail -> 10.0.20.1,
forgejo -> Traefik ClusterIP (pin kept for Technitium-outage
resilience). Proxied [External] monitors now test the internal path —
true edge fidelity moves to the external vantage (ha-london, next fix).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor's claudeMd edits must keep reaching every user now that emo is
out of the shared tree. Two reconciler additions:
- sync_managed_config: installs scripts/workstation/managed-settings.json
to /etc/claude-code whenever the repo copy changes — editing the
org claudeMd is now edit + commit, no manual install step
- refresh_codex_mirror: regenerates each user's ~/.codex/AGENTS.md
(static mirror of the claudeMd; header-guarded so user-customized
files are never clobbered)
Verified live: corrupted emo's mirror -> reconcile restored it;
wizard's stale mirror refreshed; in-sync managed config no-ops.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Local checkout carried the 2026-06-10 DNS/registry architecture series
(pfSense forward-zone, CoreDNS viktorbarzin.me:53 carve-out, nodes
stock) + vzdump/nfs-mirror/workstation-rebuild commits that never
reached the canonical remote, while forgejo master received the
emo-access series via isolated worktrees. Viktor asked to merge.
Conflict resolutions (newest iteration wins in each file):
- stacks/forgejo/cleanup.tf: LOCAL — dry_run=true (2026-06-10 revert
after live retention orphaned OCI indexes; remote had 06-09 enable)
- .claude/CLAUDE.md, docs/architecture/backup-dr.md: LOCAL — final
registry/DNS architecture + implemented vzdump alerts
- scripts/workstation/setup-devvm.sh: LOCAL — pinned-version,
reproducible-rebuild refactor (kubelogin pin, restructured staging)
- scripts/workstation/managed-settings.json: FORGEJO — the
allow-then-audit claudeMd (matches /etc deployment byte-for-byte)
- scripts/t3-provision-users.sh: FORGEJO comment; refresh_locked_clone
intact
[ci skip]: all stack changes in the local lineage were applied live
this morning — CI would re-walk 100+ stacks via the modules/ fallback
for zero state change.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Emo's first direct pushes surfaced three latent CI issues, all fixed
out-of-band today and recorded here: webhook deliveries to
ci.viktorbarzin.me timing out on the public-IP hairpin (hook now
targets the in-cluster woodpecker-server service), repo 82 registered
without the repo-scoped secret set (cloned from repo 1 in the DB), and
empty commits compiling every workflow so missing secrets hard-error.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor: emo may make any change; what matters is tracking what changed
and why. ebarzin added to master push+merge whitelists (force-push
stays disabled — append-only history). Tracking enforced three ways:
- agent instructions (managed claudeMd + AGENTS.md): commit body MUST
carry the user's plain-language intent; commits land on master
directly; [ci skip] forbidden for non-admins
- new notify-nonadmin-push step in .woodpecker/default.yml: Slack
message for every non-admin master push (admin pushes silent)
- PR flow remains the fallback for non-whitelisted users
Accepted consequence (informed): emo's pushes auto-apply changed
stacks via CI. Offboard runbook gains whitelist-removal step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Non-admins (emo) need current master without manual pulls. Two layers:
- t3-provision-users reconcile gains refresh_locked_clone: fetch all
remotes + ff-only master, guarded (on master, clean tree, upstream
set); dirty/diverged clones are left alone with a WARN.
- start-claude.sh freshens ~/code at session launch, 15s-capped so an
offline remote never delays the session.
Verified live on emo's clone: stale clone ff'd to tip by the
reconciler; launcher snippet ff's when clean and refuses while a
dirty file exists. Deployed to /usr/local/bin/t3-provision-users,
/etc/skel/start-claude.sh, and emo's launcher.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ADR-0004's premise was wrong: pushing master fires the Woodpecker apply
pipeline (require_approval=forks only), so master pushes ARE deploys.
Added Forgejo branch protection on master (push/merge whitelist=viktor,
deploy keys allowed); non-admins contribute via branches + PRs.
emo (ebarzin): write collaborator on viktor/infra, PAT in
~/.git-credentials, forgejo remote + upstream in his locked clone.
Phase-5 finished: code-shared removed; ~/.claude symlinks kept (they
ARE the skel shared-base mechanism — plan step 4c obsolete).
Offboard runbook: revoke PAT + collaborator + group steps added.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The keep-set (newest 10 versions + latest + *cache* tags) treats
multi-arch/attestation index CHILDREN — separate untagged sha256
versions — as deletable: for images not rebuilt recently they sort
outside the newest-10 window and were pruned while their kept parent
index survived. kms-website :latest and :dfc83fb children 404'd
(RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe
within hours; deployed tag a794d1a unaffected).
Healed: :latest re-pointed at the intact a794d1a index (also the
newest commit), corrupt :dfc83fb version deleted, probe re-run clean
(0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied
live. Re-enable only with a container-aware keep-set — options in the
post-mortem.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to
Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM
backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I
re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl,
mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success.
NOT [ci]-applied: this is a Terraform stack change — arms on the next
`scripts/tg apply` of the monitoring stack (metrics already flow, so it arms
immediately once applied). Admin-gated apply per org policy.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
nfs-mirror does `rsync -rlt --delete /srv/nfs/ -> /mnt/backup/`; any /mnt/backup
dir with no /srv/nfs counterpart is an orphan and gets --delete'd. vzdump-vms
(added yesterday) writes /mnt/backup/vzdump/, which wasn't excluded — so the
02:00 nfs-mirror run silently deleted both successful 40G devvm images
(verified: dir gone, 40G freed, despite status=0 success logs). Add
--exclude='/vzdump/' alongside the existing pvc-data/pfsense/pve-config/
sqlite-backup excludes that exist for exactly this reason. TDD-proven with an
isolated rsync --delete -n -v. backup-dr.md notes the dependency.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node
customization — split-brain lives in the DNS infra):
- pfSense Unbound domain override viktorbarzin.me -> Technitium
10.0.20.201 (applied via php write_config, backup on-box). Every
Unbound client on every VLAN now gets the internal split-horizon
answers (live Traefik IP via apex CNAME) with zero per-host config.
- CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block —
forgejo pinned to Traefik ClusterIP via data source (pods cannot reach
the ETP=Local LB IP pfSense now returns), all other .me names kept on
public resolvers (pods' pre-existing behavior). Replaces the .:53
forgejo rewrite.
- Removed the same-day resolved routing-domain drop-ins from all 7 nodes;
node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206)
for fleet parity; cloud-init no longer writes any DNS drop-ins.
- Docs: dns.md, pfsense-unbound runbook (override + rollback), registry
bullet, post-mortem final-architecture addendum.
Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK,
pods resolve forgejo -> ClusterIP / others -> public, mail record works,
.lan zone unaffected.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Supersedes this morning's per-node /etc/hosts pin (no hardcoded service
IPs on nodes, per Viktor). Technitium's split-horizon zone already
resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP
(ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe
alerts) -- the nodes just never queried it. Rolled the devvm's
systemd-resolved routing-domain pattern (~viktorbarzin.me ->
10.0.20.201) to all 7 nodes, removed the pins, verified getent +
crictl pull via pure DNS.
Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1)
to FallbackDNS-only: public servers in the global set race the routing
domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete
-- exactly the stale comment that pointed new nodes at the hairpin.
hosts.toml mirror kept but documented as vestigial (Traefik 404s
bare-IP requests; registry auth realm is an absolute URL).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
tuya-bridge was down 7.5h (ImagePullBackOff on k8s-node3): fresh kubelet
pulls of forgejo.viktorbarzin.me images depended on the intermittently
broken public-IP hairpin. The containerd hosts.toml mirror cannot keep
pulls internal on its own — Traefik 404s its bare-IP requests (no
Host/SNI match) and the registry Bearer realm is an absolute public URL
fetched outside the mirror. Third incident of this class (buildkit
06-04, tripit/devvm 06-09).
Fix: /etc/hosts pin 10.0.20.203 forgejo.viktorbarzin.me on every node —
covers resolve + token + blob legs with correct SNI and valid cert.
Applied live to all 7 nodes; persisted in the cloud-init bootstrap and
the existing-node rollout script. Docs updated (registry bullet, dns.md
hairpin scope + stale .200 literals, runbook) + post-mortem.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).
vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.
Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):
- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
/api/auth/bootstrap on 404 — one binary pairs across both versions and any
rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
before, green after). Built, deployed, verified live on 0.0.24 (all three
users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
state.sqlite (was the only copy, unbacked) -> the one-way forward schema
migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).
- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.
Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09
auth-schema rollback wiped all server-side sessions, browsers kept dead
30-day t3_session cookies; the dispatcher proxied them straight through
and t3 rendered its pair page ("all users must pair again").
Now a present cookie on a top-level document navigation is validated via
the instance's /api/auth/session and re-paired on authenticated:false.
Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html)
so XHR/asset/WebSocket sub-requests are never answered with a 302; fails
open (proxy through) on any validation error. Unit + handler tests added.
[ci skip]
Co-Authored-By: Claude <noreply@anthropic.com>
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).
vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.
Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):
- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
/api/auth/bootstrap on 404 — one binary pairs across both versions and any
rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
before, green after). Built, deployed, verified live on 0.0.24 (all three
users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
state.sqlite (was the only copy, unbacked) -> the one-way forward schema
migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).
- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.
Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09
auth-schema rollback wiped all server-side sessions, browsers kept dead
30-day t3_session cookies; the dispatcher proxied them straight through
and t3 rendered its pair page ("all users must pair again").
Now a present cookie on a top-level document navigation is validated via
the instance's /api/auth/session and re-paired on authenticated:false.
Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html)
so XHR/asset/WebSocket sub-requests are never answered with a 302; fails
open (proxy through) on any validation error. Unit + handler tests added.
[ci skip]
Co-Authored-By: Claude <noreply@anthropic.com>
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Supersedes the MEMBERSHIP model of the 2026-06-07 design (roster.yaml SSoT). Key principle: workstation access (T3 Users group membership) is decoupled from cluster authorization (k8s_users + kubernetes-* groups, untouched). A user is defined once in Authentik: email + T3 Users membership + optional os_user attribute. Provisioner reconciles accounts from the Authentik API; roster.yaml retires. v1 foundation (config inheritance, locked clone, kubeconfig, swap, hardening, emo cutover) unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
multi-tenancy.md: new DevVM Workstation section (roster SSoT, tiers, config inheritance, locked clone, built-vs-gated status). service-catalog.md t3code row: corrected the stale 'source of truth = /etc/ttyd-user-map' (now roster.yaml; the map/dispatch are GENERATED). offboard-user.md: written (was a referenced-but-missing dead link) — staged reversible-cut-then-gated-destructive for both cluster + workstation surfaces.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Spike GO (claude 2.1.168): managed claudeMd reaches a session; no managed-skills key exists so skills/rules inherit via per-user ~/.claude symlinks to the base (seeded in /etc/skel). Records the settings.json 0664->0600 leak fix.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
User-chosen fully-open registration on tuwunel (no CAPTCHA support; browser
challenges break native clients). Bot defense is layered instead:
- Traefik rate-limit Middleware on a path-scoped /register ingress carve-out,
keyed on request Host (GLOBAL /register cap) not source IP — the host is
reachable via both Cloudflare-IPv4 (CF-Connecting-IP) and IPv6-direct (HE
tunnel, no CF header), so a per-source key let IPv6 bots bypass. 10/min,
burst 20, per replica; CrowdSec is the hard backstop on both paths.
- Loki ruler rule MatrixNewUserRegistered -> lane=security -> existing
#security Slack receiver (matches "registered on this server", never the
rejection line). tuwunel's admin bot also posts signups to the admin room.
Dropped the REGISTRATION_TOKEN env (secret/matrix + ESO kept for revert).
Applied via scripts/tg (matrix tier-1 + targeted monitoring configmap), so
[ci skip] to avoid CI full-applying monitoring (unrelated grafana-acl drift).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
devvm multi-user Claude Code workstation: role-driven profiles (admin/power-user/namespace-owner) off one git roster (single source of truth, full onboard->offboard lifecycle); config inheritance via Claude's native machine-wide managed layer; per-user writable git-crypt-locked infra clone (ungated, apply-time is the boundary); per-tier OIDC kubectl; per-user secrets/auth (memory isolation deferred); incremental, emo-safe migration; capacity prereqs. Folds in gap-analysis findings verified live 2026-06-08. Designed, not yet implemented.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The migration left a UI-managed (not TF) Authentik OIDC app orphaned — tuwunel
uses native password auth, so nothing consumed it. Deleted application `matrix`
+ OAuth2 provider pk=6 via the Authentik API (user-confirmed). Drop the stale
Matrix rows from the SSO reference tables and update the plan's residual list.
Doc-only [ci skip].
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the cramped Synapse deployment with tuwunel v1.7.1: embedded RocksDB
drops the CNPG dependency (both init-containers, the db ESO, the Reloader
annotation all gone), env-var config, fsGroup-owned encrypted PVC, federation
on, tuwunel-served well-known delegation to :443. server_name unchanged
(matrix.viktorbarzin.me); fresh start (no Synapse->RocksDB migration path).
Registered @viktor admin then disabled registration (403).
Cleanup: removed the orphaned pg-matrix Vault static role and dropped the
matrix Postgres DB/role; updated service-catalog, upgrade-config, CLAUDE.md
PG-rotation list, and the Matrix OIDC->orphaned auth notes. Design+plan in
docs/plans/2026-06-08-matrix-synapse-to-tuwunel-*.
Already applied via scripts/tg (matrix tier-1 + targeted vault tier-0), so
[ci skip] to avoid CI reconciling an unrelated pre-existing vault OIDC
tune-TTL drift.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The devvm periodic Vault admin token (token-devvm-wizard, period=768h, policies default+sops-admin+vault-admin) is kept alive by a systemd user timer, but the renewer script + units lived only under ~/.local/bin and ~/.config/systemd/user — lost on a devvm rebuild. Move them into the repo as the source of truth so a rebuild can restore them. (version-only scope: behavior unchanged; no canonical-file/self-heal added.)
- scripts/vault-token-renew.{sh,service,timer}: renewer + user units, refactored into pure drift-guard functions + a guarded main (behavior identical; deployed live and verified still renewing with full write access).
- scripts/test-vault-token-renew.sh: unit-tests the drift guard + lookup-JSON parsing, incl. the 2026-06-05 woodpecker-clobber case (17 assertions).
- docs/runbooks/vault-token-renew-devvm.md: deploy, mint/re-mint, health-check, drift recovery.
- docs/architecture/secrets.md: correct the stale '~/.vault-token = OIDC token' description for devvm.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resource changes/deletions are now attributable (the novelapp deletion this week
was untraceable because apiserver audit was off). Low-write policy: drops
reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into
the kube-apiserver static-pod manifest + kubeadm-config (v1beta4
extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails
/var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}.
Root cause that had silently blocked this AND OIDC for weeks: a stray
kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate
static-pod manifest kubelet ran instead of the real one, dropping every flag
added to the real manifest. Removed it. Runbook added.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reflect commit b6dd23b1: bot-block-proxy short-circuits /auth to
return 200 instead of proxying to the scaled-to-0 poison-fountain.
- security.md Layer 1 + tarpit description + troubleshooting (fix stale
stacks/platform path -> traefik stack; drop misleading
restart-poison-fountain step).
- .claude/CLAUDE.md: add matrix to PG rotation list; document that
startup-read secret consumers need a Reloader annotation (matrix root
cause, found via Loki 2026-06-05).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the 2026-06-05 decision to keep proxmox-csi and harden it (keep PVC
mobility, no hardware) over TopoLVM (pins to node) / Longhorn (2x writes on
single shared HDD). Wave-2 moved 5 non-DB workloads off block to NFS
(tandoor, speedtest, hackmd, changedetection, send), freeing 5 LUN slots.
- storage.md: live PVC counts, Retain-policy/orphan-LV note, Wave-2 history,
updated cap-relief levers
- topolvm-evaluation.md: stamped NOT ADOPTED with rationale + pointer to the
decision doc
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Refresh CONTEXT.md against current repo + cluster reality (grill-with-docs):
- Module taxonomy rewrite: drop fictional k8s_app/helm_app/postgres_app
factory modules (never existed); name the real four (ingress_factory,
nfs_volume, anubis_instance, setup_tls_secret) + the shared / Stack-local
/ flat distinction; flag vestigial modules/kubernetes/<app> dirs.
- Rename "Ingress auth tier" -> "Ingress auth" (discrete modes, not tiers);
reserve "tier" for State tier + Namespace tier only.
- Add local-path entry (cluster default SC; node-local footgun warning).
- Add concepts: Keel, Diun, CNPG/pg-cluster, MetalLB LB-IP split, Calico.
- Add "policy" ambiguity flag (Kyverno vs Calico NetworkPolicy vs Vault/RBAC).
- Fix node count 5 -> 7 (k8s-master + k8s-node1..6).
Doc-sync (same commit per repo rules):
- overview.md: replace fictional factory modules with the real shared
modules + the flat/stack-local pattern.
- .claude/CLAUDE.md: drop dead nfs-proxmox column from the storage decision
table + stale cross-reference (vault migrated off it 2026-04-25).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tandoor is PostgreSQL-backed with no embedded DB, so its media/static PVC
is NFS-safe. Frees one proxmox-csi SCSI-LUN slot. Part of the 'harden
proxmox-csi + NFS' plan (keeps PVC mobility, no new hardware) — see
docs/plans/2026-06-05-block-storage-harden-nfs-design.md.
- swap tandoor-data-proxmox -> nfs_volume module (nfs-truenas SC)
- data copied + verified (HTTP 200 on NFS volume); block PVC removed
- block LV retained per SC policy (orphan cleanup tracked in code-dfjn)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Synology offsite backup target (/mnt/synology-backup, surfaced via
the PVE host NFS mount) sits at ~94% by design and was firing
NodeFilesystemFull continuously. Per user request, raise the threshold
to 95% (<5% free). NOTE: NodeFilesystemFull is a global node-filesystem
rule, so this also loosens the warning on k8s node/system disks;
BackupDiskFull (sda /mnt/backup) stays at 85%.
Also adds docs/runbooks/synology-storage.md: how to assess Synology
usage WITHOUT du (Storage Analyzer weekly CSVs, df/btrfs/qgroup),
btrfs async/snapshot-pinned reclaim, the 2026-06-05 capacity assessment
(94% full; Backup share 4.42TiB), and ~500GiB of homelab cleanup
candidates (redundant gphotos Takeout, old laptop VM images, archives).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ha-sofia's 7 R730 REST sensors (CPU/exhaust/inlet temp, power, 2x PSU voltage,
fan) read the iDRAC via the slow on-demand Redfish exporter (scan_interval 120,
~16-22s/fetch, intermittent `unavailable` blips). Migrated them to a FAST
Prometheus query of the SNMP values (instant, ~1m-fresh from the snmp-idrac
scrape), scan_interval 30.
This adds the enabling ingress: `prometheus-query.viktorbarzin.lan` →
`prometheus-server:80`, auth=none, allow_local_access_only, path-scoped to
`/api/v1/query` (read-only instant-query only — not the UI/admin/federation).
ha-sofia can't use `prometheus.viktorbarzin.me` (Authentik-gated, no OIDC from
a REST sensor), so this mirrors the existing local-only `.lan` exporter
ingresses HA already queries.
The ha-sofia REST file (`/config/rest_resources/idrac_redfish_exporter.yaml`)
was edited in place (auto-version-controlled by the HA version-control add-on;
pre-migration copy at `/config/idrac_redfish_exporter.bak-pre-snmp`). The
Technitium CNAME `prometheus-query.viktorbarzin.lan -> ingress.viktorbarzin.lan`
was added manually via the API — like the other `.lan` exporter hosts it is NOT
auto-synced (the technitium-ingress-dns-sync CronJob only creates `.me`
records). Follow-up (already noted for the Loki sensor): extend that sync to
manage `.lan` CNAMEs too. The Redfish remnant's `sensors` collector is now
vestigial (HA no longer reads it).
Verified: all 7 HA sensors report correct fresh values from Prometheus (fan
10800 rpm, CPU 62.0C, power 280W, PSU 230/240V).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the 2026-06-05 cluster-wide log observability work: the Alloy
local.file_match fix (loki.source.file doesn't expand globs) + stage.cri, the
new "Cluster Logs" Grafana dashboard, the ha-sofia cluster-log-health REST
sensors, and the loki.viktorbarzin.lan Technitium-record follow-up.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on
each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m
interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac
module: ~3.7s/scrape at 1m.
- snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM
(coolingDeviceReading + location lookup) and an amperageProbeLocationName
lookup so the "System Board Pwr Consumption" watts probe is label-selectable.
- snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the
primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*).
- Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and
fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes.
- Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names;
temps ÷10 (tenths-degC); DellStatus value-mappings updated.
- Demote the Redfish exporter to a slow remnant: trim collectors to
system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for
metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive
table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads
idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change.
SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified
live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without
firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan +
as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md,
docs/architecture/monitoring.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Fan % and Fan RPM sensor-graph cards had identical trend shapes (RPM ∝ %),
so merge them into one "Fan speed" card: % trend (stable Pushgateway sensor) +
RPM beneath. RPM reads sensor.r730_fan_speed (Redfish) but falls back to the
calibrated estimate (rpm≈160·%+1520, shown with a "~" prefix) when that sensor
is unavailable — it blips out intermittently, so the readout never goes blank.
The Override readout likewise shows both "% · rpm". HA-side only; daemon
unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add an "External host: rpi-sofia" section to docs/architecture/monitoring.md
covering the 2026-06-05 setup: node_exporter + vcgencmd textfile metrics; the
full-journal promtail->Loki shipping (job=rpi-sofia-journal — kernel/dmesg via
the (none) unit + all systemd units, labeled by unit/level); the RPi Sofia
alert group; the dashboard; and the systemd watchdog. Notes the SD-card root
cause and that the Pi-side config is hand-managed + backed up off-box.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>