Commit graph

309 commits

Author SHA1 Message Date
Viktor Barzin
8304ef0f70 Merge origin/master (pfsense SNI-routed internal 443) into forgejo/master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Reconciles the two live infra remotes after the pve-host logging change landed
on forgejo (which was a commit behind origin). Non-destructive merge — keeps both
eae35c51 (pfsense webmail SNI routing) and aac807fb (pve-host Loki shipping).
2026-06-10 19:35:55 +00:00
Viktor Barzin
aac807fb3a pve-host: ship journal to Loki (snoopy command audit + sshd-pve) for emo's root SSH
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Emo's Claude agent was given root SSH to the Proxmox host (`ssh pve`, dedicated
shared-root key emo-pve-agent@devvm) so he can manage the host — e.g. the R730
fan daemon — through his agent. To keep an audit trail of what that agent does,
and to feed the long-pending Wave-1 S1 security rule, the PVE host now ships its
systemd journal to cluster Loki:

- snoopy logs every execve() to journald (identifier=snoopy), enabled via
  /etc/ld.so.preload; config scripts/pve-snoopy.ini.
- promtail v3.5.1 (amd64) ships /var/log/journal to Loki as {job="pve-journal"}
  (full host journal; filter identifier="snoopy" for the command audit), and
  relabels sshd auth to {job="sshd-pve"} — which ACTIVATES S1 (it was PENDING
  only for lack of this shipper). Config/unit: scripts/pve-promtail.{yaml,service}.

S1 won't false-fire on legitimate access: the devvm SNATs through pfSense to
192.168.1.2, which is already in the S1 source-IP allowlist.

Loki is reached via an /etc/hosts pin (10.0.20.203 loki.viktorbarzin.lan);
follow-up noted to register a Technitium CNAME so it auto-tracks LB renumbers.

Host pieces are hand-managed (not Terraform), like fan-control and the rpi-sofia
promtail — these files are the source of truth. Docs updated: security.md
(S1 LIVE) and monitoring.md ("External host: pve").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 19:31:45 +00:00
Viktor Barzin
eae35c511a pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere
Completes the internal port table of the mail front door (10.0.20.1):
443 was squatted by the pfSense webGUI (self-signed cert expired 2022),
so internal webmail and the kuma [External] mail probe hit the firewall
login instead of Roundcube — the last leg of the mail split-brain name.

Design (Viktor): route by what the client asked for. New HAProxy
frontend internal_https_443 (binds 10.0.20.1+10.0.10.1 :443, mode tcp):
SNI present -> Traefik .203 with send-proxy-v2 (trusted, IPv6-bridge
pattern, no health check per the PROXY-probe gotcha); SNI of
pfsense.viktorbarzin.{lan,me} or NO SNI (bare-IP admin access) -> webGUI,
which moved to :8443 (invisible to habits — https://10.0.20.1 still
lands on the login page; :8443 doubles as direct fallback). The
reverse-proxy pfsense ingress now targets :8443 directly.

Declared idempotently in pfsense-haproxy-bootstrap.php; config.xml
backed up on-box (config.xml.bak-2026-06-10-pre-sni443). Verified:
bare IP -> GUI login; pfsense.viktorbarzin.lan -> GUI;
pfsense.viktorbarzin.me -> 302 via ingress; mail.viktorbarzin.me ->
Roundcube with STRICT cert validation; :993 IMAPS untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:41:07 +00:00
Viktor Barzin
2825cb1703 workstation: per-user code_layout — workspace puts project repos under ~/code (ancamilea + tripit)
Viktor asked to restructure Anca's setup: her ~/code WAS the infra clone
itself; he wants ~/code to be the directory where all her project repos
(tripit etc.) live side by side, with infra moved to a subdirectory.

- roster.yaml gains per-user 'code_layout: single|workspace' + 'repos',
  validated + derived by roster_engine.py (12 new tests, 40 total).
- t3-provision-users reconcile: auto-migrates a single-layout ~/code to
  ~/code/infra (running processes follow the moved inode), hoists nested
  project clones to the workspace root, clones roster repos from Forgejo
  AS the user (their PAT makes private repos work), and wires the
  documented forgejo remote + forgejo/master upstream into clones that
  predate that contract.
- Fixed a latent TSV bug: empty jq @tsv fields collapse under tab-IFS
  read, shifting later fields left (groups was only safe by being the
  last field) — emit '-' sentinels instead.
- start-claude.sh session freshen is layout-aware (freshens each repo
  under ~/code for workspace users).
- managed claudeMd + AGENTS.md non-admin recipe + multi-tenancy.md
  updated in the same change.

Applied live: ancamilea = workspace (infra at ~/code/infra, her existing
tripit clone hoisted to ~/code/tripit, master upstream switched to
forgejo/master); emo stays single layout, untouched. [ci skip]
2026-06-10 18:05:31 +00:00
Viktor Barzin
daddafd279 docs: superset rule for the internal viktorbarzin.me zone (mail-auth records) [ci skip]
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:47:31 +00:00
Viktor Barzin
e7fbf986fb workstation: rename tmux persistence out of the t3 namespace [ci skip]
Viktor's correction: this feature is about the tmux web-terminal
sessions, not t3 — t3 auto-saves its own threads (~/.t3 state +
daily t3-backup-state). Renamed t3-tmux-sessions -> tmux-persist
(units tmux-persist-save.timer / tmux-persist-restore.service, state
/var/lib/tmux-persist), header rescoped to say exactly that. Same
mechanism, correct taxonomy. Old units removed, state migrated,
re-verified live (5 emo + 3 wizard sessions snapshotted).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:42:52 +00:00
Viktor Barzin
2e4f48f3fc workstation: tmux sessions survive devvm reboots (save timer + boot restore)
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Viktor: emo's open web-terminal sessions must persist across reboots.
Claude conversations were already durable on disk; the volatile part
was the tmux wiring (which named session runs which conversation).

t3-tmux-sessions save (5-min timer) snapshots every roster user's
sessions to /var/lib/t3-tmux-state/<user>.tsv — conversation uuid
taken from argv --resume (self-sustaining once restored) or the
newest transcript in the cwd-slug project dir created after process
start (fresh launcher sessions; claude does NOT hold its transcript
fd open, so fd-sniffing was a dead end). t3-tmux-sessions restore
(boot oneshot, also safe after partial loss) recreates missing
sessions with claude --resume <uuid>. Reconciler self-heals both
units' enablement.

Verified live: emo's 5 sessions snapshotted with correct uuids;
killed R730-cooling -> restore brought it back resuming the same
conversation (context meter identical); other sessions untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:39:32 +00:00
Viktor Barzin
59a531b8e0 coredns: pods get internal split-horizon answers for viktorbarzin.me [ci skip]
Forward the viktorbarzin.me:53 pod block to the Technitium ClusterIP
(10.96.0.53, same as the .lan block) instead of 8.8.8.8/1.1.1.1. Pods
become ordinary internal clients (CNAME -> apex -> live Traefik LB;
mail -> 10.0.20.1), fixing the 27 non-proxied [External] uptime-kuma
monitors that rode the TP-Link NAT loopback (hard-down since 06-09;
loopback refuses flows whose source equals the reflection target, which
all pfSense-SNAT'd cluster traffic does).

Enabled by re-testing a stale premise: on k8s 1.34 pods DO reach the
ETP=Local Traefik LB IP (kube-proxy short-circuits in-cluster traffic
to LB IPs; verified from pods on three non-Traefik nodes) — re-verify
after major k8s upgrades; canary = [External] fleet going red. The
NAT-layer alternatives (pfSense rdr, SNAT-drop) were rejected: both
fight return-path asymmetry and deepen TP-Link dependency.

Verified in-pod: immich -> .203 + HTTPS 200, mail -> 10.0.20.1,
forgejo -> Traefik ClusterIP (pin kept for Technitium-outage
resilience). Proxied [External] monitors now test the internal path —
true edge fidelity moves to the external vantage (ha-london, next fix).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 16:21:34 +00:00
Viktor Barzin
35c89fa90c workstation: managed Claude config self-deploys from the repo [ci skip]
Viktor's claudeMd edits must keep reaching every user now that emo is
out of the shared tree. Two reconciler additions:
- sync_managed_config: installs scripts/workstation/managed-settings.json
  to /etc/claude-code whenever the repo copy changes — editing the
  org claudeMd is now edit + commit, no manual install step
- refresh_codex_mirror: regenerates each user's ~/.codex/AGENTS.md
  (static mirror of the claudeMd; header-guarded so user-customized
  files are never clobbered)

Verified live: corrupted emo's mirror -> reconcile restored it;
wizard's stale mirror refreshed; in-sync managed config no-ops.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 16:03:24 +00:00
Viktor Barzin
8cfd0e5e5c Merge forgejo/master: reconcile diverged lineages [ci skip]
Local checkout carried the 2026-06-10 DNS/registry architecture series
(pfSense forward-zone, CoreDNS viktorbarzin.me:53 carve-out, nodes
stock) + vzdump/nfs-mirror/workstation-rebuild commits that never
reached the canonical remote, while forgejo master received the
emo-access series via isolated worktrees. Viktor asked to merge.

Conflict resolutions (newest iteration wins in each file):
- stacks/forgejo/cleanup.tf: LOCAL — dry_run=true (2026-06-10 revert
  after live retention orphaned OCI indexes; remote had 06-09 enable)
- .claude/CLAUDE.md, docs/architecture/backup-dr.md: LOCAL — final
  registry/DNS architecture + implemented vzdump alerts
- scripts/workstation/setup-devvm.sh: LOCAL — pinned-version,
  reproducible-rebuild refactor (kubelogin pin, restructured staging)
- scripts/workstation/managed-settings.json: FORGEJO — the
  allow-then-audit claudeMd (matches /etc deployment byte-for-byte)
- scripts/t3-provision-users.sh: FORGEJO comment; refresh_locked_clone
  intact

[ci skip]: all stack changes in the local lineage were applied live
this morning — CI would re-walk 100+ stacks via the modules/ fallback
for zero state change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:21:50 +00:00
Viktor Barzin
a34f9ff3b8 docs: infra Woodpecker repo-82 ops — in-cluster webhook, secret parity, empty-commit gotcha [ci skip]
Emo's first direct pushes surfaced three latent CI issues, all fixed
out-of-band today and recorded here: webhook deliveries to
ci.viktorbarzin.me timing out on the public-IP hairpin (hook now
targets the in-cluster woodpecker-server service), repo 82 registered
without the repo-scoped secret set (cloned from repo 1 in the DB), and
empty commits compiling every workflow so missing secrets hard-error.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:09:17 +00:00
Viktor Barzin
a49d1eadf6 workstation: emo direct master push — allow-then-audit [ci skip]
Viktor: emo may make any change; what matters is tracking what changed
and why. ebarzin added to master push+merge whitelists (force-push
stays disabled — append-only history). Tracking enforced three ways:
- agent instructions (managed claudeMd + AGENTS.md): commit body MUST
  carry the user's plain-language intent; commits land on master
  directly; [ci skip] forbidden for non-admins
- new notify-nonadmin-push step in .woodpecker/default.yml: Slack
  message for every non-admin master push (admin pushes silent)
- PR flow remains the fallback for non-whitelisted users

Accepted consequence (informed): emo's pushes auto-apply changed
stacks via CI. Offboard runbook gains whitelist-removal step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:53:43 +00:00
Viktor Barzin
2e5af5dc0e workstation: keep non-admin infra clones fresh (hourly + at launch) [ci skip]
Non-admins (emo) need current master without manual pulls. Two layers:
- t3-provision-users reconcile gains refresh_locked_clone: fetch all
  remotes + ff-only master, guarded (on master, clean tree, upstream
  set); dirty/diverged clones are left alone with a WARN.
- start-claude.sh freshens ~/code at session launch, 15s-capped so an
  offline remote never delays the session.

Verified live on emo's clone: stale clone ff'd to tip by the
reconciler; launcher snippet ff's when clean and refuses while a
dirty file exists. Deployed to /usr/local/bin/t3-provision-users,
/etc/skel/start-claude.sh, and emo's launcher.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 09:41:38 +00:00
Viktor Barzin
5d9417fbaa workstation: emo contribute access + Phase-5 cutover done; gate master (push=apply) [ci skip]
ADR-0004's premise was wrong: pushing master fires the Woodpecker apply
pipeline (require_approval=forks only), so master pushes ARE deploys.
Added Forgejo branch protection on master (push/merge whitelist=viktor,
deploy keys allowed); non-admins contribute via branches + PRs.

emo (ebarzin): write collaborator on viktor/infra, PAT in
~/.git-credentials, forgejo remote + upstream in his locked clone.
Phase-5 finished: code-shared removed; ~/.claude symlinks kept (they
ARE the skel shared-base mechanism — plan step 4c obsolete).
Offboard runbook: revoke PAT + collaborator + group steps added.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 09:30:41 +00:00
Viktor Barzin
a1b7b0ca53 forgejo retention: revert to DRY_RUN — first live run orphaned OCI indexes [ci skip]
The keep-set (newest 10 versions + latest + *cache* tags) treats
multi-arch/attestation index CHILDREN — separate untagged sha256
versions — as deletable: for images not rebuilt recently they sort
outside the newest-10 window and were pruned while their kept parent
index survived. kms-website :latest and :dfc83fb children 404'd
(RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe
within hours; deployed tag a794d1a unaffected).

Healed: :latest re-pointed at the intact a794d1a index (also the
newest commit), corrupt :dfc83fb version deleted, probe re-run clean
(0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied
live. Re-enable only with a container-aware keep-set — options in the
post-mortem.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 09:22:47 +00:00
Viktor Barzin
e49c91e60c monitoring: VzdumpBackup{Stale,NeverRun,Failing} alerts for the new VM-image backup
vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to
Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM
backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I
re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl,
mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success.

NOT [ci]-applied: this is a Terraform stack change — arms on the next
`scripts/tg apply` of the monitoring stack (metrics already flow, so it arms
immediately once applied). Admin-gated apply per org policy.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 09:10:46 +00:00
Viktor Barzin
d9ea7812f5 nfs-mirror: exclude /vzdump/ — it was reaping the new VM-image backups nightly
nfs-mirror does `rsync -rlt --delete /srv/nfs/ -> /mnt/backup/`; any /mnt/backup
dir with no /srv/nfs counterpart is an orphan and gets --delete'd. vzdump-vms
(added yesterday) writes /mnt/backup/vzdump/, which wasn't excluded — so the
02:00 nfs-mirror run silently deleted both successful 40G devvm images
(verified: dir gone, 40G freed, despite status=0 success logs). Add
--exclude='/vzdump/' alongside the existing pvc-data/pfsense/pve-config/
sqlite-backup excludes that exist for exactly this reason. TDD-proven with an
isolated rsync --delete -n -v. backup-dr.md notes the dependency.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 09:04:57 +00:00
Viktor Barzin
2b8c0def30 dns: pfSense forward-zone for viktorbarzin.me, nodes fully stock [ci skip]
Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node
customization — split-brain lives in the DNS infra):

- pfSense Unbound domain override viktorbarzin.me -> Technitium
  10.0.20.201 (applied via php write_config, backup on-box). Every
  Unbound client on every VLAN now gets the internal split-horizon
  answers (live Traefik IP via apex CNAME) with zero per-host config.
- CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block —
  forgejo pinned to Traefik ClusterIP via data source (pods cannot reach
  the ETP=Local LB IP pfSense now returns), all other .me names kept on
  public resolvers (pods' pre-existing behavior). Replaces the .:53
  forgejo rewrite.
- Removed the same-day resolved routing-domain drop-ins from all 7 nodes;
  node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206)
  for fleet parity; cloud-init no longer writes any DNS drop-ins.
- Docs: dns.md, pfsense-unbound runbook (override + rollback), registry
  bullet, post-mortem final-architecture addendum.

Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK,
pods resolve forgejo -> ClusterIP / others -> public, mail record works,
.lan zone unaffected.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 08:32:34 +00:00
Viktor Barzin
1ee1bf0817 forgejo pulls: route *.viktorbarzin.me to Technitium, drop /etc/hosts pins [ci skip]
Supersedes this morning's per-node /etc/hosts pin (no hardcoded service
IPs on nodes, per Viktor). Technitium's split-horizon zone already
resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP
(ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe
alerts) -- the nodes just never queried it. Rolled the devvm's
systemd-resolved routing-domain pattern (~viktorbarzin.me ->
10.0.20.201) to all 7 nodes, removed the pins, verified getent +
crictl pull via pure DNS.

Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1)
to FallbackDNS-only: public servers in the global set race the routing
domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete
-- exactly the stale comment that pointed new nodes at the hairpin.

hosts.toml mirror kept but documented as vestigial (Traefik 404s
bare-IP requests; registry auth realm is an absolute URL).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 07:56:31 +00:00
Viktor Barzin
b6976ce014 forgejo pulls: pin registry name to internal Traefik in node /etc/hosts [ci skip]
tuya-bridge was down 7.5h (ImagePullBackOff on k8s-node3): fresh kubelet
pulls of forgejo.viktorbarzin.me images depended on the intermittently
broken public-IP hairpin. The containerd hosts.toml mirror cannot keep
pulls internal on its own — Traefik 404s its bare-IP requests (no
Host/SNI match) and the registry Bearer realm is an absolute public URL
fetched outside the mirror. Third incident of this class (buildkit
06-04, tripit/devvm 06-09).

Fix: /etc/hosts pin 10.0.20.203 forgejo.viktorbarzin.me on every node —
covers resolve + token + blob legs with correct SNI and valid cert.
Applied live to all 7 nodes; persisted in the cloud-init bootstrap and
the existing-node rollout script. Docs updated (registry bullet, dns.md
hairpin scope + stale .200 literals, runbook) + post-mortem.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 07:15:24 +00:00
Viktor Barzin
7330cb6a0b backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).

vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.

Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:54 +00:00
Viktor Barzin
dacd9d2d8a t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip]
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):

- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
  /api/auth/bootstrap on 404 — one binary pairs across both versions and any
  rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
  before, green after). Built, deployed, verified live on 0.0.24 (all three
  users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
  handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
  build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
  state.sqlite (was the only copy, unbacked) -> the one-way forward schema
  migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
baac46415f t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip]
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).

- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
  nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
  immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.

Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
41c11216da t3-dispatch: re-pair on present-but-invalid t3_session cookie
The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09
auth-schema rollback wiped all server-side sessions, browsers kept dead
30-day t3_session cookies; the dispatcher proxied them straight through
and t3 rendered its pair page ("all users must pair again").

Now a present cookie on a top-level document navigation is validated via
the instance's /api/auth/session and re-paired on authenticated:false.
Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html)
so XHR/asset/WebSocket sub-requests are never answered with a 302; fails
open (proxy through) on any validation error. Unit + handler tests added.

[ci skip]

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
1edccedb1f workstation: v2 membership implementation plan [ci skip]
8 tasks: engine derive_os_user + roster_from_members (TDD); read-only Authentik token (TF); setup-devvm.sh stages it; provisioner sources T3 Users members from the Authentik API (replaces roster.yaml); Authentik-managed membership + legacy os_user attributes; retire roster.yaml; e2e add/remove smoke. Pairs with the 2026-06-09 design doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
83f418159a backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).

vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.

Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:30:19 +00:00
Viktor Barzin
bccaa08d8e t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip]
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):

- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
  /api/auth/bootstrap on 404 — one binary pairs across both versions and any
  rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
  before, green after). Built, deployed, verified live on 0.0.24 (all three
  users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
  handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
  build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
  state.sqlite (was the only copy, unbacked) -> the one-way forward schema
  migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
5ea238c707 t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip]
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).

- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
  nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
  immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.

Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
2125651aaa t3-dispatch: re-pair on present-but-invalid t3_session cookie
The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09
auth-schema rollback wiped all server-side sessions, browsers kept dead
30-day t3_session cookies; the dispatcher proxied them straight through
and t3 rendered its pair page ("all users must pair again").

Now a present cookie on a top-level document navigation is validated via
the instance's /api/auth/session and re-paired on authenticated:false.
Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html)
so XHR/asset/WebSocket sub-requests are never answered with a 302; fails
open (proxy through) on any validation error. Unit + handler tests added.

[ci skip]

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
fbcc330214 workstation: v2 membership implementation plan [ci skip]
8 tasks: engine derive_os_user + roster_from_members (TDD); read-only Authentik token (TF); setup-devvm.sh stages it; provisioner sources T3 Users members from the Authentik API (replaces roster.yaml); Authentik-managed membership + legacy os_user attributes; retire roster.yaml; e2e add/remove smoke. Pairs with the 2026-06-09 design doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00
Viktor Barzin
6d224861c4 stem95su: scheduled Drive->site sync CronJob (every 10m)
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.

Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:42:26 +00:00
Viktor Barzin
05b50d2b96 workstation: v2 membership design — Authentik-group-driven, email-identified [ci skip]
Supersedes the MEMBERSHIP model of the 2026-06-07 design (roster.yaml SSoT). Key principle: workstation access (T3 Users group membership) is decoupled from cluster authorization (k8s_users + kubernetes-* groups, untouched). A user is defined once in Authentik: email + T3 Users membership + optional os_user attribute. Provisioner reconciles accounts from the Authentik API; roster.yaml retires. v1 foundation (config inheritance, locked clone, kubeconfig, swap, hardening, emo cutover) unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 07:19:25 +00:00
Viktor Barzin
37626cb89b workstation: docs — mark RBAC + Authentik gate applied [ci skip]
multi-tenancy.md + service-catalog.md status: per-user OIDC kubeconfig, oidc-power-user-readonly ClusterRole, emo k8s_users entry, and the Authentik T3 Users edge gate are now applied + verified. Remaining: emo cutover (Phase 5, held), offboarding apply-side (Phase 7), per-user MCP injection, roster-reconciled group membership.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 17:51:44 +00:00
Viktor Barzin
c611ecf84d workstation: docs — multi-tenancy Workstation section + offboard runbook + service-catalog fix [ci skip]
multi-tenancy.md: new DevVM Workstation section (roster SSoT, tiers, config inheritance, locked clone, built-vs-gated status). service-catalog.md t3code row: corrected the stale 'source of truth = /etc/ttyd-user-map' (now roster.yaml; the map/dispatch are GENERATED). offboard-user.md: written (was a referenced-but-missing dead link) — staged reversible-cut-then-gated-destructive for both cluster + workstation surfaces.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 14:27:17 +00:00
Viktor Barzin
3feb69e379 workstation: pin verified config-inheritance mechanism in design §4 [ci skip]
Spike GO (claude 2.1.168): managed claudeMd reaches a session; no managed-skills key exists so skills/rules inherit via per-user ~/.claude symlinks to the base (seeded in /etc/skel). Records the settings.json 0664->0600 leak fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 14:09:13 +00:00
Viktor Barzin
6504911a77 matrix: open (tokenless) registration + bot mitigations + #security alert
User-chosen fully-open registration on tuwunel (no CAPTCHA support; browser
challenges break native clients). Bot defense is layered instead:
- Traefik rate-limit Middleware on a path-scoped /register ingress carve-out,
  keyed on request Host (GLOBAL /register cap) not source IP — the host is
  reachable via both Cloudflare-IPv4 (CF-Connecting-IP) and IPv6-direct (HE
  tunnel, no CF header), so a per-source key let IPv6 bots bypass. 10/min,
  burst 20, per replica; CrowdSec is the hard backstop on both paths.
- Loki ruler rule MatrixNewUserRegistered -> lane=security -> existing
  #security Slack receiver (matches "registered on this server", never the
  rejection line). tuwunel's admin bot also posts signups to the admin room.

Dropped the REGISTRATION_TOKEN env (secret/matrix + ESO kept for revert).
Applied via scripts/tg (matrix tier-1 + targeted monitoring configmap), so
[ci skip] to avoid CI full-applying monitoring (unrelated grafana-acl drift).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 13:27:02 +00:00
Viktor Barzin
bb7bcf803b multi-user-workstation: design + phased implementation plan
devvm multi-user Claude Code workstation: role-driven profiles (admin/power-user/namespace-owner) off one git roster (single source of truth, full onboard->offboard lifecycle); config inheritance via Claude's native machine-wide managed layer; per-user writable git-crypt-locked infra clone (ungated, apply-time is the boundary); per-tier OIDC kubectl; per-user secrets/auth (memory isolation deferred); incremental, emo-safe migration; capacity prereqs. Folds in gap-analysis findings verified live 2026-06-08. Designed, not yet implemented.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:58:29 +00:00
Viktor Barzin
3d6c5b8bc7 matrix/authentik: remove orphaned Matrix OAuth2 app + provider (post-tuwunel)
The migration left a UI-managed (not TF) Authentik OIDC app orphaned — tuwunel
uses native password auth, so nothing consumed it. Deleted application `matrix`
+ OAuth2 provider pk=6 via the Authentik API (user-confirmed). Drop the stale
Matrix rows from the SSO reference tables and update the plan's residual list.

Doc-only [ci skip].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:32:49 +00:00
Viktor Barzin
23602f393e matrix: migrate Synapse -> tuwunel (Rust homeserver, fresh start, federated)
Replace the cramped Synapse deployment with tuwunel v1.7.1: embedded RocksDB
drops the CNPG dependency (both init-containers, the db ESO, the Reloader
annotation all gone), env-var config, fsGroup-owned encrypted PVC, federation
on, tuwunel-served well-known delegation to :443. server_name unchanged
(matrix.viktorbarzin.me); fresh start (no Synapse->RocksDB migration path).
Registered @viktor admin then disabled registration (403).

Cleanup: removed the orphaned pg-matrix Vault static role and dropped the
matrix Postgres DB/role; updated service-catalog, upgrade-config, CLAUDE.md
PG-rotation list, and the Matrix OIDC->orphaned auth notes. Design+plan in
docs/plans/2026-06-08-matrix-synapse-to-tuwunel-*.

Already applied via scripts/tg (matrix tier-1 + targeted vault tier-0), so
[ci skip] to avoid CI reconciling an unrelated pre-existing vault OIDC
tune-TTL drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 11:58:17 +00:00
Viktor Barzin
d4ec5768b2 vault-token-renew: version the devvm renewer + user units in the repo
The devvm periodic Vault admin token (token-devvm-wizard, period=768h, policies default+sops-admin+vault-admin) is kept alive by a systemd user timer, but the renewer script + units lived only under ~/.local/bin and ~/.config/systemd/user — lost on a devvm rebuild. Move them into the repo as the source of truth so a rebuild can restore them. (version-only scope: behavior unchanged; no canonical-file/self-heal added.)

- scripts/vault-token-renew.{sh,service,timer}: renewer + user units, refactored into pure drift-guard functions + a guarded main (behavior identical; deployed live and verified still renewing with full write access).

- scripts/test-vault-token-renew.sh: unit-tests the drift guard + lookup-JSON parsing, incl. the 2026-06-05 woodpecker-clobber case (17 assertions).

- docs/runbooks/vault-token-renew-devvm.md: deploy, mint/re-mint, health-check, drift recovery.

- docs/architecture/secrets.md: correct the stale '~/.vault-token = OIDC token' description for devvm.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 22:10:06 +00:00
Viktor Barzin
551412488b apiserver: enable audit logging (low-write Metadata) + ship to Loki
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Resource changes/deletions are now attributable (the novelapp deletion this week
was untraceable because apiserver audit was off). Low-write policy: drops
reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into
the kube-apiserver static-pod manifest + kubeadm-config (v1beta4
extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails
/var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}.

Root cause that had silently blocked this AND OIDC for weeks: a stray
kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate
static-pod manifest kubelet ran instead of the real one, dropping every flag
added to the real manifest. Removed it. Runbook added.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
9529eedfe0 docs(security): bot-block-proxy is a no-op while poison-fountain is at 0 [ci skip]
Reflect commit b6dd23b1: bot-block-proxy short-circuits /auth to
return 200 instead of proxying to the scaled-to-0 poison-fountain.
- security.md Layer 1 + tarpit description + troubleshooting (fix stale
  stacks/platform path -> traefik stack; drop misleading
  restart-poison-fountain step).
- .claude/CLAUDE.md: add matrix to PG rotation list; document that
  startup-read secret consumers need a Reloader annotation (matrix root
  cause, found via Loki 2026-06-05).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
d808694af4 docs(storage): record harden-half shipped (orphan cleanup + ghost-reconcile)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2a orphan cleanup (67 Released PVs + 475 LVs removed, VG pve 997->~410) + 2b
csi-ghost-reconcile CronJob done — ghost-disk doom loop closed by construction,
beads code-dfjn retireable. Cap kept at 28 (lowering would reverse the
2026-05-25 eviction-cascade post-mortem fix). Phase-1: insta2spotify migrated
(noted its 3.26GB image re-pull blip on node reschedule).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 21:39:36 +00:00
Viktor Barzin
63182730f9 docs(storage): record Wave-2 NFS migration + harden-proxmox-csi decision (option 1)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Document the 2026-06-05 decision to keep proxmox-csi and harden it (keep PVC
mobility, no hardware) over TopoLVM (pins to node) / Longhorn (2x writes on
single shared HDD). Wave-2 moved 5 non-DB workloads off block to NFS
(tandoor, speedtest, hackmd, changedetection, send), freeing 5 LUN slots.

- storage.md: live PVC counts, Retain-policy/orphan-LV note, Wave-2 history,
  updated cap-relief levers
- topolvm-evaluation.md: stamped NOT ADOPTED with rationale + pointer to the
  decision doc

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:15:21 +00:00
Viktor Barzin
c24b4a21d8 docs(architecture): fix stale 5-node claim -> 7 nodes (k8s-node1..6) [ci skip]
Cluster grew to 7 nodes (k8s-master + node1..6; node5/6 added ~10d ago)
but several docs still said "5 nodes". Corrected with live specs:

- overview.md: 7-node enumeration; node1 is 16c/48GB (doc wrongly said
  32GB), node2-6 are 8c/32GB general workers
- compute.md: "5-node" -> "7-node" cluster description
- dns.md: NodeLocal DNSCache DaemonSet "5 nodes" -> "7 nodes"
- mailserver.md: HAProxy backend diagram "node1..4" -> "node1..6"

Illustrative "0/5 nodes available" scheduler-error examples left as-is.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:03:58 +00:00
Viktor Barzin
52f5de905d docs(context): freshen infra glossary (modules, tiers, new concepts) [ci skip]
Refresh CONTEXT.md against current repo + cluster reality (grill-with-docs):

- Module taxonomy rewrite: drop fictional k8s_app/helm_app/postgres_app
  factory modules (never existed); name the real four (ingress_factory,
  nfs_volume, anubis_instance, setup_tls_secret) + the shared / Stack-local
  / flat distinction; flag vestigial modules/kubernetes/<app> dirs.
- Rename "Ingress auth tier" -> "Ingress auth" (discrete modes, not tiers);
  reserve "tier" for State tier + Namespace tier only.
- Add local-path entry (cluster default SC; node-local footgun warning).
- Add concepts: Keel, Diun, CNPG/pg-cluster, MetalLB LB-IP split, Calico.
- Add "policy" ambiguity flag (Kyverno vs Calico NetworkPolicy vs Vault/RBAC).
- Fix node count 5 -> 7 (k8s-master + k8s-node1..6).

Doc-sync (same commit per repo rules):
- overview.md: replace fictional factory modules with the real shared
  modules + the flat/stack-local pattern.
- .claude/CLAUDE.md: drop dead nfs-proxmox column from the storage decision
  table + stale cross-reference (vault migrated off it 2026-04-25).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 19:34:49 +00:00
Viktor Barzin
aa948be581 storage: migrate tandoor off proxmox-lvm to NFS (LUN-cap relief)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
tandoor is PostgreSQL-backed with no embedded DB, so its media/static PVC
is NFS-safe. Frees one proxmox-csi SCSI-LUN slot. Part of the 'harden
proxmox-csi + NFS' plan (keeps PVC mobility, no new hardware) — see
docs/plans/2026-06-05-block-storage-harden-nfs-design.md.

- swap tandoor-data-proxmox -> nfs_volume module (nfs-truenas SC)
- data copied + verified (HTTP 200 on NFS volume); block PVC removed
- block LV retained per SC policy (orphan cleanup tracked in code-dfjn)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 19:34:47 +00:00
Viktor Barzin
bc33cd5ac4 monitoring: NodeFilesystemFull 90%->95% + Synology storage runbook
The Synology offsite backup target (/mnt/synology-backup, surfaced via
the PVE host NFS mount) sits at ~94% by design and was firing
NodeFilesystemFull continuously. Per user request, raise the threshold
to 95% (<5% free). NOTE: NodeFilesystemFull is a global node-filesystem
rule, so this also loosens the warning on k8s node/system disks;
BackupDiskFull (sda /mnt/backup) stays at 85%.

Also adds docs/runbooks/synology-storage.md: how to assess Synology
usage WITHOUT du (Storage Analyzer weekly CSVs, df/btrfs/qgroup),
btrfs async/snapshot-pinned reclaim, the 2026-06-05 capacity assessment
(94% full; Backup share 4.42TiB), and ~500GiB of homelab cleanup
candidates (redundant gphotos Takeout, old laptop VM images, archives).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 18:18:31 +00:00
Viktor Barzin
dbe115910f monitoring: add local-only prometheus-query.lan ingress for ha-sofia SNMP sensors
ha-sofia's 7 R730 REST sensors (CPU/exhaust/inlet temp, power, 2x PSU voltage,
fan) read the iDRAC via the slow on-demand Redfish exporter (scan_interval 120,
~16-22s/fetch, intermittent `unavailable` blips). Migrated them to a FAST
Prometheus query of the SNMP values (instant, ~1m-fresh from the snmp-idrac
scrape), scan_interval 30.

This adds the enabling ingress: `prometheus-query.viktorbarzin.lan` →
`prometheus-server:80`, auth=none, allow_local_access_only, path-scoped to
`/api/v1/query` (read-only instant-query only — not the UI/admin/federation).
ha-sofia can't use `prometheus.viktorbarzin.me` (Authentik-gated, no OIDC from
a REST sensor), so this mirrors the existing local-only `.lan` exporter
ingresses HA already queries.

The ha-sofia REST file (`/config/rest_resources/idrac_redfish_exporter.yaml`)
was edited in place (auto-version-controlled by the HA version-control add-on;
pre-migration copy at `/config/idrac_redfish_exporter.bak-pre-snmp`). The
Technitium CNAME `prometheus-query.viktorbarzin.lan -> ingress.viktorbarzin.lan`
was added manually via the API — like the other `.lan` exporter hosts it is NOT
auto-synced (the technitium-ingress-dns-sync CronJob only creates `.me`
records). Follow-up (already noted for the Loki sensor): extend that sync to
manage `.lan` CNAMEs too. The Redfish remnant's `sensors` collector is now
vestigial (HA no longer reads it).

Verified: all 7 HA sensors report correct fresh values from Prometheus (fan
10800 rpm, CPU 62.0C, power 280W, PSU 230/240V).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 17:25:06 +00:00