Drift guard section rewritten: admin-capable clobbers now self-heal at the
nightly run (HEALED log line); weak clobbers keep the loud DRIFT failure;
manual re-mint is only the weak-clobber recovery now.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked for 'vault login -method=oidc' to work seamlessly: the OIDC
login the docs prescribe kept clobbering ~/.vault-token with a 7-day token,
and detect-only DRIFT failures went unnoticed for weeks (weekly-expiry
loop, twice in June). On drift the renewer now re-mints the periodic token
with the clobbering token's own authority (Vault's 403 is the judge — no
policy guessing), sanity-checks it, replaces the file atomically, and
revokes stale token-devvm-wizard leftovers. Weak/read-only clobbers still
fail loudly on purpose. Design: docs/plans/2026-07-03-vault-token-self-heal-design.md
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
vtr_accessor parses the accessor from lookup JSON; vtr_is_stale_periodic
decides which old token-devvm-wizard tokens a heal may revoke (never the
just-minted one, never foreign tokens, nothing when the keeper is unknown).
TDD red-green for the heal branch that lands next.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Task-by-task TDD plan for the approved self-heal design: pure-function
tests first, then the heal branch, runbook update, deploy + live clobber
simulation, landing and memory updates.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to make 'vault login -method=oidc' work seamlessly on devvm:
today any OIDC login clobbers the permanent periodic token in
~/.vault-token, the drift guard only logs the drift, and his access
effectively expires weekly. Approved design: the nightly renewer re-mints
the periodic token from any admin-capable clobber (weak clobbers keep
failing loudly) and revokes stale periodic tokens after each heal.
Implementation follows on this branch.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Docs-with-change convention: the new tasks stack (Reminders-style PWA over
Nextcloud CalDAV) gets its catalog entry — what it is, its CNPG db + Vault
static role, the auth=required/X-authentik-username trust model with the
SEC-1 NetworkPolicy, and the ADR-0002 CI/CD path — and tasks joins the
Cloudflare proxied hostname list.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Deploys the Reminders-style tasks app at tasks.viktorbarzin.me: namespace,
ExternalSecrets (fernet_key from secret/tasks; TASKS_DB_DSN composed from
the pg-tasks static-creds password the tripit way), single-replica
Deployment of ghcr.io/viktorbarzin/tasks:latest (image ignore_changes per
the fleet set-image pattern; Reloader restarts it on the 7-day DB password
rotation; /healthz probes on 8000; Europe/Sofia local tz; DEV_USER
deliberately absent — security invariant), Service on 8000, and an
ingress_factory host with auth=required + dns_type=proxied since Authentik
forward-auth is the app's only gate. NetworkPolicy tasks-ingress (SEC-1)
limits pod ingress to the traefik namespace plus monitoring on 8000 for
/metrics, so the trusted X-authentik-username header cannot be spoofed by
other pods.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The new tasks PWA (Reminders-style front-end over Nextcloud CalDAV, per
tasks/docs/2026-07-03-tasks-pwa-design.md) needs its own Postgres database
for Connected Accounts and sync state. Follows the tripit/job_hunter
pattern exactly: idempotent null_resource creates role+db on the CNPG
primary with a placeholder password, and the Vault database engine static
role pg-tasks (added to the postgresql connection allowed_roles) rotates
the real password every 7 days, consumed by the tasks stack via a
vault-database ExternalSecret.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor's first real forward carried the invoice PDF with
Content-Disposition: inline (Apple Mail does this for real documents),
and the attachments-only rules consumed nothing — recorded
PROCESSED_WO_CONSUMPTION, which also blocks reprocessing. Flipped all 5
rules to attachment_type=2 (process inline) via the API and documented
the trade-off + the ProcessedMail unblock step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Completes the ADR-0018 cutover. The stack is emptied to a tombstone so
CI destroys nginx, the NFS content volume, the ingress, the per-site
gdrive-sync CronJob and the namespace; serving + sync are owned by
stacks/valia-sites since the cutover commits. Catalog + runbook updated
to the migrated state (incl. the one-time 42.9→21.4MB video compression
Viktor approved).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Creates the public proxied CNAME stem95su -> stem95su.pages.dev and
adds the internal split-horizon entry via the valia-sites-dns
ConfigMap (the sync's update pass repoints the existing internal
record). Completes the ADR-0018 cutover; the old in-cluster serving
stack is retired in a follow-up.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
First half of the ADR-0018 stem95su cutover: the tunnel-target CNAME is
destroyed so stacks/valia-sites can create the Pages-target record for
the same name (Cloudflare allows one CNAME per name; the follow-up
commit flips manage_dns=true there). stem_video.mp4 was compressed to
21.4MB with Viktor's explicit OK, clearing the 25MB Pages cap; content
is already deployed on the stem95su Pages project. Brief public
NXDOMAIN window between the two applies is accepted.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Hit during the docs@ rollout: after a pod restart postsrsd came up
spinning without binding its TCP ports, so postfix cleanup tempfailed
every message with 451 queue file write error. Document the signature
and the supervisorctl-restart / pod-recreate fix.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CI pipeline 469 failed with 'Invalid index' on the postfix_virtual alias
filter: terraform only short-circuits &&/|| from v1.6, and the older
terraform in the infra-ci image still evaluated split(" ", line)[1] for
the blank and comment lines that have been in extra/aliases.txt since the
plans@ block. The devvm's newer terraform short-circuits, which is why the
local apply of the same commit passed. A conditional expression is lazy on
every terraform version, so move the length guard into a ternary.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A root-run kubectl exec mail_fetcher downloads attachments root-owned into
the scratch dir and the celery consumer (uid 1000) fails with
PermissionError — found during the build E2E. Document s6-setuidgid usage
and the recovery step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to forward arbitrary emails with PDF attachments into
paperless-ngx, with the forwarding sender mapping 1:1 to the paperless
account that owns the document. paperless-ngx's built-in IMAP consumer
already does the sender->owner mapping, so the infra half is a dedicated
real mailbox docs@viktorbarzin.me: an explicit self-alias (the @domain
catch-all would otherwise divert it into the TripIt-swept spam@ mailbox,
whose sweeper LLM-parses and auto-replies to mail from linked senders)
plus a per-user Dovecot sieve that discards non-family senders at
delivery (chosen behaviour for unmatched senders: ignore and delete;
also keeps spam out of the guessable address). The mailbox credential
was added to Vault secret/platform.mailserver_accounts. Paperless-side
mail account + 5 per-sender rules are DB state, configured via the API
per the new runbook docs/runbooks/paperless-mail-ingest.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor wants the traffic-flow view as a colored excalidraw instead of
the ASCII block (which was the only thing rendering after the earlier
VLAN-tagging SVG commit failed to push — a locally-masked non-fast-
forward this session, not a merge clobber). Ships both the editable
.excalidraw scene and a hand-drawn-style SVG export embedded in the
Traffic-on-the-trunk section: two lanes showing where the 802.1Q tag
is added, carried (only P5<->vmbr0) and stripped, L2 membership drops
vs L3 firewall verdicts.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Runbook covers add/update/retire (one map entry; internal DNS now
cleans up after itself), content rules for Valia's folders, and the
failure modes incl. both token re-mint paths. dns.md superset-rule
paragraph now describes the declarative ConfigMap reconcile instead of
hand-added static CNAMEs. Catalog: new valia-sites row; stem95su row
notes its Pages cutover is parked on the 42.9MB stem_video.mp4
exceeding the 25MB Pages per-file cap.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two fixes from the first live runs. (1) The sync job now skips a whole
site when any file exceeds Cloudflare Pages' 25MB per-file cap, leaving
current serving untouched — stem95su's stem_board.html references a
42.9MB stem_video.mp4, which made every run fail; the guard turns that
into a loud skip so bridge keeps syncing. (2) The CI terraform is older
than 1.7 and rejects removed{} blocks anywhere (pipelines 461/464), so
the bridge record handoff was completed with a one-time manual
'tg state rm module.cloudflared.cloudflare_record.bridge_pages' from
the main checkout; the block is deleted and the module comment records
the manual step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked for one diagram showing just the physical connections
between nodes, separate from the logical/VLAN topology: ISP->AX6000,
the in-wall apartment->garage run into P1, 4G router (cellular OOB),
UPS mgmt, the PoE cat6 to the camera, the LAN1 cable to eno1, dark
eno2 fallback + free eno3/4, iDRAC on shared-LOM, and the note that
everything else on the R730 is virtual. Referenced from the ADR next
to the logical SVG.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Pipeline 461 failed terraform init: the removed{} handoff block sat in
the stack-local module, but Terraform only allows removed blocks in the
root module. Same intent, correct position (from =
module.cloudflared.cloudflare_record.bridge_pages, destroy=false).
Without this the stale state entry would make the next cloudflared
apply destroy the record valia-sites now owns.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
While reviewing the single-switch design Viktor asked whether both the
home LAN and the camera VLAN 'go via pfSense which forwards upstream' -
a natural misreading a future reader would repeat. Added a section
spelling out the vmbr0 fork: untagged home LAN is L2-bridged past
pfSense (gateway stays the AX6000, rack outage does not affect it, OOB
via 4G survives), while tagged-30 can only land on the dCCTV interface,
making a pfSense bypass impossible by construction. Includes a compact
ASCII topology for terminal readers alongside the SVG.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Valia keeps asking Viktor to host 1-page sites from her Drive folders;
this makes it one map entry. New stacks/valia-sites: per site a CF Pages
project + custom domain + proxied CNAME (bridge adopted via import{}),
a ConfigMap feed (valia-sites-dns) the technitium ingress-dns-sync
script now reconciles internal CNAMEs from (add/update/REMOVE — fixes
the add-only stale-record gotcha), and one shared 10-min CronJob that
mirrors each Content folder (rclone, drive.readonly, stem95su's guards)
and wrangler-deploys ONLY on manifest change (free-tier deploy cap).
Scoped CF Pages token + shared rclone conf in secret/valia-sites; the
Global API Key never enters a pod. cloudflared forgets bridge's record
via removed{} (no destroy). stem95su is in the map dns-parked
(manage_dns=false) until its cutover commit.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Grill session with Viktor: his mother Valia will keep asking for 1-page
site hosting, so the pattern is being made repeatable. Decisions: all
Valia sites serve off-infra on Cloudflare Pages (survive homelab
outages); one shared in-cluster CronJob mirrors her Drive folders every
10 min and redeploys on change; English subdomain names picked by
Viktor; failed-Job-only visibility; stem95su migrates onto the pattern.
CONTEXT.md gains Valia site / Content folder / Entry file; full
rationale and rejected options in ADR-0018.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to rename the 'мост' school static site to 'bridge'.
New Cloudflare Pages project 'bridge' (bridge-cv2.pages.dev) already
deployed and the custom domain attached; this renames the public CNAME
(TF resource most_pages -> bridge_pages, destroy+create swaps the
record) and the internal split-horizon static CNAME in the
ingress-dns-sync CronJob. The old 'most' Pages project and the stale
internal 'most' record are removed out-of-band after this applies.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The internal split-horizon zone is authoritative for viktorbarzin.me,
so the new Cloudflare Pages site (most.viktorbarzin.me, added for
Viktor's 'мост' school static site) NXDOMAINed for every internal
client — LAN, VLANs and pods — while resolving fine externally.
Per the superset rule, add it as a static CNAME (-> most-6if.pages.dev)
in the ingress-dns-sync CronJob next to the mail-auth records, and
document the off-infra-site case in dns.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to host a static HTML site (the 'мост' school project,
ОбУ „Отец Паисий", pulled from his Google Drive) on Cloudflare Pages
with a custom domain, as a try-out of Pages hosting. The site content
is deployed off-infra via wrangler to the Pages project 'most'
(most-6if.pages.dev); this CNAME points most.viktorbarzin.me at it.
The custom domain is already attached to the Pages project and is
waiting on this DNS record to validate.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor prefers not running two switches, so the TL-SG105PE takes over
all rack duties (apartment uplink, 4G, UPS, camera PoE) and the CCTV
segment moves onto a managed tagged trunk over the existing LAN1 cable:
pfSense net3 re-pointed from vmbr2 to vmbr0 tag=30 (applied live; same
MAC so vtnet3/dCCTV survived untouched). This is safe where the original
802.1Q rejection was not, because the managed switch is the only device
on eno1 and polices VLAN-30 membership. eno2/vmbr2 kept dormant as the
documented fallback. Old SG105E retires to cold spare; PE inherits
192.168.1.6. Glossary Segment term updated (all three segments are now
bridge-tags feeding untagged pfSense vNICs).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to verify free ports on the garage switch (192.168.1.6)
before finalizing. Logging into it showed it is NOT the TL-SG105PE from
the plan but a pre-existing non-PoE TL-SG105E with 4 of 5 ports in use
(apartment uplink, R730 LAN1, 4G router, UPS) - the single-shared-switch
port-VLAN design written earlier today was based on conflating the two
devices. Corrected: the new TL-SG105PE carries ONLY camera + eno2
uplink (mgmt 10.0.30.6 inside the segment), the old switch is untouched,
and no VLAN config exists anywhere. ADR, topology SVG and networking.md
updated to match.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked for place photos on the tripit Trip board. The app-side
work (add-time photo fetch, board place cards) shipped in tripit
v0.106.0, but prod never set PLACE_PHOTO_PROVIDER, so the fake provider
would store placeholder PNGs for every hand-added place. Same class of
fake-default gap as PLACE_RESOLVER_MODE (set explicitly for the same
reason); the ADR-0035 rollout had left both the env flip and its
backfill cron undone.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor and emo are adding the first owned camera at the Sofia site (HiLook
IPC-T241H-C watching the garage / server rack). Viktor asked to finalize
emo's plan; the grilling session resolved emo's five open decisions and
replaced the doc's 802.1Q-trunk idea with the site idiom: a dedicated
physical leg (R730 eno2 -> vmbr2 -> pfSense net3 = dCCTV 10.0.30.1/24),
port-based VLAN split on the shared TL-SG105PE, camera default-deny with
NTP-only egress, Frigate + ha-sofia as the only consumers.
The PVE bridge, pfSense interface, Kea subnet and firewall rules were
applied live this session (hand-managed hosts, backed up). This commit
records the decision (ADR-0017), the glossary terms (Segment / CCTV
segment), the as-built architecture doc, and bumps Frigate's ADR-0016
VRAM budget 2000 -> 2300 MiB for the upcoming NVDEC stream.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Emil asked his Sofia Portal Mini photo-frame to show only the past
year of photos rolling from today, instead of the last two years.
Changes ImagesFromDays 730 -> 365 in the frame-emo Settings.yml.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
t3.viktorbarzin.me went down 2026-07-02 15:42-16:35 UTC: an agent-spawned
12.3G ugrep plateaued inside t3-serve@wizard's MemoryHigh(12G)..MemoryMax(16G)
band. With MemorySwapMax=0 its anon pages were unreclaimable, so the kernel
throttled every task in the cgroup indefinitely (memory.pressure full ~80%,
oom_kill never fired) - the t3 event loop starved, the accept queue rotted,
and the terminal was dead until the hog was SIGKILLed by hand.
The 2026-06-22 design assumed 'throttle to a crawl, then OOM locally'; a hog
that stabilises between high and max never OOMs, so the throttle band is a
livelock zone, not a safety layer. Viktor asked to close that gap: MemoryHigh
is now explicitly infinity on all three work cgroup definitions (t3-serve@
unit, user-<uid>.slice drop-in, docker.slice) so a runaway is cgroup-OOM-
killed at MemoryMax immediately - OOMPolicy=continue already keeps the t3
server alive when a child dies. MemoryMax/MemorySwapMax=0/earlyoom unchanged.
Applied live to the devvm the same day (daemon-reload + runtime set-property
on running cgroups, no session restarts). Post-mortem addendum + runbook
updated in the same commit.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Session wrap-up doc sync: the Immich note still claimed the shared T4 had no
VRAM isolation. Record the gpumem budget/watchdog shipped earlier today, that
the watchdog is observe-only, and that budgets need a retune (llama-swap's
real 16k-ctx resident is ~7GB, not 4.35) before arming.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor saw dawarich throwing 429s through Traefik and asked to loosen
the burst for it. The access log confirms the burst pattern: one page
load fires the whole fingerprinted-asset tail (SVG store badges,
favicons, webmanifest) from a single client IP and trips the default
10 req/s / burst 50 limiter (repro: 80 parallel GETs -> 28x 429).
Same remedy as ha-sofia, ActualBudget, noVNC, tripit, health and
authentik: dedicated dawarich-rate-limit middleware (average 100 /
burst 1000) + skip_default_rate_limit on the dawarich ingress. Also
updates the networking.md middleware enumerations (adding the
previously undocumented tripit/health limiters alongside dawarich).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The image was still built by hand and pushed to DockerHub (v1..v4),
predating the all-builds-off-infra doctrine; Viktor chose to move it
onto the standard pipeline while shipping the export/rename feature
rather than keep the manual flow.
Mirrors the k8s-portal pattern: .github/workflows/build-excalidraw.yml
(go test + buildx linux/amd64, pushes ghcr latest+sha), excalidraw ns
added to the Kyverno ghcr-credentials allowlist (package is PRIVATE),
deployment now pins ghcr :latest with pullPolicy Always + pull secret,
Keel force/match-tag/5m annotations seed the metadata (live values win
via ignore_changes). DockerHub viktorbarzin/excalidraw-library:v4 stays
frozen as the rollback image. Docs: ci-cd.md + .claude/CLAUDE.md image
lists updated (also backfilled the missing k8s-portal rows in ci-cd.md).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Users couldn't see Excalidraw's built-in Save as / Export image options:
the app's custom toolbar was drawn exactly on top of the native hamburger
menu button, hiding it. Removed the overlay and integrated Back to
Library / Save now / Rename into the native menu, so the native export
formats (.excalidraw file, PNG, SVG, clipboard) are now reachable.
Viktor asked for exports to work via the native Excalidraw feature and
for drawings to be renameable by clicking their name.
Rename: new PATCH /api/drawings/{id} endpoint (server-side name
sanitization, 409 on conflict) + click-to-rename title pill in the
editor (updates URL in place) + Rename button/modal in the dashboard.
Existing GET/PUT/DELETE semantics unchanged for API compatibility
(emo's upload pipeline). Added main_test.go (httptest) covering rename
+ existing handler behavior; dashboard rows now DOM-built (XSS-safe).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to upgrade Immich to the just-released v3.0.0 (release notes,
migration guide and release discussion #29439 reviewed — no config-breaking
changes for this stack: we already use the split MACHINE_LEARNING_PRELOAD
vars, don't set DB_VECTOR_EXTENSION, OAuth goes through Authentik over
HTTPS, and the GPU node's CPU meets the new x86-64-v2 requirement).
The Immich Postgres image moves to VectorChord 0.4.3 to match the upstream
v3 reference stack (0.3.0 is still within v3's supported range '>=0.3 <2';
Immich upgrades the extension itself at startup). Both photo frames switch
to ImmichFrame's immich_v3 compatibility tag because every versioned
ImmichFrame release (≤ v1.0.33.0) crashes deserializing Immich v3 API
responses; repin to a versioned tag once upstream ships stable v3 support.
Deployment images are Keel-managed (KEEL_IGNORE_IMAGE, policy=patch), so
this commit is the source-of-truth record; the live rollout happens via
kubectl set image in the same session. Pre-upgrade pg_dumpall taken
(job postgresql-backup-pre-v3).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to fix emo's permission so his Claude can upload to the
Excalidraw service. emo's recent sessions show the documented upload
recipe (kubectl port-forward svc/draw + X-Authentik-Username header,
from his ~/.claude/CLAUDE.md) failing with:
pods/portforward forbidden for system:serviceaccount:chrome-service:emo-browser
in namespace excalidraw
because his default kubeconfig is the read-only emo-browser SA (its
port-forward grant covers only chrome-service) and his old admin
kubeconfig at /home/emo/code/config expired and was removed.
Add a namespace-scoped Role (pods/portforward create) + RoleBinding for
that SA in the excalidraw namespace, mirroring the 2026-06-28
chrome-service grant. Trade-off (any-user drawings via the trusted
username header) documented in the file and accepted.
Also record the grant in docs/architecture/chrome-service.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor doesn't want a Slack message for every CI run — only failures.
The infra apply pipeline posted a status line to #general on every push,
and the renew-tls / postmortem-todos / registry-config-sync /
pve-nfs-exports-sync crons posted on every scheduled run (~30+ routine
messages a week). Now: the apply pipeline's success post is gone
(notify-failure already covers failures), all cron notifies are
status:[failure] with explicit FAILED texts, and drift-detection is
silent when all stacks are clean (still posts drift findings and errors,
and gains a hard-failure catch step it previously lacked). Kept:
notify-nonadmin-push (org audit feed) and the actionable provision-user
post. Per-app deploy template in ci-cd.md updated to match.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor was getting upgrade-error Slack messages every hour and wants
update notifications at most weekly. Root cause of the errors: Keel kept
trying to roll gotenberg 8.25->8.25.1 in paperless-ngx but kyverno's
require-trusted-registries denied it — gotenberg/* (and apache/*, which
tika will hit next) were never allowlisted, and Keel's Slack notifier at
info level re-posted the identical failure to #general on every hourly
poll since Jun 28.
Changes: allowlist gotenberg/* + apache/* so the patch applies cleanly;
disable Keel's direct Slack notifier and replace failure visibility with
a KeelUpdateFailing Loki-ruler alert (alert-on-change: one notification
plus the daily digest, never an hourly drip); remove diun's Slack
notifier whose default message @channel-pinged #image-updates for every
new upstream tag every 6h (the n8n upgrade-agent webhook feed is
untouched). The k8s upgrade report is already weekly (Mon 06:07 UTC).
Paperless-ngx itself stays paused (keel policy=never, user-managed) while
the ingest runs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A GitGuardian audit of the infra repo showed the recent alerts were test
fixtures (false positives), but surfaced a real historical leak: a
cluster-admin kubeconfig was once committed as stacks/f1-stream/.../.config
(now expired, reachable only via a GitHub PR ref). The .gitignore already had
a `config` rule for kubeconfigs but missed the dotfile form `.config` — which
is exactly how that file slipped onto the public mirror.
Close the gap in two layers:
- .gitignore: also ignore `.config`, `kubeconfig`, `*.kubeconfig`,
`admin.conf`, `.kube/` so they're never staged by accident.
- .gitattributes: route `.config`, `kubeconfig`, `*.kubeconfig`, `admin.conf`
through git-crypt so a force-add or rename still lands as ciphertext (never
plaintext) on the public GitHub mirror.
No tracked files match these names today, so there is zero retroactive impact
— purely forward-looking prevention.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The nightly ibkr sync failed with 'No such command ibkr': every broker-sync
CronJob still pulled viktorbarzin/broker-sync:latest from DockerHub, which
nothing has pushed to since the ADR-0002 move to GHA->ghcr on 2026-06-13 —
the jobs were silently running a frozen pre-ibkr build. The migration had
allowlisted only the wealthfolio namespace for the private
ghcr.io/viktorbarzin/wealthfolio-sync image, so broker-sync also lacked
pull credentials. Repoint the image, add ghcr-credentials imagePullSecrets
to all eight CronJobs, and allowlist the broker-sync namespace (wealthfolio
stays — its own monthly sync pulls the same image). Related: code-9ko8.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
First apply of ADR-0016 failed: terraform local-exec defaults to /bin/sh,
which on Ubuntu is dash — 'set -euo pipefail' exits 2 before running kubectl.
Pin the interpreter to bash. Everything else in the gpumem apply succeeded.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
All broker-sync CronJobs share one RWO proxmox-lvm volume. With free
scheduling the nightly 02:00-04:15 runs land on different nodes, forcing
a detach/attach cycle whose QMP hotplug intermittently ghost-attaches on
disk-heavy VMs — every job then sits in ContainerCreating for hours
(happened 2026-06-30, 07-01 and again 07-02; fires
PodsStuckContainerCreating and skips the day's trade syncs). Pinning all
seven volume-mounting jobs to k8s-node4 (fewest CSI disks, 11) makes the
volume attach once and stay put — no hotplug dance, no wedge.
version_probe mounts nothing and stays unpinned. Durable fix for the
recurrence tracked in beads code-9ko8.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
From the flaw-hunt workflow (all verified):
- Projected-FIRE-date panels (solo/household/family) now guard savings £/yr:
0 / empty / negative all render "Set savings £/yr" instead of a blank tile,
a SQL error, or a nonsensical past date ("Jan 1849"). Verified across cases.
- New "Tax regime" panel surfaces the per-country jurisdiction — 14/22 countries
fall back to the neutral 'nomad' 1% assumption, which was previously invisible.
- Intro no longer hard-codes "£139k pension" (contradicted the £328k tranche
panel); pension value is now only shown data-bound in the tranche panel.
- Intro adds caveats: Anca's spend is an estimate (pending live re-pull), and
non-modelled countries use the nomad tax fallback.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The document server had been deliberately scaled to 0/0 for 184 days, but
its ingress kept the uptime-kuma monitors alive, so 'onlyoffice down'
showed up in every daily alert digest. Viktor approved tearing it down.
terragrunt destroy ran clean (11 resources) before this commit; the kuma
monitors auto-prune with the ingress. Also drops the onlyoffice/* image
prefix from the kyverno trusted-registries allowlist, the service-catalog
rows, and updates the nextcloud collabora comment. Document data (if any)
remains on the PVE NFS share.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor is getting daily Slack alert noise; these two were the recurring
generators. The postiz-postgres-backup CronJob still dumped from the old
in-namespace postiz-postgresql service that was removed in the CNPG
migration (2026-06-28) — it failed every night at 03:00 and re-fired
BackupCronJobFailed each day. The postiz DB now lives on the shared CNPG
cluster and is already covered by the dbaas per-db dumps, so the CronJob
(and its NFS backup volume) is redundant and removed rather than repaired.
portal-stt/portal-tts advertised prometheus.io scrape annotations that
never worked: the deployed Speaches build 404s /metrics, and openai-edge-tts
has no metrics at all (its annotation pointed at a JSON endpoint, which
fails exposition parsing regardless). Both produced a permanently firing
ScrapeTargetDown. Annotations removed until the apps actually serve metrics.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The current, still-accruing calendar year read misleadingly high (e.g. 2026
at 5 months showed £149/h gross, above all of 2025) because the full-year
bonus - paid every March - plus front-loaded quarterly RSU vests get divided
by only the months worked so far. It settles lower as the year completes.
Split each line into a solid series (complete years) and a dashed series
(the latest, still-accruing year), so the provisional point is visually
flagged. The split auto-detects the in-progress year (latest year with
< 12 months of payslips), so it needs no per-year maintenance. Panel
description now explains the caveat.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wanted to see, on the wealth dashboard, the hourly wage he earned
each year - both gross and net - with year on the X axis.
New timeseries (line) panel "Effective hourly rate - gross vs net":
- hourly = annual pay / hours worked; hours = contractual 40h/week
(2,080h per full year, confirmed from the Facebook/Meta UK offer letter:
Mon-Fri 09:00-18:00 less a 1h lunch), prorated by the months actually
worked so partial years (2019, 2020, 2026) read correctly.
- Gross = gross_pay incl. notional RSU vest; Net = take-home.
- timeFrom 10y so all years show under the dashboard's default 180d range.
Source data: a duplicate March-2023 payslip (Paperless doc 347, a re-upload
of doc 33) was removed separately, so 2023 is no longer double-counted; this
also corrects the existing net-pay panel.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
MySQL device-write investigation (code-oflt): after the nextcloud webcal
throttle settled (the earlier 3.4-8.8 MB/s were post-restart transients),
MySQL is ~1.74 MB/s at the InnoDB level — and HALF of that (~0.86 MB/s,
~55 pages/s) is the doublewrite buffer writing every flushed page twice.
Redo is negligible (0.01 MB/s), no temp-table spilling.
Set innodb_doublewrite=DETECT_ONLY (dynamic, no restart; persisted in the
cnf): InnoDB stops writing full page CONTENT to the doublewrite buffer
(~halves MySQL's page-flush writes on the IOPS-bound sdc) but keeps
torn-page DETECTION metadata — a crash-torn page is flagged on recovery
(restore from the daily mysqldump) rather than silently corrupt. Chosen
over full OFF: same write saving, keeps detection, and OFF requires a
shutdown ("cannot change to OFF if doublewrite is enabled"). Acceptable
risk given the PERC BBU cache + UPS (in-flight writes complete on power
loss) + daily per-db backups.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Node "container churn" investigation (code-oflt): container logs (~30 KB/s)
and overlayfs (~17 KB/s) are negligible; the node OS-disk churn is ext4
journal (jbd2) metadata writes driven mostly by journald's continuous
appends. node4 + node5 had drifted to uncapped persistent journald (4 GB
each, ~100 KB/s); master/node1-3 were correctly capped at 500M.
Node + pod journals already ship to Loki (alloy loki.source.journal), so
on-disk journald is pure write-IOPS overhead on the IOPS-bound sdc. Switch
journald to Storage=volatile (RAM, RuntimeMaxUse=200M) fleet-wide:
- cloud_init.yaml: drop-in 90-oflt-volatile.conf for new nodes (replaces
the old persistent seds).
- running nodes (master + node1-5): pushed the same drop-in via qm guest
exec + journald restart + cleared /var/log/journal.
Verified node5: OS-disk writers jbd2/sda1-8 931->46 KB/s, systemd-journal
gone (~94% drop); ~4 GB freed each on node4/node5. Logs stay queryable in
Loki. Trade-off: a hard crash loses the last unshipped journal.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The single time-sliced Tesla T4 has no per-tenant memory isolation, so its
~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's
onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking
recruiter-responder for ~5h. Viktor asked for memory protection so we don't
overallocate GPU memory, and chose to do it at the scheduling level (no
device-plugin swap) after weighing HAMi and MPS.
Make the scheduler VRAM-aware and add runtime teeth, all repo-native,
time-slicing untouched:
- Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a
reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob.
- Each always-on GPU tenant declares a gpumem budget (immich-ml 3000,
llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300
<= advertised) so the scheduler refuses to co-schedule past the card
(overflow -> Pending).
- gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when
actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to
false after a few cycles look right.
- Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown --
the 2026-06-02 post-mortem's never-built free-VRAM follow-up.
- Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing
glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment.
HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node
shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reflect the code-oflt MySQL write-reduction work (commit 82c9e69b + the
nextcloud webcal app-data throttle):
- MySQL row: buffer pool 1->2Gi, mem limit 4->6Gi, and the nextcloud
webcal calendar churn that was ~60% of MySQL's writes (now throttled
in oc_calendarsubscriptions.refreshrate — app-data, can regress).
- CNPG apply-gotcha note: the mysql_standalone VCT-annotation drift no
longer needs -target dodging (now ignore_changes'd on the STS VCT).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cut MySQL's write-IOPS footprint on the contended PVE sdc HDD (code-oflt).
Standalone MySQL was the #1 sdc bandwidth writer (~2.8-3.5 MB/s). Live
attribution found ~60% of its writes were nextcloud webcal calendar churn
(throttled separately at the app layer); this addresses write amplification
on the remainder:
- innodb_buffer_pool_size 1Gi -> 2Gi: the pool was too small for the ~5.6Gi
hot set (Innodb_buffer_pool_wait_free=1.78M = threads stalling for a free
page -> constant flush-to-make-room write IOPS).
- container memory limit 4Gi -> 6Gi (requests 3->4Gi): the pod was already
at ~3.7Gi/4Gi (near OOM) with the 1Gi pool, so the 2Gi pool needs the
headroom. One-time MySQL pod restart to apply.
- ignore_changes on the StatefulSet volume_claim_template: the VCT is
immutable post-creation and pvc-autoresizer rewrites its annotations on
the live object, so TF's desired VCT could never apply and errored every
broad dbaas apply. Ignoring it (autoresizer owns PVC sizing) removes the
long-standing need to -target around it.
Applied + verified live: buffer_pool=2.0GiB, limit=6Gi, pod healthy,
24 DBs reachable, restart clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR #17 moved frigate-rtsp to a dedicated MetalLB LoadBalancer IP
(10.0.20.204) exposing RTSP 8554 + WebRTC 8555, but the networking doc
still listed only four IPs in use / three dedicated. Add the .204 row to
the allocation table, bump the counts (five in use, four dedicated, 5-IP
layout), and add a LB-IP renumber-checklist entry for the out-of-band
consumers (the go2rtc WebRTC candidate on the frigate config PVC and the
HA-sofia rtsp_url_template). Note go2rtc cannot use a DNS name in ICE
candidates, so the Service annotation is the single source of truth.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
HA live video from the cluster Frigate hangs/fails because the only path
to Frigate is the Traefik HTTP(S) ingress (frigate-lan -> 10.0.20.203),
which cannot carry RTSP or WebRTC. The container already listens on
8554+8555 but only RTSP had a Service (NodePort), and WebRTC (8555) was
never exposed. Convert frigate-rtsp to a LoadBalancer on a dedicated MetalLB
IP (.204, ETP=Local, pod pinned to the GPU node) carrying RTSP 8554 +
WebRTC 8555 (TCP+UDP), giving HA Sofia + LAN browsers a stable cross-VLAN
endpoint for native HLS/WebRTC live (parity with the Hikvision NVR).
Companion non-Terraform steps are in the PR body.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Memory limits were committed (batch 2/3) but pipeline #427 was killed mid-apply and the local homelab tf apply hit a stale backend-init; this comment-only diff re-triggers a clean CI apply for the three stacks so live matches master (job-hunter 768Mi, wealthfolio 512Mi, f1-stream 384Mi).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add --enable-vault-token-cache to the ESO controller (a graduated,
non-experimental flag in chart 2.6.0). Until now ESO authenticated to
Vault with login -> lookup-self -> revoke-self on *every* secret fetch.
Across 92 ExternalSecrets refreshing every 15m that measured ~0.22
logins/s + ~0.22 revoke-self/s on the active Vault member, and each
cycle is a token create+revoke (plus its lease) written to the Raft log
on all three members. Those fsync-heavy writes land on the contended
PVE RAID1 7200rpm HDD (sdc) -- one of the write sources behind the
recurring control-plane flaps (code-oflt write-reduction).
The eso kubernetes-auth role already issues a 240h periodic, unlimited-
use token, so the churn was pure waste: ESO discarded a perfectly good
token after a single use. With token caching ESO mints one token and
reuses/renews it, collapsing logins from ~13/min to a handful per token
lifetime. Verified live: vault cache initialized, 112/113 ExternalSecrets
Ready (the one failure, instagram-poster, is pre-existing data drift
unrelated to auth), logins dropped to ~0 after warm-up.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
shlink 512->704Mi, linkwarden 1Gi->1280Mi, chrome-service 2Gi->2624Mi, forgejo 4Gi->5Gi, f1-stream 256->384Mi. All were request==limit with 30d peak at 91-100% of the ceiling — a spike would OOM-kill them. Raising the limit (now Burstable, request<limit) gives real burst headroom. This is the genuine 'don't OOM on occasional spike' fix. Small add (~2.2Gi limits) vs the ~20Gi of fat removed in batches 1-2, so net overcommit keeps dropping.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reflects the write-reduction params applied in c3553731, and documents the
null_resource trigger-bump + targeted-apply gotcha so the next agent doesn't
hit the inert-change / mysql-VCT-drift traps.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Part of code-oflt (cut sdc write IOPS before the SSD move; analysis #6922).
- archive_timeout 300->0: CNPG forces archive_mode=on but .spec.backup is empty
(no ObjectStore), so a 16MB WAL segment switch every 5min shipped NOWHERE =
~4.6 GB/day of pure-waste WAL on the contended sdc. archive_mode stays CNPG-on
(reserved); 0 just stops the timed switch. Daily pg_dump cron unchanged.
- commit_delay 0->2500us: group-commit coalesces concurrent fsyncs. SAFE for
every DB incl financial -- data still fsynced before COMMIT acks, only <=2.5ms
added latency under concurrency.
- wal_compression pglz->zstd: ~30-50% smaller full-page images.
All sighup-reloadable. Applied via targeted apply of
module.dbaas.null_resource.pg_cluster (trigger bumped) to avoid the pre-existing
mysql VCT drift that breaks broad dbaas applies.
Refs: code-oflt.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
claude-breakglass 4Gi->512Mi, stirling-pdf 1536->512Mi, insta2spotify 2Gi->256Mi, recruiter-responder 768->256Mi. These idle/utility services had memory LIMITS sitting 4-15x above their 30d peak, inflating cluster limit-overcommit to 142% across the 5 post-node6 nodes. Burstable (request<limit), limits capped at ~peak x1.5 (never below peak), so no OOM risk (verified zero OOMKills cluster-wide in 30d). Reduces phantom limit overcommit + frees scheduler requests.
Follows the 3-reviewer adversarial review: raising limits on an already-overcommitted cluster worsens correlated node-OOM; the real fix is trimming the fat. Limits only lowered where peak is far below; tuned/DB/GPU limits untouched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Part of the sdc IOPS-reduction work (code-oflt). 462 daily thin snapshots
(66 PVCs x 7d) drive ~10-34 w/s of thin-pool metadata (tmeta) CoW writes on
the contended sdc spindle and pin ~2TB in the 70%-full pool. Halving to 3
days roughly halves both. Instant-restore window shrinks 7->3d; daily-backup
still keeps 4 weeks of file-level PVC history, so DR coverage is unchanged.
Deployed to the PVE host via scp (these host scripts are scp-deployed, not
TF-managed). Doc updated in .claude/CLAUDE.md.
Refs: code-oflt.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The N-1 capacity alert was hardcoded to k8s-node[234]/[1234], predating node5/node6 (added 2026-05-26) and the 2026-06-29 removal of node6 — so it no longer reflected the real cluster and gave no trustworthy N-1 signal. Generalize node selection via metrics: GPU node by nvidia_com_gpu capacity, drained/cordoned by kube_node_spec_unschedulable, down by the Ready condition. Control-plane excluded by name (node!~"k8s-master.*") because this cluster's kube-state-metrics exposes neither kube_node_role nor node taints/labels (verified live).
Also fixes a latent bug (multiplying by kube_node_spec_unschedulable==0 zeroed the result) and refreshes the remediation text (krr, not the removed Goldilocks). With node6 gone the rule now correctly evaluates LHS 31.0Gi > RHS 27.9Gi (fires) — the honest signal that removing node6 tightened requests-based N-1; trimming the inflated requests clears it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
a2c8f906 added checkpoint_timeout=15min + max/min_wal_size to the CNPG
Cluster YAML, but the cluster is applied via null_resource.pg_cluster +
local-exec kubectl apply, which only re-runs when its `triggers` change.
The YAML edit didn't bump a trigger, so the change was inert and never
applied (incl. via CI). Bump the pg_params trigger so the kubectl apply
re-runs and CNPG hot-reloads the new params (reloadable, no restart).
Landing it via a targeted apply (-target=null_resource.pg_cluster) to avoid
3 pre-existing unrelated drifts in this stack -- notably a mysql_standalone
volumeClaimTemplate annotation diff the apiserver rejects as immutable,
which is what fails broad dbaas applies (and silently blocked a2c8f906).
Refs: code-oflt.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reusable Workflow script that audits whether the cluster is memory-overcommitted and whether a single k8s worker can be removed to return RAM to the PVE host without sacrificing N-1 failover. Read-only throughout: gathers PVE host memory (qm config / free / KSM via SSH), k8s per-node capacity + cluster 30d peak working set, and per-workload right-sizing, then models N-1 two ways (physical actual-usage and scheduling-by-request) and adversarially verifies the conclusion with 3 skeptics.
Sizes requests (scheduling reservation) and limits (OOM ceiling) as SEPARATE knobs — an earlier ad-hoc pass conflated them by sizing requests to 30d peak, which manufactured a false N-1 shortfall. Invoke via Workflow {scriptPath}, or by name when cwd is the infra repo.
Requested by Viktor: identify memory overcommit and whether deployment requests can be trimmed to free PVE host RAM by removing a node, without sacrificing service reliability.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to reduce CNPG checkpoint/WAL writes as part of the sdc
IOPS-isolation work (code-oflt). The IOPS deep-dive found CNPG checkpoints
fire 100% on the 5-min timer (checkpoints_timed >> checkpoints_req), each
triggering a full-page-write burst + flush onto the contended 7200rpm sdc
spindle -- a top write-IOPS source after etcd.
Set checkpoint_timeout=15min + max_wal_size=4GB + min_wal_size=1GB so
checkpoints fire ~1/3 as often (fewer FPW) and WAL segments are recycled
rather than churned. All three are sighup-reloadable -> CNPG applies them
without a restart or failover. checkpoint_completion_target stays 0.9 so
each checkpoint's IO is still smeared across the interval. Bounded
recovery-time tradeoff (more WAL to replay on crash), acceptable for the
write relief. wal_compression left at pglz ('on') pending image
zstd-support verification.
Also refreshes the stale CNPG tuning note in .claude/CLAUDE.md (it listed
shared_buffers=512MB / effective_cache_size=1536MB / 2Gi; live is 1024MB /
2560MB / 3Gi).
Refs: code-oflt (etcd/sdc IO isolation).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held
gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt
buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security)
uptake now lags up to 7 days instead of <=1.
- var.schedule: 0 23 * * * -> 0 23 * * 0 (detector: weekly Sunday 23:00 UTC)
- var.report_schedule: 7 6 * * * -> 7 6 * * 1 (report: Monday 06:07 UTC, ~7h
after the Sunday check, so nightly-report.py's ~25h staleness threshold stays
valid AND still flags a missed weekly run; no STALE_SECONDS change needed)
The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename
= churn). Cadence wording updated across main.tf comments, nightly-report.py
docstring, and the runbook.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor flagged not wanting to wear the single non-RAID SSD with useless etcd
writes if etcd moves there. Investigation found the avoidable load is kyverno
reporting: the 2026-06-12 etcd-load-reduction disabled the report *features*
but left the reports-controller running (default --enableReporting +
--validatingAdmissionPolicyReports=true), so the 2026-06-21 kyverno upgrade
left a one-time pile of ~10.5k cluster/namespaced ephemeralreports (~114MB in
etcd) that nothing reaps (aggregation off). Listing that range starves etcd's
fdatasync enough to flap the apiserver (observed live 2026-06-28).
Disable the reports-controller outright (reportsController.enabled=false),
completing the 2026-06-12 intent. Reports are not consumed (violations surface
via Loki->Slack); admission enforcement (deny-* policies) and Keel mutation are
independent of it. The ~10.5k stale reports already in etcd are cleared
separately (throttled, out-of-band) since bulk-deleting them is itself
etcd-heavy.
Refs: code-oflt (etcd IO isolation), code-at4f (etcd starvation alerting).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
No-op comment touch in loki.tf to force a clean `terragrunt apply monitoring`.
The pfSense egress-monitoring apply (commit 7fe2d978, CI pipeline #414) was
cancelled by a newer push and SIGKILLed mid-helm-upgrade: the live resources
applied (probes green, rules loaded) but the Terraform state write and the helm
release finalize were lost, leaving the prometheus release stuck in
pending-upgrade (manually unstuck). This commit re-applies the unchanged
monitoring stack so state matches live, with zero resource changes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor plans to live to 100, so the portfolio must last that long. The
fire-targets CronJob was solving a 60-year horizon (≈ to age 88); set it to 72
(retire ~age 28 → age 100). Raises every case's FIRE number modestly (more years
to fund). A one-off in-cluster job re-solves the existing rows at the new horizon.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for
~20 min while internal routing + Unbound stayed up; recovery needed a manual
reboot and NOTHING alerted — there was no egress probe and the cloudflared
replica metric stayed green. Add first-class egress monitoring so the next
occurrence pages in ~2 min instead of being noticed by a human.
- blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW
so ICMP can use raw sockets).
- Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that
failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 +
1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers).
- Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable,
InternetEgressDown (both providers dead), ExternalDNSResolutionDown,
EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's
exact "external down while internal up" signature), PfSenseVMDown.
- Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the
cloudflared replica metric is blind to tunnel-connection loss. Threshold
calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident).
- Alertmanager inhibit: WAN/egress-down suppresses the downstream egress
symptom alerts so one root alert pages, not a storm.
- Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md.
All metric names + the cloudflared threshold verified against live
Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening
(dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred
and documented in the runbook.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Post-mortem for the 2026-04/05 SEV3 where a stuck MetalLB ServiceL2Status
CR (immutable status.node) flapped the PG load-balancer VIP and silently
broke Tier-1 Woodpecker terragrunt applies for ~5 days (the wrapper error
"Cannot read PG creds" masked the real cause for ~25 days). Written when
the incident closed (beads code-aoxk, 2026-05-26) but never committed;
landing it so the RCA + stuck-CR cleanup procedure live in the repo.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The fire-planner-pg Grafana datasource baked the rotating fire_planner DB
password into its provisioning ConfigMap at terraform plan-time, so on every
7-day static-role rotation the password went stale and ALL fire-planner-pg
dashboards (fire-planner, cost-of-living, and the new wealth FIRE Countdown)
silently failed with "password authentication failed for user fire_planner"
until the next stack apply.
Switch to the same live-env pattern wealth-pg / payslips-pg already use:
- new ExternalSecret grafana-fire-planner-pg-creds (monitoring ns, Reloader
match) mirrors the rotating Vault static-creds/pg-fire-planner password
- datasource ConfigMap now references $__env{FIRE_PLANNER_PG_PASSWORD}
- Grafana mounts it via envFromSecrets; reloader (auto) restarts Grafana on
rotation so the provisioned datasource never goes stale
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The noVNC sidecar (x11vnc + websockify) was OOMKilled (exit 137) repeatedly
whenever someone actively opened chrome.viktorbarzin.me — the view connected
then froze/hung. Idle usage is ~37Mi, but x11vnc + websockify
framebuffer/encode buffers spike past the 96Mi cap when streaming the
1280x720 screen to a client. Raised request 32Mi->64Mi, limit 96Mi->256Mi
(Burstable, aux tier). Already applied live via a transient kubectl patch
(Recreate rollout, verified 0 restarts since); this lands the durable state
so the next apply / daily drift-detection doesn't revert it to 96Mi.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to give emo access to the cluster's headed Chrome so he can fill
in forms and get past anti-bot / captcha pages. emo was deliberately locked
out of chrome-service (noVNC Authentik allowlist was Viktor-only + his
power-user RBAC has no pods/portforward). Viktor's explicit decision: SHARE
his existing browser rather than stand up an isolated per-user instance,
accepting that emo can therefore reach Viktor's warmed logged-in sessions
(CDP has no per-context auth, so the single shared persistent profile is
reachable by anyone who can drive the browser). emo's CLI use is hands-off
(his agent can run it unattended).
- authentik: add emo (emil.barzin / emil.barzin@gmail.com) to CHROME_ALLOWED
so the admin-services-restriction policy admits him to chrome.viktorbarzin.me
(noVNC). Reverses the prior Viktor-only lock; comment updated to record why.
- chrome-service/rbac.tf (new): emo-browser ServiceAccount + long-lived token
(dashboard-sa.tf pattern), a chrome-service-portforward Role granting
pods/portforward, and a cluster read-only binding (oidc-power-user-readonly)
so the SA can resolve the Service and emo's normal read access doesn't regress.
- t3-provision-users.sh: install_browser_kubeconfig installs a dual-context
kubeconfig for any user with a <user>-browser SA — SA token as the default
context (non-interactive, works headless), personal OIDC retained as the
oidc@homelab named context. emo's OIDC-only kubeconfig can't authenticate the
headless agent session that homelab browser needs.
- docs/architecture/chrome-service.md: document the shared-browser multi-user
access model, the session-exposure trade-off, and how to grant/revoke a user.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6 OCR workers crept past the 8Gi per-container memory cap over ~6h and
OOMKilled paperless at 15:00 during the Emo bulk import. The import
auto-recovered (the consume dir lives on the PVC, so a restart re-scans
and reprocesses — nothing lost), but it left the queue inflated with
re-queued duplicates and spiked etcd on each restart.
The 8Gi cap is the shared edge-tier `tier-defaults` LimitRange, not worth
raising for one namespace. 4 workers fit with headroom (4 measured
~1.3Gi). Matches the value applied live via `kubectl set env` during
incident response; this removes the drift so the next apply keeps it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
postiz's live deployment (Helm + Temporal + Elasticsearch + Authentik
OIDC + static-DB password) came from the never-merged branch
`wizard/postiz-cnpg-oidc`, so master's HCL was stale and a `terragrunt
apply` would have DESTROYED the stack. This lands that postiz config to
master so HCL == state == live (CI green; destroy-landmine gone).
Kept PARKED (postiz + temporal replicas = 0): IG-via-postiz is Meta-
blocked (it hardcodes retired Instagram scopes → OAuth "Invalid Scopes"),
which is why it was parked; IG runs via the instagram-poster service. To
revive later: flip postiz `replicaCount` + temporal `replicas` back to 1
and re-check image pins.
Notes captured in this reconcile:
- ES image pinned to 7.17.28 (the branch's 7.17.24 was a DOWNGRADE vs the
live data → ES refused to start "cannot downgrade node 7.17.28→7.17.24";
caught + rolled back during this work).
- The 4 Authentik resources (app/provider/group/binding) were re-imported
into state (adopted, not recreated — no duplicate AK objects); the
obsolete `external_secret_jwt` ExternalSecret was removed (Retain → its
synced secret was kept).
- Vault-side cleanup (removing the unused pg-postiz rotated role) is
deliberately NOT included here — deferred, postiz uses a static
secret/postiz database_url.
State was already reconciled by a local `scripts/tg apply`; this commit is
the HCL catch-up (CI re-apply is a no-op).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Old-browser users on the SFE who have a password but no MFA device hit the
default-authentication-flow's forced WebAuthn passkey enrolment, which the SFE
cannot render (the 'unsupported state: ak-stage-authenticator-webauthn' error).
emo (Google-only, iPadOS 15) hit this on the password path.
Document the two no-MFA-downgrade fixes: (1) social login, whose source flow
(default-source-authentication) has no MFA stage, so the SFE's social button
always completes; (2) enrolling TOTP, which the SFE can validate (unlike
WebAuthn) and which flips the MFA stage from force-enrol to validate. TOTP was
enrolled for emo and stored in his Vaultwarden authentik item; verified
end-to-end (a Bitwarden-generated code is accepted by authentik).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
global.image -> 2026.2.4-patch3. Old iPad Chrome (and any iOS browser) now gets
the SFE too, and the SFE login shows social-login buttons (emo is Google-only with
no password, so the password form alone was a dead end). Docs: .claude/CLAUDE.md +
authentication.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two follow-ups to patch2 (both in patch-compat-sfe.py, guarded):
1. compat_needs_sfe() now also serves the SFE to ANY iOS browser on iOS<=16.3,
not just Safari. iOS Chrome/Firefox are WebKit skins (Apple mandate) reporting
a non-Safari UA family, so the Safari-only check missed them and they still got
the blank modern SPA. Added an os.family=="iOS" + version<=16.3 branch.
2. Inject static social-login <a> links (Continue with Google/GitHub/Facebook ->
/source/oauth/login/<slug>/) into the SFE shell (flow-sfe.html). The SFE
architecturally can't render Identification-stage sources (authentik docs), and
emo's account (emil.barzin@gmail.com) is Google-only with NO password — so the
SFE's username/password form was a dead end. The links are plain redirects that
work on any browser. Slugs are static; re-verify on source changes.
Tag -> 2026.2.4-patch3; values repoint + docs land once GHA builds it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a "FIRE Countdown" section to the wealth Grafana dashboard plus a monthly
CronJob that computes the targets it reads.
Viktor wanted a £ countdown to retirement in today's money, per life-case
(Solo / Household / Family) and per country, with progress, a projected date,
runway, and his safety guardrails — so he can see how close he is to FIRE
(ideally lean) without ever coming back to work.
- wealth.json: new country / with_home / savings_per_year template vars + a
per-Case row (target NW at the 99% GK bar, progress gauge, still-needed,
projected FIRE date, runway) and safety-valve panels (re-entry trigger vs
£1.0M, 2.5yr cash buffer, pension tranche @57, Anca-bridge note). Reads
fire_planner.fire_target via the fire-planner-pg datasource (Mixed).
- fire-planner stack: fire-planner-fire-targets CronJob (monthly, 2nd 10:00
UTC) runs `recompute-fire-targets --countries all`.
Targets come from the solver shipped in fire-planner edb4d11.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
global.image -> 2026.2.4-patch2 (adds the compat_needs_sfe SFE patch on top of the
SLOW-1a query patch). Old Safari/WebKit (<=16.3) now gets authentik's no-JS SFE
login instead of a blank page — fixes emo's iPadOS-15.8 iPad with no auth
downgrade. Docs: .claude/CLAUDE.md Authentik row + docs/architecture/authentication.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Old Safari/WebKit (<=16.3, e.g. iPadOS<=16.3) can't parse authentik's modern
ES2022 flow SPA and gets a COMPLETELY BLANK login — exactly what emo's iPadOS-15.8
iPad hit. authentik already ships a no-JS Simplified Flow Executor (SFE, ES5) and
serves it via compat_needs_sfe(), but only for IE/old-Edge/PKeyAuth. Extend that
to old Safari so those clients get the REAL authentik login (password + MFA +
reputation, identity preserved — NO auth downgrade, no new credential store).
Chosen over a Traefik basic-auth fallback after an adversarial review: that route
would put a single, spoofable-UA password in front of vbarzin->wizard (passwordless
root on the cluster-controlling devvm) — an MFA->single-factor path to cluster root.
SFE keeps full authentik auth and is generic for any old browser.
Shipped as patch #2 in the existing overlay image (patch-compat-sfe.py — guarded:
asserts the upstream anchor + ast-parses; verified against the live interface.py).
Tag -> 2026.2.4-patch2; the values repoint lands once GHA builds the image.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going
red ~20% of the time. Root causes (verified from the failure logs, not
guessed):
1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82)
AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every
push. The two applies race each other for the per-stack PG state lock →
"Error acquiring the state lock" failures + push-supersede "killed" runs.
2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string
("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state
lock") fell through and was counted as a hard FAILURE.
3. Transient provider-registry download timeouts (and Vault 5xx) failed the
whole pipeline with no retry.
Fixes (all in default.yml):
- Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on
the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons
(they live on repo 1), so we de-dup the apply without deactivating the
registration. Fail-open on unknown forge.
- Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED.
- Bounded retry (3x) ONLY on transient signatures (provider download
timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast.
Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA
validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock
failures; reproduced `terraform validate` passing the exact stacks that
fail at apply) and lock-reaping/force-unlock (PG advisory locks are
session-scoped + auto-release; force-unlock can't free them and would
corrupt a live concurrent apply).
Shell logic + the classification regexes were unit-tested locally against
the real decoded error strings (#359 PG lock, #353 provider timeout, #360
missing-arg, helm atomic timeout); `bash -n` clean; YAML parses.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`homelab vault` only spoke to Vaultwarden (the password manager), but the
name reads as HashiCorp Vault (the infra secrets store — actually OpenBao
here). Make the two unmistakable and support both.
Distinction (no breakage — the existing Vaultwarden verbs are unchanged):
- bare `homelab vault` help now LEADS with the two-stores split;
- every verb summary is tagged `[vaultwarden]` or `[hashicorp-vault]`;
- HashiCorp Vault/OpenBao lives under a clearly-named `vault kv` group.
New `vault kv` (HashiCorp Vault / OpenBao, the secret/… KV store):
- `kv get <path> [--field K]` — read; --field → one value (TTY-aware
clipboard/stdout), no field → full secret JSON (refuses a bare TTY).
- `kv list <path>` — list sub-paths (no values).
- `kv put <path> <key>` — write one key; value via stdin (piped or
no-echo prompt, never argv); creates the path or merges (never
clobbers siblings; uses kv patch -method=rw so no `patch` cap needed).
Critical: `kv` uses the caller's OWN Vault token (OIDC ~/.vault-token /
$VAULT_TOKEN), NOT the per-user scoped Vaultwarden token (bound only to
claude-users/<user>, which would 403 elsewhere) — handlers set VAULT_ADDR
but never inject the scoped token. Access is whatever the policy grants.
Logic in cmd_vault_kv.go (pure cores extractKVData/parseKVList/arg
builders/kvGet/List/Put; file header documents the credential split).
CLI v0.11.0. Tests: no value in put argv, create-then-merge, KV-v2
envelope strip, help names both systems. Verified e2e against live Vault
(read key-names-only + a scratch put/merge/cleanup).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked that emo be able to edit his own secrets with full access.
emo's personal-emo policy was read-only (read on data, read/list on
metadata), so he could view but not change his personal secrets.
Widen it to the same self-service capability set every namespace-owner
already has over their own tree: create/read/update/delete/list on
secret/data/emo(+/*) and list/read/delete on secret/metadata/emo(+/*).
Scope is unchanged — still only emo's own secret/emo subtree, still a
named exception that does not widen the power-user tier in general.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GHA built ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch1 (public, verified
anonymously pullable). Point global.image at it (repository + tag pinned
explicitly so neither helm's appVersion default nor Keel can downgrade it — the
2026-06-10 boot-storm class) and remove keel.sh/enrolled from the namespace so
Keel won't auto-bump the custom tag. authentik is now manual-upgrade: bump the
Dockerfile FROM + this tag together on each authentik version bump.
Net effect once rolled: the identification-stage query drops ~1.4s -> ~14ms, so
the cold login-flow first-load stops being slow. (Does NOT affect old-browser
clients — iPadOS<=15/Safari<=15.6 still can't run the SPA; that's unfixable
server-side.) Docs: .claude/CLAUDE.md Authentik row.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The login flow's identification stage runs a bare select_subclasses() that
LEFT-JOINs every Source subtype table — ~1.4s server-side on every cold login
(verified live: 1527ms vs 14ms). Narrow it to only the subtypes that render a UI
login button (oauth/saml/plex/telegram/kerberos — not the sync-only ldap/scim),
via django-model-utils string accessors so no import is needed. Byte-identical
output, ~100x faster, robust to adding new login source types.
Shipped as a thin overlay over the official image (mirrors the diun/excalidraw
precedent): stacks/authentik/Dockerfile (FROM ghcr.io/goauthentik/server:2026.2.4
+ a guarded sed) built by .github/workflows/build-authentik.yml -> ghcr.io/
viktorbarzin/authentik-server:2026.2.4-patch1. The values repoint + Keel freeze
land in a follow-up commit once the image is built. Upstream bug still present in
main (no fix/PR) — drop this overlay once upstream narrows the query.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`bw unlock` only decrypts the LOCAL cache, so a persisted (already
logged-in) session served stale data — a password changed in the web
vault wouldn't appear until the next fresh login. Add a best-effort
`bw sync` in openSession (the chokepoint every read shares: get, get
--all, list, code, status), so reads reflect current server-side values.
Best-effort by design: a transient sync failure warns on stderr and
falls back to the cached vault rather than failing the read (an AFK
agent shouldn't break on a network blip). status keeps its own explicit
sync so a reachability failure still surfaces in its report.
CLI v0.10.1. Tests assert the sync runs after unlock and before the read,
and that a read still succeeds when sync fails.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reflect the classification change in the operational runbook: the gate's three
refusal classes (actionable/waiting/pinned), held wins on a mix, refusals now
Complete cleanly (no Failed Job), k8s_upgrade_held gauge + the deliberate
no-alert-for-held, the dropped K8sUpgradeChainJobFailed suppression clause, the
nightly report ⏸️ HELD outcome, and the detector's silent nightly re-evaluation.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.
compat-gate.py now classifies each blocker:
- ACTIONABLE: a newer addon version in addon-compat.json supports the target
-> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
nightly report).
- WAITING-on-upstream: no released version supports the target yet -> held.
- PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.
Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.
nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).
Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`homelab vault get` could only fetch one of five allow-listed fields and
had no way to see what fields an item even has — in particular it could
not reach arbitrary user-defined custom fields. Add a `--all` flag that
dumps the whole item as a normalized JSON object
(`{name, username?, password?, uris?, totp?, notes?, fields?}`), so a
Claude session can discover and read every field, custom ones included,
in a single call.
Security model preserved:
- Like `get --json`, the dump is all secret values, so it refuses a bare
TTY (pipe it, e.g. `| jq`); the machine/agent path is stdout.
- The TOTP *seed* is reduced to a presence flag (`"totp": true`) and
never emitted — the seed is more powerful than a one-time code, so the
only seed-derived path stays the specially-audited `vault code`. Tests
assert the seed and password-history never appear in the dump.
- Op-log uses a distinct `get-all` verb (item name still never logged) so
a bulk dump is distinguishable from a single-field read.
`normalizeItem` is a pure, unit-tested core; `getItem` is the
session+fetch seam. CLI bumped to v0.10.0. Docs: README changelog,
onboarding runbook, design spec §16.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The nightly upgrade chain fails a preflight Job and raises K8sUpgradeBlocked
every night for the 1.36 target, even though the block is unactionable: no
kyverno/ESO release supports 1.36 yet and gpu-operator is deliberately pinned
(NVIDIA driver/Ubuntu coupling). Viktor asked to teach the checker to tell
'we can fix this' apart from 'nothing to do but wait', and stop the nightly
Failed-Job + alert noise for the latter.
This documents the design: classify each blocker as actionable / waiting-
upstream / pinned, keep the alert only for actionable, quiet the held case to
the nightly report, and make deliberate gate decisions Complete cleanly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Makes the goldmane_edges east-west trail (ADR-0014) reachable during incident
investigations without remembering the DB/creds/SQL. New top-level verb:
homelab edges --ns <ns> edges touching <ns> (either direction)
homelab edges --src/--dst <ns> directional egress / ingress peers
homelab edges --peers-of <ns> distinct peer namespaces of <ns>
homelab edges --new-since 24h first seen since a duration or date (YYYY-MM-DD)
homelab edges --denied only action='deny' (blocked / lateral movement)
homelab edges --json --limit N machine-readable / row cap (default 200)
Filters render to a single read-only SELECT against the `edge` table, run via
the dbaas CNPG primary pod (same exec path as `k8s db`). Namespace values are
validated to the k8s name charset (injection guard) before they reach SQL.
TDD: edges_test.go covers flag parsing, query building (each filter, AND
combination, peers-of shape, JSON wrapper), the new-since duration/date parser,
and namespace-validation / injection rejection. Smoke-tested live: --peers-of,
--new-since 24h, --denied, and --json all return correct rows.
Docs: runbook query section now leads with the CLI; cli/README gains a v0.9
section. VERSION v0.8.2 -> v0.9.0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
emo's Claude Code sessions hit "UserPromptSubmit hook error" on almost every
prompt. Root cause: the homelab-memory-recall.py UserPromptSubmit hook runs
`homelab memory recall <prompt>` and strict-decodes its stdout. printMemories
truncated each memory's preview with a BYTE slice (c[:240]), which cuts through
the middle of a 2-byte Cyrillic character and emits invalid UTF-8 (a dangling
0xd0 lead byte). The hook's subprocess.run(text=True) then raised
UnicodeDecodeError — not caught by its `except (TimeoutExpired, OSError)` — so
the hook exited non-zero and Claude surfaced the error. It is Cyrillic-specific
(ASCII has no multibyte chars to split), so it bit emo (Bulgarian prompts) every
turn while English users almost never saw it.
Two-layer fix:
- cli: truncatePreview() now counts RUNES, not bytes, so the preview never
splits a character. Regression test asserts valid UTF-8 on a long Cyrillic
string. Fixes the root for every consumer of `memory recall` / `memory list`.
- hook: subprocess.run gains errors="replace" and the except is broadened to
honor the script's own "best-effort, exit 0" contract — so a truncated or
otherwise odd payload can never again surface as a hook error.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Real root cause of the 2026-06-28 "Whisker UI empty" incident (the watchdog
added in 8d1d2fb9 was treating a symptom). The tigera operator's own `whisker`
NetworkPolicy is policyTypes:[Ingress,Egress]; its egress allows DNS only to the
kube-dns *pods* (podSelector k8s-app=kube-dns). But whisker-backend resolves
goldmane.calico-system.svc via the kube-dns *ClusterIP* (10.96.0.10), and Calico
drops UDP DNS to a ClusterIP under a podSelector-only egress rule.
Verified in an isolated repro: from the whisker pod's netns, ClusterIP DNS = 100%
timeout while direct kube-dns pod-IP DNS = OK; a pod with no egress policy
resolves fine; a test pod with the operator's podSelector-only egress rule
reproduces the failure, and adding an ipBlock(ClusterIP) egress rule flips it to
100% ok. whisker-backend resolves goldmane once in the brief startup window
before the policy programs, holds its long-lived gRPC stream, and only
re-resolves when that stream breaks (e.g. a node-reboot blip) — then the blocked
ClusterIP DNS wedges its Go resolver and the UI goes empty. The durable
aggregator (separate pod, unrestricted namespace) was never affected.
Fix: additive egress NetworkPolicy whisker-allow-dns-clusterip
(whisker -> 10.96.0.10/32 on 53 UDP+TCP); k8s egress policies are additive so
the operator NP is untouched. The whisker-watchdog CronJob is kept as a backstop
(repurposed comment). Applied + verified: ClusterIP DNS from the whisker netns
now 8/8 ok, whisker-backend 0 errors, flow API returns 828 flows / the namespace
list. Docs (runbook + CLAUDE.md) updated to the real root cause.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The login screen would sometimes hang/blank for everyone for ~30s at a time.
Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a
transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3
goauthentik-server pods dropped out of the Service at once, so Traefik had no
healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the
rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` —
so live ran the chart-default 25%/25% and dropped a pod out of rotation on every
roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on
PostgreSQL and request-serving is coupled to PG — verified there is no
external-cache option to put back, so a SHORT transient is now survived but a
total CNPG outage still takes authentik down.)
Reliability package (R2, approved):
- readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover
reconnect without dropping the whole fleet from the Service.
- rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key)
and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready.
- gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9
workers' recycles don't cluster on a DB blip.
- / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000)
from the previous commit (skip_default_rate_limit) — fixes the cold-load 429
blank screen.
Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200,
so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the
pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md
(also corrected a stale "60s persistent DB connections" note).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Unauthenticated users were getting a blank login screen (and the screen would
sometimes just hang). Root-caused via a read-only fan-out + adversarial verify:
the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through
the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and
a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was
the only first-party SPA still on the default limiter (8 siblings already have a
carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket).
- traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000,
mirroring the existing health/tripit carve-outs). The authentik / and /static
ingresses switch to it in the authentik-stack commit.
- monitoring: the `traefik` scrape job's drop-regex was a blanket
`traefik_router_.*`, which also dropped `traefik_router_requests_total` — so
per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable.
Narrowed it to keep the counter while still dropping the high-cardinality
`*_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh`
for the episodic all-3-server-pods-NotReady 502/503/504 cascade.
Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The onboarding runbook's "rebuild the binary" command stamped the version
from `git describe --tags --always`, but setup-devvm.sh stamps it from
`cli/VERSION`. The v0.8.1 tag is no longer reachable from master, so the
describe form silently produced a bare commit sha — diverging from what a
provisioner reconcile stamps. Match the canonical source.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Setting up emo's Bitwarden access via `homelab vault`, his one-time
`homelab vault setup` failed with an opaque "exit status 2". Two latent
CLI bugs, both of which any non-admin AFK invocation can hit:
1. The CLI set VAULT_TOKEN but never VAULT_ADDR, relying on the ambient
value. It IS in /etc/environment (login shells), but emo runs his
agents from long-lived tmux / non-login shells that never sourced it,
so every `vault` child hit the 127.0.0.1:8200 default -> connection
refused. claude-auth-sync already self-defaults VAULT_ADDR; the CLI
now does the same.
2. Token precedence was env > ~/.vault-token > scoped. A power-user who
ran `vault login -method=oidc` carries a read-only ~/.vault-token
(policy `default`, capability `deny` on their workstation path), which
shadowed the purpose-built scoped token -> 403 permission denied on
the user's OWN path. This tool only ever touches
secret/workstation/claude-users/<user>, which the scoped token covers
exactly, so precedence is now env > scoped > ~/.vault-token. Verified
the scoped tokens for both emo and wizard hold create/read/update on
their own paths, so admins are unaffected.
Also stop swallowing the shelled `vault`/`bw` stderr: errors now carry
the real message (connection refused / permission denied) instead of a
bare "exit status N" — without that, (1) and (2) were indistinguishable.
Verified end-to-end as emo (VAULT_ADDR unset + his read-only
~/.vault-token present): writeCreds now succeeds.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Whisker showed an empty UI on 2026-06-28. Root cause: whisker-backend dials
goldmane:7443 over a long-lived gRPC stream; when that stream dropped during a
transient CNI/DNS blip (right after k8s-node5 finished its v1.35.6 upgrade, its
pod resolver briefly timed out on the kube-dns ClusterIP) the Go gRPC resolver
got WEDGED — spamming "failed to stream flows" / "code = Unavailable: dns ...
i/o timeout" forever, never reconnecting. The operator ships whisker-backend
with NO liveness probe, so nothing restarted it; the live UI stayed blank until
a manual `kubectl delete pod`. (The durable aggregator is a separate pod and
was unaffected — only Whisker's ~60-min live view went dark.)
Whisker is operator-managed (Whisker CR), so we can't inject a liveness probe.
Instead add a watchdog so this never needs a manual restart again:
- whisker-watchdog CronJob (every 10 min) + least-privilege SA/Role/RoleBinding
(calico-system only: pods get/list/delete, pods/log get).
- It restarts the whisker pod only when whisker-backend logs >=10 goldmane-
connection errors in 11m AND Goldmane is Ready (the Goldmane-Ready guard
avoids restart-thrash during a real Goldmane outage).
- Self-tested: a manual run reports "whisker-backend healthy: 0 ... errors"
and does not restart.
Docs: runbook gains a "Whisker UI empty" troubleshooting entry + a self-heal
note; the stale 2026-06-25 "digest never posted" known-state block is updated
to Resolved (digest posts to #alerts, lastSuccessfulTime current); CLAUDE.md
flow-trail bullet gains the whisker-wedge gotcha.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A heavy user (emo) runs 8+ always-on `claude` agents + their t3-serve instance,
all sharing one ~/.claude/.credentials.json. When the shared access token expires
the processes refresh simultaneously; OAuth refresh-token rotation makes the
losing writer persist an EMPTY refresh token, logging the user out roughly every
access-token lifetime (~8h). Re-issuing the credential never sticks — the race
recurs (this is why emo's "standalone token" fix kept regressing).
Fix: an opt-in, per-user, non-rotating setup-token (sk-ant-oat01, ~1y, scope
user:inference) kept in the user's OWN Vault path (field `setup_token`).
claude-auth-sync materializes it to a user-owned
~/.config/claude-auth-sync/claude-oauth.env and, while it is present, SKIPS the
rotating-credential validate/backup/restore (so no false
WorkstationClaudeAuthInvalid). start-claude.sh and t3-serve@.service load it as
CLAUDE_CODE_OAUTH_TOKEN, so every session of that user uses the non-rotating
token and there is nothing to race on.
Fail-safe + opt-in: with no `setup_token` in Vault, every path is a no-op, so
users on the normal per-user Enterprise-SSO flow are unaffected. This is each
user's OWN identity, never the forbidden shared CLAUDE_CODE_OAUTH_TOKEN. Runbook
documents enable/disable/rotate.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The prior commit set the limit to 10Gi, but the shared tier-defaults
LimitRange caps per-container memory at 8Gi, so the rollout's new pod was
forbidden (FailedCreate) and paperless was briefly down. 8Gi is ample for
6 workers anyway (4 workers measured ~1.3Gi under full OCR load). Restored
service live via kubectl patch; this commit matches TF to the live 8Gi so
drift detection won't re-revert it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Emo's ~13.7k-document import was going through the API upload path, which
stages each file on the pod's EPHEMERAL scratch before queuing it. Any
paperless pod or redis restart therefore destroyed all in-flight work
(the "File not found" failures we hit) and required manual re-uploads.
Move bulk ingest to paperless's consume directory placed on the encrypted
PVC, with PAPERLESS_CONSUMER_POLLING so the whole folder is re-scanned
periodically (and on startup) with a file-stability check. Files now live
on durable storage and survive any restart — the folder is the queue and
self-heals, so we can copy everything in fast and let it process over
time with zero retry/integrity risk. RECURSIVE preserves the source tree
(avoids basename collisions); owner+tag come from a consumption workflow.
Bump TASK_WORKERS 4->6 to speed the OCR/convert-bound processing (node6
has the core headroom for one pod) and mem limit 8->10Gi for the extra
workers. Revert workers/mem/consume envs to defaults once the import ends.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bottleneck found: single Gotenberg 503s under concurrent workers (office docs
failing + slow). Cluster is otherwise idle (sdc 0.5% util, etcd ~1/min), so:
- Gotenberg 1->3 + Tika 1->2 (Service load-balances; fixes the 503s, parallel
office conversion).
- paperless TASK_WORKERS 2->4, THREADS_PER_WORKER 2->1, mem limit 4->8Gi (avoid
OOM with 4 concurrent OCR). Requests kept low to stay within tier-quota
(requests.memory 3840/4096Mi).
- PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text: skip redundant archive for born-
digital/office docs (big IO saver for the work-doc set).
Guard + etcd watch stay in place; revert to defaults after the import.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Getting Started portal only walked through the heaviest path (local VPN + kubectl + Vault + sops install) and never mentioned the two zero-setup routes that users actually reach first. Restructure onboarding to lead with all three, recommendation first: (A) the t3 web terminal, which drops you into a ready shell with kubectl/Vault/repos preinstalled; (B) the k8s web dashboard, auto-authenticated per user; and (C) the existing own-machine setup. Flag the dashboard/terminal as the fallback when CLI OIDC login is unavailable, reframe the misleading home-page 'VPN required' banner (only path C needs it), add the access endpoints to the service catalog, and fix a stale Vaultwarden URL (was vault.viktorbarzin.me, which is actually HashiCorp Vault; Vaultwarden is vaultwarden.viktorbarzin.me).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Emo's ~13.7k-doc import is OCR-bound on a single celery worker (~10s/doc =
multi-day). Bump PAPERLESS_TASK_WORKERS=2 + THREADS_PER_WORKER=2 for ~2x
throughput, and the memory limit 2Gi->4Gi to fit two concurrent OCR jobs.
Kept deliberately modest: archive writes hit the shared sdc HDD that etcd
also lives on (IO-storm risk, code-oflt) — watch etcd apply latency and
revert workers to 1 if it degrades. Revert to defaults once the import done.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ASCII flow of the migrated plotting-book pipeline (GHA build in Anca's
repo → private ghcr.io/passionprojectsanca/book-plotter → Woodpecker
redeploy hook → in-cluster pull via ghcr-credentials), plus the Kyverno
admission / Keel backstop / Vault pull-cred supporting cast and the
serving path. Appended to the existing plotting-book section.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The nightly drift-detection cron and every vault-touching push apply have
been failing because CI runs terragrunt plan/apply on the Tier-0 `vault`
stack, which manages Vault's own transit mount + ACL policies. The CI
`ci` Vault role intentionally lacks those admin perms (sys/mounts,
sys/policies/acl), so the run always errors:
- apply: 403 on vault_mount.transit + vault_policy.personal_emo, plus an
Invalid for_each (local.k8s_users from secret/platform is deferred)
- drift: terragrunt plan exits 1 → fails the whole nightly run
vault is Tier-0 = human-applied via OIDC. Skip it in both pipelines:
- default.yml: skip `vault` in the platform-apply loop (kept in
PLATFORM_STACKS so the app-stack detector still excludes it)
- drift-detection.yml: skip `vault` in the per-stack plan loop
- ci-cd.md: document the exclusion on both pipeline rows
Found during a CI-health sweep (user reported many failures): GitHub
Actions all green; all Woodpecker repos green except this recurring
infra-repo failure, doubled by the legacy repo-1 + repo-82 dual
registration.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Anca's plotting-book app now builds its image in her own GitHub repo to
the private package ghcr.io/passionprojectsanca/book-plotter (off public
DockerHub viktorbarzin/book-plotter). Wire the cluster to pull it:
- stacks/plotting-book: point the deployment baseline image at the ghcr
package and add imagePullSecrets {ghcr-credentials} so the pod can pull
the private image (the live tag is still CI-owned via ignore_changes).
- stacks/kyverno: add the plotting-book namespace to the ghcr-credentials
allowlist so the Kyverno generate policy clones the pull secret into it.
Verified the shared ghcr_pull_token (Viktor, repo-admin on Anca's repo)
can read the private package before wiring this.
Docs: correct ci-cd.md (it wrongly listed plotting-book as already on
ghcr — it was on DockerHub) and note the special arrangement; amend
ADR-0003 to record that this GitHub-first repo builds to its own org's
ghcr namespace.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the backoff_limit=0 + CF-429 soft-skip hardening alongside the
cf-sync architecture description, with the why (the backoff_limit=2
retry-storm that escalated Cloudflare's Lists-API throttle into a stuck
state). Follow-up to 5b49634f.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The edge-ban sync was failing every 2 min on Cloudflare HTTP 429
(rate-limited) and never recovering, leaving the crowdsec_ban list empty.
Root cause: backoff_limit=2 made k8s re-run a failing pod up to 3x within
seconds, so each */2 cycle fired a burst of POSTs into Cloudflare's
per-60s Lists-API write limit. That kept the throttle perpetually tripped
(it stopped clearing even after minutes of quiet) — a self-inflicted DoS.
Two changes make the sync gentle and self-healing:
- backoff_limit 2 -> 0: one attempt per */2 cycle (the schedule IS the
retry cadence), no rapid-fire burst.
- lapi_kv_sync.py: treat a CF 429 as a soft-skip (exit 0, retry next
cycle) like the existing LAPI fail-safe, instead of fail-loud + k8s
retry. Any other CF error still fails loud.
Found during a cluster health check (AIOStreams CSI + pfSense SSH issues
handled separately).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
emo (power-user tier) had no Vault policy granting his personal secret
path, so `vault kv get secret/emo` failed. Viktor asked to give him that
access. Adds a read-only `personal-emo` policy (read on secret/data/emo +
metadata) and attaches it to emo's OIDC identity by adopting the
entity/alias Vault auto-created on his first login. Scoped explicitly to
emo; does not widen the power-user tier (which stays secret-less).
Verified live: a personal-emo token reads secret/emo, is denied writes,
and is denied other paths (secret/viktor -> 403).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Permissions-test change requested by Viktor: slow Emo's Sofia photo-frame
slideshow from 30s to 45s per image.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Second ImmichFrame instance cloned from the London frame (frame.tf), scoped to Emo's Immich account (emil.barzin) with Sofia weather coords and last-2-years photos. Drives Emo's Meta Portal Mini in Sofia via the portal-immich-frame app. Dedicated API key minted on Emo's account and stored in Vault (secret/immich -> frame_api_key_emo).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Emo's import scope now includes his work-PC document set (C/Documents,
Project Management, Service & MRO, etc. on the NAS), which is ~4.9k Office
files (.doc/.docx/.xls/.xlsx/.ppt/.pptx) on top of Emo shared. Paperless
can only archive/OCR/index those if it can convert them, so add the standard
Apache Tika (text+metadata) + Gotenberg (-> PDF) sidecar deployments + their
services in the paperless-ngx namespace and point PAPERLESS_TIKA_* at them.
Pinned images (gotenberg 8.25, tika 3.3.1.0), single replica, no PVC.
Total in-scope document set across all NAS locations is now ~13,700 PDF+Office
files / ~13.7GB source (~30GB once OCR'd + archived), so raise the data PVC
autoresize ceiling 30Gi -> 80Gi for comfortable headroom. The topolvm
autoresizer grows on demand up to the ceiling.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Preparing Paperless for Emo's document import from the NAS. His archive is
Bulgarian (Cyrillic) + English, but OCR was English-only (tesseract had no
'bul' pack and PAPERLESS_OCR_LANGUAGE was unset/defaulted to eng), so scanned
BG documents would OCR to garbage and be unsearchable. Add bul to the install
list and set OCR_LANGUAGE=bul+eng.
Also raise the data PVC autoresize ceiling from 5Gi to 30Gi: everything
(originals + archive via PAPERLESS_MEDIA_ROOT=../data) lives on the single
encrypted PVC, and the ~2.7GB in-scope import would blow past the 5Gi cap
mid-ingest. The topolvm autoresizer grows the volume on demand up to the
ceiling; 30Gi gives ample headroom.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two remaining gaps to let non-admins (emo) use `homelab vault`:
- setup-devvm.sh installed `@bitwarden/cli` only when `command -v bw`
failed, which an admin's own ~/.local/bin/bw satisfied — so the
system-wide copy was never installed and non-admins had no `bw`
backend. Install to the npm /usr prefix and guard on the system path
(/usr/bin/bw) instead.
- Add docs/runbooks/homelab-vault-onboarding.md (per-user setup, the
shared Organization/Collection flow for sharing passwords, admin
deploy + verification, security model) and repoint the two code
comments that cited a design-spec path which never existed.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`homelab vault` was effectively admin-only: two bugs blocked every
non-admin (e.g. emo) from using it for their own Vaultwarden vault.
1. Token: the CLI relied purely on ambient `vault` auth (~/.vault-token
/ $VAULT_TOKEN), which only admins have. Non-admins carry a scoped
token at ~/.config/claude-auth-sync/vault-token (policy
workstation-claude-<user>). Add ensureVaultToken(): explicit env >
~/.vault-token > scoped fallback, wired into every vault verb. Admins
are unaffected (their ambient token wins).
2. Write capability: `homelab vault setup` used plain `vault kv patch`,
which needs the `patch` capability the scoped policy does not grant
(only create/read/update) — so setup 403'd for non-admins. Switch to
`kv patch -method=rw` (read-modify-write; same approach
claude-auth-sync already uses), with `kv put` only when the path
doesn't exist yet. Preserves co-located keys (claude_ai_oauth_json).
Enables onboarding emo onto the per-user Vaultwarden access tool.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pipeline #366 (the SHA-pin apply, commit 7b4a8ba8) was SIGKILLed mid-apply by
Woodpecker cancel-previous when I pushed the next commit (#367, docs) while it
was still running — the apply log ends at '[chrome-service] Starting apply...'
with no 'Apply complete!', so the terraform state write did not finish. The live
deployment is correct (image = the supervised SHA, verified, self-healing), but
the stored state may be stale; this commit re-triggers a clean changed-stack
apply to reconcile it (comment-only change → 0 resource changes, no rollout).
Also adds a caution to the novnc image comment: after bumping the SHA, WAIT for
the apply pipeline to finish before pushing again (memory id=1957).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to change the infra apply guidance: instead of 'never apply
locally, always rely on CI', the policy is now 'you MAY apply locally, but
always commit the change to the infra repo'.
- .claude/CLAUDE.md (Critical Rule: Terraform Only): new bullet making local
apply explicit (scripts/tg apply / homelab tf apply) from the MAIN checkout
(not a worktree — git-crypt'd tfvars read as ciphertext there), with a hard
requirement that every applied change is committed + pushed to master the same
session so the repo stays the source of truth and CI drift-detection doesn't
revert it. Spells out the apply<->commit ordering both ways.
- AGENTS.md (non-admin workstation land steps): step 5 now notes local apply as
an option alongside CI auto-apply, with the same 'always committed, never
applied uncommitted' rule.
Note: the org-managed settings block also frames CI auto-apply but is not
editable from a workstation clone.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Deploys the self-heal fix from the previous commit. Keel is off for this
deployment (keel.sh/policy=never, because the browser container's playwright
image is version-pinned to f1-stream) and the novnc image was :latest with
imagePullPolicy=IfNotPresent, so a rebuilt :latest would NOT be re-pulled on a
rollout — the supervised entrypoint would never reach the running pod.
Pin novnc to :19d0f0933a (the build of the prior
commit; ghcr digest sha256:5b783ac6, == :latest) so the stack apply rolls the
sidecar onto the new image. Future novnc entrypoint changes deploy by bumping
this digest after build-chrome-service-novnc.yml publishes a new SHA tag.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The noVNC view at chrome.viktorbarzin.me went black: x11vnc (in the novnc
sidecar) attaches to the browser container's Xvfb over localhost:6099, and when
that container restarted (~8h ago, Chrome exited cleanly) x11vnc lost its X
connection and exited. Because the entrypoint ran x11vnc as an unsupervised
background child and then exec'd websockify as PID 1, the dead x11vnc was never
relaunched — :5900 stayed dead (a defunct zombie), websockify kept returning
'Connection refused', and the view was black until a manual pod restart.
Fix: the entrypoint now runs both x11vnc and websockify as supervised background
children and exits non-zero via 'wait -n' if either dies, so the kubelet restarts
the novnc container, which re-waits for Xvfb and relaunches x11vnc. The bridge
now self-heals across browser-container restarts. Mirrors the android-emulator
stack's supervision pattern. Architecture doc updated with the new failure mode,
diagnosis, immediate-recovery, and SHA-pin deploy note.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
emo cares about ha-sofia + his Sofia smart-home devices (Tuya, the MPPT
ATS, the Барзини → Статус dashboard), and only about the cluster when it's
breaking those. Rewrite his vendored cluster-health into an ha-sofia-focused,
read-only variant:
- leads with ha-sofia's in-cluster dependency chain (tuya-bridge + the
cloudflared/Traefik/DNS/TLS reachability path), all checkable read-only;
- fixes the script path to emo's own clone (/home/emo/code) — he can't read
wizard's tree — and runs it --no-fix (he's cluster read-only);
- loads emo's own HA token (see below) so the ha-sofia checks (26-29, 45)
actually run for him; documents the host-SSH/Vault checks that skip;
- triages: cluster FAIL/WARN matters only if on his chain; everything else is
a one-line "admin's area"; escalate via /file-issue since he can't fix.
This snapshot copy is now an emo-specific variant, intentionally diverged
from the canonical 47-check admin skill — README updated to say "do not
re-sync from canonical".
Token: a dedicated long-lived HA token (client_name emo-cluster-health) was
minted on ha-sofia via the admin account and stored emo-readable at
/home/emo/.config/cluster-health/haos_token (600). It carries admin HA scope
(HA only mints tokens for the authenticating account); independently revocable.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make cluster-health a user-global skill for emo (the lone entry in the
provisioner's SKILL_USERS allowlist), so it's available from any directory
— not only when working inside the infra clone where it already exists as a
project skill (.claude/skills/cluster-health). install_skills() in
t3-provision-users.sh copies the vendored snapshot into ~/.agents/skills/ and
symlinks ~/.claude/skills/, so this is the durable, rebuild-surviving path.
cluster-health is homelab-local (vendored from this repo's own
.claude/skills/), unlike the other snapshot entries which mirror upstream
mattpocock/skills + vercel-labs/skills; README documents its provenance and
the explicit re-sync step so the vendored copy doesn't silently drift.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The dedicated #security Slack channel was unreachable: the shared incoming
webhook (Vault secret/viktor -> alertmanager_slack_api_url) belongs to a
Slack app that isn't a member of #security, so any channel override on it
returns HTTP 404 channel_not_found. The goldmane-edges-digest was silently
failing for that reason.
Per request ("dump the security channel, post in an existing one"), route
everything to #alerts instead:
- alertmanager slack-security receiver -> #alerts (keeps its [SECURITY/<sev>]
title styling so security-lane alerts still stand out in the shared channel)
- goldmane-edges-digest CronJob SLACK_CHANNEL -> #alerts (comment only; value
was already switched and applied last change)
- AggregatorDown / DigestFailing alert summaries reworded to say #alerts
- docs swept (security.md, monitoring.md, ADR-0014, goldmane runbook,
.claude/CLAUDE.md, service-catalog, CONTEXT.md) to drop the
"invite the app / flip back to #security" caveats and state the
#security abandonment + #alerts consolidation as the current routing.
Monitoring stack applied (alertmanager rolled, live config verified:
slack-security channel is now #alerts).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The SSO restore script backed up the live manifest with
`cp "$MANIFEST" "$MANIFEST.bak.$TS"` — i.e. INSIDE /etc/kubernetes/manifests/.
The kubelet treats every file in that dir as a static pod, so the .bak became a
SECOND kube-apiserver static pod. While both copies were identical it was
harmless, but the instant `kubeadm upgrade` changed the real manifest's image to
v1.35.6, the kubelet saw two same-named pods with different specs and flip-flopped
(pod attempt count hit 13) — the new apiserver never stabilised, so kubeadm timed
out on "static Pod hash did not change after 5m" and rolled back. THIS was the
real cause of the 1.34->1.35 upgrade stalling for days (not etcd IO, which was a
downstream symptom of the flip-flopping apiserver hammering etcd).
Fix: write backups to a dedicated dir OUTSIDE the static-pod dir
(/etc/kubernetes/apiserver-oidc-bak/) and read the rollback copy from there. The
stray .bak that planted the landmine on 2026-06-18 was moved out manually
2026-06-26; this prevents the SSO script (and the upgrade chain's restore.sh,
which is the same script) from ever re-creating it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The postiz-postgres-backup CronJob still dumped from the chart's bundled
`postiz-postgresql` host with a hardcoded `postiz-password`. That bundled
PostgreSQL was removed when postiz migrated to the shared CNPG cluster, so
the host no longer resolves (NXDOMAIN) and every nightly run failed —
firing BackupCronJobFailed, and leaving the postiz DB with no logical dump
in the offsite pipeline.
Connect via the app's own DATABASE_URL (from the postiz-secrets Secret,
postgresql://postiz:…@pg-cluster-rw.dbaas.svc.cluster.local/postiz) instead
of a hardcoded host/user/password, so the backup tracks the live DB and
credentials. Verified with a one-off test job: psql + pg_dump 16.4 connect
to CNPG 16.9 and produce a 180K custom-format dump.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
kube-state-metrics had no explicit resources, so the monitoring-namespace
LimitRange pinned it to requests=limits=256Mi (Guaranteed QoS). KSM idles
around 45Mi but momentarily spikes past 256Mi during a full object relist
(450+ pods, 150+ jobs, all secrets/endpoints) and gets OOMKilled. Each OOM
blacks out the KSM-exported series that ~10 alert rules read, so they all
fire false "<svc>Down" criticals at once and self-resolve when KSM recovers
~5 min later — exactly the alert storm seen at 2026-06-26 08:42 UTC.
Set explicit Burstable resources: keep the request low (64Mi, just above
idle) so we don't reserve memory we don't use, and raise only the limit to
512Mi to absorb the relist peak. No CPU limit, per the cluster-wide policy.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cas_backup did `vault kv put secret/workstation/claude-users/<user>`, a full
KV-v2 replace that rewrote the document with only its 3 OAuth keys. Because
`homelab vault setup` co-locates the user's vaultwarden_* credentials on that
same path, every six-hourly sync silently deleted them — so `homelab vault`
reported "not configured" within hours of each setup. (Reported as: homelab
vault "keeps getting reset / logged out", set up 3 times.)
Switch the backup to a merge: `kv patch -method=rw` (read+update, needs no
`patch` capability) when the path exists, and `kv put` only to create it on the
first backup. Add a regression test with a fake vault asserting a pre-existing
sibling key survives a backup, and document the merge requirement in the
renewal runbook.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ADR-0015's policy change was applied live to /etc/claude-code/managed-settings.json, but that file self-deploys from the repo source scripts/workstation/managed-settings.json via the hourly reconcile (sync_managed_config). Without updating the source the next reconcile would REVERT /etc to the old 'never read other homes' rule. This updates the source-of-truth claudeMd (now byte-identical to /etc) so the change is durable + canonical, and refresh_codex_mirror propagates it to every user's ~/.codex/AGENTS.md. Also notes the access-model change in the multi-tenancy architecture doc (pointer to ADR-0015).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor (owner) wants agents to stop refusing file reads the OS already permits. wizard holds passwordless root ((ALL) NOPASSWD: ALL), so the managed-settings rule 'never read another user's ~/.claude' was stricter than the OS itself. The managed-settings policy (/etc/claude-code/managed-settings.json) was updated out-of-band to defer to OS/sudo authorization with no extra prompt; backup kept at .bak-2026-06-26. This ADR records the decision, its symmetry across sudo-holders, and the larger blast radius.
ADR-0011's usage-telemetry design is unchanged; only the cross-user privacy norm it referenced is superseded. The original ask was to delete ADR-0011 — superseded instead to preserve the audit trail and the ADR-0012/0013 references.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The 2026-06-22 external-secrets v1 migration made the ESO controller the
server-side-apply owner of .spec.refreshInterval on every ExternalSecret, so any
stack defining one via kubernetes_manifest fails `terraform apply` with a
field-manager conflict the next time it's applied (instagram-poster + grafana hit
this on 2026-06-24; it was latent across the whole fleet). Add
field_manager { force_conflicts = true } to all 101 remaining ExternalSecret
manifests across 70 stacks, matching the fix already on grafana / woodpecker /
traefik / k8s-version-upgrade / instagram-poster. TF and ESO set the same value,
so it's stable (no perpetual drift). Defuses the landmine before each stack's
next apply trips it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a
subagent workflow (distinct stacks in parallel, docs last):
- #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me,
auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081
(the operator's whisker NP default-denies ingress). calico stack.
- #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts
(prometheus_chart_values.tpl) + cluster-health check #48.
- #59 service-identity labels on the multi-Service namespaces (monitoring's 5
TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they
update in-place.
- #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog,
security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md.
#62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the
edge table (feeds code-8ywc; enforce-flips out of scope).
Also fixes the digest's Slack target: #security override 404s channel_not_found
because the shared alertmanager_slack_api_url webhook's app isn't a member of
#security (this likely also breaks alertmanager's slack-security receiver — flagged
in the runbook). Routed to #alerts (the webhook's working channel) until the app
is invited; verified a real digest run posts cleanly (360 edges).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Digging into "why did the apiserver crash" disproved the earlier OIDC
explanation. An isolated v1.35.6 apiserver repro with authentik reachable
initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the
--authentication-config -> --oidc-* revert is NOT what crashed it. etcd's
surviving crash-window log is the real cause: 1180 "apply request took too long"
warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as
kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the
shared sdc HDD (beads code-oflt).
A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full
~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated,
driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live
(73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate.
Also corrected the OIDC handling: the kubeadm-config drift is real but only
breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the
chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the
apiserver. So the preflight check is now an ALERT, not a block (was added on the
wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected.
Per Viktor: reclaim the disk and automate so the manual cleanup never recurs;
the durable IO fix remains code-oflt (etcd off the shared HDD).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Last night's autonomous 1.34->1.35 run reached the master control-plane phase
for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then
the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back
to 1.34.9. The cluster stayed healthy but the master was left cordoned and the
chain wedged on in_flight.
Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from
the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a
structured multi-issuer --authentication-config (kubectl + dashboard SSO), but
kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the
regenerated manifest reverted structured auth and the new apiserver crash-looped.
Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step
never ran because the upgrade itself never succeeded.
Fix:
- rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config
(kubeadm init phase upload-config: drop --oidc-*, add --authentication-config)
so a future kubeadm upgrade regenerates a correct manifest. Delivered to the
cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no
ssh key); trigger deliberately not script-hashed since CI cannot ssh.
- k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade
diff` and BLOCKS+alerts (never drains the master) if --authentication-config
would still be dropped.
- Post-mortem + runbook updated.
The live kubeadm-config was reconciled directly on the master and verified
(`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's
run can complete the 1.34->1.35 upgrade.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to redesign the ha-london dashboards and fix the broken integrations (the Cowboy one). The skill's ha-london knowledge map had drifted badly from reality, so this brings it current: it claimed HA 2025.9.1 on a docker-run container (it's HAOS 2026.5.2, managed); listed the now-dead jdejaegh Cowboy integration with sensor.bike_* entities (revived via elsbrock/cowboy-ha v1.2.0 -> entities are sensor.classic_performance_*); and didn't flag that met/metoffice/roomba/hildebrandglow are user-disabled (not broken) or that Tapo P100 is failing. Also documents the redesigned Overview (Home+More sections, Mushroom+mini-graph-card), the dashboard/view/card glossary, the parked-bike 'unknown battery' gotcha, and that london is API-only from the Sofia devvm (no SSH).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Infra side of ADR-0014: an mTLS gRPC consumer of Calico Goldmane's Flows API
that records the namespace-pair edge-set in CNPG and posts a daily new-edge
digest to #security. Adds the goldmane-edge-aggregator stack, the
pg-goldmane-edges Vault rotation role (Tier-0 vault state updated here), and the
namespace in the ghcr-credentials allowlist.
Cert: REUSES the operator-minted, Tigera-CA-signed whisker-backend client cert
(Goldmane verifies only the CA chain, not identity) instead of minting from the
Tigera CA private key. This avoids putting the CA key in TF state AND the
hashicorp/tls provider, which is incompatible with this repo's global
generate-providers/lockfile pattern (it broke every stack's lockfile).
Verified live: aggregator streaming flows, 174 edges in Postgres across 50x54
namespaces, db+slack ExternalSecrets synced, digest dry-run formats correctly,
private image pulls via the Kyverno-synced ghcr-credentials.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The ESO v1 migration (2026-06-22) made the external-secrets controller own
.spec.refreshInterval via server-side apply, so terraform apply of the two
ExternalSecret manifests fails with a field-manager conflict (Woodpecker #348),
which blocked the replicas=0 scale-down from landing. Add force_conflicts=true
to both, matching the grafana/woodpecker/traefik fix applied to other stacks
the same day.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor confirmed the Instagram Graph poster isn't used. Its ExternalSecret
has been dead on missing Vault keys (ig_graph_long_lived_token,
ig_business_account_id), so the deployment sat at 0/1 firing
DeploymentReplicasMismatch. Setting replicas=0 stops the alert and makes the
scale-down durable (a bare kubectl scale reverts on the next stack apply).
Re-set to 1 after minting a Meta long-lived token + populating the Vault keys.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The external-secrets controller owns .spec.refreshInterval via SSA, so a plain
terraform apply of the monitoring stack conflicts. Latent until 2026-06-24 (the
homelab-vault loki-rules change was the first monitoring apply in a while and
surfaced it). force_conflicts lets TF win — same pattern as woodpecker/traefik/
k8s-version-upgrade stacks.
Turns on Calico 3.30's native east-west flow observability so we can see which
Service talks to which (ADR-0014, issue #57). Enabled via the operator CRs
directly (kubectl_manifest Goldmane + Whisker, name=default) rather than the
Helm goldmane/whisker flags, because the goldmanes/whiskers CRDs already exist
and this sidesteps the helm-upgrade CR-before-CRD ordering issue. Whisker
notifications=Disabled so the UI doesn't call the external Tigera endpoint.
Applied supervised: creating the Goldmane CR re-rendered calico-node with the
FELIX_FLOWLOGSGOLDMANESERVER env (operator auto-wires Felix — no manual
FelixConfiguration); calico-node rolled cleanly 7/7, tigerastatus healthy,
goldmane is receiving flows from all nodes, Whisker UI serves.
Durable Loki persistence is NOT included here: the Goldmane emitter is Calico
Cloud/Enterprise-gated with no OSS knob to aim it at Loki (the CR can override
only name+resources, not env), so a durable trail needs a small custom gRPC
consumer of goldmane:7443 — tracked in issue #58.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a Loki ruler group (lane=security -> #security) for the homelab vault
op-log: VaultwardenTOTPFetched (every 2nd-factor fetch is visible) and
VaultwardenFetchVolumeHigh (>100 fetches/10m backstop). The audit spine
(Vault audit device, reads of secret/data/workstation/claude-users/*) is
already captured. True CLI-bypass detection needs cross-stream correlation
(follow-up).
C1 (critical): setup wrote the master password + API client_secret as
`vault kv patch key=value` argv, leaking them via /proc/<pid>/cmdline to
same-UID processes. Now written via stdin (key=- form); only email +
client_id (non-credentials) remain in argv.
I1: `get --json` refused on a TTY (was dumping the secret to scrollback).
M1: vaultLock now holds the per-user flock (it mutates bw state).
M4: bw login-detection parses status JSON instead of substring matching.
M5: clipboard path refuses when stderr is not a TTY (was silently failing).
M6: realRunner trims only trailing newline, preserving secret whitespace;
secret prompts likewise.
Adds security-property tests: no secret in argv across the get flow,
clipboard decision matrix, --json TTY gate, bw status parsing.
Records the design reached in a /grill-with-docs session: how to track which
Service talks to which as more Services are added, using k8s-native options.
Decision: service identity = the workload's namespace (primary) plus a
`service-identity` label only in the few multi-Service namespaces; east-west
observability = Calico 3.30 Goldmane/Whisker (already in our Calico v3.30.7,
currently disabled) emitting to Loki for a durable trail; enforcement reuses the
existing Wave 1 egress track. Dedicated per-Service ServiceAccounts deferred and
a service mesh / mTLS / SPIFFE rejected — the trust model needs attribution-grade
forensics on a trusted, etcd-constrained cluster, not cryptographic
non-repudiation. This is the service-mesh evaluation the 2026-04-20 infra audit
flagged as missing; rejected alternatives (Retina, Hubble, Kiali, a custom Alloy
enricher) are recorded with rationale.
Adds glossary terms (Service identity, Goldmane / Whisker) to CONTEXT.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Caught live-testing the previous commit: every sleeper run exited 141
(SIGPIPE) in ~1s with no output, never reaching the scale-down. Cause:
`set -o pipefail` + `dumpsys power | awk '...; exit'` — awk closes the pipe
after the first match while `kubectl exec` is still streaming dumpsys, so
the exec gets SIGPIPE, pipefail makes the pipeline 141, and set -e kills the
script before any echo. (My earlier dry-run missed it because it didn't run
under `set -euo pipefail`.)
Fix: drop pipefail; capture each exec to a var (`|| true`) then parse with
awk reading to END (no early `exit`), so nothing can SIGPIPE mid-stream and
a failed/booting exec falls through to the fail-safe "do not sleep" branch.
Also fetch the pod name via jsonpath instead of `-o name | head -1` (no pipe
to SIGPIPE, no `pod/` prefix to strip), and exec `adb` directly without the
`sh -c` wrapper.
Verified live: ran the corrected script as the gate ServiceAccount against
the stuck emulator (idle ~120h) — it logged "idle >= 6h ... scaling to zero"
and patched the deployment to replicas=0. The 6+ day pod is now asleep.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The emulator was meant to scale to zero when idle but had been up 6+ days
straight despite ~5 days with no real use. Two bugs:
1. The idle check counted ESTABLISHED TCP connections to the adb/noVNC
ports. A forgotten `adb connect` (no disconnect) holds that transport
open forever, so every 15-min run saw "active" and reset the counter --
it never reached the sleep branch. (Right now: 4 such stale transports
from pods on k8s-node3/node4.)
2. Even when it did reach the sleep branch, `kubectl scale --replicas=0`
failed Forbidden -- the gate ServiceAccount can patch `deployments` but
not `deployments/scale`.
Switch the sleeper to measure actual use: time since last user activity
(taps/keys/app-launches, incl. noVNC clicks) from `dumpsys power` vs guest
uptime. No interaction for 6h -> sleep. This ignores idle/forgotten
connections entirely. Scale down with a direct replicas patch on the named
deployment (same path the wake gate scales up), so it needs only the
existing `deployments` patch grant -- no `deployments/scale`. Now stateless
(drops the idle-counter annotation; gate.py no longer sets it) and lighter
on etcd. Fail-safe: any read error (e.g. mid-boot) does not sleep.
Requested by Viktor: turn the dev-only emulator off when it hasn't been
used for 6h.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Last night's 1.34.9->1.35.6 run passed the ESO/kyverno compat gate (the migration
worked!) but ABORTED at the kubeadm-plan-target gate: it ran `kubeadm upgrade plan`
with NO version, so master's old 1.34.9 kubeadm auto-proposed only the current
minor (Loki: "falling back to stable-1.34") and plan_target != 1.35.6 -> abort.
That gate worked for patch upgrades but never for minors. Fix: pass the explicit
`v$TARGET_VERSION` (verified on master: `kubeadm upgrade plan v1.35.6` emits
"kubeadm upgrade apply v1.35.6"). Works for patches too. Applied live to the
ConfigMap before tonight's run; deleted the failed preflight-1-35-6 job.
Also: ESO 2.x took SSA ownership of .spec.refreshInterval, so terraform's apply of
the k8s-upgrade-creds ExternalSecret hit a field-manager conflict. Added
field_manager.force_conflicts=true (benign — interval is semantically identical).
This pattern affects all 104 migrated ESs fleet-wide (follow-up).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The operator OOM-crashlooped on 2026-06-23: it idles at ~246Mi with a ~266Mi
startup spike (re-listing resources to build informer caches), both at/over the
256Mi limit, so the first time the pod restarted it could never finish startup
(exit 137 OOMKilled, leader-elect, OOM, repeat). A latent landmine — the limit
was always too tight; it only bit once the pod restarted. Data plane was never
affected (calico-node 7/7, tigerastatus green throughout). 512Mi gives headroom
(now ~246Mi steady, verified stable 0 restarts). NOT caused by the ESO migration
(which never touched calico); cluster churn was at most the trigger that exposed
the tight limit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker).
Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time,
each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true
rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only
after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on
external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to
upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s
1.34 -> 1.35 on its next nightly run.
Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the
terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox,
but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent
dependency lock file: no version selected"). Reconciled via `tg init -upgrade`
and committed so `terragrunt apply`/CI work cleanly again.
Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc
marked COMPLETE.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Record the new reconcile step alongside install_memory/install_playwright:
vendored own-copies of the 16-skill set for the SKILL_USERS allowlist (emo),
why it's vendored not npx (upstream drift), and that if-absent keys on the
user's own copy so it heals a stale/cross-user ~/.claude/skills symlink
(emo's grill-me pointed into the admin's home).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Follow-up to the vendored-skills change, from verifying the emo rollout:
- The if-absent guard treated ANY pre-existing ~/.claude/skills/<name> entry
as "installed", so a manual cross-user symlink emo already had (grill-me ->
/home/wizard/.claude/skills/grill-me) was skipped — leaving the requested
skill depending on the admin's home instead of emo's own copy. The guard now
keys on the user's OWN copy (a real dir under ~/.agents/skills) and (re)points
the ~/.claude/skills symlink at it, healing a stale/cross-user link while
still never clobbering a real dir.
- install -d left the intermediate ~/.agents owned by root; now owned by the user.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make the admin's Claude Code agent skills available to the `emo` devvm user.
Viktor asked to install Matt Pocock's skills for emo, starting with grill-me
but covering the full set the admin already uses.
The `npx skills` upstream has drifted off that set (diagnose -> diagnosing-bugs
and write-a-skill -> writing-great-skills were renamed; caveman + zoom-out are
no longer published), so reproducing it via npx is impossible and would also
spray ~70 agent dirs into the user's home + add a GitHub-clone + unpinned-CLI
dependency to the hourly root reconcile. Instead vendor a point-in-time
snapshot of the 16 skills (scripts/workstation/claude-skills/) and copy them
per-user, mirroring install_memory: install_skills() copies each skill into
~/.agents/skills/<name> (owned by the user) and symlinks
~/.claude/skills/<name> -> ../../.agents/skills/<name>. if-absent, additive,
best-effort, scoped to the SKILL_USERS allowlist (emo).
find-skills is from vercel-labs/skills (not Matt Pocock) but included since it
is part of the admin's current set.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Point the chrome-service container at the new chrome-service-browser image and
launch /opt/google/chrome/chrome instead of the bundled Chromium. Fixes
MEDIA_ERR_SRC_NOT_SUPPORTED on H.264/AAC video (Instagram Reels etc.) in the
noVNC view — bundled Chromium has those codecs compiled out; only real Chrome
carries them. connect_over_cdp callers (tripit fare scrape, homelab browser,
snapshot-harvester) attach over raw CDP (version-tolerant) — validated after
rollout. Image is built off-infra on GHA (prior commit) → public ghcr.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add an infra-owned image (Playwright base + google-chrome-stable) + its GHA
build workflow. The bundled Chromium ships proprietary codecs compiled out, so
H.264/AAC video (Instagram Reels, X, most .mp4) fails in the noVNC view with
MEDIA_ERR_SRC_NOT_SUPPORTED; only real Google Chrome carries those codecs
(libffmpeg swap + Chrome-for-Testing both ruled out). This commit only builds
the image (→ ghcr.io/viktorbarzin/chrome-service-browser); a follow-up flips
main.tf's launch to it once the image exists + is public.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The API rewrite half of the ESO 0.12->2.6 migration (last k8s-1.35 compat-gate
blocker). Done on chart 0.16.2, which serves BOTH external-secrets.io/v1beta1
and v1, so this is the safe window — MUST land before 0.17 removes v1beta1
(there is no conversion webhook). Pure apiVersion bump, schema is byte-identical:
106 occurrences (104 ExternalSecrets + 2 ClusterSecretStores vault-kv/vault-database)
across 73 .tf files, v1beta1 -> v1, no other field changes.
Validated live first on tandoor (single, non-coupled, synced ES): the
kubernetes_manifest apiVersion bump forces a REPLACE; the target Secret is
cascade-GC'd for ONE ~0.3s poll then ESO recreates it (identical value re-synced
from Vault, new UID) and the ES returns SecretSynced=True on v1. Running pods
keep their mounted copy through the sub-second blip. All 110 target Secrets were
snapshotted to /tmp first as a backstop.
CI applies the changed stacks serially (staged rollout); watching aggregate ES
sync back to 108 synced (2 pre-existing dead: instagram-poster, payslip-ingest).
Next: Phase 3 climb 0.16.2 -> 2.6.0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The chrome-service noVNC exposes Viktor's live logged-in browser sessions
(Instagram etc. — he'll sign in there for homelab browser to reuse). It was
auth="required" = any authenticated user, and "Home Server Admins" includes emo
(emil.barzin@gmail.com), so the admin group is not a sufficient gate. Add a
host-specific case to the domain-wide forward-auth restriction allowing only
Viktor's accounts (vbarzin@gmail.com + akadmin break-glass); everyone else,
incl. emo, is denied at the noVNC. emo's AGENT already can't reach the browser
(read-only RBAC blocks port-forward); this closes the human noVNC path.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The noVNC view showed the browser in the top-left with the rest of the
framebuffer black. Cause: Chrome launched with no --window-size, and there's no
window manager, so it opened at its profile-persisted (smaller) size inside the
1280x720 Xvfb. Add --window-size=1280,720 --window-position=0,0 so the window
fills the screen on every launch (fresh pods/profiles too). Live windows were
already resized via CDP as a stopgap.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The noVNC view hung on "Connecting" forever then timed out. Root cause: x11vnc
sweeps the entire fd table (fcntl per fd) on every client connection, and
containerd grants pods RLIMIT_NOFILE=2^31, so the RFB handshake never completes
(websockify accepts the WS and dials localhost:5900, but x11vnc never sends its
banner — verified: handshake timed out at 8s, x11vnc had burned 1h41m CPU
spinning). Same bug + fix the android-emulator stack already carries.
Cap nofile before x11vnc starts, in two places:
- files/novnc/entrypoint.sh: `ulimit -n 65536` (root fix, makes the image correct)
- main.tf novnc container: `command = ["bash","-c","ulimit -n 65536; exec /entrypoint.sh"]`
so the cap applies deterministically on rollout even though the image is
:latest/IfNotPresent (a rebuilt entrypoint isn't guaranteed to be re-pulled).
Also documents the gotcha + diagnosis in docs/architecture/chrome-service.md and
notes the black-when-idle behaviour + the autoconnect URL.
(A live x11vnc relaunch with the cap already unblocked the running pod; this
makes it survive restarts.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
REEL_PROVIDER was unset, so the reel pipeline used FakeReelExtractor, which returns
a CANNED caption — every pasted (tripit #120) or forwarded reel produced a DUMMY
Saved Place instead of reading the real reel. Set REEL_PROVIDER=anonymous in app_env
(covers the web Deployment + the ingest CronJob) so AnonymousReelExtractor does the
real anonymous read. Verified live from the cluster: yt-dlp fetched a real IG /p/
caption (no IG_GRAPHQL_DOC_ID needed — the internal-API path is an optional
optimisation; yt-dlp fallback works). LLM extraction + Nominatim POI geocoding were
already real (prior commits); this was the last fake link in the chain.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The web Deployment ran LLM_MODE=fake with no reel geocoder — only the ingest-plans
CronJob had real providers. The in-app reel-URL paste feature (tripit #120) runs
ingest_reel IN the web pod (BackgroundTask), so the Deployment now needs real
extraction: LLM_MODE=llamacpp (qwen3vl-8b; qwen3-8b segfaults on the current
llama-swap image) with the ADR-0033 claude-agent-service fallback, plus
REEL_GEOCODER_PROVIDER=nominatim for venue->city/country POI geocoding. Set in
app_env (feeds the Deployment; the CronJobs already had these via extra_env). Bonus:
this also un-fakes the in-app booking *share* import, which used the same fake LLM.
MAIL_INGEST_ENABLED stays false on the Deployment (only the CronJob polls mail).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The qwen3-8b GGUF segfaults on load on the current llama-swap :cuda image
("common_init_from_params: failed to create context"; llama-swap returns 502),
which broke ALL tripit mail ingest text extraction — booking emails AND forwarded
reels (status=failed, "no place could be read"). The GGUF isn't corrupt (valid
header, full size, worked for weeks) — it's a llama.cpp/image regression. Rather
than pin the SHARED llama-swap image (cross-user blast radius), repoint the
ingest-plans CronJob at qwen3vl-8b, an already-provisioned 8B model that loads
fine and extracts flight numbers + places reliably. Restores the auto-path
(reels resolve via the Nominatim geocoder; bookings parse again). The broken
qwen3-8b GGUF is a separate, non-urgent llama-cpp cleanup.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make `homelab browser --help` and chrome-service.md state the same tiered rule
now in ~/code/CLAUDE.md: default to the Playwright MCP/headless browser for all
routine automation; reach for `homelab browser` ONLY when headless is blocked
(loads-but-submit-fails / one request errors while siblings 200 / explicit bot
wall). Removes the "co-equal choice" framing so agents have one non-conflicting
instruction. Adds a test asserting the tiered wording so it can't regress.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Forwarded reels (tripit ADR-0031) geocode their venue to map a Saved Place to a
country + city, but the reel route was wired to the global geocoder, which here is
GEOCODER_PROVIDER=openmeteo (city-level, name-based). OpenMeteo returns nothing for
a venue query like "Time Out Market, Lisbon" so reels never resolved and no Saved
Place was created. The app fix (tripit 3c62d596) gave the reel route its own
geocoder behind REEL_GEOCODER_PROVIDER; set it to nominatim on the ingest-plans
CronJob (the only one running the reel route) so forwarded reels resolve to real
venue coords + city + country. Isolated from the global geocoder, which stays
openmeteo for weather/tours. Verified Nominatim resolves the venue from the cluster.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `homelab browser run|open` so agents can drive the cluster's headful
Chrome (chrome-service) over CDP from the devvm. The headless playwright/mcp
browser can load anti-bot sites and fill their forms, but the gated submit
silently fails — e.g. the Stirling Ackroyd Fixflo tenant portal returned
net::ERR_FILE_NOT_FOUND on its pre-submit check and hung, creating nothing.
Driving the real headful Chrome submits first try. That capability already
existed but was undiscoverable, so it cost ~40 min + redundant form re-runs to
find; now it is one command, versioned, test-covered, and `browser --help`
carries the when-to-use signature + an error-code cheat-sheet so the right tool
is reached at the right moment (the failure was judgment, not setup).
- port-forward svc/chrome-service:9222 (tunnels API-server->pod, so it bypasses
the :9222 NetworkPolicy), assert non-headless via /json/version,
connect_over_cdp, inject the same vendored stealth.js the in-cluster callers
use; the port-forward is always torn down, on success and on error.
- node CDP client pinned to playwright-core@1.48.2 to match the v1.48.0-noble
image (Chromium 130); self-provisioned lazily into ~/.cache/homelab, no
per-user setup.
- default is a fresh incognito context (safe for the shared browser + concurrent
callers); --shared-context reuses the warmed persistent profile.
- TDD: cmd_browser_test.go covers arg parsing, headless detection, the version
pin, the help cheat-sheet, and a stealth.js drift guard. Verified end-to-end
against bot.sannysoft.com (real Chrome UA, webdriver hidden, plugins/WebGL
spoofed) and `browser open`.
- docs: README v0.8 section, ADR-0013, and a chrome-service.md "driving from
outside the cluster" section.
Closes: code-nepg
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.
Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.
The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.
Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.
This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.
Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).
Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
wire-memory-hooks.py now PRUNES any settings.json hook still pointing at the
retired claude-memory plugin (plugins/claude-memory/hooks/) before the additive
pass. install_memory() rm -rf's that dir, so those entries are dangling — and a
missing UserPromptSubmit hook exits 2, a BLOCKING error that erases the prompt
and froze emo's sessions (2026-06-22). The plugin shares basenames with the
homelab hooks, so the old additive-only logic saw the dead plugin path as
"already present" and skipped installing the real ~/.claude/hooks/ copy; pruning
first fixes that. Verified against emo's exact original config: yields the
correct 4-hook set, drops the dead PermissionRequest entry, idempotent on rerun.
auto-learn.py now stores via the `homelab memory` CLI only — dropped the direct
HTTP path and the local-SQLite fallback (memory is homelab-CLI-only per Viktor;
never local files). No-ops silently when no API key is in env (e.g. ancamilea).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-land Phase 2 after the first attempt's two failure modes, both fixed:
- tempo.resources set under the correct single-binary chart key (was OOMKilled on
the namespace LimitRange default when mis-placed at top level).
- atomic=true + cleanup_on_fail=true on BOTH helm releases — a failed install
auto-rolls-back instead of leaving a stuck/orphaned release (memory #6479).
Tempo (single-binary, proxmox-lvm 20Gi, 30d) + OTel Collector (contrib; otlp ->
redaction -> batch -> tempo) + Tempo datasource + additive trace_id->Tempo
derivedField on Loki + tripit LOG_FORMAT=json/OTEL_EXPORTER_OTLP_ENDPOINT.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
multi-tenancy.md never mentioned the homelab-memory hooks rollout and still
listed claude_memory credential injection as purely "future". Document what is
actually true now: install_memory provisions the recall/auto-learn/compaction
hooks per user, the provisioner binary self-deploys from the repo (step 0), the
set -e abort fix, and that the hooks no-op without a MEMORY_API_KEY in env (CLI
defaults the URL) — emo has a key, ancamilea is keyless until one is minted.
Also clarify setup-devvm.sh's binary install is now bootstrap-only (ongoing
edits self-deploy).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Root cause of emo's lost memory: nothing redeployed /usr/local/bin/t3-provision-users
except the manual setup-devvm.sh, so the homelab-memory rollout (44562535/9aa2438e,
Jun 21) sat committed-but-undeployed for a day — the hourly reconcile kept running the
pre-memory binary and never wired the new memory hooks for emo/anca.
Close the gap the same way the script already treats managed-settings.json and
start-claude.sh (sync_managed_config / deploy_user_launcher): the repo is the
authoring surface. At the top of the run, if the repo copy differs from the deployed
binary, install it and re-exec the fresh one. Guards: a re-exec env flag (no loop),
bash -n (never deploy a broken script), DRY_RUN (no mutation), cmp (no churn when
unchanged). Verified across all four paths in isolation.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
install_memory (added in 44562535) ended with `[[ -d <plugin-dir> ]] && rm && log`
and guarded a chmod with a bare `[[ -f settings ]] && chmod`. When the plugin dir
or settings file is absent — the normal case for users who never had the
claude-memory plugin — those return non-zero, and under `set -euo pipefail` the
function returns non-zero and kills the whole hourly reconcile after the FIRST
user, before the rest are processed.
It never fired before because the rollout was committed but the deployed
/usr/local/bin/t3-provision-users was never updated, so install_memory had never
run. On first real run it aborted right after ancamilea, so emo (and wizard)
never got their memory hooks wired — the reason emo's sessions lost memory. Wrap
the cleanup in an if-block, guard the chmod, and end the function with return 0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pipeline #315 failed: tempo-0 CrashLoopBackOff / OOMKilled (exit 137). The
single-binary grafana/tempo chart (v1.24.4) takes container resources at
tempo.resources, not a top-level resources: — so my block was ignored and the pod
fell to the namespace LimitRange default and OOMed. Set tempo.resources explicitly
(req 256Mi / limit 2Gi). tripit + existing monitoring were unaffected throughout.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stand up the cluster's first trace store + OTLP ingress so tripit's OpenTelemetry
spans (Phase 1, already live in prod) export and correlate with logs:
- Grafana Tempo (single-binary, filesystem on proxmox-lvm 20Gi, 30d)
- OTel Collector (contrib; otlp -> redaction deny-list backstop -> batch -> tempo)
- Grafana: a Tempo datasource + an ADDITIVE trace_id->Tempo derivedField on the
Loki datasource (no uid change, so existing dashboards are unaffected)
- tripit deployment: LOG_FORMAT=json + OTEL_EXPORTER_OTLP_ENDPOINT -> the Collector
Additive (new helm releases; Loki/Prometheus/Grafana untouched). Offline
'terraform validate' clean; full plan+apply runs in CI (locked git-crypt blocks a
local plan as non-admin).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Execution log added to the ESO migration plan. Phase 1 complete: ESO at 0.16.2
(both v1beta1+v1 served). Phase 2 findings confirmed live: apiVersion bump forces
a kubernetes_manifest REPLACE, and ESO ESs use creationPolicy=Owner (target Secret
ownerRef → cascade-GC risk on the replace's delete). Phase 2 must snapshot Secrets
+ empirically validate GC-survival on the first live ES + per-stack two-phase
-target apply (fallback: state rm + import). Corrected the doc's k8s assumption
(cluster is on 1.34; whole climb stays on 1.34, no interleave).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First half of the ESO 0.12->2.6 migration (docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md),
clearing the LAST k8s-1.35 compat-gate blocker. Stepped one minor at a time on
k8s 1.34 (no k8s interleave — cluster already on 1.34, ESO bands are conservative
tested ranges not hard limits): 0.12.1 -> 0.13.0 -> 0.14.4 -> 0.15.1 -> 0.16.2.
Each hop applied + verified: controller healthy, all 108 live ExternalSecrets
stayed SecretSynced (2 pre-existing dead — instagram-poster, payslip-ingest —
missing Vault data, untouched). Added atomic=true + timeout=600 (ESO had no
rollback safety net). 0.16.2 serves BOTH v1beta1 AND v1 (storedVersions now
["v1beta1","v1"]) — the safe window to rewrite all 104 CRs to v1 (Phase 2) before
0.17 removes v1beta1. State auto-committed per hop by scripts/tg (Tier-0 SOPS).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The photo date overlay was showing US-style MM/dd/yyyy — ImmichFrame's built-in default when PhotoDateFormat is unset. Viktor wants UK day/month/year ordering instead. Pin PhotoDateFormat to the date-fns pattern "dd/MM/yyyy" (uppercase MM = month; lowercase mm would render minutes). The config map carries reloader.stakater.com/match, so Reloader restarts the immich-frame pod automatically on apply.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked whether the spend-down numbers were inflation-adjusted —
they were not (all nominal). He chose to switch the card to today's
money, so every row now shows constant purchasing power for life.
Each row is a die-with-zero annuity at the REAL rate (1+g)/1.03−1
(3% inflation), spending a constant inflation-adjusted amount (the
actual pounds withdrawn rise with inflation) until net worth hits £0
at age 100:
• No growth (0%) → £12/day, £370/mo, £4,446/yr (negative real: loses to inflation)
• Inflation (3%) → £43/day, £1,315/mo, £15,776/yr (0% real: holds value)
• Market (7%) → £130/day, £3,942/mo, £47,300/yr (~3.9% real)
Title now flags "(today's £)". Same panel/layout; only the SQL, title,
and tooltip changed.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to switch the Immich photo-frame shown on the Portal
kitchen appliance to a 24-hour clock. immichFrame defaults ClockFormat
to 'hh:mm' (12-hour) and we never overrode it, so the frame was showing
12-hour time. Set ClockFormat: "HH:mm" (date-fns 24h token) in the
frame Settings.yml ConfigMap; Reloader restarts the pod on apply.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wanted the spend-down card to compare three portfolio-growth
scenarios rather than the previous floor-vs-4%-real pair.
The table now has three rows, each a die-with-zero annuity (drain net
worth to £0 by age 100) spending a constant number of ACTUAL (nominal)
pounds, differing only by the assumed nominal growth rate:
• No growth (0%) → £43/day, £1,315/mo, £15,776/yr (= NW ÷ years)
• Inflation (3%) → £106/day, £3,233/mo, £38,792/yr (NEW)
• Avg market (7%) → £220/day, £6,703/mo, £80,435/yr
This keeps the £43 no-growth floor he anchored on. The old third row
was "4% real" (£133) expressed in today's money; it's replaced by the
7%-nominal market row (£220, actual pounds) so all three rows share one
basis (nominal pounds) and are directly comparable. 3%/7% are hardcoded
(one-line SQL edit). Table height 4→5 for the extra row; panels below
shifted down 1.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wanted the six separate spend-down stat tiles consolidated into a
single, more compact card with the figures laid out as rows.
Replaces stat panels 9220-9225 with one table panel (id 9220) in the
Overview row: 2 rows (Floor / 4% real) × 3 columns (per day / month /
year). Same underlying math and live values (£43/£1,315/£15,776 floor;
£133/£4,039/£48,463 at 4% real). w=9 instead of the full-width tile row,
so it takes ~a third of the width.
Note: this intentionally overrides the "table panels live at the bottom"
layout convention — Viktor chose to keep this headline KPI glanceable at
the top of the dashboard rather than scroll for it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Now that the native-auth rollout is complete: (1) AUTH_MODE hybrid->normal — the legacy Authentik OIDC-bearer + forward-auth arms were removed in #96, and 'hybrid' already resolved to 'normal' via backward-compat parsing; this makes it explicit and corrects the now-false comment. (2) SMTP_FROM plans@->trips@ — the dedicated native-auth sender; the trips@->spam@ send-as alias is live + verified (RCPT 250). (3) TRUST_FORWARDED_FOR=true — so #95's per-IP signup rate-limit keys on the real client behind Traefik, not the shared ingress pod IP. Env-only; the Deployment image is KEEL_IGNORE_IMAGE (lifecycle-ignored), so this does NOT touch the running image. Reloader restarts the pod to pick up the new env.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wanted a glanceable number on the Wealth dashboard for how much
he can spend for the rest of his life — spending the whole net worth
down to zero by age 100.
Adds a third line of six stat tiles to the Overview section, two
equations × three cadences (per day / month / year):
• FLOOR — net worth ÷ time remaining to age 100. Treats the money as
cash (no growth, no inflation): a conservative lower bound.
≈ £43/day, £1.3k/mo, £15.8k/yr.
• 4% REAL — die-with-zero annuity: the constant, inflation-adjusted
spend that drains the balance to £0 at 100 while it keeps earning
4% real. PMT = NW·r/(1−(1+r)^−n). ≈ £133/day, £4.0k/mo, £48.5k/yr.
Horizon is today → his 100th birthday (DOB 1998-10-04 → 2098-10-04),
computed live so the figures tick as net worth and the horizon move.
Net worth reuses the existing latest-per-account dav_corrected math, so
the tiles always agree with the "Net worth (current)" stat (pension
included; target £0). The 4% real rate is hard-coded per his "keep it
simple, just a number" steer — a one-line SQL edit to change later.
Layout: tiles inserted at y=9; all sections below shifted down 4 rows.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
install_memory ran the JSON-merge helper via 'runuser -u $user', but the helper
lives under the admin's mode-700 home ($WORKSTATION_DIR) which non-admin users
can't traverse -> wiring silently failed for emo/anca (hooks copied but never
wired into settings.json). Run the helper as root (it reads both the repo helper
and the user's home) and chown the result back to the user. Verified by the live
all-users rollout: emo + anca now wired correctly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Roll the wizard MCP->homelab-CLI memory migration out to every devvm user. Adds
install_memory() to t3-provision-users.sh (mirrors install_playwright: per-user,
idempotent, if-absent, as-the-user): installs the 4 memory hook scripts into
~/.claude/hooks, wires them into settings.json additively (wire-memory-hooks.py
never touches env / the per-user MEMORY_API_KEY), and removes ONLY the
claude_memory MCP + plugin if present. Reuses each user's existing key (no
minting; per-user isolation stays deferred per the 2026-06-07 design). The
homelab CLI hits the same remote HTTP API the MCP used; recall runs via the
homelab-memory-recall.py UserPromptSubmit hook. Shared instructions (rules/skills
symlinked from base; root+infra CLAUDE.md) already cover all users.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Design doc for migrating External Secrets Operator off v0.12 (k8s <=1.31), now
the ONLY remaining compat-gate blocker for autonomous k8s 1.35 (kyverno cleared
to 1.18.1 today). Decisive findings: NO v1beta1->v1 conversion webhook, so all
104 ExternalSecrets (across 73 stacks) + 2 ClusterSecretStores must be rewritten
to external-secrets.io/v1 (byte-identical apiVersion bump) while on 0.16.2, BEFORE
crossing 0.17 (which removes v1beta1 — the point of no return). Step one minor at
a time (no skipping); chart==app version; downstream Secrets survive. 5-phase
ordered plan + per-phase rollback + the plan-time data.kubernetes_secret -target
gotcha (15 stacks) + Tier-0/SOPS handling. Plan only — nothing applied.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The claude-memory MCP/plugin was uninstalled 2026-06-21 (recall now via the
homelab-memory-recall.py UserPromptSubmit hook; store/recall/update via the
`homelab memory` CLI, which hits the same remote HTTP API). Updates the
.claude/CLAUDE.md 'remember X' instruction off the obsolete local memory-tool
CLI + memory_search/memory_get onto the homelab CLI. Matches the root monorepo
CLAUDE.md + ~/.claude/rules/execution.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
kyverno v1.16 supports k8s <=1.34, so it was one of the two addons blocking the
autonomous 1.35 upgrade (compat gate, nightly). v1.18 supports 1.35.
Stepped one minor at a time per the kyverno upgrade guide (per-minor CRD notes):
3.6.1 (1.16) -> 3.7.2 (1.17.2) -> 3.8.1 (1.18.1), each hop applied + verified
supervised. atomic=true (auto-rollback on a failed rollout) + forceFailurePolicyIgnore
(admissions stay open mid-roll) kept it safe. Values schema confirmed compatible
across 3.6->3.8 (forceFailurePolicyIgnore still under features:).
Verified after each hop: all 17 ClusterPolicies stayed Ready, admission controller
2/2, no destroys/replaces in plan. Final 1.18.1: images v1.18.1, mutating webhook
live (server-side dry-run injects ndots:2 in a non-excluded ns). compat-gate vs
1.35.6 now lists ONLY external-secrets (kyverno cleared). ESO 0.12->2.x
(v1beta1->v1, 73 files) is the last remaining 1.35 blocker — to be planned.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").
Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.
What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
running version, detector freshness, detected target, outcome (no-op /
blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
`k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
(or any future helper) can't false-trip the chain-wedged alarm.
Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Last night (2026-06-20) the detector + compat-gate fixes worked: the chain
resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno
1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
fired as designed. But the refusal also made the preflight Job exit 1
(block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped
K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm
for what is the intended halt-and-alert outcome.
Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate
block sets that gauge (and it stays 1 until the next preflight resets it), so
the chain-job-failed alert is suppressed for the blocked period; a genuine
wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires
(preserving the alert's original purpose — catching the pre-in_flight preflight
failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs
updated to match.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wanted the freshness tile to cover all three main holdings
(META, VUAG, VUSA), not only the single stalest one. Dropped LIMIT 1 so
the stat renders one value per held position (worst-first), switched the
tile to horizontal orientation so the three values sit side-by-side, and
updated the description. Each value is coloured by its own age threshold
(META red ~2mo, the Vanguard ETFs green ~2d). No threshold or datasource
change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler
never invoked) and has been removed. Document the replacement: in-kernel
nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List
+ zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts.
Both add zero per-request latency and fail open.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Zero live ingresses reference traefik-crowdsec@kubernetescrd (PR1 + a
cluster-wide targeted ingress re-apply confirmed 0), so the crowdsec Middleware
CRD and the broken Yaegi bouncer plugin can be removed without orphaning any
router. Removes: the `crowdsec` Middleware, the crowdsec-bouncer plugin (static
config + initContainer download + state.json entry), the captcha template
ConfigMap + volume + captcha.html, the Turnstile widget + data.cloudflare_accounts,
and the 3 now-unused module vars. Also drops the `crowdsec` middleware from the
catch-all error-pages IngressRoute chain (the one remaining CRD-level reference,
which an Ingress-annotation grep does not surface) so that router is not orphaned
when the Middleware is deleted; it keeps rate-limit. Enforcement is fully handled
out-of-band now: cs-firewall-bouncer (in-kernel nftables, direct hosts) +
Cloudflare IP-List/WAF (proxied hosts). The api-token-middleware plugin is
deliberately preserved (still used by paperless-mcp).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The council-complaints app (Islington civic-reporting pilot) has been
abandoned. It was already dead in the cluster (deployments scaled 0/0,
image only on the decommissioned registry.viktorbarzin.me which 404s),
and it was never in Terraform — only docs + a kyverno comment referenced
it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses)
were torn down out-of-band via kubectl (nothing in TF to drift from); the
DB-dump PVC was backed up to NFS first.
This removes the remaining repo references to the live app:
- service-catalog.md: drop the council-complaints row
- ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list
- kyverno require-trusted-registries: the registry.viktorbarzin.me/*
allowlist comment claimed council-complaints as the last referencer;
rewrite it (no live workload pulls from that registry now; only stale
completed Job records still carry the ref). The allowlist line itself
is kept (registry-scoped, not app-specific).
Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade-
apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated
repos (memory id=388)" snapshot; left as-is so the dated record stays
accurate.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two follow-ups Viktor asked for on the Price freshness panel:
- Expand every section by default. Grafana's collapsed rows hide their
child panels; just flipping collapsed=false leaves a non-canonical shape
(confirmed via the Grafana API that it keeps the panels nested rather
than hoisting them), so each row is now collapsed=false + panels=[] with
its children hoisted to top-level -- the exact form Grafana writes when
you expand-and-save. Row headers revert to their original y (the child
y-coords were already expanded-layout coordinates).
- Stop the freshness stat from taking its own line. It's now the 6th tile
in the existing returns row (1d/7d/30d/90d/12mo + freshness), all width 4
at y=5; the collapsed-row y-shift from the previous commit is undone.
No query or threshold changes. The large diff is mechanical: 12 child
panels re-indent from nested to top-level.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
For each /var/lib/t3-autoupdate/deferred/<user> marker: skip+clear if the unit is
gone or was already restarted after the deferral; otherwise, when the idle gate is
satisfied, take a pre-restart backup and restart via the shared safe_restart_unit,
clearing the marker on verified success. DRY_RUN logs decisions without acting.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The gate reads t3's state.sqlite: safe to restart only when zero threads have an
active_turn_id AND the most-recent thread activity is older than the quiet buffer
(default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover
the boundaries against fixture DBs (no root/bats/Docker).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Behavior-preserving refactor: the per-unit restart/recover body and small helpers
now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is
deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/
so the new idle migrator can drain it later; clear the marker on a successful
restart. Install/health-gate/canary logic is unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pull the per-unit backup->restart->verify->recover routine (and the small
helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second
job (the upcoming idle migrator) can reuse the exact same audited recovery path
instead of forking safety-critical code. safe_restart_unit returns non-zero on
failure (after recovery+freeze) rather than exiting, so callers control flow.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor was worried about stale prices silently distorting net worth.
Confirmed it's real: META's quote has been frozen at 2026-04-17 (65 days
old) while the dashboard keeps valuing the ~55-share position at that
stale close; the Vanguard ETFs are current. Nothing flagged it.
Adds one compact stat to the Overview row showing the most out-of-date
HELD position's quote age (symbol + humanised age), colour-coded: green
<=4d (weekend/bank-holiday tolerant), amber 5-9d, red >=10d. Pure read of
the quote_latest mirror via the wealth-pg datasource, held positions
only, LEFT JOIN so a held symbol with no quote at all sorts as max-stale.
The six collapsed rows below shift down 4 grid units to make room; no
other panel touched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor hit the t3 'Client and server versions differ' warning. Root cause: the daily gated autoupdate defers a user's t3-serve restart whenever that user has an active agent at the 04:00 window, so anyone busy every night (long-lived/AFK sessions) never migrates and the client/server version skew persists for days.
This design adds a small idle-gated overnight job that drains those deferrals -- restarting a deferred instance onto the current binary only when no turn is in flight (state.sqlite active_turn_id) and it's been quiet for a buffer, so the migration lands in a real quiet gap instead of killing in-flight agent turns. Reuses the autoupdate's proven backup->restart->verify->recover path via a shared helper (approach C from the brainstorm). Design doc only; no behavior change yet.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`ha token` originally read openclaw/openclaw-secrets -> skill_secrets, which only
cluster admins can read — so it hung/failed for the non-admin operator it was
built for (emo = emil.barzin@gmail.com, OIDC group "Home Server Admins", whose
identity is deliberately barred from secrets in the openclaw namespace).
Split the HA tokens into a dedicated secret openclaw/ha-tokens (keys sofia/london)
with a Role + RoleBinding granting `get` on JUST that secret to the Home Server
Admins group (k8s RBAC can't scope to a JSON sub-key, hence a separate object).
emo now resolves the HA token with their own identity, WITHOUT gaining the rest
of skill_secrets (slack_webhook, uptime_kuma_password). openclaw's own deployment
keeps reading openclaw-secrets — purely additive.
- stacks/openclaw/ha_tokens.tf: new secret + least-privilege Role/RoleBinding
- cli/cmd_ha.go: read openclaw/ha-tokens (raw base64 per-instance key); drop JSON parse
- README + ADR-0012 updated; VERSION -> v0.7.1
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The first PR1 commit only dropped the ingress_factory reference + the 8
exclude_crowdsec call sites. But the crowdsec middleware is ALSO hard-coded
(not via the variable) in 6 more ingresses that build their middleware chain by
hand: owntracks, the monitoring Helm values (grafana + prometheus +
alertmanager), and the reverse-proxy module + its own separate ingress factory.
Remove all 6 so that after the full-cluster apply NO live ingress references
traefik-crowdsec@kubernetescrd — the precondition for PR2 deleting the CRD.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The infra CI pipeline was failing often — ~38% of the last 50 runs didn't
succeed. The single biggest cause (8 of 19 non-successes) was Tier-1 stack
applies dying instantly with "Error acquiring the state lock".
Tier-0 stacks already degrade gracefully (Vault advisory lock → the pipeline
skips a locked stack). Tier-1 stacks have no such fallback: they rely on
terraform's pg-backend pg_advisory_lock, and scripts/tg ran terragrunt with
no -lock-timeout, so any concurrent lock holder was fatal — a Woodpecker-killed
run whose PG lock wasn't reaped yet (PL266 killed → PL267 failed the same
second), a human/agent applying locally, or the daily drift `plan`.
Fix: scripts/tg now passes -lock-timeout (default 5m, override TG_LOCK_TIMEOUT)
on every state-locking verb (plan/apply/destroy/refresh), so a contended lock
WAITS for the holder to finish instead of failing. -auto-approve behaviour for
non-interactive applies is unchanged. Central wrapper change → covers CI, plus
local human/agent applies; no CI image rebuild (tg is read from the repo).
Adds a hermetic pytest (stub terragrunt + preset PG_CONN_STR) pinning the
arg-injection. Docs updated in AGENTS.md + .claude/CLAUDE.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Traefik CrowdSec (Yaegi) bouncer plugin enforces nothing on Traefik 3.7.5
(handler never invoked) and is fully superseded by the cs-firewall-bouncer
(in-kernel nftables drop on direct hosts) + the Cloudflare IP-List/WAF rule
(proxied hosts). Drop the `traefik-crowdsec@kubernetescrd` middleware from the
ingress_factory chain and the 8 explicit `exclude_crowdsec = true` call sites,
and delete the now-unused `exclude_crowdsec` variable.
This is PR1 of a 2-phase removal: the reference is removed FIRST (a shared-module
change → full-cluster apply re-renders every ingress without the middleware) so
that PR2 can delete the `crowdsec` Middleware CRD + the plugin itself WITHOUT
leaving any ingress pointing at a missing middleware (which would error those
routers). PR2 MUST NOT land until this has fully applied and zero live ingresses
reference traefik-crowdsec@kubernetescrd.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
One-node validation on k8s-node2 passed: kernel nftables sets created in both
input and forward chains (policy accept), ~31k decisions loaded, a known banned
scanner confirmed in the drop set, pod stable 4h+ with no collateral. Remove the
nodeSelector so the DaemonSet runs on every node — direct-host enforcement now
survives a MetalLB VIP failover to any worker.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The edge CF IP List can't hold the ~31k CAPI community blocklist (already
enforced in-kernel by the firewall-bouncer), so the sync now skips origin=CAPI
and carries only high-signal local/curated decisions (+ a 9000 safety cap).
Also fixes the list-items GET: per_page=1000 returned a misleading CF 400
'invalid or expired cursor' (10027); the endpoint max is 500. Verified live:
crowdsec_ban populates (4 IPs) and the sync exits 0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mined another devvm user's Claude sessions for repeated, hand-rolled command
patterns worth absorbing into the shared CLI. The dominant signal was Home
Assistant "Sofia" work: a `kubectl | base64 | jq` token-extraction pipeline
re-derived ~420x, and a bespoke non-interactive `ssh -o …` invocation reinvented
~30x — every session. The existing `home-assistant-sofia.py` already covers the
API but goes unused from an arbitrary cwd (needs an env var set + a cwd-relative
path), so agents bypassed it and hand-rolled everything.
Add two verbs covering exactly the gaps the `ha` MCP can't (entity state/control
stays with the MCP):
- `ha token [--instance sofia|london]` (read): resolves the long-lived API token
live from k8s secret openclaw/openclaw-secrets via the ambient kubeconfig — no
pre-set env var. Composes as `curl -H "Authorization: Bearer $(homelab ha token)"`.
- `ha ssh [--instance sofia|london] -- <cmd>` (write): deterministic
non-interactive ssh to the HA host using the invoking user's key.
Also fix the root cause: `home-assistant-sofia.py` now falls back to
`homelab ha token` when its env var is unset (works from any directory), and the
home-assistant skill points agents at these verbs + `homelab metrics query`
instead of hand-rolled curls. README + ADR-0012 + AGENTS.md updated per the
per-verb-group convention.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)
Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
(previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
(CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).
Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The first live renewal run showed systemd could not create state beneath a read-only home sandbox. Provision each user's writable state directory before enabling the timer so automatic renewal can run.
The approved Claude token renewal deployment could not run because setup-devvm passed inline package comments to apt as package names. Strip inline comments so the persisted all-user setup remains reproducible.
TripIt's native signup-verification + account-recovery mail (ADR-0028) sends From: trips@viktorbarzin.me while authenticating SMTP as spam@. With SPOOF_PROTECTION on, Postfix smtpd_sender_login_maps requires an EXPLICIT alias (the @domain catch-all doesn't satisfy it) — mirrors the existing plans@->spam@ grant. Must be applied + verified before TripIt flips SMTP_FROM to trips@, else every verification/recovery send is rejected 550.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Each workstation user needs a continuously valid Claude token under their own Enterprise identity. Store only that user's OAuth state in an isolated Vault path, renew and verify it automatically, recover from Vault when possible, and alert when interactive SSO is required.
TripIt external users are now LOCAL TripIt accounts (ADR-0028 native passkey + Authentik OIDC), so the Authentik-side self-enrollment machinery is dead. Removes the tripit-enrollment + tripit-recovery flows and all their stages/prompts/policies/bindings, the tripit-email-stages blueprint (+yaml), and the 'TripIt External' group; reverts the admin-services-restriction fence branch that contained those users (its sole member, the leftover tripit-demo@ test account, was deleted first, so the revert affects zero live principals). Real external collaborators (type=external) are untouched. tg plan: 0 add, 1 change (the policy expression), 20 destroy (all tripit_*). Closes tripit#97; moots the B2 per-app OIDC fences.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CF account hard-limits to 1 Rules List, so proxied enforcement uses one crowdsec_ban
list + one WAF block rule; the sync writes both ban and captcha decisions into it
(captcha downgraded to block at the edge). Drops the second list + managed_challenge
rule. Trivial touch to firewall_bouncer.tf to make CI re-apply crowdsec and recreate
the DaemonSet (tar fix already in master; stale orphan was cleared).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
v4.52.7 import id must be zone/<zone_id>/<ruleset_id>; add depends_on so the
crowdsec_ban/captcha lists exist before the WAF rules reference them.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- initContainer used GNU tar --wildcards which fails on the busybox curl image (pod Init:Error); switch to extract-all + cp via shell glob.
- cloudflare_ruleset hit the per-zone singleton conflict; import the existing 'default' http_request_firewall_custom ruleset and manage all rules — CrowdSec ban/captcha first, the pre-existing disabled skip rule preserved verbatim.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tg plan verified the agent's guess 'Account Filter Lists Edit/Read' is not a key in the v4.52.7 permission-group map; the live CF API lists the correct account-scoped groups as 'Account Rule Lists Read'/'Write'.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
910d5892 landed the [git.timeout] + [git.config] env in master, but the CI apply
skipped stacks/forgejo (the changed-stack-diff race after a sync-merge), so the
Forgejo deployment never picked it up. A trivial comment touch to force a clean
apply of the stack so the durable push-mirror fix actually takes effect.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replaces the Worker+KV approach (which only covered the ~27 routed hosts) with a
zone-wide mechanism that covers ALL proxied hosts: two CF account IP Lists
(crowdsec_ban, crowdsec_captcha) + one zone WAF custom rule that blocks
`(ip.src in $crowdsec_ban)` and managed-challenges `(ip.src in $crowdsec_captcha)`.
No per-request Worker, no cookie machinery — the rybbit Worker stays
analytics-only. lapi_kv_sync.py now full-reconciles the two lists from LAPI
(fail-safe: a LAPI blip skips the run and freezes the last-known-good block set;
serializes CF bulk ops since CF allows one pending op per account). A
least-privilege CF API token (Account Filter Lists Edit) is minted in TF.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drops banned source IPs in-kernel via nftables (hooks input+forward, so DNAT'd
LoadBalancer traffic is caught before reaching Traefik) for DIRECT hosts — the
direct-side replacement for the dead Traefik plugin, zero per-request hop.
No published image exists, so an initContainer fetches the pinned official
static binary (v0.0.34) onto a stock debian-slim base (nftables backend uses
netlink directly, no nft CLI needed). hostNetwork + NET_ADMIN/NET_RAW (not
privileged). Config (with api_key) in a Secret, Reloader-annotated. crowdsec ns
is already in the Kyverno wave-1 exclude list, so the privileged/hostNetwork pod
is admitted. Pinned to k8s-node2 (runs a Traefik pod) for one-node validation
before the nodeSelector is removed to roll cluster-wide. Fail-open by element
timeout if the bouncer stops.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New stack for the realtime voice agent — v2 of the portal-assistant brain
path. One persistent WebSocket per conversation: continuous mic audio ->
Silero VAD turn-taking -> Whisper STT (portal-stt) -> streaming Claude brain
(claude-agent-service) -> edge-tts (portal-tts) -> audio out, with barge-in.
Reuses all three upstream cluster services; nothing new is spun up.
Public Cloudflare ingress (proxied, WebSocket) at portal-realtime.viktorbarzin.me
with the app's own DEVICE_TOKEN as the edge gate (auth="app" — Authentik would
break the native Portal client). No buffering middleware: it would break the
streaming WebSocket. Image ghcr.io/viktorbarzin/portal-assistant-realtime
(private ghcr, pulled with ghcr_pull_token). Sibling to the v1 portal-assistant
gateway, which stays live.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Introduced the first pytest file in the tree
(stacks/k8s-version-upgrade/scripts/test_compat_gate.py); running it leaves an
untracked __pycache__/ dir. Ignore the standard Python build artifacts so test
runs don't show up as working-tree noise or get committed by accident.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two corrections to the runbook matching today's code fixes:
- The next-minor *patch* probe (GET .../Packages) also needs `-L`; it lacked it
until 2026-06-20 and silently no-op'd the 2026-06-19 nightly run. Both probes
now follow the 302.
- The compat gate's addon check is scoped to minor jumps — patches within the
running minor are never addon-blocked (target_minor <= running_minor returns
early), so a conservative ceiling like ESO 0.12 -> 1.31 no longer false-blocks
a 1.34.x patch.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The compat gate compared every addon's matrix ceiling against the target
k8s minor unconditionally. That is correct for a minor JUMP, but it also
blocked patch upgrades within the minor the cluster is ALREADY running:
ESO v0.12's matrix ceiling is 1.31, the cluster runs 1.34.9, so a target of
1.34.10 (a patch) was refused with "external-secrets supports k8s <= 1.31;
target 1.34 exceeds it" — even though the running cluster is itself proof ESO
0.12 works on 1.34. That silently defeats autonomous patching (it would have
bitten the moment a 1.34.10 was published).
Fix: a target at or below the running minor crosses into no new k8s minor, so
every installed addon is already empirically proven on it — check_addons now
returns no reasons when target_minor <= running_minor. Added running_minor()
(oldest kubelet across nodes, mirroring the detector; RUNNING_K8S env override
for tests) and pass it in. Minor jumps are unchanged: 1.34->1.35 still blocks
on ESO 0.12 + kyverno 1.16. removed-API + containerd checks are naturally
inert for patches (no API removal / containerd floor inside a minor) and keep
running as defence. Added test_compat_gate.py (8 cases) covering both paths.
Verified end-to-end against live Prometheus: target 1.34.10 -> EXIT 0 (safe),
target 1.35.6 -> EXIT 2 (blocked on ESO+kyverno).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Seeds two new bouncers at LAPI startup (BOUNCER_KEY_kvsync, BOUNCER_KEY_firewall)
from Vault secret/platform, mirroring the existing BOUNCER_KEY_traefik wiring.
These are the two halves of the real enforcement that replaces the dead Yaegi
plugin: kvsync authenticates the LAPI->Cloudflare-KV sync (proxied edge Worker),
firewall authenticates the cs-firewall-bouncer DaemonSet (direct-host nftables).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The next-minor Packages query used `curl -sf` without -L. pkgs.k8s.io
302-redirects every request to a backing host, so without -L curl returned
an empty body, NEXT_MINOR_PATCH came back empty, and the detector fell
through to "No upgrade needed". That is exactly why last night's 23:00 chain
no-op'd instead of resolving the 1.35 next-minor target (1.35.6) and handing
it to the compat gate. `curl -sfL` follows the redirect and returns the
Packages file (verified: -sf -> empty, -sfL -> 1.35.6). Mirrors the same
-L fix already applied to the Release availability probe (-sILo) above.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The tripit Forgejo->GitHub push-mirror silently stalled: `git cat-file
--batch-all-objects` over the NFS-backed repo exceeded the default git deadline
once ~4500 loose objects accumulated (gc.auto's 6700 threshold hadn't fired), so
pushes stopped reaching GitHub and prod deploys stalled. Raise [git.timeout]
(DEFAULT/MIRROR/GC) so a slow object enumeration can't abort the mirror, and set
[git.config] gc.auto=1000 so post-push autogc + the git_gc_repos cron keep repos
packed (the real fix). A one-off forced gc already unblocked tripit; this prevents
recurrence across all repos.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pure-stdlib script (alert_digest pattern, runs on stock python:3.12-alpine) that
projects CrowdSec Ip-scope ban/captcha decisions into the Workers KV namespace
the edge Worker reads on each proxied request. Full-reconcile per run so an
un-ban clears from the edge within one interval; fail-safe (a LAPI read error
skips the run and leaves existing bans to expire by TTL = fail-open, never a
stale all-block). TF wiring (KV namespace + CronJob + key registration) follows.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Preparing for real CrowdSec enforcement (edge Cloudflare Worker for proxied
hosts + cs-firewall-bouncer for direct hosts). Both enforce by dropping the
real source IP, so if an internal/RFC1918 address ever ended up in a ban
decision it could blackhole legitimate internal traffic. Whitelisting the
cluster/LAN/tailnet ranges (10/8, 172.16/12, 192.168/16, 100.64/10) at the
CrowdSec parser layer makes that structurally impossible — a trusted source
can never produce a decision in the first place. Public IP already whitelisted.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Answers the question that drove the whole CLI — which verbs to add next — with
data instead of one maintainer's habits, and resolves the cross-user-usage ask
in-bounds (no reading anyone's home).
- emit on dispatch: every verb fire-and-forgets one Loki line {job,user,verb} +
"exit=N ver=X". ONLY the verb path + exit code — never args, paths, flags, or
secrets (the emit never sees arguments). Best-effort: 800ms timeout, errors
swallowed, never affects the command; opt-out HOMELAB_TELEMETRY=0. Discovery
verbs (manifest/version/help) and usage itself don't self-record.
- usage top [--since 30d] [--user U] [--json]: ranks verbs via
sum by (verb)(count_over_time({job="homelab-usage"}[…])) against the shared
Loki. Cross-user analytics WITHOUT touching ~/.claude — the privacy-preserving
answer to "what does the team use".
- Loki sink (zero new infra, dogfoods v0.5 logs path); push verified HTTP 204 no
auth. ADR docs/adr/0011.
Live-verified: ran 4 verbs, usage top ranked them correctly (metrics query=2).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Base for the stepped 3.26->3.28->3.30->3.32 upgrade (k8s 1.36 prereq; 3.26 is
already unsupported on k8s 1.34). Manage ONLY the operator via the official
tigera-operator Helm chart (chart ver == Calico ver); installation.enabled=false
keeps the live Installation CR operator-managed so Helm never touches calico-node.
Adopted in place: existing operator Deployment/SA/ClusterRole/ClusterRoleBinding
pre-stamped with Helm ownership metadata (transient migration step), then the
release imported via a plan-verified create (1 to add, 0 destroy). Verified clean:
calico-node 7/7 unchanged, tigerastatus green. Closes the year-deferred adoption
(code-3ad) for the operator without adopting the Installation CR.
After bumping to v1.6.0 (stream goroutine runs) and disabling redis (in-memory
cache), the plugin logs `handleStreamCache:updated` but still does NOT enforce:
a ban present in the LAPI stream AND pulled by the plugin still let the banned IP
through. Stream-mode decision matching is unreliable under Traefik's Yaegi
interpreter here. Switch crowdsecMode stream->live: the plugin queries LAPI
synchronously per request (result cached per-IP for defaultDecisionSeconds), which
enforces reliably and picks up new decisions immediately. LAPI is 3-replica +
in-cluster so per-request latency is small; fail-open preserved (updateMaxFailure=-1).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
With the plugin on v1.6.0 the stream goroutine finally runs, and its slog output
revealed the real blocker: `handleStreamTicker ... isCrowdsecStreamHealthy:true
cache:unreachable`. The LAPI stream is healthy, but the plugin's redis client
cannot reach the cache under Traefik's Yaegi interpreter — even though
redis-master.redis.svc is reachable AND writable from the traefik namespace
(SET/GET verified via busybox; no NetworkPolicies; no auth). Same interpreter
-incompat class as the stream goroutine itself. With redisCacheUnreachableBlock
=false the bouncer then failed open and enforced nothing.
Disable the redis cache so the plugin uses its in-memory decision store (works
under Yaegi). Removes redisCacheHost/redisCacheUnreachableBlock. Trade-off:
captcha already-solved grace is per-pod across the 3 Traefik replicas (at worst
an occasional re-solve) — acceptable; bans/captcha decisions enforce correctly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The crowdsec-bouncer Yaegi plugin pinned at v1.4.2 loads on Traefik 3.7.5 but
its decision-stream goroutine never runs — no Traefik pod ever calls the LAPI
stream (verified: no traefik-pod bouncer entry / no @pod-ip auto-registration),
and it logs nothing. All deps are healthy (LAPI 200 + full ban list reachable
from the traefik ns, key valid, redis PONG, config correct, no NetworkPolicies),
so CrowdSec enforced nothing despite the bouncer now being registered. This is
the Traefik-v3 / Yaegi plugin-incompat class that already killed rewrite-body
here. v1.4.2 predates Nov 2025; latest is v1.6.0.
Bump to v1.6.0 (initContainer download URL + state.json + experimental.plugins
version). Config-verified compatible: every key we use survives (crowdsecMode,
crowdsecLapiKey/Host, updateMaxFailure, redisCache*, clientTrustedIPs, all
captcha* incl. turnstile); v1.6.0 also moves logging to slog/trace for future
diagnosis. Pinned, not auto-updated (Keel can't manage a Yaegi plugin, and
plugin bumps must be tested against the running Traefik/Yaegi).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Authentik OAuth2 source is now disabled (login_source.is_active=0) and GitHub
auto-registration (zero-click sign-up) is on. Document why (global auto-reg +
Authentik's email-as-username 500; Forgejo/Authentik email mismatch blocks
account-linking) and how to re-enable Authentik later.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per Viktor: GitHub sign-up must work zero-click (account created on first login,
no form). This global [oauth2_client] setting enables it. It conflicts with
Authentik (preferred_username is an email → invalid Forgejo username → 500 on
auto-create), and Viktor's Forgejo email (me@viktorbarzin.me) doesn't match his
Authentik email (vbarzin@gmail.com) so account-linking can't bridge it — so the
Authentik OAuth2 source is DISABLED (login_source.is_active=0; DB-managed,
out-of-band) per his directive. Forgejo sign-in is now GitHub + native login.
Committed via API to land on origin without pushing a concurrent agent's unpushed
local commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ENABLE_AUTO_REGISTRATION is a global [oauth2_client] setting (all OAuth sources).
On Authentik sign-in, Forgejo auto-created an account and derived the username
from Authentik's preferred_username claim — which is the user's email
(vbarzin@gmail.com), invalid as a Forgejo username (no '@') → CreateUser failed
→ 500 on the OAuth callback. (GitHub's username claim is valid, so only Authentik
broke.) Reverting to the standard link/register flow fixes both; GitHub sign-up
still works via a one-step register form. Committed via API to touch only main.tf
(forgejo-only CI apply) so it doesn't collide with concurrent crowdsec work.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Traefik bouncer plugin's API key was never registered with LAPI — the
crowdsec stack reads many keys from Vault but not ingress_crowdsec_api_key, and
the chart registers no bouncer. So LAPI returned 403 to the plugin, which with
updateMaxFailure=-1 failed open and enforced NOTHING: no community-blocklist
bans, and the (now-Turnstile-wired) captcha never fired. cscli bouncers list was
empty; the registration was likely lost in the MySQL->PostgreSQL DB migration
with no IaC to recreate it.
Seed the bouncer at LAPI startup via BOUNCER_KEY_traefik, valued from the same
Vault key the middleware presents — so they match by construction, and the
bouncer re-registers automatically on every LAPI start (survives DB wipes).
- stacks/crowdsec/main.tf: read ingress_crowdsec_api_key, pass to module.
- module main.tf: new sensitive var + thread into the values templatefile.
- values.yaml: BOUNCER_KEY_traefik on lapi.env.
- docs/architecture/security.md: document registration + fail-open history and
the proxied-app coverage caveat.
Activates enforcement (community blocklist bans + captcha) on non-proxied apps;
internal IPs stay bypassed (clientTrustedIPs), fail-open-on-LAPI-down preserved.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The traefik helm_release had no chart version pin, so a refreshed helm repo
index resolves `chart = "traefik"` to the latest (41.0.0), whose values schema
rejects this stack's `logs` block ("Additional property logs is not allowed") —
an unpinned apply attempts that upgrade and fails (atomic rollback). Pin to the
deployed 40.2.0 (release rev 57, since 2026-05-30) so applies are deterministic;
chart bumps must be deliberate with a values migration.
Follow-up to fd0c7493 (Turnstile captcha), which was applied with this pin
already in live TF state — this lands the pin in git to remove the drift.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wanted people to be able to sign up with GitHub, not just the
native form or Authentik SSO.
- Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth
--provider github` (name "github", matching the callback registered on
the GitHub OAuth App). Like the existing Authentik source, it lives in
Forgejo's DB rather than Terraform — there's no clean TF resource for
login sources. Client id/secret mirrored to Vault secret/viktor
(forgejo_github_oauth_client_id / _secret) for recovery.
- This commit's TF change: ENABLE_AUTO_REGISTRATION=true in
[oauth2_client], so a first GitHub sign-in creates the account directly
("sign up with GitHub") instead of a link-to-existing detour. The
GitHub identity is the trust gate for this path; Turnstile + email
confirmation still gate the native form.
Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github
redirects to GitHub's authorize URL with the correct client id +
callback, and the login page renders the button. Final browser
click-through is the user's to do.
Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section +
secret-rotation + DB-loss recreate steps).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CrowdSec LAPI already issues `captcha`-type decisions for lower-severity abuse
(http-429-abuse, http-403-abuse, http-crawl-non_statics, http-sensitive-files),
but the Traefik bouncer plugin had no captcha provider configured — so those
decisions silently fell through to a 403 ban (traced in the plugin's bouncer.go
@ v1.4.2: captchaClient.Valid==false => handleBanServeHTTP). Flagged users had
no way to self-unblock, contradicting the profile's stated intent.
Wire Cloudflare Turnstile as the bouncer's captcha provider so a captcha
decision now renders a solvable challenge instead of a hard block:
- New cloudflare_turnstile_widget.crowdsec_captcha (managed mode), scoped to
viktorbarzin.me so one widget covers every subdomain the bouncer fronts.
Mirrors the existing Forgejo-signup Turnstile pattern; sitekey + secret are
passed into the traefik module.
- middleware.tf: captchaProvider=turnstile + site/secret keys + grace 1800s +
captchaHTMLFilePath=/captcha/captcha.html.
- Vendor the plugin's captcha.html and mount it into the Traefik container at
/captcha via the chart `volumes` value — the pulled Yaegi plugin does not
expose its bundled template to Traefik.
- docs/architecture/security.md: document the ban-vs-captcha remediation split.
- Remove the dead crowdsec-ingress-bouncer.yaml (unused nginx bouncer with
placeholder reCAPTCHA keys; referenced by zero .tf).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wants Forgejo open for anyone to sign up, but without bot/spam
account floods. Flip the deployment from OAuth-only registration
(ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local
sign-up, and add two bot gates on the registration form:
- Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget
is managed in Terraform (turnstile.tf) via the CF Global API key, so
the sitekey/secret are IaC, not a dashboard artifact.
- Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the
Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me
(mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced
credential Authentik uses (email-secret.tf ESO -> secret/authentik
smtp_password).
Existing Authentik OAuth2 login is unchanged (additive). Deployment env
appended (not inserted) so the diff stays purely additive; a reloader
annotation rolls the pod on secret rotation.
Verified live: signup page renders the Turnstile widget, mailer delivers
a test message end-to-end, Forgejo healthy, plan-to-zero after apply.
Runbook: docs/runbooks/forgejo-open-signups.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The remaining verbs that pass the "saves reasoning, not just typing" test the
user posed mid-session: each encodes the non-obvious which-endpoint-reached-how
resolution otherwise re-derived every time. (Same test deprioritized node-ssh
and secret-get aliasing — thin wrappers over commands already known.)
- net check <host> [path]: two-legged reachability — external (public DNS→CF)
vs internal (Traefik LB) — so you see WHERE a break is, not just that one path
works. (live: surfaced the LB at 6ms vs CF 77ms.)
- dns lookup <name> [type]: Technitium (10.0.20.201) vs public (1.1.1.1) diff.
- metrics query "<promql>" / metrics alerts: Prometheus via the LB
(prometheus-query.viktorbarzin.lan); alerts uses the synthetic ALERTS series
since the query frontend has no /api/v1/alerts and Alertmanager has no ingress.
- logs query "<logql>" [--since 1h] [--limit N]: Loki range query via the LB.
All reach auth-free internal ingresses through the LB (Go form of
curl --resolve host:443:10.0.20.203) — no port-forward, no kubectl. In-cluster-
only endpoints (Alertmanager v2) deliberately out of scope. Verified live before
building; all five smoke-tested green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt +
alert when not":
- monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning)
in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see
Slack for why" signal. (Until monitoring is applied, a block still surfaces via
the already-live K8sUpgradeChainJobFailed.)
- upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests —
apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and
core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns)
Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't
downgrade). Catches a "pods look Running but cluster is broken" upgrade.
- runbook: documents the compat gate, the blocked alert, how to clear a block,
matrix maintenance, and the detector minor-probe fix.
After deploy, the nightly chain detects 1.35 (minor detection now works) and
correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting
via K8sUpgradeBlocked — the autonomy working as designed until the catch-up
clears those addons.
Make k8s upgrades (patch AND minor) autonomous without being reckless: the chain
attempts every upgrade but refuses unless it can prove the target is safe. A
refusal is a BLOCK (not a crash) — it halts the chain and signals for attention.
- compat-gate.py: read-only preflight check. Blocks if (a) a critical addon's
running version doesn't support the target k8s minor, (b) an in-use deprecated
API (apiserver_requested_deprecated_apis) is removed at/before the target, or
(c) a node's containerd is below the target's floor. Validated against the live
cluster: correctly blocks 1.35/1.36 today on Calico 3.26 / ESO 0.12 / kyverno
1.16 (all behind), which is exactly the auto-halt we want until they're bumped.
- addon-compat.json: curated addon -> max-supported-k8s matrix (Calico, ESO,
kyverno, gpu-operator + containerd floor), sourced from each project's compat
docs (2026-06-19). The keystone data the gate reads; keep current.
- upgrade-step.sh: phase_preflight runs the gate FIRST (before any mutation);
block() pushes k8s_upgrade_blocked=1 + Slacks the reasons + halts.
- main.tf: detector minor-probe fix (curl -sILo so the 302 from pkgs.k8s.io
resolves to 200 — minors were never being detected). Gated behind the compat
gate above, so enabling minor detection can't roll an unsafe minor.
Not pushed yet: deploys with the K8sUpgradeBlocked alert + deeper postflight +
runbook (next commit) so the detector fix only goes live with the full net.
Adds the verb-group that kills the single biggest reasoning sink in agent
sessions — watching a build/deploy to completion (proven the session that built
it: hours hand-rolling Woodpecker polling + DB-schema spelunking for one CI
incident).
- ci status/watch: Woodpecker REST API (version-stable, not its DB schema),
reached via the internal Traefik LB (dial 10.0.20.203, SNI=ci.viktorbarzin.me
so the cert verifies — the Go form of the house `curl --resolve` pattern),
token from WOODPECKER_TOKEN/Vault, repo id resolved from the cwd remote, with
retries that ride Woodpecker's intermittent empty responses. watch matches the
HEAD/given commit (avoids the post-push race) and exits non-zero on failure.
- deploy wait: image-sha match THEN rollout status (rollout status alone returns
success on the old ReplicaSet); kubectl-based.
- work land now auto-watches CI to green on the landed commit (--no-ci-watch to
skip), closing the v0.1 gap.
- ci logs deferred to v0.4.1 (Woodpecker detail/log endpoints were the least
reliable; status/watch use the working list endpoint).
Live-verified ci status/watch against the live API.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`k8s db <app>` (Postgres path) execed `pg-cluster-rw`, which is the CNPG
read-write SERVICE, not a pod — so kubectl exec failed with
`pods "pg-cluster-rw" not found`. The unit test only checked the plan; the verb
was never fired at live state (the gap flagged in v0.2), so it shipped broken.
Fix: the PG plan now carries a label selector (cnpg.io/instanceRole=primary)
instead of a pod name, and k8s db resolves the actual primary POD via
`kubectl get pod -l <selector>` before exec. MySQL path (real pod
mysql-standalone-0) unchanged. Live-verified both paths (psql + mysql).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Infra pipelines were failing intermittently across all authors (e.g. #241-244,
#247) with the git clone step exiting 128:
git fetch --depth=1 --filter=tree:0 ... (partial/treeless clone)
git reset --hard <sha>
fatal: could not fetch <tree-sha> from promisor remote
remote: 404 page not found
The plugin-git clone defaulted to a partial (treeless) clone. The initial ref
fetch carries credentials, but the lazy *promisor* object fetch triggered by
`git reset --hard` hits the PRIVATE Forgejo repo without creds -> 404 -> exit
128. Whether it fired was luck-of-the-draw, hence the ~50% intermittent failures
fleet-wide (not specific to any commit).
Fix: set `partial: false` on every clone block so all objects for the (still
shallow) commit are fetched upfront with creds — no fragile lazy promisor fetch.
Diagnosed against the woodpecker Postgres DB (steps/log_entries) since the
Woodpecker HTTP API was itself flapping. Earlier "permission for ViktorBarzin"
log lines were an unrelated cross-forge red herring.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The rpi-sofia under-voltage alert keyed off the sticky firmware bit
(rpi_under_voltage_occurred == 1), which latches on the first brown-out and
stays 1 until the Pi reboots. With alert-on-change routing it re-paged on every
boot cycle and sat firing for ~211h of the last 14d — Viktor reported "getting a
few of these lately" — and it disagreed with the HA-sofia dashboard, which shows
the live state and reads OK once voltage recovers.
Can't just switch to the live bit: rpi_under_voltage_now never registered once in
14d (brown-outs are sub-second and fall between the 1-min textfile-collector
samples), so the sticky bit is the only reliable detector.
Fix: edge-trigger on a NEW latch via increase(rpi_under_voltage_occurred[1h]) > 0.
Fires once per brown-out and auto-resolves ~1h later (~2h active over the same
14d instead of ~211h); counter-reset handling makes a clean reboot a no-op. Both
real brown-out events in the window are still caught. Docs updated in the same
commit (monitoring.md).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ADR-0028 #96 (website half): /api drops Authentik forward-auth so the browser can carry a TripIt session cookie (the outpost 302'd cookie-only requests). The app self-authenticates (TripIt-session-first in get_current_user); no session -> 401 -> SPA landing. strip-auth-headers is REQUIRED now: with forward-auth gone, the hybrid forward-auth arm would otherwise trust a client-injected X-authentik-email — stripping inbound X-authentik-* closes that. /metrics split into its own still-gated ingress. Shell keeps Authentik bearers on tripit-api.* until #94; full AUTH_MODE collapse follows then. Verified live: no-session->401, valid TripIt cookie->200, injected header->401, Shell->200.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pipeline 244 failed: data.authentik_flow.default_provider_invalidation resolved null in CI (goauthentik 2024.x provider vs 2026.2 server), silently blocking every tripit-stack apply incl. the ADR-0028 #90 signing-key + redirect-URI delivery. Pin the literal UUID (what the slug resolves to) — matches the data-source-skew workaround used for the Vault binding.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds SESSION_SIGNING_KEY (Vault secret/tripit -> tripit-secrets ExternalSecret -> env_from) so TripIt's own session JWTs are signed with a real key (the app fails closed under the dev default until this lands), and adds the website OIDC redirect URI https://tripit.viktorbarzin.me/api/auth/callback/authentik to the public tripit-app provider so 'Log in with Authentik' works. Reuses the Shell's existing public OAuth2 app.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
kubeadm upgrade apply regenerates the apiserver static-pod manifest and drops
the --authentication-config flag, silently breaking SSO (kubectl/kubelogin + the
k8s dashboard) until someone manually re-applied the rbac stack. That manual step
ran after every control-plane upgrade — the one thing keeping autonomous patch
upgrades from being truly hands-off (it bit us this cycle: an earlier master bump
left SSO broken until we noticed).
Automate it: the rbac stack now publishes its existing OIDC restore script (the
same one its null_resource runs) to a kube-system/apiserver-oidc-restore
ConfigMap, and the upgrade chain's phase_master re-runs it on master right after
the kubeadm upgrade — while tigera-operator is still quiesced so the flag-add
apiserver restart can't crashloop it. The script is idempotent and health-gates
/livez with auto-rollback; the step is non-fatal (a failure only lags SSO until
the next rbac apply, it won't abort the upgrade). phase_master already self-skips
when master is at target, so this only fires when master was actually upgraded.
The chain SA gets a name-scoped get on that one ConfigMap. Runbook updated: the
manual restore is now a documented fallback (command corrected — it needs
-replace, since the null_resource trigger hash never changes).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Lets agents search/navigate memory via the CLI, as the first step toward
deprecating the memory MCP. claude-memory is a FastAPI service (the MCP is just
one frontend); homelab memory is a thin Bearer-auth HTTP client over the same
API, using the env the hooks already set (CLAUDE_MEMORY_API_URL/KEY). It works
even when the MCP frontend is down — the recurring disconnect that took the MCP
offline for this whole session.
Verbs: recall (server-side semantic search), list, categories, tags, stats,
secret (read); store, update, delete (write). Validated against the live API
including a store→recall→delete round-trip — full data-plane parity with the MCP.
The deprecation itself (rewiring the per-prompt auto-recall + auto-learn hooks to
the CLI, then uninstalling the MCP) is a deliberate follow-up, sequenced after
the CLI is proven in the hooks — see docs/adr/0008.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bump cli/VERSION to v0.2.0; document the k8s verbs (README table + resolver
note), add docs/adr/0007 (resolver, read/write split, config-mutation stays
raw, db dbaas pattern), and extend the AGENTS.md discovery pointer with the
Kubernetes surface.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mining the post-v0.1 corpus showed kubectl is the dominant remaining domain by
far: 11,291 commands across 243 sessions (more than everything else combined).
This adds the full k8s verb-group built on an app→namespace→pod resolver (most
namespaces hold one app, so <app> defaults to the namespace and the target
defaults to deploy/<app>, letting kubectl resolve the pod; -n/--pod/-c/-l/--tty
override).
Read: status (pods + non-Normal events), get, logs, describe, debug (one-shot
triage), pf, rollout-status. Write/operational: db (the dbaas psql/mysql exec
pattern — PG via pg-cluster-rw -c postgres, MySQL via mysql-standalone-0 with the
env-password bash wrapper, never inline), exec, rm-pod (pods/jobs ONLY), restart.
Config-mutation verbs (apply/edit/patch/scale/create) are deliberately NOT
exposed — they stay raw per the Terraform-only policy.
Smoke-verified read verbs against the live cluster (get/logs/rollout-status);
write verbs are unit-tested (resolver, db-plan, shell-quoting) but not fired at
live state.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes v0.1: documentation, build/install path, and version stamping.
- cli/VERSION (v0.1.0) stamped into the binary via ldflags.
- cli/README.md rewritten as the homelab overview (verbs + tiers, manifest,
build, the preserved legacy webhook use-cases).
- docs/adr/0004-0006: why homelab exists (grown in place from infra/cli, not a
separate repo), v0.1 scope + everything-allowed/tiers-recorded, and the
work/tf behaviour (native worktree entry, verification-gated auto-land,
presence-coupled apply).
- setup-devvm.sh builds cli/ -> /usr/local/bin/homelab each provisioning run
(t3-dispatch pattern), so every devvm user gets the current binary.
- AGENTS.md: discovery pointer under Common Operations.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes the infra-loop verb surface. work start creates .worktrees/<topic>
on <user>/<topic> off <remote>/master (git-crypt-aware, ensures .worktrees is
ignored) and prints the path for native EnterWorktree entry. work land fetches,
merges master in, verifies, pushes HEAD:master with non-fast-forward retry, and
falls back to pushing the feature branch for a PR when the direct push is
rejected (branch protection). work clean removes the worktree + branch.
Safety: work land REFUSES to push when it cannot verify (no --verify-cmd and no
auto-detected suite) unless --no-verify is passed. This was added after an
accidental smoke-test invocation pushed unverified WIP to master (benign — the
infra CI applied 0 stacks since the diff was cli/-only — but the gate makes an
unverified land a deliberate choice, not the default).
Known v0.1 limitation: land does not yet block on CI to green; that arrives with
the ci/deploy watch verbs. It prints a reminder to follow the pipeline manually.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds the tf verb-group and the resolver substrate beneath it, continuing the
v0.1 infra-loop build.
- substrate: findInfraRoot (walk up to terragrunt.hcl + stacks/), stack→dir
resolver, and repo/remote/git-crypt detection (preferRemote forgejo>origin,
hasGitCryptAttr, gitCryptFlags) — the last is for `work` next.
- tf plan/validate/fmt/force-unlock/apply, resolving the stack from cwd and
delegating to scripts/tg (which owns state decrypt/encrypt, the Vault lock,
and the ingress auth-comment check) rather than calling terragrunt directly.
- tf apply is presence-coupled: claims stack:<name>, ALWAYS releases on exit
(normal, error, or SIGINT/SIGTERM via sync.Once + signal handler) — fixing
the documented ~200-claim leak — and prints an out-of-band reminder since CI
applies canonically on push.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Begin evolving the existing infra/cli into the agent-facing "homelab" CLI
decided in the design/grilling session: one composable, JSON-capable surface
for the operations agents run over and over (mined from 51k commands across
2,225 past sessions; the infra inner-loop is ~29% of them). v0.1 targets that
loop — work/tf/claim — and ships here, in place, in infra/cli.
This first slice:
- command registry + dispatcher (longest-prefix verb matching) and a
`manifest`/`manifest --json` progressive-discovery entrypoint; every verb
declares a read|write tier so write-gating can be added later (everything is
allowed for now).
- claim/release verbs wrapping the existing presence script (not reimplemented),
with label-taxonomy validation.
- main() front-dispatches the homelab verb surface but falls through to the
legacy webhook -use-case path verbatim, so the in-cluster infra-cli image is
unaffected.
- fix a pre-existing vet error (glog.Infof missing format directive) that
blocked `go test`.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The autonomous 1.34.9 version-upgrade chain has been failing its preflight every
night. A prior run left k8s-master + k8s-node1 on 1.34.9 while node2-6 stayed on
1.34.8, and preflight's gate-4 runs `kubeadm upgrade plan` on master. On an
already-at-target master, kubeadm prints no "kubeadm upgrade apply vX.Y.Z" line,
so the parsed target came back empty and the `!= requested` check aborted the
whole chain before any worker was touched. Deterministic — it self-cleaned and
re-failed identically each night, so it would have failed again tonight, leaving
node2-6 stuck on the old patch.
Fix: skip the kubeadm-plan-target gate when master is already on TARGET_VERSION
— the same at-target self-skip that phase_master and phase_worker already do.
The remaining workers are still validated by their own per-node phases, and the
detector already confirmed the target is installable via apt-cache. This lets
tonight's unattended chain resume and finish node2-6 -> 1.34.9.
Runbook updated: node count 5 -> 7, the gate skip note, and a Past Incidents
writeup (incl. the collateral apiserver OIDC wipe, restored via the rbac stack).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The claude-memory MCP backend ran as a single replica with no PDB, so every
voluntary disruption took it to zero for ~30-90s — which surfaced as the
memory MCP "keeps getting disconnected" problem. Disruption sources hitting
the lone pod: the descheduler (every-5-min CronJob, LowNodeUtilization —
caught evicting it live), Keel image bumps, Reloader restarts on the 7-day
DB-password rotation, node drains, and CI deploys.
The local stdio MCP subprocess itself was proven healthy (fast non-blocking
startup, stderr suppressed, graceful degradation), so the fault was purely
backend availability, not the MCP plumbing.
Fix: run 2 replicas (the backend is stateless FastAPI over shared CNPG
Postgres and already has hostname anti-affinity) + restore the PDB at
minAvailable=1 (safe now — the drain deadlock that justified removing it
only existed at 1 replica) + descheduler evict=false to stop the needless
5-min churn. All five disruption sources become zero-downtime rolling events.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to turn lodging prices on and stop using the fake provider.
Mirrors the existing FARE_PROVIDER wiring: point the Booking.com/Airbnb lodging
scraper at the shared chrome-service browser over CDP (the namespace is already
admitted through chrome-service's NetworkPolicy for the fare scrape). The lodging
code (ADR-0025, tripit #78) is live in tripit 03973b5, so the env lands after
that rollout.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the new paperless-ai service and the two non-obvious operational
facts: runtime config lives in the PVC .env (not TF env, which would shadow
it), and Qwen3 needs /no_think for parseable tagging output.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Auto-tagging silently no-op'd: the container env vars set in the deployment
shadowed the app's own /app/data/.env, because paperless-ai's dotenv loader
does not override process.env. A stale PROCESS_PREDEFINED_DOCUMENTS=yes (with
no TAGS) made the scan select zero documents.
Strip the wizard-owned behavioural config (Paperless URL, AI provider, model,
scan interval, tagging flags) from the container env, keeping only
infrastructural env (PUID/PGID/port/RAG/HF cache) and the Vault-sourced
secret refs. The app's setup-written .env on the PVC is now authoritative,
so processing runs and tags all documents. Qwen3 thinking is disabled via
SYSTEM_PROMPT=/no_think in that .env to keep the model's JSON output parseable.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wanted real semantic search over his ~300 Paperless documents and
preferred a ready-made solution over building one. paperless-ai provides
local-embedding RAG (ChromaDB + sentence-transformers, GPU-free) plus
LLM-driven auto-analysis/tagging.
Wiring:
- LLM (chat answers + tagging) -> in-cluster llama-swap qwen3-8b
(OpenAI-compatible); embeddings + vector store are local on the PVC.
- Reads Paperless over the internal service via a dedicated `paperless-ai`
superuser token (Vault secret/paperless-ai); app-admin creds also in Vault.
- Encrypted PVC for /app/data (SQLite + ChromaDB + model cache).
- Ingress paperless-ai.viktorbarzin.me behind Authentik (auth=required).
- Third-party image pinned (docker.io/clusterzx/paperless-ai:3.0.9), no Keel.
Runtime config persists to the PVC .env via the app's one-time setup; the
deployment env vars are pre-fill/documentation only.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The edge-tts apply was blocked by the require-trusted-registries Kyverno policy —
a bare `travisvn/openai-edge-tts` isn't in the allowlist. The policy blanket-
trusts `docker.io/*`, so prefixing the image with `docker.io/` passes admission
with no policy change. Verified live: bg synth round-trips through Whisper
verbatim and a full gateway /v1/talk bg turn returns a coherent spoken Bulgarian
reply ("Добър ден! Добре съм, благодаря!...").
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The landed portal-stt source still declared the setup_tls_secret module +
tls_secret_name variable, which file()-reads secrets/fullchain.pem — a file this
stack does not ship. portal-stt is ClusterIP-only (no ingress; the Gateway is the
sole externally-exposed component, ADR-0001), so it needs no TLS secret. The live
deployment never had it (removed during the original apply); this aligns the
source with reality so CI applies cleanly. Fixes the pipeline-229 portal-stt
apply failure.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The portal-assistant voice-assistant stacks (portal-tts, portal-stt,
portal-assistant) were applied to the live cluster from feature branches but
never landed on master — the GitOps source of truth. This lands all three and,
in portal-tts, fixes Bulgarian speech.
Bulgarian was unintelligible: the local Piper voice (bg_BG-dimitar-medium via
espeak-ng) mangles Bulgarian consonants — a synth->Whisper round-trip turned
"Добър ден" into "Обърден", and a user heard pure gibberish. English was fine.
portal-tts now runs openai-edge-tts (Microsoft edge-tts neural voices) for BOTH
languages instead of Piper — ADR-0003 always named edge-tts as the online
Bulgarian-quality fallback. Validated before landing: edge bg round-trips
through Whisper verbatim ("Добър ден! Как сте днес? ..."). The gateway maps
detected language bg/en to the edge voice names via new TTS_VOICE_BG /
TTS_VOICE_EN env (bg-BG-KalinaNeural / en-US-AvaNeural). No GPU, no NFS model
store, no secrets — edge fetches voices from Microsoft per request (egress
verified). The assistant already needs the internet for the Claude brain, so an
online TTS adds no new failure mode.
The brain stays Sonnet with no extended thinking (already the default — a live
turn answers directly in ~3.4s), per the latency-over-smartness ask.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New, empty-cache clients (the repurposed Meta Portal running the HA companion
app) cold-load the whole HA frontend at once - dozens of frontend_latest/*.js +
MDI icon chunks. ha-london had no per-service rate limit, so it fell back to the
global 10/s burst 50 and 429'd those chunks, leaving every dashboard blank
(Settings, which loads less, worked). Give ha-london its own 200/500 middleware
(skip_global_rate_limit, mirroring ha-sofia, with depends_on to avoid the
dangling-middleware 404 window) and bump ha-sofia 100/200 -> 200/500 so a cold
Portal load of Sofia doesn't hit the same wall.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New k8s nodes were only getting the personal `wizard` key in authorized_keys —
not the automated k8s-version-upgrade pipeline's key (Vault
secret/k8s-upgrade/ssh_key_pub). So a freshly provisioned node is invisible to
the upgrade chain (it SSHes in as `wizard` to drain+upgrade): node4/5/6 all hit
"Permission denied (publickey)" on 2026-06-17 and had to have the key pushed by
hand. Bake the public key into the cloud-init template so every new node gets it
on first boot.
(unattended-upgrades is already in this template — node4/node5 missed it only
because the LIVE PVE cloud-init snippet lagged this source: it deploys via a
Tier-0 `stacks/infra` apply that hadn't run since before their 2026-05-26
provision. Same lesson applies to THIS change — it reaches new nodes only after
`stacks/infra` is applied to refresh the snippet on the PVE host.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:59:59 +00:00
413 changed files with 42023 additions and 11254 deletions
@ -166,7 +166,8 @@ Pinned via Terraform in `stacks/authentik/`:
| Knob | Value | Surface | Effect |
|------|-------|---------|--------|
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. Used by password login (`default-authentication-flow`) AND passkey login (`webauthn` flow — both terminate on this stage). |
| `UserLoginStage.session_duration` on `default-source-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_source_login` in `authentik_provider.tf` (imported 2026-06-20, id `4c6977d2-…`) | **Social logins** (Google/GitHub/Facebook, via `default-source-authentication-flow`). Was the provider default `seconds=0`, which fell back to `UNAUTHENTICATED_AGE=hours=2` — so social logins expired every **2h** while password/passkey lasted 4 weeks. Pinned `weeks=4` on 2026-06-20 to make all login paths consistent. (Surfaced when the 2026-06-18 passkey wipe forced fallback to Google login → "re-login multiple times daily".) |
| `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
| `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
@ -177,6 +178,13 @@ Notes:
- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
## WebAuthn / Passkeys (2026-06-20)
- **Passkey devices live in the DB, NOT Terraform** (`WebAuthnDevice` model). They are user-owned; no TF resource or blueprint manages them. Re-enroll via the user settings UI (Authentik → Settings → MFA Devices → register a security key / passkey).
- **2026-06-18 wipe (root cause of the "WebAuthn broke" incident):** all 6 of Viktor's passkeys were deleted (`WebAuthnDevice.objects.count()` → 0) at 19:27 by an **ad-hoc tripit passkey E2E test** run from the devvm (`python-httpx/0.28.1`, as `akadmin`). The test cleanup did `GET /core/users/?search={demo}` (a **fuzzy** search) then `DELETE /api/v3/authenticators/admin/webauthn/{pk}/` for each device of `users[0]` — but `users[0]` resolved to the **real** account, not the intended demo user. **Lesson:** any future passkey-test cleanup MUST exact-match the demo user (`username == demo`), never `users[0]` of a fuzzy `?search=`. It was a one-off ad-hoc script (no committed/scheduled copy), so nothing auto-re-deletes — re-enrollment is safe.
- **Passkey login path itself is intact:** the identification stage's `passwordless_flow` → `webauthn` flow (UI-managed, in `ignore_changes`); the break was purely the missing device records.
- **Provider-schema gotcha:** the pinned authentik TF provider's `authentik_stage_identification` resource exposes **no**`webauthn_stage` or `enable_remember_me` attribute (they exist on the app *model*, not in the provider schema). Do NOT add them to `ignore_changes` — `tg plan` errors `Unsupported attribute`. They are purely UI/app-managed. (Commit `4e882989` removed them for exactly this reason; re-adding breaks every apply.)
- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
- **Outpost (2026-06-10):**`log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
- **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin.
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
Always use Home Assistant for smart home control.
author: Claude Code
version: 2.0.0
date: 2026-02-07
version: 2.1.0
date: 2026-06-24
---
# Home Assistant Control
@ -44,6 +44,12 @@ There are **two** Home Assistant instances:
- Environment variables for each instance:
- **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN`
- **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN`
- If those env vars aren't set (e.g. you're not in the infra repo / Claude venv), don't hand-roll a `kubectl | base64 | jq` token pipeline — use the global **`homelab` CLI** instead (on `$PATH` in any directory):
## homelab CLI (preferred — works from any directory)
- **Token**: `homelab ha token [--instance sofia|london]` resolves the long-lived API token live from the cluster. Use it directly in curl: `curl -H "Authorization: Bearer $(homelab ha token)" https://ha-sofia.viktorbarzin.me/api/states`. (The `home-assistant-sofia.py` script also auto-falls-back to this when its env var is unset.)
- **Host shell** (ha-sofia): `homelab ha ssh -- <cmd>` runs a command on the HA host with deterministic non-interactive ssh (no host-key prompt) — e.g. `homelab ha ssh -- "sudo docker ps"`, `homelab ha ssh -- "cat /config/configuration.yaml"`. Replaces bespoke `ssh -o StrictHostKeyChecking=no …` invocations.
- **Cluster metrics/logs** (not HA-specific): prefer `homelab metrics query "<promql>"` / `homelab logs query "<logql>"` over hand-rolled `curl …/api/v1/query`, and `homelab claim`/`release` over calling `scripts/presence` directly.
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
- **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/` (requires `sudo` for file access)
- **Platform**: Raspberry Pi 4, HA OS
- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/`
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
- **Zone**: London (home)
### Dashboards (redesigned 2026-06-24)
**Glossary** (HA terms — keep distinct):
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
- **Card** = a widget inside a view.
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
- **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
- `sensor.classic_performance_remaining_range`: Range km
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
#### 4. Uptime Monitoring (UptimeRobot)
- `sensor.blog`: blog uptime
@ -440,12 +464,17 @@ Named plugs with power/energy tracking:
- Scene: `scene.night` (turns off Livia + Michelle plugs)
### Custom Components
- **cowboy**: Cowboy e-bike integration (HACS)
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
### Custom Components (HACS integrations)
- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
### HACS frontend cards (plugins)
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
### Platform (HAOS — ignore any legacy `docker run` snippet)
ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).
description:'Read-only: assess PVE host + k8s memory overcommit, right-size deployment REQUESTS (scheduling) and LIMITS (OOM) separately from 30d usage, then test whether one worker node can be removed while preserving N-1 by BOTH a physical-usage and a scheduling-request model. Emits a gated plan.',
constUSAGE_PROMPT=`Read-only memory RIGHT-SIZING from 30-day usage. CRITICAL: requests and limits are DIFFERENT knobs - size them separately. Do NOT set requests to peak (that is what a flawed earlier run did; it manufactured a false capacity shortfall).
constTOPO_PROMPT=`Read-only reliability-topology audit: which worker is safest to remove? Candidates: k8s-node2..node6 (NOT master, NOT GPU node1). For each worker (k8s-node1..6): sticky_pods (StatefulSet members; pods with local/hostPath PVCs; single-replica critical), local_pv_count, volumeattachments, cnpg_primary (CNPG 'pg-cluster' PRIMARY here? check pod role labels), gpu_workloads, evac_difficulty (easy|medium|hard)+evac_notes.
log(`K8s: requests ${G(k8s.cluster_requests_mib)} / 30d-peak-usage ${G(clusterPeak)} / allocatable ${G(k8s.cluster_allocatable_mib)} -> requests are ${(k8s.cluster_requests_mib/clusterPeak).toFixed(2)}x real peak`);
log(`Request right-sizing: ${G(T.net_request_reclaim_mib)} of over-provisioned requests can be trimmed (${T.count_request_shrink} workloads); ${T.count_limit_raise_oom} workloads are OOM-tight on LIMITS (raise regardless).`);
?`${best.removeName} can be removed while preserving N-1: cluster 30d peak usage ${G(clusterPeak)} fits on survivors-minus-one (${G(best.phys_capacityN1_mib)}); after trimming over-provisioned requests, scheduling also fits (${G(best.sched_margin_mib)} margin). Frees ~${G(best.host_freed_mib)} to the PVE host.`
:`No worker can be removed while preserving N-1 by BOTH physical-usage and scheduling-request models.`;
{key:'math',ask:'Recompute BOTH N-1 models independently. Physical: cluster 30d peak vs (sum survivor allocatable - largest survivor). Scheduling: reschedulable recommended REQUESTS (not limits, not peak) vs (sum survivor freeGeneral - largest). Verify GPU node reserve uses REAL gpu requests, allocatable not capacity, DaemonSets are per-node fixed load. Are pool selection and numbers right?'},
{key:'temporal',ask:'Challenge the 30-DAY peak window and the request shrinks. Could a monthly/quarterly peak exceed cluster_peak_30d (compare a 90d peak)? Are the shrunk REQUESTS safe given each workload keeps a limit above its peak (Burstable)? Name any shrink or any still-tight limit that is reckless.'},
{key:'stateful',ask:'Check the chosen candidate for STRANDED state and drain blockers: CSI PV pinning (do volumes reattach anywhere?), CNPG primary, VolumeAttachment caps, anti-affinity/topologySpread unsatisfiable at one fewer worker, PDBs that block drain (disruptionsAllowed=0). Is removal actually safe, and what drain ORDERING is required?'},
];
constverdicts=(awaitparallel(lenses.map((l)=>()=>
agent(`Adversarial reviewer. Try to REFUTE:\n"${headline}"\n\nLens: ${l.ask}\n\nData (read-only). Verify LIVE: kubectl, Prometheus (curl -sk -G '${PROM}' --data-urlencode 'query=...'), ${SSH} '<cmd>'.\n\n${verifyData}\n\nDefault refuted=true if evidence does not clearly hold. Give concrete corrections.`,
# retry. Deliberately NOT helm atomic-timeouts or config errors
# (missing arg, invalid index) — those must fail fast, retry can't fix
# them and can worsen a stuck helm release.
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
- **Ask before `git push`** — always confirm with the user first
## Execution
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets)
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets; passes `-lock-timeout`, default `5m` / `TG_LOCK_TIMEOUT`, so a contended state lock waits instead of failing with `Error acquiring the state lock`)
- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Usage telemetry (v0.6): every dispatched verb fire-and-forgets a Loki line (`{user,verb}` + exit only, NO args/secrets; opt-out `HOMELAB_TELEMETRY=0`); `homelab usage top [--since][--user]` ranks verb usage across all users — evidence for what to build next, queryable without reading anyone's home. Home Assistant (v0.7): `homelab ha token [--instance sofia|london]` (prints the long-lived API token, resolved live from k8s Secret `openclaw/openclaw-secrets` — use as `curl -H "Authorization: Bearer $(homelab ha token)"`), `homelab ha ssh [--instance sofia|london] -- <cmd>` (run a command on the HA host; deterministic non-interactive ssh, the invoking user's `~/.ssh/id_ed25519`, sofia=`vbarzin@192.168.1.8` default) — entity state/control stays with the `ha` MCP, these cover only what an API-only MCP can't (token + host shell). Full docs: `cli/README.md`.
- **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service.
- **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
- **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.
@ -56,6 +56,28 @@ _Avoid_: "core service" (collides with the `0-core-*` Namespace tier name).
A non-admin identity declared in `secret/platform → k8s_users` (JSON map). Owns one or more namespaces and one or more public subdomains. Also drives a **Workstation profile** (an identity has both a cluster facet and a workstation facet).
_Avoid_: bare "user", "tenant".
### GPU sharing
**GPU slice**:
One unit of `nvidia.com/gpu` on the time-sliced Tesla T4 — a **scheduling turn, NOT a memory allocation**. The device plugin advertises the card ×100; a pod requesting `nvidia.com/gpu: 1` gets GPU *access*, with zero guarantee about how much of the 16 GB VRAM it may use. "Overallocate GPU memory" is a real failure precisely because a slice carries no memory accounting.
_Avoid_: reading a GPU slice as a memory reservation or a fraction of the card; "vGPU" (we run no vGPU/MIG/MPS — see ADR-0016).
**GPU memory budget**:
The custom node-level extended resource **`viktorbarzin.me/gpumem`** (integer MiB) that makes the scheduler VRAM-aware (ADR-0016). The GPU node advertises a total (~14000 MiB = physical minus driver/context slack); each GPU tenant declares `resources.limits."viktorbarzin.me/gpumem"`; being non-overcommittable, the scheduler refuses to co-schedule past the card (overflow → `Pending`). A *schedule-time* reservation, **not** a runtime cap — it stops pile-on, not a single tenant's runaway.
_Avoid_: treating it as a hard CUDA cap (it isn't — that's what the **GPU watchdog** is for); confusing it with the `nvidia.com/gpu` slice (orthogonal axes: access vs memory accounting).
**GPU watchdog**:
The `gpu-vram-watchdog` CronJob (nvidia ns) that supplies the runtime teeth the **GPU memory budget** lacks: when *actual* free VRAM (`gpu_pod_memory_used_bytes`) drops below a floor, it recycles the biggest tenant that is **over its declared budget**. Enforces the budget as a contract, acts only under pressure (so bursting into genuine slack is fine), and is what bounds the 2026-06-02 immich-ml runaway class.
_Avoid_: expecting it to act on priority (it enforces the *budget*, since co-tenants often share one PriorityClass); expecting instant prevention (it corrects with a detection lag — soft, by design).
**GPU demand-gate**:
The scale-0↔1 admission CronJobs (`stacks/tts`) that bring a best-effort *batch* GPU tenant (chatterbox-tts) up only when free VRAM ≥ a floor and idle it back down — letting on-demand tenants fill real slack without holding a reserved **GPU memory budget** seat.
_Avoid_: using it for interactive tenants (cold-load lag — portal-stt is warm-resident instead); conflating it with the **GPU watchdog** (gate = admit on free VRAM; watchdog = recycle on over-budget pressure).
**gpu-workload priority**:
The `gpu-workload` PriorityClass (1,200,000) auto-stamped on every GPU pod by the Kyverno `inject-gpu-workload-priority` policy — the exclude list (`tts`) drops to `tier-2-gpu` (600,000) so it loses node-pressure eviction first. Governs *Kubernetes node* eviction order, **not** VRAM (VRAM is the budget + watchdog's job).
_Avoid_: assuming it protects VRAM; it is a scheduling/eviction priority on node memory/CPU pressure.
`viktorbarzin.lan`, served by Technitium DNS. Resolves only inside the homelab network.
_Avoid_: bare "lan", "private", "intranet".
**Segment**:
One isolated L2/L3 network with pfSense as its gateway — realised as a Proxmox-bridge-level tag feeding one dedicated untagged pfSense interface (dManagementsVms 10.0.10.0/24 = vmbr1 tag 10, dKubernetes 10.0.20.0/24 = vmbr1 tag 20, dCCTV 10.0.30.0/24 = vmbr0 tag 30). pfSense itself never terminates 802.1Q.
_Avoid_: "VLAN" as the primary name (the tags 10/20/30 are transport detail; the Segment is the concept).
**CCTV segment**:
The untrusted camera **Segment** (`dCCTV`) — devices in it may be pulled from (RTSP/ISAPI) but may initiate nothing except NTP to their gateway. Deliberately outside every trusted source-IP allowlist (ADR-0017).
_Avoid_: "camera VLAN", "CCTV LAN".
**Ingress auth**:
The `auth = "..."` parameter on `ingress_factory` — a discrete *mode*, not a ranked tier — one of `required` (Authentik forward-auth gates every request), `app` (the backend owns its login), `public` (anonymous Authentik binding for audit only), or `none` (Anubis-fronted content, or native-client API). Default `required` (fail-closed).
_Avoid_: "auth tier" / "auth mode" — refer to it by the canonical key, `auth` (e.g. `auth = "required"`). "tier" is reserved for State tier and Namespace tier.
@ -117,9 +147,17 @@ The bare-metal load-balancer that assigns external IPs to `type=LoadBalancer` Se
_Avoid_: calling `.200` "the cluster IP" or assuming all ingress shares one LB IP.
**Calico**:
The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs; live flow observability via **Goldmane / Whisker**). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
_Avoid_: "firewall" (it's pod-level policy, not a perimeter); conflating a Calico **NetworkPolicy** (enforced in the data path) with a **Kyverno policy** (enforced at admission) — different layers.
**Service identity**:
How a **Service** is named in flow/audit data — its **namespace** is the primary identity (Goldmane stamps it natively, and "one Service ≈ one namespace" holds for ~87 namespaces), refined by an explicit identity label (e.g. `service-identity`) only in the handful of genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). Deliberately NOT a per-Service **ServiceAccount** (deferred — 56% of pods share `default`; revisit only if principal-based enforcement or mTLS is adopted) and NOT a SPIFFE/mesh identity (rejected — attribution-grade audit on a trusted single-tenant cluster doesn't justify a mesh).
_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
**Goldmane / Whisker**:
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`.
_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
### Storage
**proxmox-lvm-encrypted**:
@ -199,6 +237,20 @@ _Avoid_: expecting Diun to deploy; conflating with **Keel**.
**Anubis**:
A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content-bearing sites without app-level auth (blog, wiki, landing pages). Never in front of Git, WebDAV, CalDAV, or API endpoints (clients can't solve PoW).
### Externally-authored sites
**Valia site**:
A small public static site authored by Valia (Viktor's mother, external to the infra) and hosted for her under `<name>.viktorbarzin.me`. Its source of truth is a **Content folder** she owns; the live site is a mirror of that folder, fresh within ~10 minutes. Hosted **off-infra** (Cloudflare Pages) by decision: a homelab outage freezes content but never takes her sites down. Viktor picks the English subdomain name per site at registration (her folder names stay Bulgarian). Current instances: `stem95su`, `bridge`.
_Avoid_: "school site" (the family may grow beyond school projects); treating the deployed copy as editable — edits land only in the **Content folder**.
**Content folder**:
The Google Drive folder (or subfolder) Valia shares with `vbarzin@gmail.com` holding one **Valia site**'s files. Strictly read-only from the infra side — nothing ever writes back to her Drive. Empty or half-uploaded folder states must never wipe a live site.
_Avoid_: syncing a folder root when the servable content lives in a subfolder (stem95su serves `stem claude/files/`, not the folder root).
**Entry file**:
The HTML file a **Valia site** serves at `/`. Defaults to `index.html`; per-site override when she names it differently (stem95su: `stem_board.html`). The override is a registration-time setting, not a constraint on her authoring.
_Avoid_: asking Valia to rename her files to fit hosting conventions.
## Relationships
- A **Service** is defined by exactly one **Stack** — **flat** or wrapping a **Stack-local module** — which sources zero or more shared **Factory modules** and resolves to one or more K8s workloads.
@ -210,6 +262,7 @@ A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content
- A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither.
- An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**.
- Tier-1 **State tier** state and ~12 app databases share one **CNPG**`pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store.
- A **Valia site** mirrors exactly one **Content folder** and serves exactly one **Entry file** at `/`; the folder is hers, the subdomain name is Viktor's, the hosting is off-infra.
All read/write paths are validated against the live API (incl. a
store→recall→delete round-trip). This gives full data-plane parity with the MCP;
the eventual deprecation (rewiring the per-prompt auto-recall + auto-learn hooks
to the CLI, then uninstalling the MCP) is a **separate, deliberate follow-up** —
see `docs/adr/0008`.
### v0.4 verbs — ci / deploy
Watch what you trigger, without hand-rolling Woodpecker/kubectl polling. `ci`
talks to the Woodpecker API (token from `WOODPECKER_TOKEN` or Vault
`secret/ci/global`) via the internal Traefik LB, resolving the repo from the cwd
remote, with retries that ride Woodpecker's intermittent empty responses.
| Command | Tier | What it does |
|---|---|---|
| `ci status [commit]` | read | pipeline status for HEAD (or a commit) |
| `ci watch [commit]` | read | poll the pipeline to terminal; exit non-zero on failure |
| `deploy wait <ns>/<deploy> [--sha SHA]` | read | wait for the deployment image to match the sha, *then* rollout status (rollout status alone lies on the old ReplicaSet) |
`work land` now calls `ci watch` on the landed commit automatically (skip with
`--no-ci-watch`), closing the v0.1 "doesn't wait for CI" gap. `ci logs` (failing
step) is deferred to v0.4.1 — Woodpecker's per-pipeline detail/log endpoints were
the least reliable; `status`/`watch` use the list endpoint that works.
### v0.5 verbs — net / dns / metrics / logs
Reachability + observability probes. Their value is *endpoint resolution* — the
non-obvious "which host, public or LB, what auth, what URL shape" reasoning you'd
otherwise re-derive every time — not the HTTP call itself. All reach internal
ingresses through the Traefik LB (the Go form of `curl --resolve host:443:10.0.20.203`).
| Command | Tier | What it does |
|---|---|---|
| `net check <host> [path]` | read | probes the host two ways — external (public DNS → Cloudflare) vs internal (Traefik LB) — with status + latency, so you can tell *where* a break is (CF? app? the LB path?) |
| `dns lookup <name> [type]` | read | resolves via Technitium (`10.0.20.201`) and public (`1.1.1.1`), diffed — surfaces split-horizon vs propagation gaps |
| `metrics query "<promql>"` | read | Prometheus instant query (`prometheus-query.viktorbarzin.lan`); prints `value {labels}` or `--json` |
| `metrics alerts` | read | currently-firing alerts (via the synthetic `ALERTS` series — the query frontend has no `/api/v1/alerts`) |
| `logs query "<logql>" [--since 1h] [--limit N]` | read | Loki range query (`loki.viktorbarzin.lan`); prints log lines or `--json` |
Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forward,
no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the
firing set is reachable via `ALERTS` instead.)
### v0.6 — usage telemetry (`usage top`)
Makes "which verbs are actually used, by everyone" a query instead of a guess —
so adding the *next* verb is evidence-driven, not shaped by one person's habits.
Every dispatched verb emits one fire-and-forget Loki line: `{job, user, verb}`
labels + `exit=N ver=X` — **only the verb path and exit code, never args, paths,
flags, or secrets.** It's best-effort (tight timeout, errors swallowed, never
affects the command) and opt-out via `HOMELAB_TELEMETRY=0`. Because the sink is
the shared Loki, aggregate usage is queryable **without reading anyone's home** —
the privacy-preserving answer to "what does the team use."
| Command | Tier | What it does |
|---|---|---|
| `usage top [--since 30d] [--user U] [--json]` | read | rank verbs by invocation count across all users (or one), via `sum by (verb) (count_over_time({job="homelab-usage"}[…]))` |
### v0.7 verbs — Home Assistant
Cover exactly the two things the `ha`**MCP server can't**: resolving the
long-lived API token out of the cluster, and SSH to the HA host for host-level
work (config files, docker, add-ons). Entity state and control (`turn_on`,
`get_state`, services) stay with the MCP — *actions an MCP already encodes are
out of scope* (see top of this doc). The value here is the same as `net`/`dns`:
the non-obvious *which secret, which host, which key, which flags* you'd
otherwise re-derive every session — agents were hand-rolling a
`kubectl | base64 | jq` token pipeline and a bespoke `ssh -o …` invocation on
every run because the existing `home-assistant-sofia.py` needs an env var set
and a cwd-relative path, neither of which holds in an arbitrary session.
| Command | Tier | What it does |
|---|---|---|
| `ha token [--instance sofia\|london]` | read | print the long-lived HA API token, resolved live from the dedicated k8s Secret `openclaw/ha-tokens` (key per instance) via the ambient kubeconfig — no pre-set env var. Use as `curl -H "Authorization: Bearer $(homelab ha token)" …`. The secret is a least-privilege carve-out (`stacks/openclaw/ha_tokens.tf`): the `Home Server Admins` group can read *just* it, so non-admin operators get the HA token without the rest of `skill_secrets` (slack webhook, uptime-kuma password) |
| `ha ssh [--instance sofia\|london] [-i KEY] -- <cmd>` | write | run `<cmd>` on the HA host over ssh with deterministic non-interactive flags (explicit key = the invoking user's `~/.ssh/id_ed25519`, no user ssh-config, no known_hosts prompt). sofia (`vbarzin@192.168.1.8`) is reachable from the devvm LAN; london is documented but generally remote |
`--instance` defaults to **sofia** (the devvm shares the Sofia LAN). `ha token`
prints the bare token to stdout so it composes in `$(…)`; it's read-tier like
`memory secret`. `ha ssh` resolves the *invoking user's* key, so it's per-user,
not tied to whoever first wrote the workflow (the user's key must be enrolled on
Drive the cluster's **headful** Chrome (`chrome-service`, real Chrome under Xvfb)
from the devvm over CDP, for sites that detect and block headless automation. The
headless `@playwright/mcp` browser can *load* such a site and fill its forms, but
the gated action (submit/login) silently fails — the motivating case was the
Stirling Ackroyd Fixflo tenant portal, whose pre-submit check returned
`net::ERR_FILE_NOT_FOUND` and hung. This path connects via `connect_over_cdp`,
injects the same `stealth.js` the in-cluster callers use, and submits first try.
The command owns only the *mechanics* (port-forward, stealth, lifecycle); the
agent supplies the Playwright script — judgment stays out of the CLI.
| Command | Tier | What it does |
|---|---|---|
| `browser run <script.js> [--url U] [--shared-context] [--keep-open] [--port N] [--timeout S]` | write | port-forward `svc/chrome-service:9222`, assert it's a real (non-headless) Chrome via `/json/version`, `connect_over_cdp`, `addInitScript(stealth.js)`, then run the script with `page`/`context`/`browser`/`log` in scope (top-level await ok; return a value to print it). Always tears the forward down. |
| `browser open <url> [--shared-context] [--timeout S]` | write | open `<url>` headful and print title + visible text + a screenshot path — a quick check. |
| `browser --help` | read | when-to-use signature + the error-code cheat-sheet (`ERR_FILE_NOT_FOUND` = automation-layer intercept, not egress; `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` = real egress; one endpoint 500 while siblings 200 = bot rejection). |
Default context is a **fresh incognito** one (closed on exit) — safe for the
shared browser and concurrent callers (e.g. tripit's fare scrape); `--shared-context`
reuses the warmed persistent profile when a pre-logged-in session is needed.
`port-forward` tunnels API-server→pod, so it bypasses the `:9222` NetworkPolicy
that gates in-cluster callers — no namespace label needed. The node CDP client is
pinned to **`playwright-core@1.48.2`** to match the chrome-service image minor
(Chromium 130; protocol changes between minors) and is installed once, lazily,
into `~/.cache/homelab/browser-client/` (no per-user setup). Because the client
runs on the devvm, `setInputFiles` streams local files to the remote browser over
CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
`homelab vault` now fronts **two unrelated stores**, made explicit in the bare
`homelab vault` help and via `[vaultwarden]` / `[hashicorp-vault]` summary tags:
- **Vaultwarden** — your personal password manager (`vault get/list/code/…`, unchanged).
- **HashiCorp Vault / OpenBao** — homelab infra secrets, the `secret/…` KV store, under `vault kv`.
| Command | Tier | What it does |
| --- | --- | --- |
| `vault kv get <path> [--field K]` | read | read a secret: `--field K` → one value (TTY-aware clipboard/stdout); no field → all fields as JSON (refuses a bare TTY) |
| `vault kv list <path>` | read | list sub-paths under `<path>` (no values) |
| `vault kv put <path> <key>` | write | write one key; **value via stdin** (piped or no-echo prompt, never argv); creates the path or **merges** (never clobbers siblings) |
**Different credentials:** the Vaultwarden verbs use the per-user *scoped* token
(bound to `claude-users/<user>`); `vault kv` uses your **own** Vault token
(`vault login -method=oidc` → `~/.vault-token`, or `$VAULT_TOKEN`) — the kv
handlers set `VAULT_ADDR` but never inject the scoped token (which would 403 off
its own path). Access is whatever your policy grants. Writes are merge-only;
`put` (replace) / `delete` are out of scope — use the raw `vault` CLI.
## Build / install
Built from source to `/usr/local/bin/homelab` during devvm provisioning
(`scripts/workstation/setup-devvm.sh`, the `t3-dispatch` pattern); version is
stamped from `cli/VERSION` via ldflags. Manual build:
```
cd cli && go build -ldflags "-X main.version=$(cat VERSION)" -o /usr/local/bin/homelab .
go test ./...
```
## Legacy webhook use-cases (preserved)
This binary is also the in-cluster `infra-cli` image. Invocations starting with
`-use-case=<vpn|setup-openwrt-dns|add-email-alias|...>` fall through to the
original flag-based path unchanged, so the webhook handler is unaffected.
## Design
See `infra/docs/adr/0004`–`0013` for the architecture decisions.
fmt.Fprintln(os.Stderr,"copied to clipboard; clearing in 30s")
clearClipboardAfter(30)
default:// refuse
fmt.Fprintln(os.Stderr,"refusing to print secret: this terminal can't do OSC52 clipboard safely; pipe the command (e.g. | cat) or use a supported terminal")
}
}
// clearClipboardAfter spawns a detached background clear so the secret doesn't
@ -13,7 +13,7 @@ The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling
Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:
- Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.`Plotting-Your-Dream-Book` (owned by Anca, dev in her org) keeps its GHA build in-place and pushes the image to **its own org's ghcr** (`ghcr.io/passionprojectsanca/book-plotter`, private) via the workflow's built-in `GITHUB_TOKEN` — no Forgejo mirror, no `viktorbarzin`-namespace push, no shared PAT in her repo (2026-06-27, migrated off DockerHub).
- `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
- Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."
# Service identity is namespace + label; east-west observability via Calico Goldmane; no service mesh
As the Service count grows we want an audit-grade record of which Service talks to which — the "service mesh evaluation" `docs/plans/2026-04-20-infra-audit-design.md` flagged as never done ("worth a design doc even if the answer is no, too much complexity for the gain"). We evaluated the full design space against two constraints: the trust model is a single-tenant cluster needing **attribution-grade** forensics (reconstruct events in a cluster we trust), not cryptographic non-repudiation against a hostile pod; and we are acutely **etcd-constrained** (we removed VPA/Goldilocks for exactly this, and carry open beads `code-oflt`/`code-at4f` on etcd starvation). Decision: **service identity = the workload's namespace** (primary; Goldmane stamps it natively and "one Service ≈ one namespace" holds for ~87 of our namespaces), refined by an explicit `service-identity` label only in the few genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). **East-west observability = Calico 3.30 Goldmane + Whisker** (already in our Calico v3.30.7, currently `enabled = false` in `stacks/calico/main.tf`), with Goldmane's emitter shipping flows to **Loki** for a durable trail. **Enforcement reuses the existing Wave 1 observe-then-enforce egress track**, now selecting on namespace/label and fed by Goldmane's allow/deny + policy-trace flows. We explicitly **reject** a service mesh, mTLS, SPIFFE/SPIRE, and dedicated per-Service ServiceAccounts for now.
## Considered options
- **Dedicated per-Service ServiceAccount as the identity primitive** — initially chosen, then reversed. 56% of pods (257/458) run as `default`, so it is a ~116-stack rollout; and Goldmane (the chosen flow source) carries pod/namespace/workload **labels but no ServiceAccount field**, so SAs would not even reach the audit trail without a custom pod→SA mapping. The cheaper, etcd-inert path (namespace is free; a handful of static labels) delivers the same attribution. Deferred until identity-aware NetworkPolicy needs a principal finer than namespace/label, or mTLS is adopted.
- **Service mesh (Istio / Linkerd / Cilium-mesh) + mTLS + SPIFFE/SPIRE** — the only thing that makes the trail cryptographically non-repudiable against a hostile pod. The trust model does not justify it, east-west stays single-tenant plaintext, and it is precisely the "too much complexity for the gain" the audit doc predicted. Rejected.
- **Microsoft Retina (CNI-agnostic eBPF)** — more capable (DNS, drops, Hubble UI) and GA, runs on Calico without a CNI change. But identity-rich mode writes **one `RetinaEndpoint` CRD per pod to etcd** (continuous, pod-proportional churn — the exact axis we guard), and it is metrics-first, not log-first (no per-flow Loki records without custom glue). Rejected for this use case; noted as the fallback if DNS/drop-level detail is ever needed.
- **Cilium Hubble** — reads Cilium's eBPF datapath maps; unusable on Calico without migrating the CNI. A CNI migration is not justified. Rejected.
- **Kiali** — builds its graph entirely from an Istio mesh's Prometheus telemetry; no mesh, no graph. Rejected.
- **Custom Grafana Alloy enrichment exporter over raw iptables-`LOG` flow lines** — Alloy has no IP→identity dictionary-lookup primitive (`loki.process` lacks a lookup stage; `k8sattributes` can't do per-line/dual-IP association), so this is a multi-day custom build that also has to beat pod-IP churn. Goldmane delivers identity-stamped flows natively and obviates it. Rejected.
- **Kyverno generate+mutate to provision/assign identity** — rejected on etcd grounds: background scans + PolicyReports + UpdateRequests are continuous writes, the VPA-class cost we shed. Identity stays static.
## Consequences
- **No etcd cost from the flow plane.** Goldmane streams flows from Felix (the existing `calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — nothing written to etcd or the K8s API. Steady-state cost is two Deployments (`goldmane`, `whisker`) + RAM/CPU on the goldmane pod.
- **The ring buffer is not a trail.** Durable, queryable history depends on the emitter→Loki path (reuse the 90-day security-stream retention); on a Goldmane restart the in-memory window is lost.
- **Goldmane is tech-preview** in OSS Calico 3.30 — the main risk. Enabling it is a reversible toggle in `stacks/calico/main.tf`, but the toggle interacts with the operator-managed Installation CR (only namespaces are TF-adopted today; verify how Goldmane/Whisker are enabled before applying).
- **Attribution is namespace-grained for free** across ~87 single-Service namespaces. Multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`) need a `service-identity` label to disambiguate; most are platform/infra and already on the Wave 1 enforcement exclude list.
- **The trail is attribution-grade, not cryptographic.** It reliably reconstructs events in a trusted cluster but cannot prove identity against a pod that spoofs its source — an accepted limit of the trust model. This ADR does not change the east-west encryption posture (still plaintext, no mTLS).
- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
## As-built (2026-06-25)
Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.
<textx="40"y="42"font-size="26"font-weight="700"fill="#0b0b0b">ADR-0017 — CCTV segment behind pfSense, VLAN-30 trunk on the LAN1 cable</text>
<textx="40"y="66"font-size="15"fill="#52514e">Sofia/Vermont · rev 3 (single switch) 2026-07-02 · dashed = camera-day · the ONLY 802.1Q is the trunk between the switch and eno1</text>
@ -128,7 +128,7 @@ The agent handles all three version patterns in Terraform:
- **Slack**: All upgrade events reported (start, success, failure, rollback)
- **Git**: Detailed commit messages with changelog summaries, risk level, backup status
- **DIUN Slack**: Independent Slack channel for raw version detection (separate from upgrade agent)
- **DIUN Slack**: REMOVED 2026-07-02 (per-tag @channel pings in #image-updates; human cadence is the weekly upgrade report). The n8n webhook feed to the upgrade agent is unchanged.
## Bulk Upgrades
@ -319,7 +319,7 @@ each Job's pod and its drain target are always different nodes.
- `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
- `K8sUpgradeStalled` — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
- `K8sUpgradeChainJobFailed` — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured).
- `K8sUpgradeChainJobFailed` — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). The `unless k8s_upgrade_blocked == 1` clause (2026-06-21) excludes a deliberate compat-gate refusal (owned by `K8sUpgradeBlocked`) so a block doesn't double-fire as a wedge.
- **Pushgateway metrics**:
- `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
- `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`). **Skips Tier-0 `vault`** — it's human-applied via OIDC; the CI `ci` role lacks Vault-admin perms (`sys/mounts`, `sys/policies/acl`) so a CI apply 403s |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`). **Skips Tier-0 `vault`** (its `plan` 403s under the `ci` role and would fail the whole run) |
| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).
**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too.
**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too.**Off-infra Valia sites** (Cloudflare Pages, ADR-0018) are the other class of public-only names with no Traefik ingress — without internal records they NXDOMAIN for every internal client while working fine externally. Since 2026-07-03 they are reconciled **declaratively**: `stacks/valia-sites` writes the ConfigMap `valia-sites-dns` (technitium ns, `<name> → <project>.pages.dev`), and the sync script ensures/updates a CNAME per entry and **deletes** stale internal CNAMEs targeting `*.pages.dev` that left the map (retire/rename cleans itself up; deletion is suffix-scoped so nothing else can be touched).
## NodeLocal DNSCache
@ -368,6 +368,7 @@ The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`)
@ -513,6 +514,7 @@ For external `.viktorbarzin.me` records:
1. Add `dns_type = "proxied"` (or `"non-proxied"`) to the `ingress_factory` module call in the service stack
2. Run `scripts/tg apply` on the service stack — DNS record is auto-created
3. For non-standard records (MX, TXT), add a `cloudflare_record` resource in `stacks/cloudflared/modules/cloudflared/cloudflare.tf`
4. For a Valia site (off-infra Cloudflare Pages), add the entry to `local.sites` in `stacks/valia-sites/main.tf` — public CNAME + internal record both follow (`docs/runbooks/valia-sites.md`)
**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`rpi_under_voltage_occurred==1`), `RpiSofiaHighTemp`.
**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`increase(rpi_under_voltage_occurred[1h])>0` — edge-triggered on the sticky bit; the live `rpi_under_voltage_now` bit is too transient to catch at 1-min sampling, so it fires on a *new* brown-out and auto-resolves ~1h later instead of latching until reboot), `RpiSofiaHighTemp`.
**Recovery** — a systemd hardware watchdog (`RuntimeWatchdogSec=14s`, bcm2835 max ~15s) auto-reboots the Pi on a hard hang instead of leaving it dead for hours.
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
| # | Source | Event | Severity |
|---|---|---|---|
@ -318,9 +318,20 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli
Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.
- **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
- **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)
Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
#### Backup Alerts
- **PostgreSQLBackupStale**: >36h since last backup
@ -541,12 +541,18 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1
**RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's.
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}`**symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}`**symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **(2026-06-26: the managed `claudeMd` now defers OS-level file access to the OS/sudo — a user holding broad `sudo` may read other users' files incl. `~/.claude`; the mode-600 / no-symlink posture is unchanged but is no longer reinforced by an agent "never read other homes" rule. See [ADR-0015](../adr/0015-os-is-the-authorization-boundary.md).)** **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
**Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.)
**Agent skills — vendored own-copies for an allowlist (2026-06-23):** beyond the config-inheritance base (above, which symlinks the admin's `~/.claude/skills` into every user), the reconcile's `install_skills` gives users on the `SKILL_USERS` allowlist (currently `emo`) their OWN copies of a curated skill set vendored in-repo at `scripts/workstation/claude-skills/` (16: the admin's 15 `mattpocock/skills` + `find-skills` from `vercel-labs/skills`). It copies each into `~/.agents/skills/<name>` (owned by the user, parent `~/.agents` chowned too — `install -d` leaves intermediates root-owned) and points `~/.claude/skills/<name>` at it with a **relative** symlink (`../../.agents/skills/<name>` — the layout `skills add -g` produces; Claude Code reads `~/.claude/skills/`). **Vendored, NOT `npx skills add`:** upstream drifted off this exact set (`diagnose`→`diagnosing-bugs`, `write-a-skill`→`writing-great-skills` renamed; `caveman` + `zoom-out` unpublished), so npx can't reproduce it — and a per-reconcile GitHub clone + unpinned-CLI dependency has no place in the hourly root job; refresh by re-snapshotting (`claude-skills/README.md`). **if-absent keys on the user's OWN copy** (a real dir under `~/.agents/skills`), so a steady-state reconcile is a no-op AND a stale or cross-user `~/.claude/skills` symlink is healed to the own copy — emo had `grill-me`/`file-issue` symlinked into the admin's home; `grill-me` is now emo's own (`file-issue` is outside the set, left as-is). A real dir squatting a name is never clobbered. Best-effort tail (`return 0`, like `install_memory`). Extend coverage = edit `SKILL_USERS`.
**Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it).
**Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude` → `~/.local/share/claude/versions/<v>`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. **PATH (`~/.local/bin`, where the native binary lives):** ensured three ways — `/etc/profile.d/10-local-bin.sh` for login shells (machine-wide, fresh-user-safe), `start-claude.sh` itself (the launcher runs in tmux's non-login env that skips the user's shell rc), and `t3-serve@.service` (`Environment=PATH=…:/home/%i/.local/bin`).
**Claude authentication — per-user, self-renewing, Vault-recoverable (2026-06-20):** every roster user logs in with their OWN Enterprise identity; shared `CLAUDE_CODE_OAUTH_TOKEN` injection was removed because environment auth outranks local login and collapses identity/audit/quota. Claude owns access-token refresh in `~/.claude/.credentials.json`. A system template timer (`claude-auth-sync@<user>.timer`, every 6h) renews a dedicated 32-day periodic Vault token, validates Claude with real non-persistent Haiku inference (`auth status` can lie during a 401), backs up only `claudeAiOauth` to `secret/workstation/claude-users/<os_user>`, and performs one atomic Vault restore/retry on failure while preserving `mcpOAuth`. Vault policy `workstation-claude-<os_user>` isolates every path; the roster generates policies for present and future users. A hard refresh-token revocation still requires the affected person to complete SSO—there is no supported noninteractive bypass. Loki alert `WorkstationClaudeAuthInvalid` surfaces exhausted recovery. Runbook: `../runbooks/claude-auth-renew-workstation.md`.
**Per-user browser MCP — playwright, reproducible from git (2026-06-16):** every user (incl. the admin) gets their OWN isolated `@playwright/mcp` server so their concurrent Claude sessions don't fight over tabs (`--isolated` → a fresh browser context per MCP connection), wired into Claude in **every directory** via a user-scope `~/.claude.json` entry (`playwright → http://localhost:<PLAYWRIGHT_PORT>/mcp`). Mechanism: **system-level template units**`playwright-mcp@<user>.service` + `playwright-snapshot-refresh@<user>.{service,timer}` (`User=%i`, sourced from `scripts/workstation/playwright/`, installed by `setup-devvm.sh` §9e — system manager, so NO systemd --user / linger). `roster_engine.py` allocates a sticky per-user `PLAYWRIGHT_PORT` (`PLAYWRIGHT_BASE_PORT=8931`); the reconcile's `install_playwright()` writes it, seeds the chrome-service snapshot token if-absent (staged from Vault `secret/chrome-service` to `/etc/t3-serve/chrome-service-token` by `setup-devvm.sh` §8c, since the hourly root reconcile has no Vault token), wires `~/.claude.json` by running `claude mcp add --scope user` AS the user (clobber-proof + if-absent, so it fixes existing/new/admin without rewriting a populated config), and `enable --now`s the instances (idempotent — never restarts a running server). The `@playwright/mcp` version is **pinned** in the unit (the `@latest`-silently-rolls-the-fleet footgun — see `T3_PIN`). Replaced the earlier hand-made `~/.config/systemd/user/playwright-*` units (one-time idle-gated migration; pre-migration emo/anca had servers running but never wired into their `.claude.json`). Cookie-warming pipeline + ops: `../runbooks/chrome-service-snapshot.md`.
**Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/<name>`**as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):**`ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `<user>/<topic>` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`).
@ -561,7 +567,7 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1
**Web-terminal session persistence (2026-06-10):** the tmux-based web terminal's named sessions (each running one Claude conversation) survive devvm reboots — `tmux-persist-save.timer` (5-min) snapshots every terminal user's sessions (name, cwd, conversation uuid from argv or the cwd-slug transcript dir) to `/var/lib/tmux-persist/<user>.tsv`, and `tmux-persist-restore.service` recreates missing sessions at boot with `claude --resume <uuid>` (per-session idempotent; also handles partial loss). The web terminal also exposes an **on-demand "Restore sessions" button** (terminal-lobby: `tmux-api``POST /restore` → the validated root `tmux-restore-user` wrapper → `tmux-persist restore <user>`, a single-user mode of the same script): the boot-only restore service never fires when an **OOM kills a user's tmux server *without* a reboot** (the common case under multi-user memory pressure), so the button covers that gap. This is a **tmux/terminal-surface** feature, deliberately outside the t3 namespace: the t3 chat surface persists its own threads (`~/.t3` state, plus the daily `t3-backup-state` dump), and Claude conversations themselves were always durable (`~/.claude/projects/`) — what this adds is the volatile tmux wiring.
**Status (2026-06-10):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), the Authentik `T3 Users` edge gate, **the emo Phase-5 cutover (own clone + launcher repoint + `code-shared` removal, completed 2026-06-10) and emo's contribute access (`ebarzin` write collaborator + PAT + protected `master`)**, and**per-user `code_layout` with the ancamilea workspace cutover (infra → `~/code/infra`, `tripit` alongside, 2026-06-10)**. Per the live `/etc/skel` design, non-admin `~/.claude/{rules,skills}` symlinks into the admin base are **kept** (they ARE the shared-base delivery mechanism — the plan's step to remove them is obsolete). **Remaining (held / future):** the offboarding apply-side (Phase 7), the rest of per-user MCP/auth injection (`ha` + `claude_memory` + `.credentials.json` + beads Dolt cred — **per-user playwright browser MCP done 2026-06-16**, see above), and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning.
**Status (2026-06-20):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), the Authentik `T3 Users` edge gate, **the emo Phase-5 cutover (own clone + launcher repoint + `code-shared` removal, completed 2026-06-10) and emo's contribute access (`ebarzin` write collaborator + PAT + protected `master`)**, **per-user `code_layout` with the ancamilea workspace cutover**, per-user playwright browser MCP, and per-user Claude OAuth renewal/Vault recovery. Per the live `/etc/skel` design, non-admin `~/.claude/{rules,skills}` symlinks into the admin base are **kept**. **Remaining (held / future):** the offboarding apply-side (Phase 7), per-user `ha`/`claude_memory`/beads credential injection, and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning.
Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS TSIG-signed)
Last updated: 2026-07-02 (dCCTV segment added — dedicated pfSense leg for the garage camera, ADR-0017)
## Overview
The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a comprehensive middleware chain including CrowdSec bot protection, Authentik forward-auth, and rate limiting. All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.
The homelab network is built on three isolated segments behind pfSense (management VLAN 10, Kubernetes VLAN 20, and the physically-legged dCCTV camera segment — see ADR-0017) with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.
Isolated camera segment for owned cameras at the Sofia site (first: `vermont-garage`, HiLook IPC-T241H-C at the garage entrance). Decision + rejected alternatives: `docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md`.
**Physical path (rev 3, single switch)**: camera → TL-SG105PE PoE port (untagged VLAN 30) → trunk port (home LAN untagged + CCTV **tagged 30**) → the existing LAN1 cable → R730 `eno1` → `vmbr0` (vlan-aware) → pfSense `net3`/vtnet3 = `vmbr0 tag=30` = interface **dCCTV `10.0.30.1/24`**. The TL-SG105PE **replaces** the old garage TL-SG105E (retired to cold spare) and carries everything: apartment uplink, 4G router `192.168.1.7`, UPS mgmt (VLAN 1), camera (VLAN 30), trunk — all 5 ports used. VLAN-30 membership is {camera port, trunk port} only, so tagged injection from other ports is dropped. `eno2`/`vmbr2` remain dormant as the fallback physical leg (rev 2).
**Addressing**: Kea DHCP pool `10.0.30.100-199`; devices get MAC reservations (camera `10.0.30.70`; the PE switch mgmt inherits the retired switch's `192.168.1.6` on the home LAN). Kea DDNS auto-registers names in Technitium; `phpipam-pfsense-import` picks up leases hourly.
**Firewall** (all on pfSense):
- dCCTV in: pass `udp OPT4-net → 10.0.30.1:123` (NTP) — everything else hits the interface's default deny. Cameras cannot reach LAN, other segments, or the internet.
- WAN in (home LAN side): pass `192.168.1.8` (ha-sofia) → `10.0.30.70:80` (ISAPI/hikvision_next) and `:554` (RTSP), reply-to disabled on both.
- dKubernetes is allow-all, so cluster Frigate/go2rtc pulls RTSP with no extra rule (pod egress SNATs to node IPs).
- Home-LAN clients need the **AX6000 static route**`10.0.30.0/24 via 192.168.1.2` (camera-day step) to reach the camera UI.
**Consumers**: cluster Frigate (`/srv/nfs/frigate/config/config.yml` — NOT Terraform) pulls `rtsp://10.0.30.70:554` main+sub as `vermont-garage`; HA integrates via Frigate plus direct hikvision_next for tamper events.
## IPAM & DNS Auto-Registration
Devices are automatically discovered, named, and registered in DNS without manual intervention.
@ -208,24 +236,31 @@ VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the up
### Ingress Flow
CrowdSec is **not** a step in this chain — banned IPs are dropped before the
Every ingress created by the `ingress_factory` module follows this chain:
CrowdSec IP-reputation enforcement is **not** in this chain — it is out-of-band
(host nftables on direct hosts; the Cloudflare edge WAF `crowdsec_ban` rule on
proxied hosts), so banned IPs never reach the chain and there is no per-request
CrowdSec hop. Every ingress created by the `ingress_factory` module follows this
Traefik chain:
1. **CrowdSec Bouncer**: Checks IP against threat database. **Fail-open** mode — if LAPI is unreachable, traffic passes through to prevent outages.
1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`).
2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend.
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load).
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients), tripit (`tripit-rate-limit`, 100/1000, photo-tab thumbnail bursts), health (`health-rate-limit`, 100/1000, SPA shell + API burst per page), and dawarich (`dawarich-rate-limit`, 100/1000 — the Rails app self-serves all fingerprinted assets and the map adds an API burst per load; the default burst 429'd the asset tail and risked dropping OwnTracks/mobile location POSTs on the same host).
4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors).
Additional middleware:
- **Anti-AI**: On by default via `ingress_factory`. Blocks common AI crawler user-agents.
- **HTTP/3 (QUIC)**: Enabled globally on Traefik.
### Entrypoint Transport Timeouts
@ -268,9 +306,9 @@ The `websecure` entrypoint sets `respondingTimeouts` in `stacks/traefik/modules/
### MetalLB & Load Balancing
MetalLB v0.15.3 allocates IPs from `10.0.20.200-10.0.20.220` (21 IPs) in **Layer 2 mode**; **four are in use**. Most LoadBalancer services share **10.0.20.200** (`metallb.io/allow-shared-ip: shared`, `externalTrafficPolicy: Cluster`). **Three services hold dedicated IPs with `externalTrafficPolicy: Local`** to preserve the real client source IP (and, for Traefik, to make QUIC/HTTP3 work — a shared IP forbids the mixed ETP the UDP listener needs).
MetalLB v0.15.3 allocates IPs from `10.0.20.200-10.0.20.220` (21 IPs) in **Layer 2 mode**; **five are in use**. Most LoadBalancer services share **10.0.20.200** (`metallb.io/allow-shared-ip: shared`, `externalTrafficPolicy: Cluster`). **Four services hold dedicated IPs with `externalTrafficPolicy: Local`** to preserve the real client source IP (and, for Traefik, to make QUIC/HTTP3 work — a shared IP forbids the mixed ETP the UDP listener needs).
> **Why not consolidate to fewer IPs?** The three dedicated IPs can't be merged. MetalLB L2 only lets `ETP=Local` services share an IP if they have *identical pod selectors* (Traefik/KMS/Technitium don't), and a shared `ETP=Local` IP announces from a single node — blackholing any service whose pods aren't on it. Traefik additionally can never leave a dedicated IP (QUIC needs the UDP listener on its own ETP=Local IP). Merging would cost client-IP preservation or HA, so the 4-IP layout is deliberate — not sprawl. Full analysis: `docs/plans/2026-06-03-lb-ip-hygiene-design.md`.
> **Why not consolidate to fewer IPs?** The four dedicated IPs can't be merged. MetalLB L2 only lets `ETP=Local` services share an IP if they have *identical pod selectors* (Traefik/KMS/Technitium/Frigate don't), and a shared `ETP=Local` IP announces from a single node — blackholing any service whose pods aren't on it. Traefik additionally can never leave a dedicated IP (QUIC needs the UDP listener on its own ETP=Local IP). Merging would cost client-IP preservation or HA, so the 5-IP layout is deliberate — not sprawl. Full analysis: `docs/plans/2026-06-03-lb-ip-hygiene-design.md`.
| IP | ETP | Services (ns/name → ports) |
|----|-----|----------------------------|
@ -278,6 +316,7 @@ MetalLB v0.15.3 allocates IPs from `10.0.20.200-10.0.20.220` (21 IPs) in **Layer
| **10.0.20.201** (dedicated) | Local | technitium/technitium-dns→53 UDP+TCP |
| **10.0.20.202** (dedicated)¹ | Local | kms/windows-kms→1688 |
**Mailserver does NOT use a LB IP** — inbound mail enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}` → NodePorts `30125-30128` (PROXY-v2; see "Mail Server" below). (Earlier revisions of this table wrongly listed mailserver on `.200` and KMS on `.200` — both corrected 2026-06-03.)
@ -292,6 +331,7 @@ These IPs are referenced by consumers that do **not** auto-follow when an IP mov
- **`.201` Technitium:** assigner `stacks/technitium/modules/technitium/main.tf` · DNS records `config.tfvars` (ns1/ns2/`viktorbarzin.lan`, dnscrypt forwarder) · `modules/create-template-vm/cloud_init.yaml` FallbackDNS · `scripts/provision-k8s-worker` · pfSense NAT 53 (**literal `10.0.20.201`**, not the `technitium_dns` alias — known inconsistency).
- **`.202` KMS:** assigner `stacks/kms/main.tf` · pfSense NAT 1688 → `k8s_kms_lb` · Cloudflare `vlmcs` public A → WAN → `.202`.
- **`.204` Frigate go2rtc:** assigner `stacks/frigate/main.tf` · go2rtc WebRTC ICE candidate in Frigate `config.yml` (on the `frigate-config` PVC, OOB — `webrtc.candidates: [10.0.20.204:8555]`) · HA-sofia Frigate integration `rtsp_url_template` (OOB — `rtsp://10.0.20.204:8554/{{ name }}`). **No DNS indirection**: go2rtc inserts the literal into the ICE host candidate and won't resolve a hostname (verified in go2rtc source), so the Service annotation is the single source of truth for this IP.
| Cloudflared | `stacks/cloudflared/` | Deployment (3 replicas), tunnel config; runs `--no-autoupdate` (in-place self-updates exited the pods and severed all tunnel WebSockets, 2026-06-09/10) |
@ -436,13 +476,30 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
**Decision**: Technitium handles internal `.lan` domains with near-zero latency. Cloudflare handles public domains with global DNS. K8s nodes use Technitium as primary, which forwards non-.lan queries to Cloudflare.
### Why Fail-Open on CrowdSec Bouncer?
### Why CrowdSec Enforcement Is Out-of-Band (and Fails Open)
**Alternatives considered**:
1. **Fail-closed**: Maximum security, but LAPI downtime blocks all traffic.
2. **Redundant LAPI**: Already scaled to 3 replicas, but resource pressure can still cause outages.
CrowdSec used to enforce inline as a Traefik middleware (the
`crowdsec-bouncer-traefik-plugin`). On Traefik 3.7.5 the Yaegi plugin handler was
never invoked, so it enforced nothing; the plugin was removed and enforcement
moved off the request path entirely (full history in
`docs/architecture/security.md`). It now runs on two surfaces:
**Decision**: Availability > strict bot blocking. CrowdSec LAPI is scaled to 3 replicas for resilience, but during cluster-wide resource exhaustion (e.g., memory pressure), bouncer falls back to allowing traffic. This prevents a complete service outage due to a security add-on.
- **Direct hosts** → `cs-firewall-bouncer` DaemonSet drops banned IPs in the host
nftables, in **both the `input` and `forward` hooks**. The `forward` hook is
the load-bearing one: with Traefik on a dedicated LB IP at
`externalTrafficPolicy=Local`, client packets are DNAT'd to the Traefik **pod**
and transit the node's `forward` chain (not `input`) — which is exactly why the
ingress must preserve the **real client IP** end-to-end (ETP=Local + PROXY-v2
for IPv6; see the Traefik LB IP and IPv6 ingress notes above). Without the real
client IP the firewall-bouncer (and the CF edge rule) would have nothing to
match on.
- **Proxied hosts** → a Cloudflare edge WAF rule (`ip.src in $crowdsec_ban`) fed
by the `crowdsec-cf-sync` CronJob.
Both **fail open**: if LAPI is unreachable, the firewall-bouncer simply stops
receiving new decisions (existing drops persist) and the CF sync skips a run —
**Fix**: If Authentik is down and ingress uses forward-auth, pods won't pass health checks. Scale Authentik to 3 replicas or temporarily disable forward-auth middleware.
@ -519,7 +577,7 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
**Diagnosis**: Check Traefik middleware config for the affected IngressRoute.
**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300.
**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, and tripit/health/authentik/dawarich each 100/1000 (SPA or asset-heavy page loads bursting past the default from one client IP).
### Large Downloads or Uploads Truncate / Fail Partway
Some files were not shown because too many files have changed in this diff
Show more