Compare commits

..

388 commits

Author SHA1 Message Date
Viktor Barzin
114a7743ac backup-mx: pivot to self-hosted Oracle relay; challenge-hardened design v3
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Rollernet's free tier failed the validation gates before any DNS change
(200 msgs / 10 MB per rolling week, then 48h of SMTP 5xx bounces —
worse than no backup MX; free accounts being discontinued). Viktor
chose to stay free, so the backup MX becomes a Postfix store-and-forward
relay on an Oracle Always-Free VM (mx2.viktorbarzin.me, MX pref 20),
draining via port 2526 through the existing pfSense HAProxy frontend
since Oracle blocks egress 25.

Two independent adversarial reviews then fixed the design: primary-side
drain enablement moved to the layers that actually reject (unknown-
client-hostname, spoof protection, anvil limits, rspamd reject tier ->
external_relay + action cap, never backscatter), monitoring moved off
the nonexistent cluster->tailnet path to allowlisted public-IP scrapes,
bounce lifetime cut to 1d (the VM can never deliver DSNs), OCI OS-level
iptables + reserved-IP + mandatory PAYG requirements added, and 4xx-only
postscreen hygiene replaces the blanket no-filtering stance.

ADR-0019 and the design doc renamed accordingly (rollernet -> oracle).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 13:38:39 +00:00
Viktor Barzin
c1ffed17a9 backup-mx design: credentials to Vaultwarden, not Vault KV
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked for the Rollernet account credentials to live in
Vaultwarden (the personal password manager) rather than HashiCorp
Vault. Item 'Rollernet (backup MX)' created; doc updated to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 12:55:43 +00:00
Viktor Barzin
c311a6a3c9 tasks: public ingress carve-out for PWA icons; adopt orphaned stack state
All checks were successful
ci/woodpecker/push/default Pipeline was successful
macOS Safari's Add to Dock (and iOS/Android home-screen installs) fetch
the app icon and web manifest without any session cookies, so the
Authentik forward-auth 302 on tasks.viktorbarzin.me made Safari fall
back to a letter monogram instead of the real icon. Viktor asked for an
ingress carve-out so exactly these five static PWA assets are publicly
fetchable: /apple-touch-icon.png, /favicon.png, /pwa-192x192.png,
/pwa-512x512.png, /manifest.webmanifest.

A second ingress_factory instance (auth=none, dns_type=none, same host)
routes only those paths straight to the tasks service; the SPA shell and
/api stay behind Authentik exactly as before. The new carve-out is also
registered in the Authentik walling-off probe so a future regression
(anything 302-ing these paths to Authentik again) alarms, and the
service catalog entry records the exception.

stacks/tasks/imports.tf adopts the live tasks resources into Terraform
state first: the stack's first-ever apply (pipeline 477, 2026-07-03)
died mid-apply after creating the resources but before the pg state
write, leaving tasks.states empty — without the import blocks this (and
every future) tasks apply would create-fail with 'already exists'. Same
pattern as the monitoring alert-digest adoption.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 10:14:44 +00:00
Viktor Barzin
c91fa881e6 docs: design + ADR-0019 — free backup MX via Roller Network secondary MX
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wants inbound email to survive homelab outages without loss;
delayed delivery is acceptable and the budget is zero, which rules out
the previously doc-flagged Dynu option. Design adopts Roller Network's
free Secondary MX (3-week store-and-forward queue, no forced filtering,
catch-all-compatible) with our-side postscreen/rspamd whitelisting,
five validation gates before any DNS change, and a live failover test.
Also records the dangling-MTA-STS finding (TXT published, policy host
absent) as a follow-up. Implementation starts only after Viktor reviews
these docs; account will use rollernet@viktorbarzin.me.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 09:59:16 +00:00
Viktor Barzin
1a63fee4e4 cloudflare: drop 6 dead legacy DNS names (zone at Free-plan 200-record cap)
Some checks failed
ci/woodpecker/push/default Pipeline failed
authelia, immich-powertools, loki, mcaptcha, nfty, whiteboard removed from
cloudflare_proxied_names — all verified dead (no HTTP response, no cluster
route; authelia superseded by Authentik, nfty was a typo of ntfy, whiteboard
was excalidraw's old name). The cap blocked the new drone-logbook stack's
dronelog record (Cloudflare error 81045). Records already destroyed via
targeted local apply; Viktor approved the removal. Zone now at 195/200.
2026-07-04 09:31:32 +00:00
Viktor Barzin
7e49bf394d Merge remote-tracking branch 'forgejo/master' into wizard/drone-logbook
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-07-04 08:44:04 +00:00
Viktor Barzin
c52cdd1f68 Merge branch 'master' of https://forgejo.viktorbarzin.me/viktor/infra
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
2026-07-04 08:43:55 +00:00
Viktor Barzin
c868ef3332 nfs_directories: add drone-logbook sync-logs + backup dirs
Drop folder for the new drone-logbook stack's auto-import (SYNC_LOGS_PATH)
and its daily backup target. Both created on 192.168.1.127 (root:www-data,
2777 — root-squash-writable like vaultwarden-backup).
2026-07-04 08:43:38 +00:00
Viktor Barzin
50778d47d3 drone-logbook: new stack — self-hosted Open DroneLog at dronelog.viktorbarzin.me
Viktor asked to self-host the DJI flight-log analyzer for his DJI Mini 4 Pro
(his fork ViktorBarzin/drone-logbook -> upstream arpanghosh8453/open-dronelog).
Upstream ghcr image with Keel auto-upgrade, DuckDB data on an encrypted
proxmox-lvm PVC (GPS traces = sensitive), NFS /sync-logs drop folder imported
every 8h, daily backup CronJob to /srv/nfs/drone-logbook-backup (vaultwarden
pattern), Authentik-gated ingress, PROFILE_CREATION_PASS from Vault via ESO.
Design + plan in docs/plans/; service-catalog updated.
2026-07-04 08:42:53 +00:00
Viktor Barzin
d9717a53bf vault-token-renew runbook: document the self-heal behavior
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Drift guard section rewritten: admin-capable clobbers now self-heal at the
nightly run (HEALED log line); weak clobbers keep the loud DRIFT failure;
manual re-mint is only the weak-clobber recovery now.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 20:20:44 +00:00
Viktor Barzin
4a7b6db806 vault-token-renew: self-heal the periodic token on admin-capable clobber
Viktor asked for 'vault login -method=oidc' to work seamlessly: the OIDC
login the docs prescribe kept clobbering ~/.vault-token with a 7-day token,
and detect-only DRIFT failures went unnoticed for weeks (weekly-expiry
loop, twice in June). On drift the renewer now re-mints the periodic token
with the clobbering token's own authority (Vault's 403 is the judge — no
policy guessing), sanity-checks it, replaces the file atomically, and
revokes stale token-devvm-wizard leftovers. Weak/read-only clobbers still
fail loudly on purpose. Design: docs/plans/2026-07-03-vault-token-self-heal-design.md

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 20:20:00 +00:00
Viktor Barzin
8631709ca2 vault-token-renew: pure helpers for the self-heal revoke filter
vtr_accessor parses the accessor from lookup JSON; vtr_is_stale_periodic
decides which old token-devvm-wizard tokens a heal may revoke (never the
just-minted one, never foreign tokens, nothing when the keeper is unknown).
TDD red-green for the heal branch that lands next.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 20:19:09 +00:00
Viktor Barzin
029b65ff93 state(vault): update encrypted state 2026-07-03 20:14:54 +00:00
Viktor Barzin
c48ce73c80 state(vault): update encrypted state 2026-07-03 20:14:35 +00:00
Viktor Barzin
b03a295397 state(vault): update encrypted state 2026-07-03 20:14:18 +00:00
Viktor Barzin
a07a603b80 docs/plans: vault-token self-heal implementation plan
Task-by-task TDD plan for the approved self-heal design: pure-function
tests first, then the heal branch, runbook update, deploy + live clobber
simulation, landing and memory updates.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 20:09:36 +00:00
Viktor Barzin
e2bfb20c84 docs/plans: vault-token self-heal design (devvm renewer)
Viktor asked to make 'vault login -method=oidc' work seamlessly on devvm:
today any OIDC login clobbers the permanent periodic token in
~/.vault-token, the drift guard only logs the drift, and his access
effectively expires weekly. Approved design: the nightly renewer re-mints
the periodic token from any admin-capable clobber (weak clobbers keep
failing loudly) and revokes stale periodic tokens after each heal.
Implementation follows on this branch.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 20:02:53 +00:00
Viktor Barzin
6698018ab6 service-catalog: add tasks row + tasks to the proxied-domains list
Some checks failed
ci/woodpecker/push/default Pipeline failed
Docs-with-change convention: the new tasks stack (Reminders-style PWA over
Nextcloud CalDAV) gets its catalog entry — what it is, its CNPG db + Vault
static role, the auth=required/X-authentik-username trust model with the
SEC-1 NetworkPolicy, and the ADR-0002 CI/CD path — and tasks joins the
Cloudflare proxied hostname list.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 19:53:42 +00:00
Viktor Barzin
02640df620 stacks/tasks: new stack for the tasks PWA (Authentik-gated, CNPG-backed)
Deploys the Reminders-style tasks app at tasks.viktorbarzin.me: namespace,
ExternalSecrets (fernet_key from secret/tasks; TASKS_DB_DSN composed from
the pg-tasks static-creds password the tripit way), single-replica
Deployment of ghcr.io/viktorbarzin/tasks:latest (image ignore_changes per
the fleet set-image pattern; Reloader restarts it on the 7-day DB password
rotation; /healthz probes on 8000; Europe/Sofia local tz; DEV_USER
deliberately absent — security invariant), Service on 8000, and an
ingress_factory host with auth=required + dns_type=proxied since Authentik
forward-auth is the app's only gate. NetworkPolicy tasks-ingress (SEC-1)
limits pod ingress to the traefik namespace plus monitoring on 8000 for
/metrics, so the trusted X-authentik-username header cannot be spoofed by
other pods.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 19:53:27 +00:00
Viktor Barzin
e0db1054e7 dbaas+vault: provision tasks CNPG database, role and rotating password
The new tasks PWA (Reminders-style front-end over Nextcloud CalDAV, per
tasks/docs/2026-07-03-tasks-pwa-design.md) needs its own Postgres database
for Connected Accounts and sync state. Follows the tripit/job_hunter
pattern exactly: idempotent null_resource creates role+db on the CNPG
primary with a placeholder password, and the Vault database engine static
role pg-tasks (added to the postgresql connection allowed_roles) rotates
the real password every 7 days, consumed by the tasks stack via a
vault-database ExternalSecret.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 19:53:13 +00:00
Viktor Barzin
9dcd3b0d5d Merge remote-tracking branch 'forgejo/master' into wizard/stem95su-cutover
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-03 15:27:04 +00:00
Viktor Barzin
5367d4a055 paperless-mail-ingest: rules process inline attachments (Apple Mail lesson)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor's first real forward carried the invoice PDF with
Content-Disposition: inline (Apple Mail does this for real documents),
and the attachments-only rules consumed nothing — recorded
PROCESSED_WO_CONSUMPTION, which also blocks reprocessing. Flipped all 5
rules to attachment_type=2 (process inline) via the API and documented
the trade-off + the ProcessedMail unblock step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 15:25:44 +00:00
Viktor Barzin
21c6e7112e stem95su: retire the in-cluster serving stack — now a Valia site on Pages
Completes the ADR-0018 cutover. The stack is emptied to a tombstone so
CI destroys nginx, the NFS content volume, the ingress, the per-site
gdrive-sync CronJob and the namespace; serving + sync are owned by
stacks/valia-sites since the cutover commits. Catalog + runbook updated
to the migrated state (incl. the one-time 42.9→21.4MB video compression
Viktor approved).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 15:22:32 +00:00
Viktor Barzin
974c9976e3 valia-sites: take over stem95su DNS (manage_dns=true) — cutover half 2
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Creates the public proxied CNAME stem95su -> stem95su.pages.dev and
adds the internal split-horizon entry via the valia-sites-dns
ConfigMap (the sync's update pass repoints the existing internal
record). Completes the ADR-0018 cutover; the old in-cluster serving
stack is retired in a follow-up.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 15:21:18 +00:00
Viktor Barzin
5c8e9daabd stem95su: release the public CNAME (dns_type=none) for the Pages cutover
All checks were successful
ci/woodpecker/push/default Pipeline was successful
First half of the ADR-0018 stem95su cutover: the tunnel-target CNAME is
destroyed so stacks/valia-sites can create the Pages-target record for
the same name (Cloudflare allows one CNAME per name; the follow-up
commit flips manage_dns=true there). stem_video.mp4 was compressed to
21.4MB with Viktor's explicit OK, clearing the 25MB Pages cap; content
is already deployed on the stem95su Pages project. Brief public
NXDOMAIN window between the two applies is accepted.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 15:21:18 +00:00
Viktor Barzin
c1ee6863b3 mailserver docs: troubleshooting entry for the postsrsd 100%-CPU spin
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Hit during the docs@ rollout: after a pod restart postsrsd came up
spinning without binding its TCP ports, so postfix cleanup tempfailed
every message with 451 queue file write error. Document the signature
and the supervisorctl-restart / pod-recreate fix.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 14:39:13 +00:00
Viktor Barzin
4ee4d1927d mailserver: guard alias filter against short lines with a lazy ternary
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
CI pipeline 469 failed with 'Invalid index' on the postfix_virtual alias
filter: terraform only short-circuits &&/|| from v1.6, and the older
terraform in the infra-ci image still evaluated split(" ", line)[1] for
the blank and comment lines that have been in extra/aliases.txt since the
plans@ block. The devvm's newer terraform short-circuits, which is why the
local apply of the same commit passed. A conditional expression is lazy on
every terraform version, so move the length guard into a ternary.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 14:38:30 +00:00
Viktor Barzin
68b9858eff paperless-mail-ingest runbook: manual mail_fetcher must drop to the paperless user
All checks were successful
ci/woodpecker/push/default Pipeline was successful
A root-run kubectl exec mail_fetcher downloads attachments root-owned into
the scratch dir and the celery consumer (uid 1000) fails with
PermissionError — found during the build E2E. Document s6-setuidgid usage
and the recovery step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 14:26:12 +00:00
Viktor Barzin
77fcb08e8e mailserver: add docs@ paperless ingest mailbox (sieve sender allowlist)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor asked to forward arbitrary emails with PDF attachments into
paperless-ngx, with the forwarding sender mapping 1:1 to the paperless
account that owns the document. paperless-ngx's built-in IMAP consumer
already does the sender->owner mapping, so the infra half is a dedicated
real mailbox docs@viktorbarzin.me: an explicit self-alias (the @domain
catch-all would otherwise divert it into the TripIt-swept spam@ mailbox,
whose sweeper LLM-parses and auto-replies to mail from linked senders)
plus a per-user Dovecot sieve that discards non-family senders at
delivery (chosen behaviour for unmatched senders: ignore and delete;
also keeps spam out of the guessable address). The mailbox credential
was added to Vault secret/platform.mailserver_accounts. Paperless-side
mail account + 5 per-sender rules are DB state, configured via the API
per the new runbook docs/runbooks/paperless-mail-ingest.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 14:06:19 +00:00
Viktor Barzin
f5187806f9 ADR-0017: replace ASCII trunk diagram with excalidraw VLAN-tagging diagram
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wants the traffic-flow view as a colored excalidraw instead of
the ASCII block (which was the only thing rendering after the earlier
VLAN-tagging SVG commit failed to push — a locally-masked non-fast-
forward this session, not a merge clobber). Ships both the editable
.excalidraw scene and a hand-drawn-style SVG export embedded in the
Traffic-on-the-trunk section: two lanes showing where the 802.1Q tag
is added, carried (only P5<->vmbr0) and stripped, L2 membership drops
vs L3 firewall verdicts.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 13:21:59 +00:00
Viktor Barzin
316cdb7441 docs: valia-sites runbook + dns.md CM mechanism + service-catalog entries
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Runbook covers add/update/retire (one map entry; internal DNS now
cleans up after itself), content rules for Valia's folders, and the
failure modes incl. both token re-mint paths. dns.md superset-rule
paragraph now describes the declarative ConfigMap reconcile instead of
hand-added static CNAMEs. Catalog: new valia-sites row; stem95su row
notes its Pages cutover is parked on the 42.9MB stem_video.mp4
exceeding the 25MB Pages per-file cap.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:46:24 +00:00
Viktor Barzin
4a3c8287c3 Merge remote-tracking branch 'forgejo/master' into wizard/valia-sites
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-03 12:43:28 +00:00
Viktor Barzin
e0991853e4 valia-sites: 25MB Pages-limit guard; cloudflared: drop removed{} (CI TF <1.7)
Two fixes from the first live runs. (1) The sync job now skips a whole
site when any file exceeds Cloudflare Pages' 25MB per-file cap, leaving
current serving untouched — stem95su's stem_board.html references a
42.9MB stem_video.mp4, which made every run fail; the guard turns that
into a loud skip so bridge keeps syncing. (2) The CI terraform is older
than 1.7 and rejects removed{} blocks anywhere (pipelines 461/464), so
the bridge record handoff was completed with a one-time manual
'tg state rm module.cloudflared.cloudflare_record.bridge_pages' from
the main checkout; the block is deleted and the module comment records
the manual step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:43:13 +00:00
Viktor Barzin
348f64d34d ADR-0017: add physical-cabling diagram (wires only)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked for one diagram showing just the physical connections
between nodes, separate from the logical/VLAN topology: ISP->AX6000,
the in-wall apartment->garage run into P1, 4G router (cellular OOB),
UPS mgmt, the PoE cat6 to the camera, the LAN1 cable to eno1, dark
eno2 fallback + free eno3/4, iDRAC on shared-LOM, and the note that
everything else on the R730 is virtual. Referenced from the ADR next
to the logical SVG.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:40:29 +00:00
Viktor Barzin
126cf4c88e Merge origin/master into wizard/cctv-adr-trunk
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-03 12:32:00 +00:00
Viktor Barzin
695e020111 cloudflared: move bridge removed{} to stack root — removed blocks are root-module-only
Some checks failed
ci/woodpecker/push/default Pipeline failed
Pipeline 461 failed terraform init: the removed{} handoff block sat in
the stack-local module, but Terraform only allows removed blocks in the
root module. Same intent, correct position (from =
module.cloudflared.cloudflare_record.bridge_pages, destroy=false).
Without this the stale state entry would make the next cloudflared
apply destroy the record valia-sites now owns.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:31:53 +00:00
Viktor Barzin
5d16a18cf4 ADR-0017: document trunk traffic semantics + ASCII topology
While reviewing the single-switch design Viktor asked whether both the
home LAN and the camera VLAN 'go via pfSense which forwards upstream' -
a natural misreading a future reader would repeat. Added a section
spelling out the vmbr0 fork: untagged home LAN is L2-bridged past
pfSense (gateway stays the AX6000, rack outage does not affect it, OOB
via 4G survives), while tagged-30 can only land on the dCCTV interface,
making a pfSense bypass impossible by construction. Includes a compact
ASCII topology for terminal readers alongside the SVG.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:31:48 +00:00
Viktor Barzin
8b80b4cc41 valia-sites: registry stack for Valia's Pages sites + declarative internal DNS (ADR-0018)
Some checks failed
Build valia-sites-sync / build (push) Waiting to run
ci/woodpecker/push/default Pipeline failed
Valia keeps asking Viktor to host 1-page sites from her Drive folders;
this makes it one map entry. New stacks/valia-sites: per site a CF Pages
project + custom domain + proxied CNAME (bridge adopted via import{}),
a ConfigMap feed (valia-sites-dns) the technitium ingress-dns-sync
script now reconciles internal CNAMEs from (add/update/REMOVE — fixes
the add-only stale-record gotcha), and one shared 10-min CronJob that
mirrors each Content folder (rclone, drive.readonly, stem95su's guards)
and wrangler-deploys ONLY on manifest change (free-tier deploy cap).
Scoped CF Pages token + shared rclone conf in secret/valia-sites; the
Global API Key never enters a pod. cloudflared forgets bridge's record
via removed{} (no destroy). stem95su is in the map dns-parked
(manage_dns=false) until its cutover commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:28:06 +00:00
Viktor Barzin
5c42155b81 docs: Valia-sites domain language + ADR-0018 (off-infra Pages, in-cluster sync)
Grill session with Viktor: his mother Valia will keep asking for 1-page
site hosting, so the pattern is being made repeatable. Decisions: all
Valia sites serve off-infra on Cloudflare Pages (survive homelab
outages); one shared in-cluster CronJob mirrors her Drive folders every
10 min and redeploys on change; English subdomain names picked by
Viktor; failed-Job-only visibility; stem95su migrates onto the pattern.
CONTEXT.md gains Valia site / Content folder / Entry file; full
rationale and rejected options in ADR-0018.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:17:45 +00:00
Viktor Barzin
e1bd111562 rename CF Pages site most.viktorbarzin.me -> bridge.viktorbarzin.me
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to rename the 'мост' school static site to 'bridge'.
New Cloudflare Pages project 'bridge' (bridge-cv2.pages.dev) already
deployed and the custom domain attached; this renames the public CNAME
(TF resource most_pages -> bridge_pages, destroy+create swaps the
record) and the internal split-horizon static CNAME in the
ingress-dns-sync CronJob. The old 'most' Pages project and the stale
internal 'most' record are removed out-of-band after this applies.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 10:52:30 +00:00
Viktor Barzin
7dd80b6c7c technitium: mirror most.viktorbarzin.me into the internal zone (CF Pages site)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The internal split-horizon zone is authoritative for viktorbarzin.me,
so the new Cloudflare Pages site (most.viktorbarzin.me, added for
Viktor's 'мост' school static site) NXDOMAINed for every internal
client — LAN, VLANs and pods — while resolving fine externally.
Per the superset rule, add it as a static CNAME (-> most-6if.pages.dev)
in the ingress-dns-sync CronJob next to the mail-auth records, and
document the off-infra-site case in dns.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 10:10:46 +00:00
Viktor Barzin
217a54be9d cloudflared: add most.viktorbarzin.me CNAME for Cloudflare Pages site
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to host a static HTML site (the 'мост' school project,
ОбУ „Отец Паисий", pulled from his Google Drive) on Cloudflare Pages
with a custom domain, as a try-out of Pages hosting. The site content
is deployed off-infra via wrangler to the Pages project 'most'
(most-6if.pages.dev); this CNAME points most.viktorbarzin.me at it.
The custom domain is already attached to the Pages project and is
waiting on this DNS record to validate.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 10:06:33 +00:00
Viktor Barzin
be80ef23bb ADR-0017 rev 3: single switch — PE replaces the SG105E, CCTV rides a VLAN-30 trunk on the LAN1 cable
Viktor prefers not running two switches, so the TL-SG105PE takes over
all rack duties (apartment uplink, 4G, UPS, camera PoE) and the CCTV
segment moves onto a managed tagged trunk over the existing LAN1 cable:
pfSense net3 re-pointed from vmbr2 to vmbr0 tag=30 (applied live; same
MAC so vtnet3/dCCTV survived untouched). This is safe where the original
802.1Q rejection was not, because the managed switch is the only device
on eno1 and polices VLAN-30 membership. eno2/vmbr2 kept dormant as the
documented fallback. Old SG105E retires to cold spare; PE inherits
192.168.1.6. Glossary Segment term updated (all three segments are now
bridge-tags feeding untagged pfSense vNICs).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 09:15:52 +00:00
Viktor Barzin
4082934bc1 Merge origin/master into wizard/cctv-two-switch
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-03 08:37:34 +00:00
Viktor Barzin
e11bd6e893 ADR-0017 rev 2: two switches — the PE is a dedicated CCTV island, no VLAN table anywhere
Viktor asked to verify free ports on the garage switch (192.168.1.6)
before finalizing. Logging into it showed it is NOT the TL-SG105PE from
the plan but a pre-existing non-PoE TL-SG105E with 4 of 5 ports in use
(apartment uplink, R730 LAN1, 4G router, UPS) - the single-shared-switch
port-VLAN design written earlier today was based on conflating the two
devices. Corrected: the new TL-SG105PE carries ONLY camera + eno2
uplink (mgmt 10.0.30.6 inside the segment), the old switch is untouched,
and no VLAN config exists anywhere. ADR, topology SVG and networking.md
updated to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 08:37:15 +00:00
Viktor Barzin
08fb65827c tripit: set PLACE_PHOTO_PROVIDER=wikipedia — real place preview photos
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked for place photos on the tripit Trip board. The app-side
work (add-time photo fetch, board place cards) shipped in tripit
v0.106.0, but prod never set PLACE_PHOTO_PROVIDER, so the fake provider
would store placeholder PNGs for every hand-added place. Same class of
fake-default gap as PLACE_RESOLVER_MODE (set explicitly for the same
reason); the ADR-0035 rollout had left both the env flip and its
backfill cron undone.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 21:57:21 +00:00
Viktor Barzin
b761701994 ADR-0017: add network topology diagram (SVG) next to the decision
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked for a reviewable network visualization committed alongside
the CCTV-segment ADR. Hand-drawn SVG (renders on Forgejo, validated
palette): physical path camera -> TL-SG105PE port-VLANs -> eno2/vmbr2 ->
pfSense dCCTV, the firewall flows (Frigate RTSP, ha-sofia ISAPI/RTSP,
NTP-only egress, default deny), and the dashed camera-day steps (patch
cable, cat6 run, AX6000 static route).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 20:25:28 +00:00
Viktor Barzin
248e186dce CCTV segment (dCCTV 10.0.30.0/24) on a dedicated pfSense leg for the garage camera
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor and emo are adding the first owned camera at the Sofia site (HiLook
IPC-T241H-C watching the garage / server rack). Viktor asked to finalize
emo's plan; the grilling session resolved emo's five open decisions and
replaced the doc's 802.1Q-trunk idea with the site idiom: a dedicated
physical leg (R730 eno2 -> vmbr2 -> pfSense net3 = dCCTV 10.0.30.1/24),
port-based VLAN split on the shared TL-SG105PE, camera default-deny with
NTP-only egress, Frigate + ha-sofia as the only consumers.

The PVE bridge, pfSense interface, Kea subnet and firewall rules were
applied live this session (hand-managed hosts, backed up). This commit
records the decision (ADR-0017), the glossary terms (Segment / CCTV
segment), the as-built architecture doc, and bumps Frigate's ADR-0016
VRAM budget 2000 -> 2300 MiB for the upcoming NVDEC stream.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 20:01:45 +00:00
3a5194c9d4 Merge pull request 'immich(frame-emo): show photos from the last 365 days (was 730)' (#18) from emo/frame-emo-1year into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Reviewed-on: #18
2026-07-02 19:05:31 +00:00
9e253d409a immich(frame-emo): show photos from the last 365 days (was 730)
Emil asked his Sofia Portal Mini photo-frame to show only the past
year of photos rolling from today, instead of the last two years.
Changes ImagesFromDays 730 -> 365 in the frame-emo Settings.yml.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 19:05:31 +00:00
Viktor Barzin
4c532dbf97 devvm containment: drop the MemoryHigh throttle band, straight to MemoryMax OOM
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
t3.viktorbarzin.me went down 2026-07-02 15:42-16:35 UTC: an agent-spawned
12.3G ugrep plateaued inside t3-serve@wizard's MemoryHigh(12G)..MemoryMax(16G)
band. With MemorySwapMax=0 its anon pages were unreclaimable, so the kernel
throttled every task in the cgroup indefinitely (memory.pressure full ~80%,
oom_kill never fired) - the t3 event loop starved, the accept queue rotted,
and the terminal was dead until the hog was SIGKILLed by hand.

The 2026-06-22 design assumed 'throttle to a crawl, then OOM locally'; a hog
that stabilises between high and max never OOMs, so the throttle band is a
livelock zone, not a safety layer. Viktor asked to close that gap: MemoryHigh
is now explicitly infinity on all three work cgroup definitions (t3-serve@
unit, user-<uid>.slice drop-in, docker.slice) so a runaway is cgroup-OOM-
killed at MemoryMax immediately - OOMPolicy=continue already keeps the t3
server alive when a child dies. MemoryMax/MemorySwapMax=0/earlyoom unchanged.
Applied live to the devvm the same day (daemon-reload + runtime set-property
on running cgroups, no session restarts). Post-mortem addendum + runbook
updated in the same commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 16:59:38 +00:00
Viktor Barzin
684ca4527c docs(CLAUDE.md): T4 now has a VRAM budget + watchdog (ADR-0016, dry-run); note llama-swap budget miscalibration
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Session wrap-up doc sync: the Immich note still claimed the shared T4 had no
VRAM isolation. Record the gpumem budget/watchdog shipped earlier today, that
the watchdog is observe-only, and that budgets need a retune (llama-swap's
real 16k-ctx resident is ~7GB, not 4.35) before arming.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 15:20:06 +00:00
Viktor Barzin
21afae85c9 dawarich: dedicated 100/1000 Traefik rate limit (default 10/50 429'd page loads)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor saw dawarich throwing 429s through Traefik and asked to loosen
the burst for it. The access log confirms the burst pattern: one page
load fires the whole fingerprinted-asset tail (SVG store badges,
favicons, webmanifest) from a single client IP and trips the default
10 req/s / burst 50 limiter (repro: 80 parallel GETs -> 28x 429).
Same remedy as ha-sofia, ActualBudget, noVNC, tripit, health and
authentik: dedicated dawarich-rate-limit middleware (average 100 /
burst 1000) + skip_default_rate_limit on the dawarich ingress. Also
updates the networking.md middleware enumerations (adding the
previously undocumented tripit/health limiters alongside dawarich).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 15:03:08 +00:00
Viktor Barzin
91d0213d1a Merge remote-tracking branch 'forgejo/master' into wizard/excalidraw-export-rename
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build excalidraw-library / build (push) Has been cancelled
2026-07-02 14:29:34 +00:00
Viktor Barzin
8fc657f431 excalidraw: migrate image build to GHA -> private ghcr (ADR-0002)
The image was still built by hand and pushed to DockerHub (v1..v4),
predating the all-builds-off-infra doctrine; Viktor chose to move it
onto the standard pipeline while shipping the export/rename feature
rather than keep the manual flow.

Mirrors the k8s-portal pattern: .github/workflows/build-excalidraw.yml
(go test + buildx linux/amd64, pushes ghcr latest+sha), excalidraw ns
added to the Kyverno ghcr-credentials allowlist (package is PRIVATE),
deployment now pins ghcr :latest with pullPolicy Always + pull secret,
Keel force/match-tag/5m annotations seed the metadata (live values win
via ignore_changes). DockerHub viktorbarzin/excalidraw-library:v4 stays
frozen as the rollback image. Docs: ci-cd.md + .claude/CLAUDE.md image
lists updated (also backfilled the missing k8s-portal rows in ci-cd.md).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:29:23 +00:00
Viktor Barzin
1cbc1e962b excalidraw: native export menu + drawing rename
Users couldn't see Excalidraw's built-in Save as / Export image options:
the app's custom toolbar was drawn exactly on top of the native hamburger
menu button, hiding it. Removed the overlay and integrated Back to
Library / Save now / Rename into the native menu, so the native export
formats (.excalidraw file, PNG, SVG, clipboard) are now reachable.
Viktor asked for exports to work via the native Excalidraw feature and
for drawings to be renameable by clicking their name.

Rename: new PATCH /api/drawings/{id} endpoint (server-side name
sanitization, 409 on conflict) + click-to-rename title pill in the
editor (updates URL in place) + Rename button/modal in the dashboard.
Existing GET/PUT/DELETE semantics unchanged for API compatibility
(emo's upload pipeline). Added main_test.go (httptest) covering rename
+ existing handler behavior; dashboard rows now DOM-built (XSS-safe).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:29:10 +00:00
Viktor Barzin
d94f267c93 immich: upgrade v2.7.5 → v3.0.0 (postgres → vectorchord 0.4.3, frames → immich_v3 tag)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to upgrade Immich to the just-released v3.0.0 (release notes,
migration guide and release discussion #29439 reviewed — no config-breaking
changes for this stack: we already use the split MACHINE_LEARNING_PRELOAD
vars, don't set DB_VECTOR_EXTENSION, OAuth goes through Authentik over
HTTPS, and the GPU node's CPU meets the new x86-64-v2 requirement).

The Immich Postgres image moves to VectorChord 0.4.3 to match the upstream
v3 reference stack (0.3.0 is still within v3's supported range '>=0.3 <2';
Immich upgrades the extension itself at startup). Both photo frames switch
to ImmichFrame's immich_v3 compatibility tag because every versioned
ImmichFrame release (≤ v1.0.33.0) crashes deserializing Immich v3 API
responses; repin to a versioned tag once upstream ships stable v3 support.

Deployment images are Keel-managed (KEEL_IGNORE_IMAGE, policy=patch), so
this commit is the source-of-truth record; the live rollout happens via
kubectl set image in the same session. Pre-upgrade pg_dumpall taken
(job postgresql-backup-pre-v3).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:18:22 +00:00
Viktor Barzin
6f03ccd1aa excalidraw: grant emo-browser SA port-forward for drawing uploads
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to fix emo's permission so his Claude can upload to the
Excalidraw service. emo's recent sessions show the documented upload
recipe (kubectl port-forward svc/draw + X-Authentik-Username header,
from his ~/.claude/CLAUDE.md) failing with:

  pods/portforward forbidden for system:serviceaccount:chrome-service:emo-browser
  in namespace excalidraw

because his default kubeconfig is the read-only emo-browser SA (its
port-forward grant covers only chrome-service) and his old admin
kubeconfig at /home/emo/code/config expired and was removed.

Add a namespace-scoped Role (pods/portforward create) + RoleBinding for
that SA in the excalidraw namespace, mirroring the 2026-06-28
chrome-service grant. Trade-off (any-user drawings via the trusted
username header) documented in the file and accepted.

Also record the grant in docs/architecture/chrome-service.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 11:08:28 +00:00
Viktor Barzin
88c86e2109 ci: Slack-notify failed pipeline runs only
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor doesn't want a Slack message for every CI run — only failures.
The infra apply pipeline posted a status line to #general on every push,
and the renew-tls / postmortem-todos / registry-config-sync /
pve-nfs-exports-sync crons posted on every scheduled run (~30+ routine
messages a week). Now: the apply pipeline's success post is gone
(notify-failure already covers failures), all cron notifies are
status:[failure] with explicit FAILED texts, and drift-detection is
silent when all stacks are clean (still posts drift findings and errors,
and gains a hard-failure catch step it previously lacked). Kept:
notify-nonadmin-push (org audit feed) and the actionable provision-user
post. Per-app deploy template in ci-cd.md updated to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 07:27:43 +00:00
Viktor Barzin
a64d2ba2b9 upgrades: fix hourly gotenberg error + cap update notifications at weekly
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor was getting upgrade-error Slack messages every hour and wants
update notifications at most weekly. Root cause of the errors: Keel kept
trying to roll gotenberg 8.25->8.25.1 in paperless-ngx but kyverno's
require-trusted-registries denied it — gotenberg/* (and apache/*, which
tika will hit next) were never allowlisted, and Keel's Slack notifier at
info level re-posted the identical failure to #general on every hourly
poll since Jun 28.

Changes: allowlist gotenberg/* + apache/* so the patch applies cleanly;
disable Keel's direct Slack notifier and replace failure visibility with
a KeelUpdateFailing Loki-ruler alert (alert-on-change: one notification
plus the daily digest, never an hourly drip); remove diun's Slack
notifier whose default message @channel-pinged #image-updates for every
new upstream tag every 6h (the n8n upgrade-agent webhook feed is
untouched). The k8s upgrade report is already weekly (Mon 06:07 UTC).
Paperless-ngx itself stays paused (keel policy=never, user-managed) while
the ingest runs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 07:16:50 +00:00
Viktor Barzin
5d5d9752cb guard: ignore + git-crypt kubeconfig files so they can't leak to the public mirror
All checks were successful
ci/woodpecker/push/default Pipeline was successful
A GitGuardian audit of the infra repo showed the recent alerts were test
fixtures (false positives), but surfaced a real historical leak: a
cluster-admin kubeconfig was once committed as stacks/f1-stream/.../.config
(now expired, reachable only via a GitHub PR ref). The .gitignore already had
a `config` rule for kubeconfigs but missed the dotfile form `.config` — which
is exactly how that file slipped onto the public mirror.

Close the gap in two layers:
- .gitignore: also ignore `.config`, `kubeconfig`, `*.kubeconfig`,
  `admin.conf`, `.kube/` so they're never staged by accident.
- .gitattributes: route `.config`, `kubeconfig`, `*.kubeconfig`, `admin.conf`
  through git-crypt so a force-add or rename still lands as ciphertext (never
  plaintext) on the public GitHub mirror.

No tracked files match these names today, so there is zero retroactive impact
— purely forward-looking prevention.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 07:14:58 +00:00
Viktor Barzin
dab307f9f8 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-02 05:39:15 +00:00
Viktor Barzin
f1e81772d5 broker-sync: repoint image to ghcr (was frozen on pre-migration DockerHub)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The nightly ibkr sync failed with 'No such command ibkr': every broker-sync
CronJob still pulled viktorbarzin/broker-sync:latest from DockerHub, which
nothing has pushed to since the ADR-0002 move to GHA->ghcr on 2026-06-13 —
the jobs were silently running a frozen pre-ibkr build. The migration had
allowlisted only the wealthfolio namespace for the private
ghcr.io/viktorbarzin/wealthfolio-sync image, so broker-sync also lacked
pull credentials. Repoint the image, add ghcr-credentials imagePullSecrets
to all eight CronJobs, and allowlist the broker-sync namespace (wealthfolio
stays — its own monthly sync pulls the same image). Related: code-9ko8.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 05:31:00 +00:00
Viktor Barzin
ac41e7c017 nvidia: run advertise-gpumem provisioner under bash (dash rejects pipefail)
First apply of ADR-0016 failed: terraform local-exec defaults to /bin/sh,
which on Ubuntu is dash — 'set -euo pipefail' exits 2 before running kubectl.
Pin the interpreter to bash. Everything else in the gpumem apply succeeded.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 05:21:47 +00:00
Viktor Barzin
968b2b9c64 Merge remote-tracking branch 'origin/master' into wizard/gpu-vram-budget 2026-07-02 05:18:34 +00:00
Viktor Barzin
a12b09af04 broker-sync: pin data-mounting CronJobs to k8s-node4 (stop nightly RWO wedge)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All broker-sync CronJobs share one RWO proxmox-lvm volume. With free
scheduling the nightly 02:00-04:15 runs land on different nodes, forcing
a detach/attach cycle whose QMP hotplug intermittently ghost-attaches on
disk-heavy VMs — every job then sits in ContainerCreating for hours
(happened 2026-06-30, 07-01 and again 07-02; fires
PodsStuckContainerCreating and skips the day's trade syncs). Pinning all
seven volume-mounting jobs to k8s-node4 (fewest CSI disks, 11) makes the
volume attach once and stay put — no hotplug dance, no wedge.
version_probe mounts nothing and stays unpinned. Durable fix for the
recurrence tracked in beads code-9ko8.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 05:16:38 +00:00
Viktor Barzin
3c85af2dc2 fire-countdown dashboard: SQL guards + tax regime + honesty fixes
All checks were successful
ci/woodpecker/push/default Pipeline was successful
From the flaw-hunt workflow (all verified):
- Projected-FIRE-date panels (solo/household/family) now guard savings £/yr:
  0 / empty / negative all render "Set savings £/yr" instead of a blank tile,
  a SQL error, or a nonsensical past date ("Jan 1849"). Verified across cases.
- New "Tax regime" panel surfaces the per-country jurisdiction — 14/22 countries
  fall back to the neutral 'nomad' 1% assumption, which was previously invisible.
- Intro no longer hard-codes "£139k pension" (contradicted the £328k tranche
  panel); pension value is now only shown data-bound in the tranche panel.
- Intro adds caveats: Anca's spend is an estimate (pending live re-pull), and
  non-modelled countries use the nomad tax fallback.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 22:44:17 +00:00
Viktor Barzin
339f5d89b9 onlyoffice: decommission (stack destroyed, dir removed)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The document server had been deliberately scaled to 0/0 for 184 days, but
its ingress kept the uptime-kuma monitors alive, so 'onlyoffice down'
showed up in every daily alert digest. Viktor approved tearing it down.
terragrunt destroy ran clean (11 resources) before this commit; the kuma
monitors auto-prune with the ingress. Also drops the onlyoffice/* image
prefix from the kyverno trusted-registries allowlist, the service-catalog
rows, and updates the nextcloud collabora comment. Document data (if any)
remains on the PVE NFS share.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-01 22:35:22 +00:00
Viktor Barzin
3c476dab32 postiz+portal: remove broken alert sources (stale backup CronJob, bogus scrape annotations)
Viktor is getting daily Slack alert noise; these two were the recurring
generators. The postiz-postgres-backup CronJob still dumped from the old
in-namespace postiz-postgresql service that was removed in the CNPG
migration (2026-06-28) — it failed every night at 03:00 and re-fired
BackupCronJobFailed each day. The postiz DB now lives on the shared CNPG
cluster and is already covered by the dbaas per-db dumps, so the CronJob
(and its NFS backup volume) is redundant and removed rather than repaired.

portal-stt/portal-tts advertised prometheus.io scrape annotations that
never worked: the deployed Speaches build 404s /metrics, and openai-edge-tts
has no metrics at all (its annotation pointed at a JSON endpoint, which
fails exposition parsing regardless). Both produced a permanently firing
ScrapeTargetDown. Annotations removed until the apps actually serve metrics.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-01 22:35:21 +00:00
Viktor Barzin
5a312563c6 monitoring/wealth: dash the in-progress year on the hourly-rate panel
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The current, still-accruing calendar year read misleadingly high (e.g. 2026
at 5 months showed £149/h gross, above all of 2025) because the full-year
bonus - paid every March - plus front-loaded quarterly RSU vests get divided
by only the months worked so far. It settles lower as the year completes.

Split each line into a solid series (complete years) and a dashed series
(the latest, still-accruing year), so the provisional point is visually
flagged. The split auto-detects the in-progress year (latest year with
< 12 months of payslips), so it needs no per-year maintenance. Panel
description now explains the caveat.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:45:51 +00:00
Viktor Barzin
28984dda9a monitoring/wealth: add per-year effective hourly-rate panel (gross vs net)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted to see, on the wealth dashboard, the hourly wage he earned
each year - both gross and net - with year on the X axis.

New timeseries (line) panel "Effective hourly rate - gross vs net":
- hourly = annual pay / hours worked; hours = contractual 40h/week
  (2,080h per full year, confirmed from the Facebook/Meta UK offer letter:
  Mon-Fri 09:00-18:00 less a 1h lunch), prorated by the months actually
  worked so partial years (2019, 2020, 2026) read correctly.
- Gross = gross_pay incl. notional RSU vest; Net = take-home.
- timeFrom 10y so all years show under the dashboard's default 180d range.

Source data: a duplicate March-2023 payslip (Paperless doc 347, a re-upload
of doc 33) was removed separately, so 2023 is no longer double-counted; this
also corrects the existing net-pay panel.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:28:46 +00:00
Viktor Barzin
82371d1ef8 dbaas/mysql: innodb_doublewrite=DETECT_ONLY to halve page-flush writes
All checks were successful
ci/woodpecker/push/default Pipeline was successful
MySQL device-write investigation (code-oflt): after the nextcloud webcal
throttle settled (the earlier 3.4-8.8 MB/s were post-restart transients),
MySQL is ~1.74 MB/s at the InnoDB level — and HALF of that (~0.86 MB/s,
~55 pages/s) is the doublewrite buffer writing every flushed page twice.
Redo is negligible (0.01 MB/s), no temp-table spilling.

Set innodb_doublewrite=DETECT_ONLY (dynamic, no restart; persisted in the
cnf): InnoDB stops writing full page CONTENT to the doublewrite buffer
(~halves MySQL's page-flush writes on the IOPS-bound sdc) but keeps
torn-page DETECTION metadata — a crash-torn page is flagged on recovery
(restore from the daily mysqldump) rather than silently corrupt. Chosen
over full OFF: same write saving, keeps detection, and OFF requires a
shutdown ("cannot change to OFF if doublewrite is enabled"). Acceptable
risk given the PERC BBU cache + UPS (in-flight writes complete on power
loss) + daily per-db backups.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 08:47:09 +00:00
Viktor Barzin
fbae573664 state(dbaas): update encrypted state 2026-06-30 08:46:45 +00:00
Viktor Barzin
71501be408 nodes: journald -> volatile (RAM) to cut sdc write-IOPS
Some checks failed
ci/woodpecker/push/default Pipeline failed
Node "container churn" investigation (code-oflt): container logs (~30 KB/s)
and overlayfs (~17 KB/s) are negligible; the node OS-disk churn is ext4
journal (jbd2) metadata writes driven mostly by journald's continuous
appends. node4 + node5 had drifted to uncapped persistent journald (4 GB
each, ~100 KB/s); master/node1-3 were correctly capped at 500M.

Node + pod journals already ship to Loki (alloy loki.source.journal), so
on-disk journald is pure write-IOPS overhead on the IOPS-bound sdc. Switch
journald to Storage=volatile (RAM, RuntimeMaxUse=200M) fleet-wide:
- cloud_init.yaml: drop-in 90-oflt-volatile.conf for new nodes (replaces
  the old persistent seds).
- running nodes (master + node1-5): pushed the same drop-in via qm guest
  exec + journald restart + cleared /var/log/journal.

Verified node5: OS-disk writers jbd2/sda1-8 931->46 KB/s, systemd-journal
gone (~94% drop); ~4 GB freed each on node4/node5. Logs stay queryable in
Loki. Trade-off: a hard crash loses the last unshipped journal.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 08:15:38 +00:00
Viktor Barzin
74819d4061 feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation
The single time-sliced Tesla T4 has no per-tenant memory isolation, so its
~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's
onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking
recruiter-responder for ~5h. Viktor asked for memory protection so we don't
overallocate GPU memory, and chose to do it at the scheduling level (no
device-plugin swap) after weighing HAMi and MPS.

Make the scheduler VRAM-aware and add runtime teeth, all repo-native,
time-slicing untouched:
- Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a
  reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob.
- Each always-on GPU tenant declares a gpumem budget (immich-ml 3000,
  llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300
  <= advertised) so the scheduler refuses to co-schedule past the card
  (overflow -> Pending).
- gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when
  actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to
  false after a few cycles look right.
- Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown --
  the 2026-06-02 post-mortem's never-built free-VRAM follow-up.
- Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing
  glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment.

HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node
shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:57:40 +00:00
Viktor Barzin
1afe41880e docs: MySQL buffer-pool/limit + nextcloud webcal throttle; VCT drift fixed
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Reflect the code-oflt MySQL write-reduction work (commit 82c9e69b + the
nextcloud webcal app-data throttle):
- MySQL row: buffer pool 1->2Gi, mem limit 4->6Gi, and the nextcloud
  webcal calendar churn that was ~60% of MySQL's writes (now throttled
  in oc_calendarsubscriptions.refreshrate — app-data, can regress).
- CNPG apply-gotcha note: the mysql_standalone VCT-annotation drift no
  longer needs -target dodging (now ignore_changes'd on the STS VCT).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:56:04 +00:00
Viktor Barzin
82c9e69b77 dbaas/mysql: 2Gi InnoDB buffer pool + 6Gi limit + ignore VCT drift
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Cut MySQL's write-IOPS footprint on the contended PVE sdc HDD (code-oflt).
Standalone MySQL was the #1 sdc bandwidth writer (~2.8-3.5 MB/s). Live
attribution found ~60% of its writes were nextcloud webcal calendar churn
(throttled separately at the app layer); this addresses write amplification
on the remainder:

- innodb_buffer_pool_size 1Gi -> 2Gi: the pool was too small for the ~5.6Gi
  hot set (Innodb_buffer_pool_wait_free=1.78M = threads stalling for a free
  page -> constant flush-to-make-room write IOPS).
- container memory limit 4Gi -> 6Gi (requests 3->4Gi): the pod was already
  at ~3.7Gi/4Gi (near OOM) with the 1Gi pool, so the 2Gi pool needs the
  headroom. One-time MySQL pod restart to apply.
- ignore_changes on the StatefulSet volume_claim_template: the VCT is
  immutable post-creation and pvc-autoresizer rewrites its annotations on
  the live object, so TF's desired VCT could never apply and errored every
  broad dbaas apply. Ignoring it (autoresizer owns PVC sizing) removes the
  long-standing need to -target around it.

Applied + verified live: buffer_pool=2.0GiB, limit=6Gi, pod healthy,
24 DBs reachable, restart clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:55:18 +00:00
Viktor Barzin
29bf275cef state(dbaas): update encrypted state 2026-06-30 07:53:48 +00:00
Viktor Barzin
308a174ad6 docs(networking): record MetalLB .204 (frigate-rtsp go2rtc) allocation
All checks were successful
ci/woodpecker/push/default Pipeline was successful
PR #17 moved frigate-rtsp to a dedicated MetalLB LoadBalancer IP
(10.0.20.204) exposing RTSP 8554 + WebRTC 8555, but the networking doc
still listed only four IPs in use / three dedicated. Add the .204 row to
the allocation table, bump the counts (five in use, four dedicated, 5-IP
layout), and add a LB-IP renumber-checklist entry for the out-of-band
consumers (the go2rtc WebRTC candidate on the frigate config PVC and the
HA-sofia rtsp_url_template). Note go2rtc cannot use a DNS name in ICE
candidates, so the Service annotation is the single source of truth.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:42:27 +00:00
469cdd7507 frigate: expose go2rtc on a dedicated MetalLB LB IP (RTSP 8554 + WebRTC 8555)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
HA live video from the cluster Frigate hangs/fails because the only path
to Frigate is the Traefik HTTP(S) ingress (frigate-lan -> 10.0.20.203),
which cannot carry RTSP or WebRTC. The container already listens on
8554+8555 but only RTSP had a Service (NodePort), and WebRTC (8555) was
never exposed. Convert frigate-rtsp to a LoadBalancer on a dedicated MetalLB
IP (.204, ETP=Local, pod pinned to the GPU node) carrying RTSP 8554 +
WebRTC 8555 (TCP+UDP), giving HA Sofia + LAN browsers a stable cross-VLAN
endpoint for native HLS/WebRTC live (parity with the Hikvision NVR).
Companion non-Terraform steps are in the PR body.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:15:22 +00:00
Viktor Barzin
9ea9cae073 rightsize: reconcile batch-2/3 stacks blocked by killed #427 (job-hunter, wealthfolio, f1-stream)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Memory limits were committed (batch 2/3) but pipeline #427 was killed mid-apply and the local homelab tf apply hit a stale backend-init; this comment-only diff re-triggers a clean CI apply for the three stacks so live matches master (job-hunter 768Mi, wealthfolio 512Mi, f1-stream 384Mi).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:59:41 +00:00
Viktor Barzin
7cc9cde5b1 external-secrets: enable ESO Vault token cache to cut sdc write churn
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Add --enable-vault-token-cache to the ESO controller (a graduated,
non-experimental flag in chart 2.6.0). Until now ESO authenticated to
Vault with login -> lookup-self -> revoke-self on *every* secret fetch.
Across 92 ExternalSecrets refreshing every 15m that measured ~0.22
logins/s + ~0.22 revoke-self/s on the active Vault member, and each
cycle is a token create+revoke (plus its lease) written to the Raft log
on all three members. Those fsync-heavy writes land on the contended
PVE RAID1 7200rpm HDD (sdc) -- one of the write sources behind the
recurring control-plane flaps (code-oflt write-reduction).

The eso kubernetes-auth role already issues a 240h periodic, unlimited-
use token, so the churn was pure waste: ESO discarded a perfectly good
token after a single use. With token caching ESO mints one token and
reuses/renews it, collapsing logins from ~13/min to a handful per token
lifetime. Verified live: vault cache initialized, 112/113 ExternalSecrets
Ready (the one failure, instagram-poster, is pre-existing data drift
unrelated to auth), logins dropped to ~0 after warm-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:32:37 +00:00
Viktor Barzin
5e384ed762 state(external-secrets): update encrypted state 2026-06-29 15:32:37 +00:00
Viktor Barzin
bc626a2d89 rightsize: raise OOM-tight memory limits (batch 3/N — spike protection)
Some checks failed
ci/woodpecker/push/default Pipeline failed
shlink 512->704Mi, linkwarden 1Gi->1280Mi, chrome-service 2Gi->2624Mi, forgejo 4Gi->5Gi, f1-stream 256->384Mi. All were request==limit with 30d peak at 91-100% of the ceiling — a spike would OOM-kill them. Raising the limit (now Burstable, request<limit) gives real burst headroom. This is the genuine 'don't OOM on occasional spike' fix. Small add (~2.2Gi limits) vs the ~20Gi of fat removed in batches 1-2, so net overcommit keeps dropping.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:28:11 +00:00
Viktor Barzin
418d1efb4b rightsize: trim over-provisioned memory (batch 2/N)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
claude-agent-service 12Gi->3Gi (peak 585Mi — the single biggest fat, ~9Gi of limit-overcommit removed), job-hunter 1280->768Mi (kept chromium headroom; 30d peak 118Mi), fire-planner 1024->320Mi, wealthfolio 1Gi->512Mi (kept history-growth headroom). Burstable, limits kept >= generous peak headroom, never below peak. ~10.7Gi of limit overcommit removed. paperless-ai intentionally LEFT at 4Gi (documented in-process RAG model load).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:27:17 +00:00
Viktor Barzin
a3f2c2947a docs: refresh CNPG tuning note (archive_timeout=0, commit_delay, zstd) + apply gotcha
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Reflects the write-reduction params applied in c3553731, and documents the
null_resource trigger-bump + targeted-apply gotcha so the next agent doesn't
hit the inert-change / mysql-VCT-drift traps.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:17:38 +00:00
Viktor Barzin
ec04963bfe state(dbaas): update encrypted state
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
2026-06-29 15:16:50 +00:00
Viktor Barzin
c3553731c7 dbaas: CNPG write-reduction — archive_timeout=0, commit_delay, wal_compression=zstd
Part of code-oflt (cut sdc write IOPS before the SSD move; analysis #6922).
- archive_timeout 300->0: CNPG forces archive_mode=on but .spec.backup is empty
  (no ObjectStore), so a 16MB WAL segment switch every 5min shipped NOWHERE =
  ~4.6 GB/day of pure-waste WAL on the contended sdc. archive_mode stays CNPG-on
  (reserved); 0 just stops the timed switch. Daily pg_dump cron unchanged.
- commit_delay 0->2500us: group-commit coalesces concurrent fsyncs. SAFE for
  every DB incl financial -- data still fsynced before COMMIT acks, only <=2.5ms
  added latency under concurrency.
- wal_compression pglz->zstd: ~30-50% smaller full-page images.
All sighup-reloadable. Applied via targeted apply of
module.dbaas.null_resource.pg_cluster (trigger bumped) to avoid the pre-existing
mysql VCT drift that breaks broad dbaas applies.

Refs: code-oflt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:16:38 +00:00
Viktor Barzin
5d059786a1 rightsize: trim over-provisioned memory limits+requests (batch 1/N)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
claude-breakglass 4Gi->512Mi, stirling-pdf 1536->512Mi, insta2spotify 2Gi->256Mi, recruiter-responder 768->256Mi. These idle/utility services had memory LIMITS sitting 4-15x above their 30d peak, inflating cluster limit-overcommit to 142% across the 5 post-node6 nodes. Burstable (request<limit), limits capped at ~peak x1.5 (never below peak), so no OOM risk (verified zero OOMKills cluster-wide in 30d). Reduces phantom limit overcommit + frees scheduler requests.

Follows the 3-reviewer adversarial review: raising limits on an already-overcommitted cluster worsens correlated node-OOM; the real fix is trimming the fat. Limits only lowered where peak is far below; tuned/DB/GPU limits untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 14:46:58 +00:00
Viktor Barzin
4473b469e3 lvm-pvc-snapshot: cut retention 7->3 days (reduce sdc thin-pool CoW IOPS + free ~1TB)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Part of the sdc IOPS-reduction work (code-oflt). 462 daily thin snapshots
(66 PVCs x 7d) drive ~10-34 w/s of thin-pool metadata (tmeta) CoW writes on
the contended sdc spindle and pin ~2TB in the 70%-full pool. Halving to 3
days roughly halves both. Instant-restore window shrinks 7->3d; daily-backup
still keeps 4 weeks of file-level PVC history, so DR coverage is unchanged.

Deployed to the PVE host via scp (these host scripts are scp-deployed, not
TF-managed). Doc updated in .claude/CLAUDE.md.

Refs: code-oflt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 12:59:16 +00:00
Viktor Barzin
256122ff5b monitoring: make ClusterCannotTolerateNonGpuNodeLoss topology-agnostic
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The N-1 capacity alert was hardcoded to k8s-node[234]/[1234], predating node5/node6 (added 2026-05-26) and the 2026-06-29 removal of node6 — so it no longer reflected the real cluster and gave no trustworthy N-1 signal. Generalize node selection via metrics: GPU node by nvidia_com_gpu capacity, drained/cordoned by kube_node_spec_unschedulable, down by the Ready condition. Control-plane excluded by name (node!~"k8s-master.*") because this cluster's kube-state-metrics exposes neither kube_node_role nor node taints/labels (verified live).

Also fixes a latent bug (multiplying by kube_node_spec_unschedulable==0 zeroed the result) and refreshes the remediation text (krr, not the removed Goldilocks). With node6 gone the rule now correctly evaluates LHS 31.0Gi > RHS 27.9Gi (fires) — the honest signal that removing node6 tightened requests-based N-1; trimming the inflated requests clears it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 12:34:01 +00:00
Viktor Barzin
6c3619c9c6 state(dbaas): update encrypted state
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-29 12:26:21 +00:00
Viktor Barzin
682b982c78 state(dbaas): update encrypted state 2026-06-29 12:25:53 +00:00
Viktor Barzin
c0e0911afa dbaas: bump pg_cluster trigger so the checkpoint/WAL params actually apply
a2c8f906 added checkpoint_timeout=15min + max/min_wal_size to the CNPG
Cluster YAML, but the cluster is applied via null_resource.pg_cluster +
local-exec kubectl apply, which only re-runs when its `triggers` change.
The YAML edit didn't bump a trigger, so the change was inert and never
applied (incl. via CI). Bump the pg_params trigger so the kubectl apply
re-runs and CNPG hot-reloads the new params (reloadable, no restart).

Landing it via a targeted apply (-target=null_resource.pg_cluster) to avoid
3 pre-existing unrelated drifts in this stack -- notably a mysql_standalone
volumeClaimTemplate annotation diff the apiserver rejects as immutable,
which is what fails broad dbaas applies (and silently blocked a2c8f906).

Refs: code-oflt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 12:25:37 +00:00
Viktor Barzin
bebe8fbd74 workflows: add read-only memory-overcommit + node-removal capacity analysis
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Reusable Workflow script that audits whether the cluster is memory-overcommitted and whether a single k8s worker can be removed to return RAM to the PVE host without sacrificing N-1 failover. Read-only throughout: gathers PVE host memory (qm config / free / KSM via SSH), k8s per-node capacity + cluster 30d peak working set, and per-workload right-sizing, then models N-1 two ways (physical actual-usage and scheduling-by-request) and adversarially verifies the conclusion with 3 skeptics.

Sizes requests (scheduling reservation) and limits (OOM ceiling) as SEPARATE knobs — an earlier ad-hoc pass conflated them by sizing requests to 30d peak, which manufactured a false N-1 shortfall. Invoke via Workflow {scriptPath}, or by name when cwd is the infra repo.

Requested by Viktor: identify memory overcommit and whether deployment requests can be trimmed to free PVE host RAM by removing a node, without sacrificing service reliability.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 12:06:17 +00:00
Viktor Barzin
a2c8f906ec dbaas: stretch CNPG checkpoint timer 5->15min + raise WAL size (cut sdc write IOPS)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to reduce CNPG checkpoint/WAL writes as part of the sdc
IOPS-isolation work (code-oflt). The IOPS deep-dive found CNPG checkpoints
fire 100% on the 5-min timer (checkpoints_timed >> checkpoints_req), each
triggering a full-page-write burst + flush onto the contended 7200rpm sdc
spindle -- a top write-IOPS source after etcd.

Set checkpoint_timeout=15min + max_wal_size=4GB + min_wal_size=1GB so
checkpoints fire ~1/3 as often (fewer FPW) and WAL segments are recycled
rather than churned. All three are sighup-reloadable -> CNPG applies them
without a restart or failover. checkpoint_completion_target stays 0.9 so
each checkpoint's IO is still smeared across the interval. Bounded
recovery-time tradeoff (more WAL to replay on crash), acceptable for the
write relief. wal_compression left at pglz ('on') pending image
zstd-support verification.

Also refreshes the stale CNPG tuning note in .claude/CLAUDE.md (it listed
shared_buffers=512MB / effective_cache_size=1536MB / 2Gi; live is 1024MB /
2560MB / 3Gi).

Refs: code-oflt (etcd/sdc IO isolation).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 11:41:09 +00:00
Viktor Barzin
3398873a16 k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held
gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt
buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security)
uptake now lags up to 7 days instead of <=1.

- var.schedule:        0 23 * * *  ->  0 23 * * 0   (detector: weekly Sunday 23:00 UTC)
- var.report_schedule: 7 6 * * *   ->  7 6 * * 1    (report: Monday 06:07 UTC, ~7h
  after the Sunday check, so nightly-report.py's ~25h staleness threshold stays
  valid AND still flags a missed weekly run; no STALE_SECONDS change needed)

The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename
= churn). Cadence wording updated across main.tf comments, nightly-report.py
docstring, and the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 06:22:20 +00:00
Viktor Barzin
e43e64c666 kyverno: disable reports-controller to stop etcd ephemeralreport load
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor flagged not wanting to wear the single non-RAID SSD with useless etcd
writes if etcd moves there. Investigation found the avoidable load is kyverno
reporting: the 2026-06-12 etcd-load-reduction disabled the report *features*
but left the reports-controller running (default --enableReporting +
--validatingAdmissionPolicyReports=true), so the 2026-06-21 kyverno upgrade
left a one-time pile of ~10.5k cluster/namespaced ephemeralreports (~114MB in
etcd) that nothing reaps (aggregation off). Listing that range starves etcd's
fdatasync enough to flap the apiserver (observed live 2026-06-28).

Disable the reports-controller outright (reportsController.enabled=false),
completing the 2026-06-12 intent. Reports are not consumed (violations surface
via Loki->Slack); admission enforcement (deny-* policies) and Keel mutation are
independent of it. The ~10.5k stale reports already in etcd are cleared
separately (throttled, out-of-band) since bulk-deleting them is itself
etcd-heavy.

Refs: code-oflt (etcd IO isolation), code-at4f (etcd starvation alerting).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 05:35:36 +00:00
Viktor Barzin
cf42042cba monitoring: re-trigger apply to persist state after CI cancel-race
All checks were successful
ci/woodpecker/push/default Pipeline was successful
No-op comment touch in loki.tf to force a clean `terragrunt apply monitoring`.
The pfSense egress-monitoring apply (commit 7fe2d978, CI pipeline #414) was
cancelled by a newer push and SIGKILLed mid-helm-upgrade: the live resources
applied (probes green, rules loaded) but the Terraform state write and the helm
release finalize were lost, leaving the prometheus release stuck in
pending-upgrade (manually unstuck). This commit re-applies the unchanged
monitoring stack so state matches live, with zero resource changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:58:49 +00:00
Viktor Barzin
f92075b7c5 fire-planner: solve FIRE targets to age 100 (horizon 60→72)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor plans to live to 100, so the portfolio must last that long. The
fire-targets CronJob was solving a 60-year horizon (≈ to age 88); set it to 72
(retire ~age 28 → age 100). Raises every case's FIRE number modestly (more years
to fund). A one-off in-cluster job re-solves the existing rows at the new horizon.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:49:20 +00:00
Viktor Barzin
7fe2d9780e monitoring: add pfSense WAN/egress alerting + probes
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for
~20 min while internal routing + Unbound stayed up; recovery needed a manual
reboot and NOTHING alerted — there was no egress probe and the cloudflared
replica metric stayed green. Add first-class egress monitoring so the next
occurrence pages in ~2 min instead of being noticed by a human.

- blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW
  so ICMP can use raw sockets).
- Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that
  failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 +
  1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers).
- Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable,
  InternetEgressDown (both providers dead), ExternalDNSResolutionDown,
  EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's
  exact "external down while internal up" signature), PfSenseVMDown.
- Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the
  cloudflared replica metric is blind to tunnel-connection loss. Threshold
  calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident).
- Alertmanager inhibit: WAN/egress-down suppresses the downstream egress
  symptom alerts so one root alert pages, not a storm.
- Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md.

All metric names + the cloudflared threshold verified against live
Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening
(dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred
and documented in the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:46:30 +00:00
Viktor Barzin
279b88d2bc docs: add MetalLB L2Status-immutable PG-VIP-flap post-mortem (code-aoxk)
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Post-mortem for the 2026-04/05 SEV3 where a stuck MetalLB ServiceL2Status
CR (immutable status.node) flapped the PG load-balancer VIP and silently
broke Tier-1 Woodpecker terragrunt applies for ~5 days (the wrapper error
"Cannot read PG creds" masked the real cause for ~25 days). Written when
the incident closed (beads code-aoxk, 2026-05-26) but never committed;
landing it so the RCA + stuck-CR cleanup procedure live in the repo.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:25:10 +00:00
Viktor Barzin
6f042ee239 fix(fire-planner): grafana fire-planner-pg datasource survives pw rotation
Some checks failed
ci/woodpecker/push/default Pipeline failed
The fire-planner-pg Grafana datasource baked the rotating fire_planner DB
password into its provisioning ConfigMap at terraform plan-time, so on every
7-day static-role rotation the password went stale and ALL fire-planner-pg
dashboards (fire-planner, cost-of-living, and the new wealth FIRE Countdown)
silently failed with "password authentication failed for user fire_planner"
until the next stack apply.

Switch to the same live-env pattern wealth-pg / payslips-pg already use:
- new ExternalSecret grafana-fire-planner-pg-creds (monitoring ns, Reloader
  match) mirrors the rotating Vault static-creds/pg-fire-planner password
- datasource ConfigMap now references $__env{FIRE_PLANNER_PG_PASSWORD}
- Grafana mounts it via envFromSecrets; reloader (auto) restarts Grafana on
  rotation so the provisioned datasource never goes stale

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:14:42 +00:00
Viktor Barzin
35c0057d83 chrome-service: raise noVNC sidecar memory limit 96Mi->256Mi (fix OOMKill)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The noVNC sidecar (x11vnc + websockify) was OOMKilled (exit 137) repeatedly
whenever someone actively opened chrome.viktorbarzin.me — the view connected
then froze/hung. Idle usage is ~37Mi, but x11vnc + websockify
framebuffer/encode buffers spike past the 96Mi cap when streaming the
1280x720 screen to a client. Raised request 32Mi->64Mi, limit 96Mi->256Mi
(Burstable, aux tier). Already applied live via a transient kubectl patch
(Recreate rollout, verified 0 restarts since); this lands the durable state
so the next apply / daily drift-detection doesn't revert it to 96Mi.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:39:17 +00:00
Viktor Barzin
2e50c1235c chrome-service: grant emo shared browser access (noVNC + homelab browser CLI)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to give emo access to the cluster's headed Chrome so he can fill
in forms and get past anti-bot / captcha pages. emo was deliberately locked
out of chrome-service (noVNC Authentik allowlist was Viktor-only + his
power-user RBAC has no pods/portforward). Viktor's explicit decision: SHARE
his existing browser rather than stand up an isolated per-user instance,
accepting that emo can therefore reach Viktor's warmed logged-in sessions
(CDP has no per-context auth, so the single shared persistent profile is
reachable by anyone who can drive the browser). emo's CLI use is hands-off
(his agent can run it unattended).

- authentik: add emo (emil.barzin / emil.barzin@gmail.com) to CHROME_ALLOWED
  so the admin-services-restriction policy admits him to chrome.viktorbarzin.me
  (noVNC). Reverses the prior Viktor-only lock; comment updated to record why.
- chrome-service/rbac.tf (new): emo-browser ServiceAccount + long-lived token
  (dashboard-sa.tf pattern), a chrome-service-portforward Role granting
  pods/portforward, and a cluster read-only binding (oidc-power-user-readonly)
  so the SA can resolve the Service and emo's normal read access doesn't regress.
- t3-provision-users.sh: install_browser_kubeconfig installs a dual-context
  kubeconfig for any user with a <user>-browser SA — SA token as the default
  context (non-interactive, works headless), personal OIDC retained as the
  oidc@homelab named context. emo's OIDC-only kubeconfig can't authenticate the
  headless agent session that homelab browser needs.
- docs/architecture/chrome-service.md: document the shared-browser multi-user
  access model, the session-exposure trade-off, and how to grant/revoke a user.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:20:07 +00:00
Viktor Barzin
50077b43d4 paperless-ngx: drop TASK_WORKERS 6->4 (6 OOMKilled the pod mid-import)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
6 OCR workers crept past the 8Gi per-container memory cap over ~6h and
OOMKilled paperless at 15:00 during the Emo bulk import. The import
auto-recovered (the consume dir lives on the PVC, so a restart re-scans
and reprocesses — nothing lost), but it left the queue inflated with
re-queued duplicates and spiked etcd on each restart.

The 8Gi cap is the shared edge-tier `tier-defaults` LimitRange, not worth
raising for one namespace. 4 workers fit with headroom (4 measured
~1.3Gi). Matches the value applied live via `kubectl set env` during
incident response; this removes the drift so the next apply keeps it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:06:46 +00:00
Viktor Barzin
8236ae309d postiz: reconcile HCL to live (adopt unmerged stack config), keep parked
All checks were successful
ci/woodpecker/push/default Pipeline was successful
postiz's live deployment (Helm + Temporal + Elasticsearch + Authentik
OIDC + static-DB password) came from the never-merged branch
`wizard/postiz-cnpg-oidc`, so master's HCL was stale and a `terragrunt
apply` would have DESTROYED the stack. This lands that postiz config to
master so HCL == state == live (CI green; destroy-landmine gone).

Kept PARKED (postiz + temporal replicas = 0): IG-via-postiz is Meta-
blocked (it hardcodes retired Instagram scopes → OAuth "Invalid Scopes"),
which is why it was parked; IG runs via the instagram-poster service. To
revive later: flip postiz `replicaCount` + temporal `replicas` back to 1
and re-check image pins.

Notes captured in this reconcile:
- ES image pinned to 7.17.28 (the branch's 7.17.24 was a DOWNGRADE vs the
  live data → ES refused to start "cannot downgrade node 7.17.28→7.17.24";
  caught + rolled back during this work).
- The 4 Authentik resources (app/provider/group/binding) were re-imported
  into state (adopted, not recreated — no duplicate AK objects); the
  obsolete `external_secret_jwt` ExternalSecret was removed (Retain → its
  synced secret was kept).
- Vault-side cleanup (removing the unused pg-postiz rotated role) is
  deliberately NOT included here — deferred, postiz uses a static
  secret/postiz database_url.

State was already reconciled by a local `scripts/tg apply`; this commit is
the HCL catch-up (CI re-apply is a no-op).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 12:54:59 +00:00
Viktor Barzin
250d0fc334 docs(authentik): document SFE forced-WebAuthn escape hatches (TOTP + social)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Old-browser users on the SFE who have a password but no MFA device hit the
default-authentication-flow's forced WebAuthn passkey enrolment, which the SFE
cannot render (the 'unsupported state: ak-stage-authenticator-webauthn' error).
emo (Google-only, iPadOS 15) hit this on the password path.

Document the two no-MFA-downgrade fixes: (1) social login, whose source flow
(default-source-authentication) has no MFA stage, so the SFE's social button
always completes; (2) enrolling TOTP, which the SFE can validate (unlike
WebAuthn) and which flips the MFA stage from force-enrol to validate. TOTP was
enrolled for emo and stored in his Vaultwarden authentik item; verified
end-to-end (a Bitwarden-generated code is accepted by authentik).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 12:24:40 +00:00
Viktor Barzin
e518ada3d4 authentik: repoint to overlay patch3 (all-iOS SFE + SFE social links) + docs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
global.image -> 2026.2.4-patch3. Old iPad Chrome (and any iOS browser) now gets
the SFE too, and the SFE login shows social-login buttons (emo is Google-only with
no password, so the password form alone was a dead end). Docs: .claude/CLAUDE.md +
authentication.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:53:26 +00:00
Viktor Barzin
4fc09b7a61 Merge remote-tracking branch 'origin/master' into wizard/authentik-sfe-social
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Build Custom Authentik Image / build (push) Has been cancelled
2026-06-28 11:53:04 +00:00
Viktor Barzin
916516eeab authentik overlay patch3: SFE for ALL old iOS browsers + social-login links
Two follow-ups to patch2 (both in patch-compat-sfe.py, guarded):

1. compat_needs_sfe() now also serves the SFE to ANY iOS browser on iOS<=16.3,
   not just Safari. iOS Chrome/Firefox are WebKit skins (Apple mandate) reporting
   a non-Safari UA family, so the Safari-only check missed them and they still got
   the blank modern SPA. Added an os.family=="iOS" + version<=16.3 branch.

2. Inject static social-login <a> links (Continue with Google/GitHub/Facebook ->
   /source/oauth/login/<slug>/) into the SFE shell (flow-sfe.html). The SFE
   architecturally can't render Identification-stage sources (authentik docs), and
   emo's account (emil.barzin@gmail.com) is Google-only with NO password — so the
   SFE's username/password form was a dead end. The links are plain redirects that
   work on any browser. Slugs are static; re-verify on source changes.

Tag -> 2026.2.4-patch3; values repoint + docs land once GHA builds it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:53:03 +00:00
Viktor Barzin
08bdf32aa0 feat(fire-planner): FIRE Countdown dashboard section + monthly target solve
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Add a "FIRE Countdown" section to the wealth Grafana dashboard plus a monthly
CronJob that computes the targets it reads.

Viktor wanted a £ countdown to retirement in today's money, per life-case
(Solo / Household / Family) and per country, with progress, a projected date,
runway, and his safety guardrails — so he can see how close he is to FIRE
(ideally lean) without ever coming back to work.

- wealth.json: new country / with_home / savings_per_year template vars + a
  per-Case row (target NW at the 99% GK bar, progress gauge, still-needed,
  projected FIRE date, runway) and safety-valve panels (re-entry trigger vs
  £1.0M, 2.5yr cash buffer, pension tranche @57, Anca-bridge note). Reads
  fire_planner.fire_target via the fire-planner-pg datasource (Mixed).
- fire-planner stack: fire-planner-fire-targets CronJob (monthly, 2nd 10:00
  UTC) runs `recompute-fire-targets --countries all`.

Targets come from the solver shipped in fire-planner edb4d11.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:52:17 +00:00
Viktor Barzin
6ba60cbb2d authentik: repoint to overlay patch2 (SFE for old Safari) + docs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
global.image -> 2026.2.4-patch2 (adds the compat_needs_sfe SFE patch on top of the
SLOW-1a query patch). Old Safari/WebKit (<=16.3) now gets authentik's no-JS SFE
login instead of a blank page — fixes emo's iPadOS-15.8 iPad with no auth
downgrade. Docs: .claude/CLAUDE.md Authentik row + docs/architecture/authentication.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:39:29 +00:00
Viktor Barzin
5fb2004de5 Merge remote-tracking branch 'origin/master' into wizard/authentik-perf-fix
Some checks are pending
Build Custom Authentik Image / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-28 11:38:07 +00:00
Viktor Barzin
f10bb71562 authentik overlay: serve the no-JS SFE login to old Safari (patch #2)
Old Safari/WebKit (<=16.3, e.g. iPadOS<=16.3) can't parse authentik's modern
ES2022 flow SPA and gets a COMPLETELY BLANK login — exactly what emo's iPadOS-15.8
iPad hit. authentik already ships a no-JS Simplified Flow Executor (SFE, ES5) and
serves it via compat_needs_sfe(), but only for IE/old-Edge/PKeyAuth. Extend that
to old Safari so those clients get the REAL authentik login (password + MFA +
reputation, identity preserved — NO auth downgrade, no new credential store).

Chosen over a Traefik basic-auth fallback after an adversarial review: that route
would put a single, spoofable-UA password in front of vbarzin->wizard (passwordless
root on the cluster-controlling devvm) — an MFA->single-factor path to cluster root.
SFE keeps full authentik auth and is generic for any old browser.

Shipped as patch #2 in the existing overlay image (patch-compat-sfe.py — guarded:
asserts the upstream anchor + ast-parses; verified against the live interface.py).
Tag -> 2026.2.4-patch2; the values repoint lands once GHA builds the image.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:38:05 +00:00
Viktor Barzin
ec681ba6e1 ci(infra): stop double-apply + stop counting PG lock-waits as failures
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going
red ~20% of the time. Root causes (verified from the failure logs, not
guessed):

1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82)
   AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every
   push. The two applies race each other for the per-stack PG state lock →
   "Error acquiring the state lock" failures + push-supersede "killed" runs.
2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string
   ("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state
   lock") fell through and was counted as a hard FAILURE.
3. Transient provider-registry download timeouts (and Vault 5xx) failed the
   whole pipeline with no retry.

Fixes (all in default.yml):
- Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on
  the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons
  (they live on repo 1), so we de-dup the apply without deactivating the
  registration. Fail-open on unknown forge.
- Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED.
- Bounded retry (3x) ONLY on transient signatures (provider download
  timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast.

Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA
validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock
failures; reproduced `terraform validate` passing the exact stacks that
fail at apply) and lock-reaping/force-unlock (PG advisory locks are
session-scoped + auto-release; force-unlock can't free them and would
corrupt a live concurrent apply).

Shell logic + the classification regexes were unit-tested locally against
the real decoded error strings (#359 PG lock, #353 provider timeout, #360
missing-arg, helm atomic timeout); `bash -n` clean; YAML parses.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:37:18 +00:00
Viktor Barzin
69e35efd95 Merge remote-tracking branch 'origin/master' into wizard/vault-kv
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build infra CLI / build (push) Has been cancelled
2026-06-28 11:09:38 +00:00
Viktor Barzin
e03e4719ad vault: distinguish Vaultwarden vs HashiCorp Vault, add vault kv
`homelab vault` only spoke to Vaultwarden (the password manager), but the
name reads as HashiCorp Vault (the infra secrets store — actually OpenBao
here). Make the two unmistakable and support both.

Distinction (no breakage — the existing Vaultwarden verbs are unchanged):
- bare `homelab vault` help now LEADS with the two-stores split;
- every verb summary is tagged `[vaultwarden]` or `[hashicorp-vault]`;
- HashiCorp Vault/OpenBao lives under a clearly-named `vault kv` group.

New `vault kv` (HashiCorp Vault / OpenBao, the secret/… KV store):
- `kv get <path> [--field K]` — read; --field → one value (TTY-aware
  clipboard/stdout), no field → full secret JSON (refuses a bare TTY).
- `kv list <path>` — list sub-paths (no values).
- `kv put <path> <key>` — write one key; value via stdin (piped or
  no-echo prompt, never argv); creates the path or merges (never
  clobbers siblings; uses kv patch -method=rw so no `patch` cap needed).

Critical: `kv` uses the caller's OWN Vault token (OIDC ~/.vault-token /
$VAULT_TOKEN), NOT the per-user scoped Vaultwarden token (bound only to
claude-users/<user>, which would 403 elsewhere) — handlers set VAULT_ADDR
but never inject the scoped token. Access is whatever the policy grants.

Logic in cmd_vault_kv.go (pure cores extractKVData/parseKVList/arg
builders/kvGet/List/Put; file header documents the credential split).
CLI v0.11.0. Tests: no value in put argv, create-then-merge, KV-v2
envelope strip, help names both systems. Verified e2e against live Vault
(read key-names-only + a scratch put/merge/cleanup).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:09:33 +00:00
Viktor Barzin
460f2ad42f state(vault): update encrypted state
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-28 11:07:22 +00:00
Viktor Barzin
87a450e9a3 vault: grant emo full read/write on his own secret/emo tree
Viktor asked that emo be able to edit his own secrets with full access.
emo's personal-emo policy was read-only (read on data, read/list on
metadata), so he could view but not change his personal secrets.

Widen it to the same self-service capability set every namespace-owner
already has over their own tree: create/read/update/delete/list on
secret/data/emo(+/*) and list/read/delete on secret/metadata/emo(+/*).
Scope is unchanged — still only emo's own secret/emo subtree, still a
named exception that does not widen the power-user tier in general.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:07:22 +00:00
Viktor Barzin
a1cf7ccaf6 authentik: repoint to the SLOW-1a overlay image + un-enroll Keel
All checks were successful
ci/woodpecker/push/default Pipeline was successful
GHA built ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch1 (public, verified
anonymously pullable). Point global.image at it (repository + tag pinned
explicitly so neither helm's appVersion default nor Keel can downgrade it — the
2026-06-10 boot-storm class) and remove keel.sh/enrolled from the namespace so
Keel won't auto-bump the custom tag. authentik is now manual-upgrade: bump the
Dockerfile FROM + this tag together on each authentik version bump.

Net effect once rolled: the identification-stage query drops ~1.4s -> ~14ms, so
the cold login-flow first-load stops being slow. (Does NOT affect old-browser
clients — iPadOS<=15/Safari<=15.6 still can't run the SPA; that's unfixable
server-side.) Docs: .claude/CLAUDE.md Authentik row.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:46:21 +00:00
Viktor Barzin
7ec64ed5ff authentik: custom-image overlay to fix the 1.4s login-flow query (SLOW-1a)
Some checks are pending
Build Custom Authentik Image / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
The login flow's identification stage runs a bare select_subclasses() that
LEFT-JOINs every Source subtype table — ~1.4s server-side on every cold login
(verified live: 1527ms vs 14ms). Narrow it to only the subtypes that render a UI
login button (oauth/saml/plex/telegram/kerberos — not the sync-only ldap/scim),
via django-model-utils string accessors so no import is needed. Byte-identical
output, ~100x faster, robust to adding new login source types.

Shipped as a thin overlay over the official image (mirrors the diun/excalidraw
precedent): stacks/authentik/Dockerfile (FROM ghcr.io/goauthentik/server:2026.2.4
+ a guarded sed) built by .github/workflows/build-authentik.yml -> ghcr.io/
viktorbarzin/authentik-server:2026.2.4-patch1. The values repoint + Keel freeze
land in a follow-up commit once the image is built. Upstream bug still present in
main (no fix/PR) — drop this overlay once upstream narrows the query.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:42:58 +00:00
Viktor Barzin
12a45fa94e vault: bw sync on every read so reads show the latest values
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`bw unlock` only decrypts the LOCAL cache, so a persisted (already
logged-in) session served stale data — a password changed in the web
vault wouldn't appear until the next fresh login. Add a best-effort
`bw sync` in openSession (the chokepoint every read shares: get, get
--all, list, code, status), so reads reflect current server-side values.

Best-effort by design: a transient sync failure warns on stderr and
falls back to the cached vault rather than failing the read (an AFK
agent shouldn't break on a network blip). status keeps its own explicit
sync so a reachability failure still surfaces in its report.

CLI v0.10.1. Tests assert the sync runs after unlock and before the read,
and that a read still succeeds when sync fails.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:19:54 +00:00
Viktor Barzin
3d948c7033 Merge remote-tracking branch 'origin/master' into wizard/upgrade-gate-held
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-28 10:09:42 +00:00
Viktor Barzin
2880fe1c29 docs: update k8s-version-upgrade runbook for actionable-vs-held gate
Reflect the classification change in the operational runbook: the gate's three
refusal classes (actionable/waiting/pinned), held wins on a mix, refusals now
Complete cleanly (no Failed Job), k8s_upgrade_held gauge + the deliberate
no-alert-for-held, the dropped K8sUpgradeChainJobFailed suppression clause, the
nightly report ⏸️ HELD outcome, and the detector's silent nightly re-evaluation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:09:34 +00:00
Viktor Barzin
eebb6c8594 k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case
The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.

compat-gate.py now classifies each blocker:
  - ACTIONABLE: a newer addon version in addon-compat.json supports the target
    -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
    nightly report).
  - WAITING-on-upstream: no released version supports the target yet -> held.
  - PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.

Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.

nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).

Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:08:20 +00:00
Viktor Barzin
ccee443790 vault: add get --all to browse every field of an item
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`homelab vault get` could only fetch one of five allow-listed fields and
had no way to see what fields an item even has — in particular it could
not reach arbitrary user-defined custom fields. Add a `--all` flag that
dumps the whole item as a normalized JSON object
(`{name, username?, password?, uris?, totp?, notes?, fields?}`), so a
Claude session can discover and read every field, custom ones included,
in a single call.

Security model preserved:
- Like `get --json`, the dump is all secret values, so it refuses a bare
  TTY (pipe it, e.g. `| jq`); the machine/agent path is stdout.
- The TOTP *seed* is reduced to a presence flag (`"totp": true`) and
  never emitted — the seed is more powerful than a one-time code, so the
  only seed-derived path stays the specially-audited `vault code`. Tests
  assert the seed and password-history never appear in the dump.
- Op-log uses a distinct `get-all` verb (item name still never logged) so
  a bulk dump is distinguishable from a single-field read.

`normalizeItem` is a pure, unit-tested core; `getItem` is the
session+fetch seam. CLI bumped to v0.10.0. Docs: README changelog,
onboarding runbook, design spec §16.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:01:49 +00:00
Viktor Barzin
afcd463f39 k8s-upgrade: design doc for actionable-vs-held compat-gate classification
The nightly upgrade chain fails a preflight Job and raises K8sUpgradeBlocked
every night for the 1.36 target, even though the block is unactionable: no
kyverno/ESO release supports 1.36 yet and gpu-operator is deliberately pinned
(NVIDIA driver/Ubuntu coupling). Viktor asked to teach the checker to tell
'we can fix this' apart from 'nothing to do but wait', and stop the nightly
Failed-Job + alert noise for the latter.

This documents the design: classify each blocker as actionable / waiting-
upstream / pinned, keep the alert only for actionable, quiet the held case to
the nightly report, and make deliberate gate decisions Complete cleanly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:01:36 +00:00
Viktor Barzin
b3c419e108 Merge remote-tracking branch 'origin/master' into wizard/authentik-perf-fix
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-28 09:55:25 +00:00
Viktor Barzin
9a1ab6247b cli: add homelab edges — who-talks-to-whom investigation helper (v0.9.0)
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Makes the goldmane_edges east-west trail (ADR-0014) reachable during incident
investigations without remembering the DB/creds/SQL. New top-level verb:

  homelab edges --ns <ns>         edges touching <ns> (either direction)
  homelab edges --src/--dst <ns>  directional egress / ingress peers
  homelab edges --peers-of <ns>   distinct peer namespaces of <ns>
  homelab edges --new-since 24h   first seen since a duration or date (YYYY-MM-DD)
  homelab edges --denied          only action='deny' (blocked / lateral movement)
  homelab edges --json --limit N  machine-readable / row cap (default 200)

Filters render to a single read-only SELECT against the `edge` table, run via
the dbaas CNPG primary pod (same exec path as `k8s db`). Namespace values are
validated to the k8s name charset (injection guard) before they reach SQL.

TDD: edges_test.go covers flag parsing, query building (each filter, AND
combination, peers-of shape, JSON wrapper), the new-since duration/date parser,
and namespace-validation / injection rejection. Smoke-tested live: --peers-of,
--new-since 24h, --denied, and --json all return correct rows.

Docs: runbook query section now leads with the CLI; cli/README gains a v0.9
section. VERSION v0.8.2 -> v0.9.0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:51:41 +00:00
Viktor Barzin
0fa5852ec6 homelab v0.8.2: fix memory recall truncating multibyte UTF-8 mid-character
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
emo's Claude Code sessions hit "UserPromptSubmit hook error" on almost every
prompt. Root cause: the homelab-memory-recall.py UserPromptSubmit hook runs
`homelab memory recall <prompt>` and strict-decodes its stdout. printMemories
truncated each memory's preview with a BYTE slice (c[:240]), which cuts through
the middle of a 2-byte Cyrillic character and emits invalid UTF-8 (a dangling
0xd0 lead byte). The hook's subprocess.run(text=True) then raised
UnicodeDecodeError — not caught by its `except (TimeoutExpired, OSError)` — so
the hook exited non-zero and Claude surfaced the error. It is Cyrillic-specific
(ASCII has no multibyte chars to split), so it bit emo (Bulgarian prompts) every
turn while English users almost never saw it.

Two-layer fix:
- cli: truncatePreview() now counts RUNES, not bytes, so the preview never
  splits a character. Regression test asserts valid UTF-8 on a long Cyrillic
  string. Fixes the root for every consumer of `memory recall` / `memory list`.
- hook: subprocess.run gains errors="replace" and the except is broadened to
  honor the script's own "best-effort, exit 0" contract — so a truncated or
  otherwise odd payload can never again surface as a hook error.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:40:51 +00:00
Viktor Barzin
a3eb309e26 calico: fix empty Whisker UI — allow whisker egress to the kube-dns ClusterIP
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Real root cause of the 2026-06-28 "Whisker UI empty" incident (the watchdog
added in 8d1d2fb9 was treating a symptom). The tigera operator's own `whisker`
NetworkPolicy is policyTypes:[Ingress,Egress]; its egress allows DNS only to the
kube-dns *pods* (podSelector k8s-app=kube-dns). But whisker-backend resolves
goldmane.calico-system.svc via the kube-dns *ClusterIP* (10.96.0.10), and Calico
drops UDP DNS to a ClusterIP under a podSelector-only egress rule.

Verified in an isolated repro: from the whisker pod's netns, ClusterIP DNS = 100%
timeout while direct kube-dns pod-IP DNS = OK; a pod with no egress policy
resolves fine; a test pod with the operator's podSelector-only egress rule
reproduces the failure, and adding an ipBlock(ClusterIP) egress rule flips it to
100% ok. whisker-backend resolves goldmane once in the brief startup window
before the policy programs, holds its long-lived gRPC stream, and only
re-resolves when that stream breaks (e.g. a node-reboot blip) — then the blocked
ClusterIP DNS wedges its Go resolver and the UI goes empty. The durable
aggregator (separate pod, unrestricted namespace) was never affected.

Fix: additive egress NetworkPolicy whisker-allow-dns-clusterip
(whisker -> 10.96.0.10/32 on 53 UDP+TCP); k8s egress policies are additive so
the operator NP is untouched. The whisker-watchdog CronJob is kept as a backstop
(repurposed comment). Applied + verified: ClusterIP DNS from the whisker netns
now 8/8 ok, whisker-backend 0 errors, flow API returns 828 flows / the namespace
list. Docs (runbook + CLAUDE.md) updated to the real root cause.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:32:28 +00:00
Viktor Barzin
385dfff0e7 authentik: fix episodic blank-screen + 30s-hang login (reliability R2)
The login screen would sometimes hang/blank for everyone for ~30s at a time.
Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a
transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3
goauthentik-server pods dropped out of the Service at once, so Traefik had no
healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the
rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` —
so live ran the chart-default 25%/25% and dropped a pod out of rotation on every
roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on
PostgreSQL and request-serving is coupled to PG — verified there is no
external-cache option to put back, so a SHORT transient is now survived but a
total CNPG outage still takes authentik down.)

Reliability package (R2, approved):
- readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover
  reconnect without dropping the whole fleet from the Service.
- rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key)
  and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready.
- gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9
  workers' recycles don't cluster on a DB blip.
- / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000)
  from the previous commit (skip_default_rate_limit) — fixes the cold-load 429
  blank screen.

Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200,
so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the
pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md
(also corrected a stale "60s persistent DB connections" note).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:17:05 +00:00
Viktor Barzin
b84b0021c2 authentik: dedicated rate-limit carve-out + per-router 5xx observability
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Unauthenticated users were getting a blank login screen (and the screen would
sometimes just hang). Root-caused via a read-only fan-out + adversarial verify:
the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through
the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and
a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was
the only first-party SPA still on the default limiter (8 siblings already have a
carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket).

- traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000,
  mirroring the existing health/tripit carve-outs). The authentik / and /static
  ingresses switch to it in the authentik-stack commit.
- monitoring: the `traefik` scrape job's drop-regex was a blanket
  `traefik_router_.*`, which also dropped `traefik_router_requests_total` — so
  per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable.
  Narrowed it to keep the counter while still dropping the high-cardinality
  `*_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh`
  for the episodic all-3-server-pods-NotReady 502/503/504 cascade.

Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:10:34 +00:00
Viktor Barzin
65a09dcbc4 docs(homelab-vault): rebuild snippet uses cli/VERSION, not git describe
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The onboarding runbook's "rebuild the binary" command stamped the version
from `git describe --tags --always`, but setup-devvm.sh stamps it from
`cli/VERSION`. The v0.8.1 tag is no longer reachable from master, so the
describe form silently produced a bare commit sha — diverging from what a
provisioner reconcile stamps. Match the canonical source.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:05:49 +00:00
Viktor Barzin
c53e7839e1 Merge remote-tracking branch 'origin/master' into wizard/vault-addr-default
Some checks failed
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was canceled
2026-06-28 09:04:43 +00:00
Viktor Barzin
0525f0b12d homelab vault: self-default VAULT_ADDR + prefer scoped token over ~/.vault-token
Setting up emo's Bitwarden access via `homelab vault`, his one-time
`homelab vault setup` failed with an opaque "exit status 2". Two latent
CLI bugs, both of which any non-admin AFK invocation can hit:

1. The CLI set VAULT_TOKEN but never VAULT_ADDR, relying on the ambient
   value. It IS in /etc/environment (login shells), but emo runs his
   agents from long-lived tmux / non-login shells that never sourced it,
   so every `vault` child hit the 127.0.0.1:8200 default -> connection
   refused. claude-auth-sync already self-defaults VAULT_ADDR; the CLI
   now does the same.

2. Token precedence was env > ~/.vault-token > scoped. A power-user who
   ran `vault login -method=oidc` carries a read-only ~/.vault-token
   (policy `default`, capability `deny` on their workstation path), which
   shadowed the purpose-built scoped token -> 403 permission denied on
   the user's OWN path. This tool only ever touches
   secret/workstation/claude-users/<user>, which the scoped token covers
   exactly, so precedence is now env > scoped > ~/.vault-token. Verified
   the scoped tokens for both emo and wizard hold create/read/update on
   their own paths, so admins are unaffected.

Also stop swallowing the shelled `vault`/`bw` stderr: errors now carry
the real message (connection refused / permission denied) instead of a
bare "exit status N" — without that, (1) and (2) were indistinguishable.

Verified end-to-end as emo (VAULT_ADDR unset + his read-only
~/.vault-token present): writeCreds now succeeds.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:04:28 +00:00
Viktor Barzin
8d1d2fb999 calico: add whisker-watchdog CronJob to self-heal a wedged whisker-backend
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Whisker showed an empty UI on 2026-06-28. Root cause: whisker-backend dials
goldmane:7443 over a long-lived gRPC stream; when that stream dropped during a
transient CNI/DNS blip (right after k8s-node5 finished its v1.35.6 upgrade, its
pod resolver briefly timed out on the kube-dns ClusterIP) the Go gRPC resolver
got WEDGED — spamming "failed to stream flows" / "code = Unavailable: dns ...
i/o timeout" forever, never reconnecting. The operator ships whisker-backend
with NO liveness probe, so nothing restarted it; the live UI stayed blank until
a manual `kubectl delete pod`. (The durable aggregator is a separate pod and
was unaffected — only Whisker's ~60-min live view went dark.)

Whisker is operator-managed (Whisker CR), so we can't inject a liveness probe.
Instead add a watchdog so this never needs a manual restart again:
- whisker-watchdog CronJob (every 10 min) + least-privilege SA/Role/RoleBinding
  (calico-system only: pods get/list/delete, pods/log get).
- It restarts the whisker pod only when whisker-backend logs >=10 goldmane-
  connection errors in 11m AND Goldmane is Ready (the Goldmane-Ready guard
  avoids restart-thrash during a real Goldmane outage).
- Self-tested: a manual run reports "whisker-backend healthy: 0 ... errors"
  and does not restart.

Docs: runbook gains a "Whisker UI empty" troubleshooting entry + a self-heal
note; the stale 2026-06-25 "digest never posted" known-state block is updated
to Resolved (digest posts to #alerts, lastSuccessfulTime current); CLAUDE.md
flow-trail bullet gains the whisker-wedge gotcha.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 08:59:31 +00:00
Viktor Barzin
c70810a51b workstation: per-user long-lived Claude token to end concurrent-refresh logout
All checks were successful
ci/woodpecker/push/default Pipeline was successful
A heavy user (emo) runs 8+ always-on `claude` agents + their t3-serve instance,
all sharing one ~/.claude/.credentials.json. When the shared access token expires
the processes refresh simultaneously; OAuth refresh-token rotation makes the
losing writer persist an EMPTY refresh token, logging the user out roughly every
access-token lifetime (~8h). Re-issuing the credential never sticks — the race
recurs (this is why emo's "standalone token" fix kept regressing).

Fix: an opt-in, per-user, non-rotating setup-token (sk-ant-oat01, ~1y, scope
user:inference) kept in the user's OWN Vault path (field `setup_token`).
claude-auth-sync materializes it to a user-owned
~/.config/claude-auth-sync/claude-oauth.env and, while it is present, SKIPS the
rotating-credential validate/backup/restore (so no false
WorkstationClaudeAuthInvalid). start-claude.sh and t3-serve@.service load it as
CLAUDE_CODE_OAUTH_TOKEN, so every session of that user uses the non-rotating
token and there is nothing to race on.

Fail-safe + opt-in: with no `setup_token` in Vault, every path is a no-op, so
users on the normal per-user Enterprise-SSO flow are unaffected. This is each
user's OWN identity, never the forbidden shared CLAUDE_CODE_OAUTH_TOKEN. Runbook
documents enable/disable/rotate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 08:07:43 +00:00
Viktor Barzin
3cc8f9f661 paperless-ngx: keep mem limit at 8Gi (tier LimitRange caps containers)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The prior commit set the limit to 10Gi, but the shared tier-defaults
LimitRange caps per-container memory at 8Gi, so the rollout's new pod was
forbidden (FailedCreate) and paperless was briefly down. 8Gi is ample for
6 workers anyway (4 workers measured ~1.3Gi under full OCR load). Restored
service live via kubectl patch; this commit matches TF to the live 8Gi so
drift detection won't re-revert it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 19:37:59 +00:00
Viktor Barzin
21d20dccf8 paperless-ngx: bulk-import via PVC consume dir (restart-safe) + 6 workers
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Emo's ~13.7k-document import was going through the API upload path, which
stages each file on the pod's EPHEMERAL scratch before queuing it. Any
paperless pod or redis restart therefore destroyed all in-flight work
(the "File not found" failures we hit) and required manual re-uploads.

Move bulk ingest to paperless's consume directory placed on the encrypted
PVC, with PAPERLESS_CONSUMER_POLLING so the whole folder is re-scanned
periodically (and on startup) with a file-stability check. Files now live
on durable storage and survive any restart — the folder is the queue and
self-heals, so we can copy everything in fast and let it process over
time with zero retry/integrity risk. RECURSIVE preserves the source tree
(avoids basename collisions); owner+tag come from a consumption workflow.

Bump TASK_WORKERS 4->6 to speed the OCR/convert-bound processing (node6
has the core headroom for one pod) and mem limit 8->10Gi for the extra
workers. Revert workers/mem/consume envs to defaults once the import ends.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 19:35:10 +00:00
Viktor Barzin
2cb37d51d4 paperless-ngx: scale Gotenberg x3 + Tika x2, 4 workers, skip-archive — speed the Emo import
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Bottleneck found: single Gotenberg 503s under concurrent workers (office docs
failing + slow). Cluster is otherwise idle (sdc 0.5% util, etcd ~1/min), so:
- Gotenberg 1->3 + Tika 1->2 (Service load-balances; fixes the 503s, parallel
  office conversion).
- paperless TASK_WORKERS 2->4, THREADS_PER_WORKER 2->1, mem limit 4->8Gi (avoid
  OOM with 4 concurrent OCR). Requests kept low to stay within tier-quota
  (requests.memory 3840/4096Mi).
- PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text: skip redundant archive for born-
  digital/office docs (big IO saver for the work-doc set).
Guard + etcd watch stay in place; revert to defaults after the import.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 18:45:25 +00:00
Viktor Barzin
d6bd9486e3 Merge remote-tracking branch 'origin/master' into wizard/portal-onboarding-paths
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build k8s-portal / build (push) Has been cancelled
2026-06-27 16:34:44 +00:00
Viktor Barzin
fca948a23d k8s-portal: document all three cluster-access paths in onboarding
The Getting Started portal only walked through the heaviest path (local VPN + kubectl + Vault + sops install) and never mentioned the two zero-setup routes that users actually reach first. Restructure onboarding to lead with all three, recommendation first: (A) the t3 web terminal, which drops you into a ready shell with kubectl/Vault/repos preinstalled; (B) the k8s web dashboard, auto-authenticated per user; and (C) the existing own-machine setup. Flag the dashboard/terminal as the fallback when CLI OIDC login is unavailable, reframe the misleading home-page 'VPN required' banner (only path C needs it), add the access endpoints to the service catalog, and fix a stale Vaultwarden URL (was vault.viktorbarzin.me, which is actually HashiCorp Vault; Vaultwarden is vaultwarden.viktorbarzin.me).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 16:34:36 +00:00
Viktor Barzin
9599beadc9 paperless-ngx: 2 task workers + 2 threads/worker + 4Gi limit for the Emo bulk import
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Emo's ~13.7k-doc import is OCR-bound on a single celery worker (~10s/doc =
multi-day). Bump PAPERLESS_TASK_WORKERS=2 + THREADS_PER_WORKER=2 for ~2x
throughput, and the memory limit 2Gi->4Gi to fit two concurrent OCR jobs.
Kept deliberately modest: archive writes hit the shared sdc HDD that etcd
also lives on (IO-storm risk, code-oflt) — watch etcd apply latency and
revert workers to 1 if it degrades. Revert to defaults once the import done.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 16:33:43 +00:00
d4f564e8d5 Merge pull request 'docs(ci-cd): plotting-book build→ghcr→deploy flow diagram' (#16) from wizard/plotting-doc into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-27 15:50:02 +00:00
Viktor Barzin
0097bddf9f docs(ci-cd): add plotting-book build→ghcr→deploy flow diagram
ASCII flow of the migrated plotting-book pipeline (GHA build in Anca's
repo → private ghcr.io/passionprojectsanca/book-plotter → Woodpecker
redeploy hook → in-cluster pull via ghcr-credentials), plus the Kyverno
admission / Keel backstop / Vault pull-cred supporting cast and the
serving path. Appended to the existing plotting-book section.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:49:58 +00:00
Viktor Barzin
bbc797b30e ci(woodpecker): stop applying/planning the Tier-0 vault stack in CI
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The nightly drift-detection cron and every vault-touching push apply have
been failing because CI runs terragrunt plan/apply on the Tier-0 `vault`
stack, which manages Vault's own transit mount + ACL policies. The CI
`ci` Vault role intentionally lacks those admin perms (sys/mounts,
sys/policies/acl), so the run always errors:
  - apply: 403 on vault_mount.transit + vault_policy.personal_emo, plus an
    Invalid for_each (local.k8s_users from secret/platform is deferred)
  - drift: terragrunt plan exits 1 → fails the whole nightly run

vault is Tier-0 = human-applied via OIDC. Skip it in both pipelines:
- default.yml: skip `vault` in the platform-apply loop (kept in
  PLATFORM_STACKS so the app-stack detector still excludes it)
- drift-detection.yml: skip `vault` in the per-stack plan loop
- ci-cd.md: document the exclusion on both pipeline rows

Found during a CI-health sweep (user reported many failures): GitHub
Actions all green; all Woodpecker repos green except this recurring
infra-repo failure, doubled by the legacy repo-1 + repo-82 dual
registration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:48:41 +00:00
81c2b14e29 Merge pull request 'plotting-book: pull image from private ghcr instead of public DockerHub' (#15) from wizard/plotting-ghcr into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-27 15:32:35 +00:00
Viktor Barzin
c13a3f1694 plotting-book: pull image from private ghcr instead of public DockerHub
Anca's plotting-book app now builds its image in her own GitHub repo to
the private package ghcr.io/passionprojectsanca/book-plotter (off public
DockerHub viktorbarzin/book-plotter). Wire the cluster to pull it:

- stacks/plotting-book: point the deployment baseline image at the ghcr
  package and add imagePullSecrets {ghcr-credentials} so the pod can pull
  the private image (the live tag is still CI-owned via ignore_changes).
- stacks/kyverno: add the plotting-book namespace to the ghcr-credentials
  allowlist so the Kyverno generate policy clones the pull secret into it.
  Verified the shared ghcr_pull_token (Viktor, repo-admin on Anca's repo)
  can read the private package before wiring this.

Docs: correct ci-cd.md (it wrongly listed plotting-book as already on
ghcr — it was on DockerHub) and note the special arrangement; amend
ADR-0003 to record that this GitHub-first repo builds to its own org's
ghcr namespace.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:32:19 +00:00
Viktor Barzin
bf40409141 docs(security): note crowdsec-cf-sync rate-limit resilience
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Document the backoff_limit=0 + CF-429 soft-skip hardening alongside the
cf-sync architecture description, with the why (the backoff_limit=2
retry-storm that escalated Cloudflare's Lists-API throttle into a stuck
state). Follow-up to 5b49634f.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:27:44 +00:00
Viktor Barzin
5b49634fe0 rybbit/crowdsec-cf-sync: stop Cloudflare Lists-API retry-storm (429 self-DoS)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The edge-ban sync was failing every 2 min on Cloudflare HTTP 429
(rate-limited) and never recovering, leaving the crowdsec_ban list empty.

Root cause: backoff_limit=2 made k8s re-run a failing pod up to 3x within
seconds, so each */2 cycle fired a burst of POSTs into Cloudflare's
per-60s Lists-API write limit. That kept the throttle perpetually tripped
(it stopped clearing even after minutes of quiet) — a self-inflicted DoS.

Two changes make the sync gentle and self-healing:
- backoff_limit 2 -> 0: one attempt per */2 cycle (the schedule IS the
  retry cadence), no rapid-fire burst.
- lapi_kv_sync.py: treat a CF 429 as a soft-skip (exit 0, retry next
  cycle) like the existing LAPI fail-safe, instead of fail-loud + k8s
  retry. Any other CF error still fails loud.

Found during a cluster health check (AIOStreams CSI + pfSense SSH issues
handled separately).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:23:42 +00:00
Viktor Barzin
7c72368243 state(vault): update encrypted state
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-27 13:54:23 +00:00
Viktor Barzin
f92ab04dae vault: grant emo read-only access to his own secret/emo
emo (power-user tier) had no Vault policy granting his personal secret
path, so `vault kv get secret/emo` failed. Viktor asked to give him that
access. Adds a read-only `personal-emo` policy (read on secret/data/emo +
metadata) and attaches it to emo's OIDC identity by adopting the
entity/alias Vault auto-created on his first login. Scoped explicitly to
emo; does not widen the power-user tier (which stays secret-less).

Verified live: a personal-emo token reads secret/emo, is denied writes,
and is denied other paths (secret/viktor -> 403).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 13:35:57 +00:00
Viktor Barzin
90f5425cdc state(vault): update encrypted state 2026-06-27 13:33:34 +00:00
a7117e0bfe immich(frame-emo): bump photo-frame Interval 30->45s
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Permissions-test change requested by Viktor: slow Emo's Sofia photo-frame
slideshow from 30s to 45s per image.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 13:07:00 +00:00
Viktor Barzin
d50962b00e immich: add Immich photo-frame for Emo's Portal (highlights-immich-emo)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Second ImmichFrame instance cloned from the London frame (frame.tf), scoped to Emo's Immich account (emil.barzin) with Sofia weather coords and last-2-years photos. Drives Emo's Meta Portal Mini in Sofia via the portal-immich-frame app. Dedicated API key minted on Emo's account and stored in Vault (secret/immich -> frame_api_key_emo).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 12:40:29 +00:00
Viktor Barzin
e8b72019b5 paperless-ngx: deploy Tika + Gotenberg for Office ingest + raise PVC ceiling to 80Gi
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Emo's import scope now includes his work-PC document set (C/Documents,
Project Management, Service & MRO, etc. on the NAS), which is ~4.9k Office
files (.doc/.docx/.xls/.xlsx/.ppt/.pptx) on top of Emo shared. Paperless
can only archive/OCR/index those if it can convert them, so add the standard
Apache Tika (text+metadata) + Gotenberg (-> PDF) sidecar deployments + their
services in the paperless-ngx namespace and point PAPERLESS_TIKA_* at them.
Pinned images (gotenberg 8.25, tika 3.3.1.0), single replica, no PVC.

Total in-scope document set across all NAS locations is now ~13,700 PDF+Office
files / ~13.7GB source (~30GB once OCR'd + archived), so raise the data PVC
autoresize ceiling 30Gi -> 80Gi for comfortable headroom. The topolvm
autoresizer grows on demand up to the ceiling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 12:02:04 +00:00
Viktor Barzin
041aedc486 Merge remote-tracking branch 'origin/master' into wizard/paperless-emo
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-27 08:17:28 +00:00
Viktor Barzin
7988a690ed paperless-ngx: add Bulgarian OCR (bul+eng) + raise data PVC ceiling to 30Gi
Preparing Paperless for Emo's document import from the NAS. His archive is
Bulgarian (Cyrillic) + English, but OCR was English-only (tesseract had no
'bul' pack and PAPERLESS_OCR_LANGUAGE was unset/defaulted to eng), so scanned
BG documents would OCR to garbage and be unsearchable. Add bul to the install
list and set OCR_LANGUAGE=bul+eng.

Also raise the data PVC autoresize ceiling from 5Gi to 30Gi: everything
(originals + archive via PAPERLESS_MEDIA_ROOT=../data) lives on the single
encrypted PVC, and the ~2.7GB in-scope import would blow past the 5Gi cap
mid-ingest. The topolvm autoresizer grows the volume on demand up to the
ceiling; 30Gi gives ample headroom.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:17:13 +00:00
Viktor Barzin
6415f77fed Merge remote-tracking branch 'origin/master' into wizard/emo-vault-onboard
Some checks failed
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was canceled
2026-06-27 08:17:06 +00:00
Viktor Barzin
b371ae6eee homelab vault: install bw system-wide + onboarding runbook
Two remaining gaps to let non-admins (emo) use `homelab vault`:

- setup-devvm.sh installed `@bitwarden/cli` only when `command -v bw`
  failed, which an admin's own ~/.local/bin/bw satisfied — so the
  system-wide copy was never installed and non-admins had no `bw`
  backend. Install to the npm /usr prefix and guard on the system path
  (/usr/bin/bw) instead.

- Add docs/runbooks/homelab-vault-onboarding.md (per-user setup, the
  shared Organization/Collection flow for sharing passwords, admin
  deploy + verification, security model) and repoint the two code
  comments that cited a design-spec path which never existed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:16:52 +00:00
Viktor Barzin
51dc5d031c homelab vault: make it work for non-admin workstation users
`homelab vault` was effectively admin-only: two bugs blocked every
non-admin (e.g. emo) from using it for their own Vaultwarden vault.

1. Token: the CLI relied purely on ambient `vault` auth (~/.vault-token
   / $VAULT_TOKEN), which only admins have. Non-admins carry a scoped
   token at ~/.config/claude-auth-sync/vault-token (policy
   workstation-claude-<user>). Add ensureVaultToken(): explicit env >
   ~/.vault-token > scoped fallback, wired into every vault verb. Admins
   are unaffected (their ambient token wins).

2. Write capability: `homelab vault setup` used plain `vault kv patch`,
   which needs the `patch` capability the scoped policy does not grant
   (only create/read/update) — so setup 403'd for non-admins. Switch to
   `kv patch -method=rw` (read-modify-write; same approach
   claude-auth-sync already uses), with `kv put` only when the path
   doesn't exist yet. Preserves co-located keys (claude_ai_oauth_json).

Enables onboarding emo onto the per-user Vaultwarden access tool.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:15:42 +00:00
Viktor Barzin
82a7b2585b chrome-service: reconcile state after pipeline #366 was killed mid-apply + document cancel-previous hazard
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Pipeline #366 (the SHA-pin apply, commit 7b4a8ba8) was SIGKILLed mid-apply by
Woodpecker cancel-previous when I pushed the next commit (#367, docs) while it
was still running — the apply log ends at '[chrome-service] Starting apply...'
with no 'Apply complete!', so the terraform state write did not finish. The live
deployment is correct (image = the supervised SHA, verified, self-healing), but
the stored state may be stale; this commit re-triggers a clean changed-stack
apply to reconcile it (comment-only change → 0 resource changes, no rollout).

Also adds a caution to the novnc image comment: after bumping the SHA, WAIT for
the apply pipeline to finish before pushing again (memory id=1957).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:15:41 +00:00
Viktor Barzin
006f97ef58 docs: bless local terragrunt apply, but require committing every applied change
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to change the infra apply guidance: instead of 'never apply
locally, always rely on CI', the policy is now 'you MAY apply locally, but
always commit the change to the infra repo'.

- .claude/CLAUDE.md (Critical Rule: Terraform Only): new bullet making local
  apply explicit (scripts/tg apply / homelab tf apply) from the MAIN checkout
  (not a worktree — git-crypt'd tfvars read as ciphertext there), with a hard
  requirement that every applied change is committed + pushed to master the same
  session so the repo stays the source of truth and CI drift-detection doesn't
  revert it. Spells out the apply<->commit ordering both ways.
- AGENTS.md (non-admin workstation land steps): step 5 now notes local apply as
  an option alongside CI auto-apply, with the same 'always committed, never
  applied uncommitted' rule.

Note: the org-managed settings block also frames CI auto-apply but is not
editable from a workstation clone.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:10:20 +00:00
Viktor Barzin
7b4a8ba867 chrome-service: pin noVNC image to the x11vnc-supervision build
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Deploys the self-heal fix from the previous commit. Keel is off for this
deployment (keel.sh/policy=never, because the browser container's playwright
image is version-pinned to f1-stream) and the novnc image was :latest with
imagePullPolicy=IfNotPresent, so a rebuilt :latest would NOT be re-pulled on a
rollout — the supervised entrypoint would never reach the running pod.

Pin novnc to :19d0f0933a (the build of the prior
commit; ghcr digest sha256:5b783ac6, == :latest) so the stack apply rolls the
sidecar onto the new image. Future novnc entrypoint changes deploy by bumping
this digest after build-chrome-service-novnc.yml publishes a new SHA tag.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:04:55 +00:00
Viktor Barzin
19d0f0933a chrome-service: supervise x11vnc in noVNC sidecar so the VNC view self-heals
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build chrome-service-novnc / build (push) Has been cancelled
The noVNC view at chrome.viktorbarzin.me went black: x11vnc (in the novnc
sidecar) attaches to the browser container's Xvfb over localhost:6099, and when
that container restarted (~8h ago, Chrome exited cleanly) x11vnc lost its X
connection and exited. Because the entrypoint ran x11vnc as an unsupervised
background child and then exec'd websockify as PID 1, the dead x11vnc was never
relaunched — :5900 stayed dead (a defunct zombie), websockify kept returning
'Connection refused', and the view was black until a manual pod restart.

Fix: the entrypoint now runs both x11vnc and websockify as supervised background
children and exits non-zero via 'wait -n' if either dies, so the kubelet restarts
the novnc container, which re-waits for Xvfb and relaunches x11vnc. The bridge
now self-heals across browser-container restarts. Mirrors the android-emulator
stack's supervision pattern. Architecture doc updated with the new failure mode,
diagnosis, immediate-recovery, and SHA-pin deploy note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:03:29 +00:00
Viktor Barzin
abb15cd49d devvm: personalize emo's cluster-health skill for ha-sofia
All checks were successful
ci/woodpecker/push/default Pipeline was successful
emo cares about ha-sofia + his Sofia smart-home devices (Tuya, the MPPT
ATS, the Барзини → Статус dashboard), and only about the cluster when it's
breaking those. Rewrite his vendored cluster-health into an ha-sofia-focused,
read-only variant:
- leads with ha-sofia's in-cluster dependency chain (tuya-bridge + the
  cloudflared/Traefik/DNS/TLS reachability path), all checkable read-only;
- fixes the script path to emo's own clone (/home/emo/code) — he can't read
  wizard's tree — and runs it --no-fix (he's cluster read-only);
- loads emo's own HA token (see below) so the ha-sofia checks (26-29, 45)
  actually run for him; documents the host-SSH/Vault checks that skip;
- triages: cluster FAIL/WARN matters only if on his chain; everything else is
  a one-line "admin's area"; escalate via /file-issue since he can't fix.

This snapshot copy is now an emo-specific variant, intentionally diverged
from the canonical 47-check admin skill — README updated to say "do not
re-sync from canonical".

Token: a dedicated long-lived HA token (client_name emo-cluster-health) was
minted on ha-sofia via the admin account and stored emo-readable at
/home/emo/.config/cluster-health/haos_token (600). It carries admin HA scope
(HA only mints tokens for the authenticating account); independently revocable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 16:03:14 +00:00
Viktor Barzin
fc83595f5e devvm: vendor cluster-health into per-user agent-skill snapshot
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Make cluster-health a user-global skill for emo (the lone entry in the
provisioner's SKILL_USERS allowlist), so it's available from any directory
— not only when working inside the infra clone where it already exists as a
project skill (.claude/skills/cluster-health). install_skills() in
t3-provision-users.sh copies the vendored snapshot into ~/.agents/skills/ and
symlinks ~/.claude/skills/, so this is the durable, rebuild-surviving path.

cluster-health is homelab-local (vendored from this repo's own
.claude/skills/), unlike the other snapshot entries which mirror upstream
mattpocock/skills + vercel-labs/skills; README documents its provenance and
the explicit re-sync step so the vendored copy doesn't silently drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 15:20:19 +00:00
Viktor Barzin
fd33d1a447 monitoring: consolidate all Slack alerting to #alerts, abandon #security
Some checks are pending
ci/woodpecker/push/default Pipeline is running
The dedicated #security Slack channel was unreachable: the shared incoming
webhook (Vault secret/viktor -> alertmanager_slack_api_url) belongs to a
Slack app that isn't a member of #security, so any channel override on it
returns HTTP 404 channel_not_found. The goldmane-edges-digest was silently
failing for that reason.

Per request ("dump the security channel, post in an existing one"), route
everything to #alerts instead:
- alertmanager slack-security receiver -> #alerts (keeps its [SECURITY/<sev>]
  title styling so security-lane alerts still stand out in the shared channel)
- goldmane-edges-digest CronJob SLACK_CHANNEL -> #alerts (comment only; value
  was already switched and applied last change)
- AggregatorDown / DigestFailing alert summaries reworded to say #alerts
- docs swept (security.md, monitoring.md, ADR-0014, goldmane runbook,
  .claude/CLAUDE.md, service-catalog, CONTEXT.md) to drop the
  "invite the app / flip back to #security" caveats and state the
  #security abandonment + #alerts consolidation as the current routing.

Monitoring stack applied (alertmanager rolled, live config verified:
slack-security channel is now #alerts).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 13:29:44 +00:00
Viktor Barzin
196d0db4bd rbac/apiserver-oidc: back up the apiserver manifest OUTSIDE /etc/kubernetes/manifests
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The SSO restore script backed up the live manifest with
`cp "$MANIFEST" "$MANIFEST.bak.$TS"` — i.e. INSIDE /etc/kubernetes/manifests/.
The kubelet treats every file in that dir as a static pod, so the .bak became a
SECOND kube-apiserver static pod. While both copies were identical it was
harmless, but the instant `kubeadm upgrade` changed the real manifest's image to
v1.35.6, the kubelet saw two same-named pods with different specs and flip-flopped
(pod attempt count hit 13) — the new apiserver never stabilised, so kubeadm timed
out on "static Pod hash did not change after 5m" and rolled back. THIS was the
real cause of the 1.34->1.35 upgrade stalling for days (not etcd IO, which was a
downstream symptom of the flip-flopping apiserver hammering etcd).

Fix: write backups to a dedicated dir OUTSIDE the static-pod dir
(/etc/kubernetes/apiserver-oidc-bak/) and read the rollback copy from there. The
stray .bak that planted the landmine on 2026-06-18 was moved out manually
2026-06-26; this prevents the SSO script (and the upgrade chain's restore.sh,
which is the same script) from ever re-creating it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 10:29:19 +00:00
Viktor Barzin
5d33327c30 postiz: repoint postgres-backup CronJob at CNPG (was failing on removed host)
Some checks failed
ci/woodpecker/push/default Pipeline failed
The postiz-postgres-backup CronJob still dumped from the chart's bundled
`postiz-postgresql` host with a hardcoded `postiz-password`. That bundled
PostgreSQL was removed when postiz migrated to the shared CNPG cluster, so
the host no longer resolves (NXDOMAIN) and every nightly run failed —
firing BackupCronJobFailed, and leaving the postiz DB with no logical dump
in the offsite pipeline.

Connect via the app's own DATABASE_URL (from the postiz-secrets Secret,
postgresql://postiz:…@pg-cluster-rw.dbaas.svc.cluster.local/postiz) instead
of a hardcoded host/user/password, so the backup tracks the live DB and
credentials. Verified with a one-off test job: psql + pg_dump 16.4 connect
to CNPG 16.9 and produce a 180K custom-format dump.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 09:34:42 +00:00
Viktor Barzin
1bca799bb4 monitoring: give kube-state-metrics a 512Mi memory limit (Burstable)
Some checks failed
ci/woodpecker/push/default Pipeline failed
kube-state-metrics had no explicit resources, so the monitoring-namespace
LimitRange pinned it to requests=limits=256Mi (Guaranteed QoS). KSM idles
around 45Mi but momentarily spikes past 256Mi during a full object relist
(450+ pods, 150+ jobs, all secrets/endpoints) and gets OOMKilled. Each OOM
blacks out the KSM-exported series that ~10 alert rules read, so they all
fire false "<svc>Down" criticals at once and self-resolve when KSM recovers
~5 min later — exactly the alert storm seen at 2026-06-26 08:42 UTC.

Set explicit Burstable resources: keep the request low (64Mi, just above
idle) so we don't reserve memory we don't use, and raise only the limit to
512Mi to absorb the relist peak. No CPU limit, per the cluster-wide policy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 09:06:31 +00:00
Viktor Barzin
d105713ae7 fix(workstation): claude-auth-sync must merge, not overwrite, the shared Vault path
All checks were successful
ci/woodpecker/push/default Pipeline was successful
cas_backup did `vault kv put secret/workstation/claude-users/<user>`, a full
KV-v2 replace that rewrote the document with only its 3 OAuth keys. Because
`homelab vault setup` co-locates the user's vaultwarden_* credentials on that
same path, every six-hourly sync silently deleted them — so `homelab vault`
reported "not configured" within hours of each setup. (Reported as: homelab
vault "keeps getting reset / logged out", set up 3 times.)

Switch the backup to a merge: `kv patch -method=rw` (read+update, needs no
`patch` capability) when the path exists, and `kv put` only to create it on the
first backup. Add a regression test with a fake vault asserting a pre-existing
sibling key survives a backup, and document the merge requirement in the
renewal runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:33:41 +00:00
Viktor Barzin
6f1951af93 fix(workstation): carry OS/sudo authz policy into managed-settings source + multi-tenancy doc
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ADR-0015's policy change was applied live to /etc/claude-code/managed-settings.json, but that file self-deploys from the repo source scripts/workstation/managed-settings.json via the hourly reconcile (sync_managed_config). Without updating the source the next reconcile would REVERT /etc to the old 'never read other homes' rule. This updates the source-of-truth claudeMd (now byte-identical to /etc) so the change is durable + canonical, and refresh_codex_mirror propagates it to every user's ~/.codex/AGENTS.md. Also notes the access-model change in the multi-tenancy architecture doc (pointer to ADR-0015).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:25:33 +00:00
Viktor Barzin
8121d8a4ac docs(adr): add ADR-0015 (OS/sudo is the authorization boundary), supersede ADR-0011 privacy norm
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor (owner) wants agents to stop refusing file reads the OS already permits. wizard holds passwordless root ((ALL) NOPASSWD: ALL), so the managed-settings rule 'never read another user's ~/.claude' was stricter than the OS itself. The managed-settings policy (/etc/claude-code/managed-settings.json) was updated out-of-band to defer to OS/sudo authorization with no extra prompt; backup kept at .bak-2026-06-26. This ADR records the decision, its symmetry across sudo-holders, and the larger blast radius.

ADR-0011's usage-telemetry design is unchanged; only the cross-user privacy norm it referenced is superseded. The original ask was to delete ADR-0011 — superseded instead to preserve the audit trail and the ADR-0012/0013 references.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:22:29 +00:00
Viktor Barzin
ebc8b6588f ESO: add force_conflicts to all ExternalSecret manifests (fleet sweep)
Some checks failed
ci/woodpecker/push/default Pipeline failed
The 2026-06-22 external-secrets v1 migration made the ESO controller the
server-side-apply owner of .spec.refreshInterval on every ExternalSecret, so any
stack defining one via kubernetes_manifest fails `terraform apply` with a
field-manager conflict the next time it's applied (instagram-poster + grafana hit
this on 2026-06-24; it was latent across the whole fleet). Add
field_manager { force_conflicts = true } to all 101 remaining ExternalSecret
manifests across 70 stacks, matching the fix already on grafana / woodpecker /
traefik / k8s-version-upgrade / instagram-poster. TF and ESO set the same value,
so it's stable (no perpetual drift). Defuses the landmine before each stack's
next apply trips it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 21:28:11 +00:00
Viktor Barzin
6c5288998f goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a
subagent workflow (distinct stacks in parallel, docs last):

- #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me,
  auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081
  (the operator's whisker NP default-denies ingress). calico stack.
- #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts
  (prometheus_chart_values.tpl) + cluster-health check #48.
- #59 service-identity labels on the multi-Service namespaces (monitoring's 5
  TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they
  update in-place.
- #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog,
  security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md.
  #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the
  edge table (feeds code-8ywc; enforce-flips out of scope).

Also fixes the digest's Slack target: #security override 404s channel_not_found
because the shared alertmanager_slack_api_url webhook's app isn't a member of
#security (this likely also breaks alertmanager's slack-security receiver — flagged
in the runbook). Routed to #alerts (the webhook's working channel) until the app
is invited; verified a real digest run posts cleanly (360 edges).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:49:25 +00:00
Viktor Barzin
306cdd4cb3 state(dbaas): update encrypted state 2026-06-25 17:31:03 +00:00
Viktor Barzin
9c68d147e0 k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)
Some checks failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline failed
Digging into "why did the apiserver crash" disproved the earlier OIDC
explanation. An isolated v1.35.6 apiserver repro with authentik reachable
initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the
--authentication-config -> --oidc-* revert is NOT what crashed it. etcd's
surviving crash-window log is the real cause: 1180 "apply request took too long"
warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as
kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the
shared sdc HDD (beads code-oflt).

A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full
~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated,
driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live
(73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate.

Also corrected the OIDC handling: the kubeadm-config drift is real but only
breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the
chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the
apiserver. So the preflight check is now an ALERT, not a block (was added on the
wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected.

Per Viktor: reclaim the disk and automate so the manual cleanup never recurs;
the durable IO fix remains code-oflt (etcd off the shared HDD).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 15:23:15 +00:00
Viktor Barzin
60a1cb9a25 k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Last night's autonomous 1.34->1.35 run reached the master control-plane phase
for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then
the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back
to 1.34.9. The cluster stayed healthy but the master was left cordoned and the
chain wedged on in_flight.

Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from
the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a
structured multi-issuer --authentication-config (kubectl + dashboard SSO), but
kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the
regenerated manifest reverted structured auth and the new apiserver crash-looped.
Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step
never ran because the upgrade itself never succeeded.

Fix:
- rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config
  (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config)
  so a future kubeadm upgrade regenerates a correct manifest. Delivered to the
  cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no
  ssh key); trigger deliberately not script-hashed since CI cannot ssh.
- k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade
  diff` and BLOCKS+alerts (never drains the master) if --authentication-config
  would still be dropped.
- Post-mortem + runbook updated.

The live kubeadm-config was reconciled directly on the master and verified
(`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's
run can complete the 1.34->1.35 upgrade.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 14:16:04 +00:00
Viktor Barzin
c6bba1da6e home-assistant skill: refresh ha-london map (HAOS 2026.5.2, Cowboy revived, Overview redesign)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to redesign the ha-london dashboards and fix the broken integrations (the Cowboy one). The skill's ha-london knowledge map had drifted badly from reality, so this brings it current: it claimed HA 2025.9.1 on a docker-run container (it's HAOS 2026.5.2, managed); listed the now-dead jdejaegh Cowboy integration with sensor.bike_* entities (revived via elsbrock/cowboy-ha v1.2.0 -> entities are sensor.classic_performance_*); and didn't flag that met/metoffice/roomba/hildebrandglow are user-disabled (not broken) or that Tapo P100 is failing. Also documents the redesigned Overview (Home+More sections, Mushroom+mini-graph-card), the dashboard/view/card glossary, the parked-bike 'unknown battery' gotcha, and that london is API-only from the Sofia devvm (no SSH).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 22:03:15 +00:00
Viktor Barzin
b858561bd0 Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-24 20:59:39 +00:00
Viktor Barzin
a7704f46a6 deploy goldmane-edge-aggregator: durable who-talks-to-whom edge trail (#58, ADR-0014)
Infra side of ADR-0014: an mTLS gRPC consumer of Calico Goldmane's Flows API
that records the namespace-pair edge-set in CNPG and posts a daily new-edge
digest to #security. Adds the goldmane-edge-aggregator stack, the
pg-goldmane-edges Vault rotation role (Tier-0 vault state updated here), and the
namespace in the ghcr-credentials allowlist.

Cert: REUSES the operator-minted, Tigera-CA-signed whisker-backend client cert
(Goldmane verifies only the CA chain, not identity) instead of minting from the
Tigera CA private key. This avoids putting the CA key in TF state AND the
hashicorp/tls provider, which is incompatible with this repo's global
generate-providers/lockfile pattern (it broke every stack's lockfile).

Verified live: aggregator streaming flows, 174 edges in Postgres across 50x54
namespaces, db+slack ExternalSecrets synced, digest dry-run formats correctly,
private image pulls via the Kyverno-synced ghcr-credentials.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:59:39 +00:00
Viktor Barzin
aa510e3600 instagram-poster: force_conflicts on ESO manifests (fix apply)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The ESO v1 migration (2026-06-22) made the external-secrets controller own
.spec.refreshInterval via server-side apply, so terraform apply of the two
ExternalSecret manifests fails with a field-manager conflict (Woodpecker #348),
which blocked the replicas=0 scale-down from landing. Add force_conflicts=true
to both, matching the grafana/woodpecker/traefik fix applied to other stacks
the same day.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:49:53 +00:00
Viktor Barzin
53834deb24 instagram-poster: scale to 0 (unused, dead ExternalSecret)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor confirmed the Instagram Graph poster isn't used. Its ExternalSecret
has been dead on missing Vault keys (ig_graph_long_lived_token,
ig_business_account_id), so the deployment sat at 0/1 firing
DeploymentReplicasMismatch. Setting replicas=0 stops the alert and makes the
scale-down durable (a bare kubectl scale reverts on the next stack apply).
Re-set to 1 after minting a Meta long-lived token + populating the Vault keys.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:45:30 +00:00
Viktor Barzin
8dd9a3978d Merge remote-tracking branch 'forgejo/master' into wizard/homelab-vault
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-24 12:25:52 +00:00
Viktor Barzin
65b2df1222 fix(monitoring): force_conflicts on grafana_db_creds ExternalSecret
The external-secrets controller owns .spec.refreshInterval via SSA, so a plain
terraform apply of the monitoring stack conflicts. Latent until 2026-06-24 (the
homelab-vault loki-rules change was the first monitoring apply in a while and
surfaced it). force_conflicts lets TF win — same pattern as woodpecker/traefik/
k8s-version-upgrade stacks.
2026-06-24 12:25:36 +00:00
Viktor Barzin
1d0388da12 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-24 12:22:58 +00:00
Viktor Barzin
92361f36db calico: enable Goldmane + Whisker (Calico 3.30 OSS flow observability)
Turns on Calico 3.30's native east-west flow observability so we can see which
Service talks to which (ADR-0014, issue #57). Enabled via the operator CRs
directly (kubectl_manifest Goldmane + Whisker, name=default) rather than the
Helm goldmane/whisker flags, because the goldmanes/whiskers CRDs already exist
and this sidesteps the helm-upgrade CR-before-CRD ordering issue. Whisker
notifications=Disabled so the UI doesn't call the external Tigera endpoint.

Applied supervised: creating the Goldmane CR re-rendered calico-node with the
FELIX_FLOWLOGSGOLDMANESERVER env (operator auto-wires Felix — no manual
FelixConfiguration); calico-node rolled cleanly 7/7, tigerastatus healthy,
goldmane is receiving flows from all nodes, Whisker UI serves.

Durable Loki persistence is NOT included here: the Goldmane emitter is Calico
Cloud/Enterprise-gated with no OSS knob to aim it at Loki (the CR can override
only name+resources, not env), so a durable trail needs a small custom gRPC
consumer of goldmane:7443 — tracked in issue #58.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:22:48 +00:00
Viktor Barzin
e711b2f971 feat(monitoring): homelab vault traceability alerts (TOTP-fetch + volume)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Build infra CLI / build (push) Has been cancelled
Adds a Loki ruler group (lane=security -> #security) for the homelab vault
op-log: VaultwardenTOTPFetched (every 2nd-factor fetch is visible) and
VaultwardenFetchVolumeHigh (>100 fetches/10m backstop). The audit spine
(Vault audit device, reads of secret/data/workstation/claude-users/*) is
already captured. True CLI-bypass detection needs cross-stream correlation
(follow-up).
2026-06-24 10:31:32 +00:00
Viktor Barzin
64104e56e9 feat(devvm): install Bitwarden CLI for homelab vault 2026-06-24 10:29:57 +00:00
Viktor Barzin
15643d1f44 feat(cli): bare homelab vault help command 2026-06-24 10:29:32 +00:00
Viktor Barzin
772aed5370 fix(cli): vault security review fixes
C1 (critical): setup wrote the master password + API client_secret as
`vault kv patch key=value` argv, leaking them via /proc/<pid>/cmdline to
same-UID processes. Now written via stdin (key=- form); only email +
client_id (non-credentials) remain in argv.
I1: `get --json` refused on a TTY (was dumping the secret to scrollback).
M1: vaultLock now holds the per-user flock (it mutates bw state).
M4: bw login-detection parses status JSON instead of substring matching.
M5: clipboard path refuses when stderr is not a TTY (was silently failing).
M6: realRunner trims only trailing newline, preserving secret whitespace;
    secret prompts likewise.
Adds security-property tests: no secret in argv across the get flow,
clipboard decision matrix, --json TTY gate, bw status parsing.
2026-06-24 10:28:31 +00:00
Viktor Barzin
5a864cf19c feat(cli): homelab vault setup onboarding (one-time, self-service) 2026-06-24 10:21:57 +00:00
Viktor Barzin
e20033855d feat(cli): vault list/search/code/status/lock 2026-06-24 10:21:07 +00:00
Viktor Barzin
365340b37d feat(cli): homelab vault get with TTY-aware return 2026-06-24 10:20:05 +00:00
Viktor Barzin
2dd12fc6be feat(cli): vault session bootstrap with per-user flock + no-coredump 2026-06-24 10:18:36 +00:00
Viktor Barzin
5bae2a3907 feat(cli): privacy-aware vault op-log (process, never the secret) 2026-06-24 10:17:50 +00:00
Viktor Barzin
81122f8607 feat(cli): TTY-aware return + OSC52 clipboard with terminal gating 2026-06-24 10:17:13 +00:00
Viktor Barzin
06f4b87af1 feat(cli): vault bw engine env/arg builders + unlock 2026-06-24 10:16:19 +00:00
Viktor Barzin
cd44ca5921 feat(cli): vault creds loading from per-user Vault path 2026-06-24 10:15:32 +00:00
Viktor Barzin
6c53ee10b1 feat(cli): register homelab vault command group skeleton 2026-06-24 10:14:24 +00:00
Viktor Barzin
ae0d7984c4 docs: ADR-0014 + glossary — service identity (namespace+label) & Calico Goldmane observability
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Records the design reached in a /grill-with-docs session: how to track which
Service talks to which as more Services are added, using k8s-native options.

Decision: service identity = the workload's namespace (primary) plus a
`service-identity` label only in the few multi-Service namespaces; east-west
observability = Calico 3.30 Goldmane/Whisker (already in our Calico v3.30.7,
currently disabled) emitting to Loki for a durable trail; enforcement reuses the
existing Wave 1 egress track. Dedicated per-Service ServiceAccounts deferred and
a service mesh / mTLS / SPIFFE rejected — the trust model needs attribution-grade
forensics on a trusted, etcd-constrained cluster, not cryptographic
non-repudiation. This is the service-mesh evaluation the 2026-04-20 infra audit
flagged as missing; rejected alternatives (Retina, Hubble, Kiali, a custom Alloy
enricher) are recorded with rationale.

Adds glossary terms (Service identity, Goldmane / Whisker) to CONTEXT.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 10:00:36 +00:00
Viktor Barzin
0293b5c634 android-emulator: fix idle-sleeper dying with SIGPIPE before it could sleep
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Caught live-testing the previous commit: every sleeper run exited 141
(SIGPIPE) in ~1s with no output, never reaching the scale-down. Cause:
`set -o pipefail` + `dumpsys power | awk '...; exit'` — awk closes the pipe
after the first match while `kubectl exec` is still streaming dumpsys, so
the exec gets SIGPIPE, pipefail makes the pipeline 141, and set -e kills the
script before any echo. (My earlier dry-run missed it because it didn't run
under `set -euo pipefail`.)

Fix: drop pipefail; capture each exec to a var (`|| true`) then parse with
awk reading to END (no early `exit`), so nothing can SIGPIPE mid-stream and
a failed/booting exec falls through to the fail-safe "do not sleep" branch.
Also fetch the pod name via jsonpath instead of `-o name | head -1` (no pipe
to SIGPIPE, no `pod/` prefix to strip), and exec `adb` directly without the
`sh -c` wrapper.

Verified live: ran the corrected script as the gate ServiceAccount against
the stuck emulator (idle ~120h) — it logged "idle >= 6h ... scaling to zero"
and patched the deployment to replicas=0. The 6+ day pod is now asleep.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 08:57:36 +00:00
Viktor Barzin
839fdb33c2 android-emulator: sleep after 6h idle (activity-based), fix never-sleeping
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The emulator was meant to scale to zero when idle but had been up 6+ days
straight despite ~5 days with no real use. Two bugs:

1. The idle check counted ESTABLISHED TCP connections to the adb/noVNC
   ports. A forgotten `adb connect` (no disconnect) holds that transport
   open forever, so every 15-min run saw "active" and reset the counter --
   it never reached the sleep branch. (Right now: 4 such stale transports
   from pods on k8s-node3/node4.)
2. Even when it did reach the sleep branch, `kubectl scale --replicas=0`
   failed Forbidden -- the gate ServiceAccount can patch `deployments` but
   not `deployments/scale`.

Switch the sleeper to measure actual use: time since last user activity
(taps/keys/app-launches, incl. noVNC clicks) from `dumpsys power` vs guest
uptime. No interaction for 6h -> sleep. This ignores idle/forgotten
connections entirely. Scale down with a direct replicas patch on the named
deployment (same path the wake gate scales up), so it needs only the
existing `deployments` patch grant -- no `deployments/scale`. Now stateless
(drops the idle-counter annotation; gate.py no longer sets it) and lighter
on etcd. Fail-safe: any read error (e.g. mid-boot) does not sleep.

Requested by Viktor: turn the dev-only emulator off when it hasn't been
used for 6h.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 08:49:23 +00:00
Viktor Barzin
566447a698 k8s-upgrade: preflight kubeadm-plan gate must pass explicit target (minor-upgrade fix)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Last night's 1.34.9->1.35.6 run passed the ESO/kyverno compat gate (the migration
worked!) but ABORTED at the kubeadm-plan-target gate: it ran `kubeadm upgrade plan`
with NO version, so master's old 1.34.9 kubeadm auto-proposed only the current
minor (Loki: "falling back to stable-1.34") and plan_target != 1.35.6 -> abort.
That gate worked for patch upgrades but never for minors. Fix: pass the explicit
`v$TARGET_VERSION` (verified on master: `kubeadm upgrade plan v1.35.6` emits
"kubeadm upgrade apply v1.35.6"). Works for patches too. Applied live to the
ConfigMap before tonight's run; deleted the failed preflight-1-35-6 job.

Also: ESO 2.x took SSA ownership of .spec.refreshInterval, so terraform's apply of
the k8s-upgrade-creds ExternalSecret hit a field-manager conflict. Added
field_manager.force_conflicts=true (benign — interval is semantically identical).
This pattern affects all 104 migrated ESs fleet-wide (follow-up).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 06:06:14 +00:00
Viktor Barzin
98d2b89614 calico: bump tigera-operator mem limit 256Mi -> 512Mi (OOM crashloop fix)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The operator OOM-crashlooped on 2026-06-23: it idles at ~246Mi with a ~266Mi
startup spike (re-listing resources to build informer caches), both at/over the
256Mi limit, so the first time the pod restarted it could never finish startup
(exit 137 OOMKilled, leader-elect, OOM, repeat). A latent landmine — the limit
was always too tight; it only bit once the pod restarted. Data plane was never
affected (calico-node 7/7, tigerastatus green throughout). 512Mi gives headroom
(now ~246Mi steady, verified stable 0 restarts). NOT caused by the ESO migration
(which never touched calico); cluster churn was at most the trigger that exposed
the tight limit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 12:46:28 +00:00
Viktor Barzin
68c240b8de Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-23 09:56:25 +00:00
Viktor Barzin
7d297dc6b1 eso: complete migration — chart 2.6.0, all CRs on v1, 1.35 gate cleared
Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker).
Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time,
each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true
rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only
after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on
external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to
upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s
1.34 -> 1.35 on its next nightly run.

Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the
terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox,
but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent
dependency lock file: no version selected"). Reconciled via `tg init -upgrade`
and committed so `terragrunt apply`/CI work cleanly again.

Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc
marked COMPLETE.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:55:51 +00:00
Viktor Barzin
ff4b01a674 state(external-secrets): update encrypted state 2026-06-23 09:53:36 +00:00
Viktor Barzin
e1a85dd727 state(external-secrets): update encrypted state 2026-06-23 09:52:30 +00:00
Viktor Barzin
af22416d6f state(external-secrets): update encrypted state 2026-06-23 09:51:21 +00:00
Viktor Barzin
c75982f408 state(external-secrets): update encrypted state 2026-06-23 09:50:11 +00:00
Viktor Barzin
0407e3c578 state(external-secrets): update encrypted state 2026-06-23 09:48:33 +00:00
Viktor Barzin
dab8f9446f state(external-secrets): update encrypted state 2026-06-23 09:47:24 +00:00
Viktor Barzin
e815bb0295 state(external-secrets): update encrypted state 2026-06-23 09:46:17 +00:00
Viktor Barzin
8412cd7d54 state(external-secrets): update encrypted state 2026-06-23 09:45:04 +00:00
Viktor Barzin
f2956e1e62 state(external-secrets): update encrypted state 2026-06-23 09:43:57 +00:00
Viktor Barzin
bf2f865eee state(external-secrets): update encrypted state 2026-06-23 09:42:52 +00:00
Viktor Barzin
6f3cfb18c7 state(external-secrets): update encrypted state 2026-06-23 09:41:46 +00:00
Viktor Barzin
6e8e066215 state(external-secrets): update encrypted state 2026-06-23 09:40:14 +00:00
Viktor Barzin
de1fb04d9f state(external-secrets): update encrypted state 2026-06-23 09:39:12 +00:00
Viktor Barzin
606cfdb544 state(external-secrets): update encrypted state 2026-06-23 09:38:12 +00:00
Viktor Barzin
72464e7880 state(external-secrets): update encrypted state 2026-06-23 09:37:11 +00:00
Viktor Barzin
e88ea50304 docs(multi-tenancy): document install_skills (vendored per-user agent skills)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Record the new reconcile step alongside install_memory/install_playwright:
vendored own-copies of the 16-skill set for the SKILL_USERS allowlist (emo),
why it's vendored not npx (upstream drift), and that if-absent keys on the
user's own copy so it heals a stale/cross-user ~/.claude/skills symlink
(emo's grill-me pointed into the admin's home).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:30:27 +00:00
Viktor Barzin
1c8dc6bd6c t3-provision-users: install_skills heals stale symlinks + owns ~/.agents
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Follow-up to the vendored-skills change, from verifying the emo rollout:

- The if-absent guard treated ANY pre-existing ~/.claude/skills/<name> entry
  as "installed", so a manual cross-user symlink emo already had (grill-me ->
  /home/wizard/.claude/skills/grill-me) was skipped — leaving the requested
  skill depending on the admin's home instead of emo's own copy. The guard now
  keys on the user's OWN copy (a real dir under ~/.agents/skills) and (re)points
  the ~/.claude/skills symlink at it, healing a stale/cross-user link while
  still never clobbering a real dir.
- install -d left the intermediate ~/.agents owned by root; now owned by the user.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:27:31 +00:00
Viktor Barzin
987fdd16db t3-provision-users: vendor agent skills + per-user install_skills (emo)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Make the admin's Claude Code agent skills available to the `emo` devvm user.
Viktor asked to install Matt Pocock's skills for emo, starting with grill-me
but covering the full set the admin already uses.

The `npx skills` upstream has drifted off that set (diagnose -> diagnosing-bugs
and write-a-skill -> writing-great-skills were renamed; caveman + zoom-out are
no longer published), so reproducing it via npx is impossible and would also
spray ~70 agent dirs into the user's home + add a GitHub-clone + unpinned-CLI
dependency to the hourly root reconcile. Instead vendor a point-in-time
snapshot of the 16 skills (scripts/workstation/claude-skills/) and copy them
per-user, mirroring install_memory: install_skills() copies each skill into
~/.agents/skills/<name> (owned by the user) and symlinks
~/.claude/skills/<name> -> ../../.agents/skills/<name>. if-absent, additive,
best-effort, scoped to the SKILL_USERS allowlist (emo).

find-skills is from vercel-labs/skills (not Matt Pocock) but included since it
is part of the admin's current set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:23:37 +00:00
Viktor Barzin
59f2beda21 chrome-service: run real Google Chrome (H.264/AAC codecs) for the browser
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Point the chrome-service container at the new chrome-service-browser image and
launch /opt/google/chrome/chrome instead of the bundled Chromium. Fixes
MEDIA_ERR_SRC_NOT_SUPPORTED on H.264/AAC video (Instagram Reels etc.) in the
noVNC view — bundled Chromium has those codecs compiled out; only real Chrome
carries them. connect_over_cdp callers (tripit fare scrape, homelab browser,
snapshot-harvester) attach over raw CDP (version-tolerant) — validated after
rollout. Image is built off-infra on GHA (prior commit) → public ghcr.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 21:15:36 +00:00
Viktor Barzin
df1ec1879d chrome-service: build a real-Chrome browser image (H.264/AAC codecs)
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build chrome-service-browser / build (push) Has been cancelled
Add an infra-owned image (Playwright base + google-chrome-stable) + its GHA
build workflow. The bundled Chromium ships proprietary codecs compiled out, so
H.264/AAC video (Instagram Reels, X, most .mp4) fails in the noVNC view with
MEDIA_ERR_SRC_NOT_SUPPORTED; only real Google Chrome carries those codecs
(libffmpeg swap + Chrome-for-Testing both ruled out). This commit only builds
the image (→ ghcr.io/viktorbarzin/chrome-service-browser); a follow-up flips
main.tf's launch to it once the image exists + is public.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 21:01:17 +00:00
Viktor Barzin
7061b1dfc6 state(external-secrets): update encrypted state 2026-06-22 20:55:27 +00:00
Viktor Barzin
e2f328ff4a state(external-secrets): update encrypted state 2026-06-22 20:45:24 +00:00
Viktor Barzin
a735be9ba4 state(external-secrets): update encrypted state 2026-06-22 20:45:08 +00:00
Viktor Barzin
c670cb7118 eso: Phase 2 — migrate all 104 ExternalSecrets + 2 ClusterSecretStores to v1
Some checks failed
ci/woodpecker/push/default Pipeline failed
The API rewrite half of the ESO 0.12->2.6 migration (last k8s-1.35 compat-gate
blocker). Done on chart 0.16.2, which serves BOTH external-secrets.io/v1beta1
and v1, so this is the safe window — MUST land before 0.17 removes v1beta1
(there is no conversion webhook). Pure apiVersion bump, schema is byte-identical:
106 occurrences (104 ExternalSecrets + 2 ClusterSecretStores vault-kv/vault-database)
across 73 .tf files, v1beta1 -> v1, no other field changes.

Validated live first on tandoor (single, non-coupled, synced ES): the
kubernetes_manifest apiVersion bump forces a REPLACE; the target Secret is
cascade-GC'd for ONE ~0.3s poll then ESO recreates it (identical value re-synced
from Vault, new UID) and the ES returns SecretSynced=True on v1. Running pods
keep their mounted copy through the sub-second blip. All 110 target Secrets were
snapshotted to /tmp first as a backstop.

CI applies the changed stacks serially (staged rollout); watching aggregate ES
sync back to 108 synced (2 pre-existing dead: instagram-poster, payslip-ingest).
Next: Phase 3 climb 0.16.2 -> 2.6.0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 19:13:04 +00:00
Viktor Barzin
98cd535b97 authentik: lock chrome.viktorbarzin.me noVNC to Viktor only
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The chrome-service noVNC exposes Viktor's live logged-in browser sessions
(Instagram etc. — he'll sign in there for homelab browser to reuse). It was
auth="required" = any authenticated user, and "Home Server Admins" includes emo
(emil.barzin@gmail.com), so the admin group is not a sufficient gate. Add a
host-specific case to the domain-wide forward-auth restriction allowing only
Viktor's accounts (vbarzin@gmail.com + akadmin break-glass); everyone else,
incl. emo, is denied at the noVNC. emo's AGENT already can't reach the browser
(read-only RBAC blocks port-forward); this closes the human noVNC path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 18:09:27 +00:00
Viktor Barzin
a3cdc0d6d0 chrome-service: size headed Chrome window to fill Xvfb (noVNC cut-off)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The noVNC view showed the browser in the top-left with the rest of the
framebuffer black. Cause: Chrome launched with no --window-size, and there's no
window manager, so it opened at its profile-persisted (smaller) size inside the
1280x720 Xvfb. Add --window-size=1280,720 --window-position=0,0 so the window
fills the screen on every launch (fresh pods/profiles too). Live windows were
already resized via CDP as a stopgap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 18:00:20 +00:00
Viktor Barzin
c7ead032ec chrome-service: fix noVNC stuck-"Connecting" (x11vnc fd-sweep under nofile=2^31)
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build chrome-service-novnc / build (push) Has been cancelled
The noVNC view hung on "Connecting" forever then timed out. Root cause: x11vnc
sweeps the entire fd table (fcntl per fd) on every client connection, and
containerd grants pods RLIMIT_NOFILE=2^31, so the RFB handshake never completes
(websockify accepts the WS and dials localhost:5900, but x11vnc never sends its
banner — verified: handshake timed out at 8s, x11vnc had burned 1h41m CPU
spinning). Same bug + fix the android-emulator stack already carries.

Cap nofile before x11vnc starts, in two places:
- files/novnc/entrypoint.sh: `ulimit -n 65536` (root fix, makes the image correct)
- main.tf novnc container: `command = ["bash","-c","ulimit -n 65536; exec /entrypoint.sh"]`
  so the cap applies deterministically on rollout even though the image is
  :latest/IfNotPresent (a rebuilt entrypoint isn't guaranteed to be re-pulled).

Also documents the gotcha + diagnosis in docs/architecture/chrome-service.md and
notes the black-when-idle behaviour + the autoconnect URL.

(A live x11vnc relaunch with the cap already unblocked the running pod; this
makes it survive restarts.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 17:34:03 +00:00
Viktor Barzin
20ca5ee624 tripit: REEL_PROVIDER=anonymous — actually fetch reels (was fake canned caption)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
REEL_PROVIDER was unset, so the reel pipeline used FakeReelExtractor, which returns
a CANNED caption — every pasted (tripit #120) or forwarded reel produced a DUMMY
Saved Place instead of reading the real reel. Set REEL_PROVIDER=anonymous in app_env
(covers the web Deployment + the ingest CronJob) so AnonymousReelExtractor does the
real anonymous read. Verified live from the cluster: yt-dlp fetched a real IG /p/
caption (no IG_GRAPHQL_DOC_ID needed — the internal-API path is an optional
optimisation; yt-dlp fallback works). LLM extraction + Nominatim POI geocoding were
already real (prior commits); this was the last fake link in the chain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 17:30:47 +00:00
Viktor Barzin
f46b69f372 tripit: enable real LLM + Nominatim on the web Deployment (in-app reel paste #120)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The web Deployment ran LLM_MODE=fake with no reel geocoder — only the ingest-plans
CronJob had real providers. The in-app reel-URL paste feature (tripit #120) runs
ingest_reel IN the web pod (BackgroundTask), so the Deployment now needs real
extraction: LLM_MODE=llamacpp (qwen3vl-8b; qwen3-8b segfaults on the current
llama-swap image) with the ADR-0033 claude-agent-service fallback, plus
REEL_GEOCODER_PROVIDER=nominatim for venue->city/country POI geocoding. Set in
app_env (feeds the Deployment; the CronJobs already had these via extra_env). Bonus:
this also un-fakes the in-app booking *share* import, which used the same fake LLM.
MAIL_INGEST_ENABLED stays false on the Deployment (only the CronJob polls mail).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 16:50:04 +00:00
Viktor Barzin
59f2070e56 tripit: switch mail-ingest LLM_MODEL qwen3-8b -> qwen3vl-8b (qwen3-8b segfaults)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The qwen3-8b GGUF segfaults on load on the current llama-swap :cuda image
("common_init_from_params: failed to create context"; llama-swap returns 502),
which broke ALL tripit mail ingest text extraction — booking emails AND forwarded
reels (status=failed, "no place could be read"). The GGUF isn't corrupt (valid
header, full size, worked for weeks) — it's a llama.cpp/image regression. Rather
than pin the SHARED llama-swap image (cross-user blast radius), repoint the
ingest-plans CronJob at qwen3vl-8b, an already-provisioned 8B model that loads
fine and extracts flight numbers + places reliably. Restores the auto-path
(reels resolve via the Nominatim geocoder; bookings parse again). The broken
qwen3-8b GGUF is a separate, non-urgent llama-cpp cleanup.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 15:52:09 +00:00
Viktor Barzin
7dbbb74163 homelab v0.8.1: frame browser as escalation (default headless), match CLAUDE.md
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build infra CLI / build (push) Has been cancelled
Make `homelab browser --help` and chrome-service.md state the same tiered rule
now in ~/code/CLAUDE.md: default to the Playwright MCP/headless browser for all
routine automation; reach for `homelab browser` ONLY when headless is blocked
(loads-but-submit-fails / one request errors while siblings 200 / explicit bot
wall). Removes the "co-equal choice" framing so agents have one non-conflicting
instruction. Adds a test asserting the tiered wording so it can't regress.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 15:44:43 +00:00
Viktor Barzin
f96cde35bd tripit: enable Nominatim POI geocoding for reel→Wishlist ingest
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Forwarded reels (tripit ADR-0031) geocode their venue to map a Saved Place to a
country + city, but the reel route was wired to the global geocoder, which here is
GEOCODER_PROVIDER=openmeteo (city-level, name-based). OpenMeteo returns nothing for
a venue query like "Time Out Market, Lisbon" so reels never resolved and no Saved
Place was created. The app fix (tripit 3c62d596) gave the reel route its own
geocoder behind REEL_GEOCODER_PROVIDER; set it to nominatim on the ingest-plans
CronJob (the only one running the reel route) so forwarded reels resolve to real
venue coords + city + country. Isolated from the global geocoder, which stays
openmeteo for weather/tours. Verified Nominatim resolves the venue from the cluster.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 14:59:37 +00:00
Viktor Barzin
a6b52a5839 homelab v0.8.0: browser verbs for headful anti-bot web automation
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Add `homelab browser run|open` so agents can drive the cluster's headful
Chrome (chrome-service) over CDP from the devvm. The headless playwright/mcp
browser can load anti-bot sites and fill their forms, but the gated submit
silently fails — e.g. the Stirling Ackroyd Fixflo tenant portal returned
net::ERR_FILE_NOT_FOUND on its pre-submit check and hung, creating nothing.
Driving the real headful Chrome submits first try. That capability already
existed but was undiscoverable, so it cost ~40 min + redundant form re-runs to
find; now it is one command, versioned, test-covered, and `browser --help`
carries the when-to-use signature + an error-code cheat-sheet so the right tool
is reached at the right moment (the failure was judgment, not setup).

- port-forward svc/chrome-service:9222 (tunnels API-server->pod, so it bypasses
  the :9222 NetworkPolicy), assert non-headless via /json/version,
  connect_over_cdp, inject the same vendored stealth.js the in-cluster callers
  use; the port-forward is always torn down, on success and on error.
- node CDP client pinned to playwright-core@1.48.2 to match the v1.48.0-noble
  image (Chromium 130); self-provisioned lazily into ~/.cache/homelab, no
  per-user setup.
- default is a fresh incognito context (safe for the shared browser + concurrent
  callers); --shared-context reuses the warmed persistent profile.
- TDD: cmd_browser_test.go covers arg parsing, headless detection, the version
  pin, the help cheat-sheet, and a stealth.js drift guard. Verified end-to-end
  against bot.sannysoft.com (real Chrome UA, webdriver hidden, plugins/WebGL
  spoofed) and `browser open`.
- docs: README v0.8 section, ADR-0013, and a chrome-service.md "driving from
  outside the cluster" section.

Closes: code-nepg

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 12:22:22 +00:00
Viktor Barzin
de163aa6af workstation: switch devvm OOM backstop from systemd-oomd to earlyoom
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.

Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.

The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 10:39:16 +00:00
Viktor Barzin
3a59f4a8bf workstation: per-user memory caps + systemd-oomd backstop on devvm
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.

Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.

This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.

Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).

Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 10:25:09 +00:00
Viktor Barzin
2169e0de5f workstation: harden memory hooks — prune dead plugin refs + homelab-CLI-only store
All checks were successful
ci/woodpecker/push/default Pipeline was successful
wire-memory-hooks.py now PRUNES any settings.json hook still pointing at the
retired claude-memory plugin (plugins/claude-memory/hooks/) before the additive
pass. install_memory() rm -rf's that dir, so those entries are dangling — and a
missing UserPromptSubmit hook exits 2, a BLOCKING error that erases the prompt
and froze emo's sessions (2026-06-22). The plugin shares basenames with the
homelab hooks, so the old additive-only logic saw the dead plugin path as
"already present" and skipped installing the real ~/.claude/hooks/ copy; pruning
first fixes that. Verified against emo's exact original config: yields the
correct 4-hook set, drops the dead PermissionRequest entry, idempotent on rerun.

auto-learn.py now stores via the `homelab memory` CLI only — dropped the direct
HTTP path and the local-SQLite fallback (memory is homelab-CLI-only per Viktor;
never local files). No-ops silently when no API key is in env (e.g. ancamilea).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 09:24:42 +00:00
Viktor Barzin
aeed461591 Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)"
All checks were successful
ci/woodpecker/push/default Pipeline was successful
This reverts commit 1595bddfc2.
2026-06-22 08:31:17 +00:00
Viktor Barzin
1595bddfc2 feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Re-land Phase 2 after the first attempt's two failure modes, both fixed:
- tempo.resources set under the correct single-binary chart key (was OOMKilled on
  the namespace LimitRange default when mis-placed at top level).
- atomic=true + cleanup_on_fail=true on BOTH helm releases — a failed install
  auto-rolls-back instead of leaving a stuck/orphaned release (memory #6479).

Tempo (single-binary, proxmox-lvm 20Gi, 30d) + OTel Collector (contrib; otlp ->
redaction -> batch -> tempo) + Tempo datasource + additive trace_id->Tempo
derivedField on Loki + tripit LOG_FORMAT=json/OTEL_EXPORTER_OTLP_ENDPOINT.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:17:59 +00:00
Viktor Barzin
a0897de7c3 workstation: document homelab-memory hooks + provisioner self-deploy [ci skip]
multi-tenancy.md never mentioned the homelab-memory hooks rollout and still
listed claude_memory credential injection as purely "future". Document what is
actually true now: install_memory provisions the recall/auto-learn/compaction
hooks per user, the provisioner binary self-deploys from the repo (step 0), the
set -e abort fix, and that the hooks no-op without a MEMORY_API_KEY in env (CLI
defaults the URL) — emo has a key, ancamilea is keyless until one is minted.
Also clarify setup-devvm.sh's binary install is now bootstrap-only (ongoing
edits self-deploy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:04:38 +00:00
Viktor Barzin
92f35550f2 workstation: self-deploy t3-provision-users from the repo each reconcile [ci skip]
Root cause of emo's lost memory: nothing redeployed /usr/local/bin/t3-provision-users
except the manual setup-devvm.sh, so the homelab-memory rollout (44562535/9aa2438e,
Jun 21) sat committed-but-undeployed for a day — the hourly reconcile kept running the
pre-memory binary and never wired the new memory hooks for emo/anca.

Close the gap the same way the script already treats managed-settings.json and
start-claude.sh (sync_managed_config / deploy_user_launcher): the repo is the
authoring surface. At the top of the run, if the repo copy differs from the deployed
binary, install it and re-exec the fresh one. Guards: a re-exec env flag (no loop),
bash -n (never deploy a broken script), DRY_RUN (no mutation), cmp (no churn when
unchanged). Verified across all four paths in isolation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:02:31 +00:00
Viktor Barzin
0b11a28d66 workstation: stop install_memory aborting the reconcile under set -e
install_memory (added in 44562535) ended with `[[ -d <plugin-dir> ]] && rm && log`
and guarded a chmod with a bare `[[ -f settings ]] && chmod`. When the plugin dir
or settings file is absent — the normal case for users who never had the
claude-memory plugin — those return non-zero, and under `set -euo pipefail` the
function returns non-zero and kills the whole hourly reconcile after the FIRST
user, before the rest are processed.

It never fired before because the rollout was committed but the deployed
/usr/local/bin/t3-provision-users was never updated, so install_memory had never
run. On first real run it aborted right after ancamilea, so emo (and wizard)
never got their memory hooks wired — the reason emo's sessions lost memory. Wrap
the cleanup in an if-block, guard the chmod, and end the function with return 0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 07:59:47 +00:00
Viktor Barzin
464e0bfb97 Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2)"
All checks were successful
ci/woodpecker/push/default Pipeline was successful
This reverts commit 7513468a2d.
2026-06-22 06:46:56 +00:00
Viktor Barzin
72dcb125d5 Revert "fix(monitoring): tempo OOMKilled — move resources under tempo.resources"
This reverts commit a02782d11f.
2026-06-22 06:46:56 +00:00
Viktor Barzin
a02782d11f fix(monitoring): tempo OOMKilled — move resources under tempo.resources
Some checks failed
ci/woodpecker/push/default Pipeline failed
Pipeline #315 failed: tempo-0 CrashLoopBackOff / OOMKilled (exit 137). The
single-binary grafana/tempo chart (v1.24.4) takes container resources at
tempo.resources, not a top-level resources: — so my block was ignored and the pod
fell to the namespace LimitRange default and OOMed. Set tempo.resources explicitly
(req 256Mi / limit 2Gi). tripit + existing monitoring were unaffected throughout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 06:44:31 +00:00
Viktor Barzin
7513468a2d feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Stand up the cluster's first trace store + OTLP ingress so tripit's OpenTelemetry
spans (Phase 1, already live in prod) export and correlate with logs:
- Grafana Tempo (single-binary, filesystem on proxmox-lvm 20Gi, 30d)
- OTel Collector (contrib; otlp -> redaction deny-list backstop -> batch -> tempo)
- Grafana: a Tempo datasource + an ADDITIVE trace_id->Tempo derivedField on the
  Loki datasource (no uid change, so existing dashboards are unaffected)
- tripit deployment: LOG_FORMAT=json + OTEL_EXPORTER_OTLP_ENDPOINT -> the Collector

Additive (new helm releases; Loki/Prometheus/Grafana untouched). Offline
'terraform validate' clean; full plan+apply runs in CI (locked git-crypt blocks a
local plan as non-admin).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 06:31:11 +00:00
Viktor Barzin
1a32c07ffe docs(eso): Phase 1 done (0.16.2) + confirmed Phase 2 GC findings
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Execution log added to the ESO migration plan. Phase 1 complete: ESO at 0.16.2
(both v1beta1+v1 served). Phase 2 findings confirmed live: apiVersion bump forces
a kubernetes_manifest REPLACE, and ESO ESs use creationPolicy=Owner (target Secret
ownerRef → cascade-GC risk on the replace's delete). Phase 2 must snapshot Secrets
+ empirically validate GC-survival on the first live ES + per-stack two-phase
-target apply (fallback: state rm + import). Corrected the doc's k8s assumption
(cluster is on 1.34; whole climb stays on 1.34, no interleave).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:44:50 +00:00
Viktor Barzin
ac27e41fde Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 20:41:35 +00:00
Viktor Barzin
296deda3b4 eso: Phase 1 — climb chart 0.12.1 -> 0.16.2 (transition version) + atomic
First half of the ESO 0.12->2.6 migration (docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md),
clearing the LAST k8s-1.35 compat-gate blocker. Stepped one minor at a time on
k8s 1.34 (no k8s interleave — cluster already on 1.34, ESO bands are conservative
tested ranges not hard limits): 0.12.1 -> 0.13.0 -> 0.14.4 -> 0.15.1 -> 0.16.2.
Each hop applied + verified: controller healthy, all 108 live ExternalSecrets
stayed SecretSynced (2 pre-existing dead — instagram-poster, payslip-ingest —
missing Vault data, untouched). Added atomic=true + timeout=600 (ESO had no
rollback safety net). 0.16.2 serves BOTH v1beta1 AND v1 (storedVersions now
["v1beta1","v1"]) — the safe window to rewrite all 104 CRs to v1 (Phase 2) before
0.17 removes v1beta1. State auto-committed per hop by scripts/tg (Tier-0 SOPS).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:41:30 +00:00
Viktor Barzin
0cd59d2c55 state(external-secrets): update encrypted state 2026-06-21 20:41:10 +00:00
Viktor Barzin
b8612e788d state(external-secrets): update encrypted state 2026-06-21 20:39:45 +00:00
Viktor Barzin
877e5c73b2 state(external-secrets): update encrypted state 2026-06-21 20:38:34 +00:00
Viktor Barzin
de2250f667 immich-frame: set photo date format to dd/MM/yyyy
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The photo date overlay was showing US-style MM/dd/yyyy — ImmichFrame's built-in default when PhotoDateFormat is unset. Viktor wants UK day/month/year ordering instead. Pin PhotoDateFormat to the date-fns pattern "dd/MM/yyyy" (uppercase MM = month; lowercase mm would render minutes). The config map carries reloader.stakater.com/match, so Reloader restarts the immich-frame pod automatically on apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:36:43 +00:00
Viktor Barzin
8e6eff03dd state(external-secrets): update encrypted state 2026-06-21 20:36:37 +00:00
Viktor Barzin
0bae025b9b wealth dashboard: spend-down figures in today's money (inflation-adjusted)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked whether the spend-down numbers were inflation-adjusted —
they were not (all nominal). He chose to switch the card to today's
money, so every row now shows constant purchasing power for life.

Each row is a die-with-zero annuity at the REAL rate (1+g)/1.03−1
(3% inflation), spending a constant inflation-adjusted amount (the
actual pounds withdrawn rise with inflation) until net worth hits £0
at age 100:
  • No growth (0%)  → £12/day, £370/mo,   £4,446/yr   (negative real: loses to inflation)
  • Inflation (3%)  → £43/day, £1,315/mo, £15,776/yr  (0% real: holds value)
  • Market (7%)     → £130/day, £3,942/mo, £47,300/yr (~3.9% real)

Title now flags "(today's £)". Same panel/layout; only the SQL, title,
and tooltip changed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:13:59 +00:00
Viktor Barzin
3fb6284e2b immich-frame: use 24-hour clock (ClockFormat HH:mm)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to switch the Immich photo-frame shown on the Portal
kitchen appliance to a 24-hour clock. immichFrame defaults ClockFormat
to 'hh:mm' (12-hour) and we never overrode it, so the frame was showing
12-hour time. Set ClockFormat: "HH:mm" (date-fns 24h token) in the
frame Settings.yml ConfigMap; Reloader restarts the pod on apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:10:51 +00:00
Viktor Barzin
e89de86af0 wealth dashboard: spend-down table → three growth scenarios
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted the spend-down card to compare three portfolio-growth
scenarios rather than the previous floor-vs-4%-real pair.

The table now has three rows, each a die-with-zero annuity (drain net
worth to £0 by age 100) spending a constant number of ACTUAL (nominal)
pounds, differing only by the assumed nominal growth rate:
  • No growth (0%)      → £43/day,  £1,315/mo, £15,776/yr  (= NW ÷ years)
  • Inflation (3%)      → £106/day, £3,233/mo, £38,792/yr  (NEW)
  • Avg market (7%)     → £220/day, £6,703/mo, £80,435/yr

This keeps the £43 no-growth floor he anchored on. The old third row
was "4% real" (£133) expressed in today's money; it's replaced by the
7%-nominal market row (£220, actual pounds) so all three rows share one
basis (nominal pounds) and are directly comparable. 3%/7% are hardcoded
(one-line SQL edit). Table height 4→5 for the extra row; panels below
shifted down 1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:06:29 +00:00
Viktor Barzin
85d42f2c13 wealth dashboard: merge spend-down tiles into one compact table
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted the six separate spend-down stat tiles consolidated into a
single, more compact card with the figures laid out as rows.

Replaces stat panels 9220-9225 with one table panel (id 9220) in the
Overview row: 2 rows (Floor / 4% real) × 3 columns (per day / month /
year). Same underlying math and live values (£43/£1,315/£15,776 floor;
£133/£4,039/£48,463 at 4% real). w=9 instead of the full-width tile row,
so it takes ~a third of the width.

Note: this intentionally overrides the "table panels live at the bottom"
layout convention — Viktor chose to keep this headline KPI glanceable at
the top of the dashboard rather than scroll for it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:55:57 +00:00
Viktor Barzin
63add2a126 feat(tripit): finalize ADR-0028 auth env — AUTH_MODE=normal, trips@ sender, trust XFF
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Now that the native-auth rollout is complete: (1) AUTH_MODE hybrid->normal — the legacy Authentik OIDC-bearer + forward-auth arms were removed in #96, and 'hybrid' already resolved to 'normal' via backward-compat parsing; this makes it explicit and corrects the now-false comment. (2) SMTP_FROM plans@->trips@ — the dedicated native-auth sender; the trips@->spam@ send-as alias is live + verified (RCPT 250). (3) TRUST_FORWARDED_FOR=true — so #95's per-IP signup rate-limit keys on the real client behind Traefik, not the shared ingress pod IP. Env-only; the Deployment image is KEEL_IGNORE_IMAGE (lifecycle-ignored), so this does NOT touch the running image. Reloader restarts the pod to pick up the new env.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:50:20 +00:00
Viktor Barzin
166a2bcab4 wealth dashboard: add "spend-down to £0 at 100" stat tiles
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted a glanceable number on the Wealth dashboard for how much
he can spend for the rest of his life — spending the whole net worth
down to zero by age 100.

Adds a third line of six stat tiles to the Overview section, two
equations × three cadences (per day / month / year):

  • FLOOR  — net worth ÷ time remaining to age 100. Treats the money as
    cash (no growth, no inflation): a conservative lower bound.
    ≈ £43/day, £1.3k/mo, £15.8k/yr.
  • 4% REAL — die-with-zero annuity: the constant, inflation-adjusted
    spend that drains the balance to £0 at 100 while it keeps earning
    4% real. PMT = NW·r/(1−(1+r)^−n). ≈ £133/day, £4.0k/mo, £48.5k/yr.

Horizon is today → his 100th birthday (DOB 1998-10-04 → 2098-10-04),
computed live so the figures tick as net worth and the horizon move.
Net worth reuses the existing latest-per-account dav_corrected math, so
the tiles always agree with the "Net worth (current)" stat (pension
included; target £0). The 4% real rate is hard-coded per his "keep it
simple, just a number" steer — a one-line SQL edit to change later.

Layout: tiles inserted at y=9; all sections below shifted down 4 rows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:48:30 +00:00
c830f9f462 Merge pull request 'workstation: wire-memory-hooks as root (fix non-admin wiring)' (#14) from wizard/mem-fix into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:45:39 +00:00
Viktor Barzin
9aa2438e75 workstation: run wire-memory-hooks as root, not runuser (fix non-admin wiring)
install_memory ran the JSON-merge helper via 'runuser -u $user', but the helper
lives under the admin's mode-700 home ($WORKSTATION_DIR) which non-admin users
can't traverse -> wiring silently failed for emo/anca (hooks copied but never
wired into settings.json). Run the helper as root (it reads both the repo helper
and the user's home) and chown the result back to the user. Verified by the live
all-users rollout: emo + anca now wired correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:45:36 +00:00
f318773cb0 Merge pull request 'workstation: homelab-memory for all users (retire claude-memory MCP)' (#13) from wizard/memory-allusers into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:42:51 +00:00
Viktor Barzin
44562535a2 workstation: provision homelab-memory hooks for all users (retire claude-memory MCP)
Roll the wizard MCP->homelab-CLI memory migration out to every devvm user. Adds
install_memory() to t3-provision-users.sh (mirrors install_playwright: per-user,
idempotent, if-absent, as-the-user): installs the 4 memory hook scripts into
~/.claude/hooks, wires them into settings.json additively (wire-memory-hooks.py
never touches env / the per-user MEMORY_API_KEY), and removes ONLY the
claude_memory MCP + plugin if present. Reuses each user's existing key (no
minting; per-user isolation stays deferred per the 2026-06-07 design). The
homelab CLI hits the same remote HTTP API the MCP used; recall runs via the
homelab-memory-recall.py UserPromptSubmit hook. Shared instructions (rules/skills
symlinked from base; root+infra CLAUDE.md) already cover all users.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:42:42 +00:00
Viktor Barzin
79749d7324 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:27:42 +00:00
Viktor Barzin
5e3fe2e8e2 docs(plans): ESO 0.12->2.6 (v1beta1->v1) migration design — the last k8s-1.35 blocker
Design doc for migrating External Secrets Operator off v0.12 (k8s <=1.31), now
the ONLY remaining compat-gate blocker for autonomous k8s 1.35 (kyverno cleared
to 1.18.1 today). Decisive findings: NO v1beta1->v1 conversion webhook, so all
104 ExternalSecrets (across 73 stacks) + 2 ClusterSecretStores must be rewritten
to external-secrets.io/v1 (byte-identical apiVersion bump) while on 0.16.2, BEFORE
crossing 0.17 (which removes v1beta1 — the point of no return). Step one minor at
a time (no skipping); chart==app version; downstream Secrets survive. 5-phase
ordered plan + per-phase rollback + the plan-time data.kubernetes_secret -target
gotcha (15 stacks) + Tier-0/SOPS handling. Plan only — nothing applied.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:27:37 +00:00
3f81b20fa6 Merge pull request 'docs: memory via homelab CLI (retire memory-tool/MCP refs)' (#12) from wizard/memory-cli-docs into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:24:10 +00:00
Viktor Barzin
e2018f9b6c docs: memory via homelab CLI, not the retired memory-tool/MCP
The claude-memory MCP/plugin was uninstalled 2026-06-21 (recall now via the
homelab-memory-recall.py UserPromptSubmit hook; store/recall/update via the
`homelab memory` CLI, which hits the same remote HTTP API). Updates the
.claude/CLAUDE.md 'remember X' instruction off the obsolete local memory-tool
CLI + memory_search/memory_get onto the homelab CLI. Matches the root monorepo
CLAUDE.md + ~/.claude/rules/execution.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:24:00 +00:00
Viktor Barzin
51838a4ec7 kyverno: 3.6.1 -> 3.8.1 (app 1.16 -> 1.18.1) — clears the k8s-1.35 compat-gate block
All checks were successful
ci/woodpecker/push/default Pipeline was successful
kyverno v1.16 supports k8s <=1.34, so it was one of the two addons blocking the
autonomous 1.35 upgrade (compat gate, nightly). v1.18 supports 1.35.

Stepped one minor at a time per the kyverno upgrade guide (per-minor CRD notes):
3.6.1 (1.16) -> 3.7.2 (1.17.2) -> 3.8.1 (1.18.1), each hop applied + verified
supervised. atomic=true (auto-rollback on a failed rollout) + forceFailurePolicyIgnore
(admissions stay open mid-roll) kept it safe. Values schema confirmed compatible
across 3.6->3.8 (forceFailurePolicyIgnore still under features:).

Verified after each hop: all 17 ClusterPolicies stayed Ready, admission controller
2/2, no destroys/replaces in plan. Final 1.18.1: images v1.18.1, mutating webhook
live (server-side dry-run injects ndots:2 in a non-excluded ns). compat-gate vs
1.35.6 now lists ONLY external-secrets (kyverno cleared). ESO 0.12->2.x
(v1beta1->v1, 73 files) is the last remaining 1.35 blocker — to be planned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:21:38 +00:00
Viktor Barzin
ead876ec65 k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
Viktor Barzin
7270e2be3b monitoring: K8sUpgradeChainJobFailed must not double-fire on a compat-gate block
Some checks failed
ci/woodpecker/push/default Pipeline failed
Last night (2026-06-20) the detector + compat-gate fixes worked: the chain
resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno
1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
fired as designed. But the refusal also made the preflight Job exit 1
(block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped
K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm
for what is the intended halt-and-alert outcome.

Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate
block sets that gauge (and it stays 1 until the next preflight resets it), so
the chain-job-failed alert is suppressed for the blocked period; a genuine
wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires
(preserving the alert's original purpose — catching the pre-in_flight preflight
failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs
updated to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:35:35 +00:00
Viktor Barzin
b0ccaf1c65 state(vault): update encrypted state 2026-06-21 15:07:01 +00:00
Viktor Barzin
f84e6818b2 state(vault): update encrypted state 2026-06-21 15:07:01 +00:00
Viktor Barzin
cc4bb8ffe8 wealth dashboard: show price freshness for all 3 holdings, not just worst
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor wanted the freshness tile to cover all three main holdings
(META, VUAG, VUSA), not only the single stalest one. Dropped LIMIT 1 so
the stat renders one value per held position (worst-first), switched the
tile to horizontal orientation so the three values sit side-by-side, and
updated the description. Each value is coloured by its own age threshold
(META red ~2mo, the Vanguard ETFs green ~2d). No threshold or datasource
change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 14:49:33 +00:00
6c2c56ab3b Merge pull request 'docs: CrowdSec enforcement = firewall-bouncer + CF WAF (plugin removed)' (#11) from wizard/crowdsec-docs into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 13:40:41 +00:00
Viktor Barzin
ceae4d5f06 docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed)
The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler
never invoked) and has been removed. Document the replacement: in-kernel
nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List
+ zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts.
Both add zero per-request latency and fail open.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:39:26 +00:00
4df741f6de Merge pull request 'traefik/crowdsec: delete dead Yaegi plugin + middleware CRD + captcha (PR2/2)' (#10) from wizard/cs-deplugin-crd into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 13:36:03 +00:00
Viktor Barzin
c23b03864e traefik/crowdsec: delete dead Yaegi plugin + middleware CRD + captcha (PR2/2)
Zero live ingresses reference traefik-crowdsec@kubernetescrd (PR1 + a
cluster-wide targeted ingress re-apply confirmed 0), so the crowdsec Middleware
CRD and the broken Yaegi bouncer plugin can be removed without orphaning any
router. Removes: the `crowdsec` Middleware, the crowdsec-bouncer plugin (static
config + initContainer download + state.json entry), the captcha template
ConfigMap + volume + captcha.html, the Turnstile widget + data.cloudflare_accounts,
and the 3 now-unused module vars. Also drops the `crowdsec` middleware from the
catch-all error-pages IngressRoute chain (the one remaining CRD-level reference,
which an Ingress-annotation grep does not surface) so that router is not orphaned
when the Middleware is deleted; it keeps rate-limit. Enforcement is fully handled
out-of-band now: cs-firewall-bouncer (in-kernel nftables, direct hosts) +
Cloudflare IP-List/WAF (proxied hosts). The api-token-middleware plugin is
deliberately preserved (still used by paperless-mcp).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:35:13 +00:00
df86075c3d Merge pull request 'cleanup: fully remove orphaned council-complaints app' (#9) from wizard/council-cleanup into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 13:33:23 +00:00
Viktor Barzin
68d9058f85 cleanup: fully remove orphaned council-complaints app
The council-complaints app (Islington civic-reporting pilot) has been
abandoned. It was already dead in the cluster (deployments scaled 0/0,
image only on the decommissioned registry.viktorbarzin.me which 404s),
and it was never in Terraform — only docs + a kyverno comment referenced
it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses)
were torn down out-of-band via kubectl (nothing in TF to drift from); the
DB-dump PVC was backed up to NFS first.

This removes the remaining repo references to the live app:
- service-catalog.md: drop the council-complaints row
- ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list
- kyverno require-trusted-registries: the registry.viktorbarzin.me/*
  allowlist comment claimed council-complaints as the last referencer;
  rewrite it (no live workload pulls from that registry now; only stale
  completed Job records still carry the ref). The allowlist line itself
  is kept (registry-scoped, not app-specific).

Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade-
apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated
repos (memory id=388)" snapshot; left as-is so the dated record stays
accurate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:32:10 +00:00
Viktor Barzin
6dc3ce139f wealth dashboard: expand all rows by default + inline the freshness stat
Some checks failed
ci/woodpecker/push/default Pipeline failed
Two follow-ups Viktor asked for on the Price freshness panel:

- Expand every section by default. Grafana's collapsed rows hide their
  child panels; just flipping collapsed=false leaves a non-canonical shape
  (confirmed via the Grafana API that it keeps the panels nested rather
  than hoisting them), so each row is now collapsed=false + panels=[] with
  its children hoisted to top-level -- the exact form Grafana writes when
  you expand-and-save. Row headers revert to their original y (the child
  y-coords were already expanded-layout coordinates).

- Stop the freshness stat from taking its own line. It's now the 6th tile
  in the existing returns row (1d/7d/30d/90d/12mo + freshness), all width 4
  at y=5; the collapsed-row y-shift from the previous commit is undone.

No query or threshold changes. The large diff is mechanical: 12 child
panels re-indent from nested to top-level.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:29:25 +00:00
Viktor Barzin
92ff0b92f1 Merge remote-tracking branch 'forgejo/master' into wizard/t3-idle-migrate
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 12:41:33 +00:00
Viktor Barzin
5a136c7d53 docs: t3-migrate-idle runbook section + service-catalog + design status
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:40:46 +00:00
Viktor Barzin
334d8fee5d setup-devvm: install + enable t3-migrate-idle (lib, script, units, timer)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:36:13 +00:00
Viktor Barzin
3cf09a0fe3 t3-migrate-idle: systemd oneshot + overnight timer (01:00-05:40, /20)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:35:19 +00:00
Viktor Barzin
af9f7be297 t3-migrate-idle: drain deferral markers when safe
For each /var/lib/t3-autoupdate/deferred/<user> marker: skip+clear if the unit is
gone or was already restarted after the deferral; otherwise, when the idle gate is
satisfied, take a pre-restart backup and restart via the shared safe_restart_unit,
clearing the marker on verified success. DRY_RUN logs decisions without acting.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:34:44 +00:00
Viktor Barzin
06e400522f t3-migrate-idle: idle gate (no in-flight turn + quiet buffer), TDD
The gate reads t3's state.sqlite: safe to restart only when zero threads have an
active_turn_id AND the most-recent thread activity is older than the quiet buffer
(default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover
the boundaries against fixture DBs (no root/bats/Docker).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:34:11 +00:00
Viktor Barzin
de97696ff0 t3-autoupdate: source the shared safe-restart lib + record deferrals
Behavior-preserving refactor: the per-unit restart/recover body and small helpers
now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is
deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/
so the new idle migrator can drain it later; clear the marker on a successful
restart. Install/health-gate/canary logic is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:32:57 +00:00
Viktor Barzin
2ab5b94748 t3-safe-restart: extract shared safe-restart library from t3-autoupdate
Pull the per-unit backup->restart->verify->recover routine (and the small
helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second
job (the upcoming idle migrator) can reuse the exact same audited recovery path
instead of forking safety-critical code. safe_restart_unit returns non-zero on
failure (after recovery+freeze) rather than exiting, so callers control flow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:28:53 +00:00
Viktor Barzin
0cebeeb0ee t3-idle-migrate: implementation plan
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:26:05 +00:00
Viktor Barzin
ddbdbca7e9 wealth dashboard: add "Price freshness" stat for stalest held quote
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor was worried about stale prices silently distorting net worth.
Confirmed it's real: META's quote has been frozen at 2026-04-17 (65 days
old) while the dashboard keeps valuing the ~55-share position at that
stale close; the Vanguard ETFs are current. Nothing flagged it.

Adds one compact stat to the Overview row showing the most out-of-date
HELD position's quote age (symbol + humanised age), colour-coded: green
<=4d (weekend/bank-holiday tolerant), amber 5-9d, red >=10d. Pure read of
the quote_latest mirror via the wealth-pg datasource, held positions
only, LEFT JOIN so a held symbol with no quote at all sorts as max-stale.
The six collapsed rows below shift down 4 grid units to make room; no
other panel touched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:23:45 +00:00
Viktor Barzin
9503bed589 t3-idle-migrate: design for graceful overnight restart of deferred t3-serve instances
Viktor hit the t3 'Client and server versions differ' warning. Root cause: the daily gated autoupdate defers a user's t3-serve restart whenever that user has an active agent at the 04:00 window, so anyone busy every night (long-lived/AFK sessions) never migrates and the client/server version skew persists for days.

This design adds a small idle-gated overnight job that drains those deferrals -- restarting a deferred instance onto the current binary only when no turn is in flight (state.sqlite active_turn_id) and it's been quiet for a buffer, so the migration lands in a real quiet gap instead of killing in-flight agent turns. Reuses the autoupdate's proven backup->restart->verify->recover path via a shared helper (approach C from the brainstorm). Design doc only; no behavior change yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:04:22 +00:00
Viktor Barzin
b1bbe42821 homelab ha token: dedicated openclaw/ha-tokens secret + least-priv RBAC for emo
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`ha token` originally read openclaw/openclaw-secrets -> skill_secrets, which only
cluster admins can read — so it hung/failed for the non-admin operator it was
built for (emo = emil.barzin@gmail.com, OIDC group "Home Server Admins", whose
identity is deliberately barred from secrets in the openclaw namespace).

Split the HA tokens into a dedicated secret openclaw/ha-tokens (keys sofia/london)
with a Role + RoleBinding granting `get` on JUST that secret to the Home Server
Admins group (k8s RBAC can't scope to a JSON sub-key, hence a separate object).
emo now resolves the HA token with their own identity, WITHOUT gaining the rest
of skill_secrets (slack_webhook, uptime_kuma_password). openclaw's own deployment
keeps reading openclaw-secrets — purely additive.

- stacks/openclaw/ha_tokens.tf: new secret + least-privilege Role/RoleBinding
- cli/cmd_ha.go: read openclaw/ha-tokens (raw base64 per-instance key); drop JSON parse
- README + ADR-0012 updated; VERSION -> v0.7.1

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 10:45:32 +00:00
a091689603 Merge pull request 'traefik/crowdsec: remove dead plugin middleware reference (PR1/2)' (#8) from wizard/cs-deplugin-refs into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-21 00:17:51 +00:00
Viktor Barzin
71d0af084e traefik/crowdsec: remove 6 hard-coded middleware refs the variable sweep missed (PR1/2)
The first PR1 commit only dropped the ingress_factory reference + the 8
exclude_crowdsec call sites. But the crowdsec middleware is ALSO hard-coded
(not via the variable) in 6 more ingresses that build their middleware chain by
hand: owntracks, the monitoring Helm values (grafana + prometheus +
alertmanager), and the reverse-proxy module + its own separate ingress factory.
Remove all 6 so that after the full-cluster apply NO live ingress references
traefik-crowdsec@kubernetescrd — the precondition for PR2 deleting the CRD.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:17:40 +00:00
Viktor Barzin
7bd4612edf ci: scripts/tg waits out a contended state lock (-lock-timeout)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra CI pipeline was failing often — ~38% of the last 50 runs didn't
succeed. The single biggest cause (8 of 19 non-successes) was Tier-1 stack
applies dying instantly with "Error acquiring the state lock".

Tier-0 stacks already degrade gracefully (Vault advisory lock → the pipeline
skips a locked stack). Tier-1 stacks have no such fallback: they rely on
terraform's pg-backend pg_advisory_lock, and scripts/tg ran terragrunt with
no -lock-timeout, so any concurrent lock holder was fatal — a Woodpecker-killed
run whose PG lock wasn't reaped yet (PL266 killed → PL267 failed the same
second), a human/agent applying locally, or the daily drift `plan`.

Fix: scripts/tg now passes -lock-timeout (default 5m, override TG_LOCK_TIMEOUT)
on every state-locking verb (plan/apply/destroy/refresh), so a contended lock
WAITS for the holder to finish instead of failing. -auto-approve behaviour for
non-interactive applies is unchanged. Central wrapper change → covers CI, plus
local human/agent applies; no CI image rebuild (tg is read from the repo).

Adds a hermetic pytest (stub terragrunt + preset PG_CONN_STR) pinning the
arg-injection. Docs updated in AGENTS.md + .claude/CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:15:39 +00:00
Viktor Barzin
84a18a5529 traefik/crowdsec: remove dead Yaegi-plugin middleware reference (PR1/2)
The Traefik CrowdSec (Yaegi) bouncer plugin enforces nothing on Traefik 3.7.5
(handler never invoked) and is fully superseded by the cs-firewall-bouncer
(in-kernel nftables drop on direct hosts) + the Cloudflare IP-List/WAF rule
(proxied hosts). Drop the `traefik-crowdsec@kubernetescrd` middleware from the
ingress_factory chain and the 8 explicit `exclude_crowdsec = true` call sites,
and delete the now-unused `exclude_crowdsec` variable.

This is PR1 of a 2-phase removal: the reference is removed FIRST (a shared-module
change → full-cluster apply re-renders every ingress without the middleware) so
that PR2 can delete the `crowdsec` Middleware CRD + the plugin itself WITHOUT
leaving any ingress pointing at a missing middleware (which would error those
routers). PR2 MUST NOT land until this has fully applied and zero live ingresses
reference traefik-crowdsec@kubernetescrd.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:15:12 +00:00
9774ae3d19 Merge pull request 'crowdsec: firewall-bouncer cluster-wide (remove node2 pin)' (#7) from wizard/cs-fw-allnodes into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 00:08:15 +00:00
Viktor Barzin
c92590ae85 crowdsec: roll firewall-bouncer cluster-wide (remove node2 validation pin)
One-node validation on k8s-node2 passed: kernel nftables sets created in both
input and forward chains (policy accept), ~31k decisions loaded, a known banned
scanner confirmed in the drop set, pod stable 4h+ with no collateral. Remove the
nodeSelector so the DaemonSet runs on every node — direct-host enforcement now
survives a MetalLB VIP failover to any worker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:07:45 +00:00
4f1c998468 Merge pull request 'rybbit sync: exclude CAPI + per_page=500 fix' (#6) from wizard/crowdsec-syncfix into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 00:05:50 +00:00
Viktor Barzin
f55bb6c422 rybbit: sync excludes CAPI blocklist + fix CF items per_page (500)
The edge CF IP List can't hold the ~31k CAPI community blocklist (already
enforced in-kernel by the firewall-bouncer), so the sync now skips origin=CAPI
and carries only high-signal local/curated decisions (+ a 9000 safety cap).
Also fixes the list-items GET: per_page=1000 returned a misleading CF 400
'invalid or expired cursor' (10027); the endpoint max is 500. Verified live:
crowdsec_ban populates (4 IPs) and the sync exits 0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:05:05 +00:00
Viktor Barzin
6d5d3726d6 Merge remote-tracking branch 'origin/master' into wizard/ha-cli-verbs
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-20 23:46:29 +00:00
Viktor Barzin
48225f2dea homelab CLI v0.7: add ha token + ha ssh for Home Assistant
Mined another devvm user's Claude sessions for repeated, hand-rolled command
patterns worth absorbing into the shared CLI. The dominant signal was Home
Assistant "Sofia" work: a `kubectl | base64 | jq` token-extraction pipeline
re-derived ~420x, and a bespoke non-interactive `ssh -o …` invocation reinvented
~30x — every session. The existing `home-assistant-sofia.py` already covers the
API but goes unused from an arbitrary cwd (needs an env var set + a cwd-relative
path), so agents bypassed it and hand-rolled everything.

Add two verbs covering exactly the gaps the `ha` MCP can't (entity state/control
stays with the MCP):
- `ha token [--instance sofia|london]` (read): resolves the long-lived API token
  live from k8s secret openclaw/openclaw-secrets via the ambient kubeconfig — no
  pre-set env var. Composes as `curl -H "Authorization: Bearer $(homelab ha token)"`.
- `ha ssh [--instance sofia|london] -- <cmd>` (write): deterministic
  non-interactive ssh to the HA host using the invoking user's key.

Also fix the root cause: `home-assistant-sofia.py` now falls back to
`homelab ha token` when its env var is unset (works from any directory), and the
home-assistant skill points agents at these verbs + `homelab metrics query`
instead of hand-rolled curls. README + ADR-0012 + AGENTS.md updated per the
per-verb-group convention.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 23:46:09 +00:00
Viktor Barzin
46166c63b2 fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)

Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
  and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
  as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
  exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
  ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
  every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
  carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
  make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
  and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
  (previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
  (CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).

Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 23:40:22 +00:00
Viktor Barzin
600f1f933c Create Claude auth state directories
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The first live renewal run showed systemd could not create state beneath a read-only home sandbox. Provision each user's writable state directory before enabling the timer so automatic renewal can run.
2026-06-20 20:25:55 +00:00
Viktor Barzin
7f1788a106 Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-20 20:22:20 +00:00
Viktor Barzin
ff67e9d422 Fix workstation package manifest parsing
The approved Claude token renewal deployment could not run because setup-devvm passed inline package comments to apt as package names. Strip inline comments so the persisted all-user setup remains reproducible.
2026-06-20 20:22:05 +00:00
Viktor Barzin
524b874036 state(vault): update encrypted state
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
2026-06-20 20:14:53 +00:00
Viktor Barzin
7050b0441e Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 20:11:09 +00:00
Viktor Barzin
bc2fbc712c Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew 2026-06-20 20:10:48 +00:00
Viktor Barzin
02d14796cc feat(mailserver): add trips@ send-as alias for TripIt native auth email (ADR-0028)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
TripIt's native signup-verification + account-recovery mail (ADR-0028) sends From: trips@viktorbarzin.me while authenticating SMTP as spam@. With SPOOF_PROTECTION on, Postfix smtpd_sender_login_maps requires an EXPLICIT alias (the @domain catch-all doesn't satisfy it) — mirrors the existing plans@->spam@ grant. Must be applied + verified before TripIt flips SMTP_FROM to trips@, else every verification/recovery send is rejected 550.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:10:47 +00:00
Viktor Barzin
5549fc3672 Add per-user Claude auth renewal
Each workstation user needs a continuously valid Claude token under their own Enterprise identity. Store only that user's OAuth state in an isolated Vault path, renew and verify it automatically, recover from Vault when possible, and alert when interactive SSO is required.
2026-06-20 20:10:40 +00:00
Viktor Barzin
3278588325 chore(authentik): tear down obsolete tripit-enrollment (ADR-0020 superseded by ADR-0028)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
TripIt external users are now LOCAL TripIt accounts (ADR-0028 native passkey + Authentik OIDC), so the Authentik-side self-enrollment machinery is dead. Removes the tripit-enrollment + tripit-recovery flows and all their stages/prompts/policies/bindings, the tripit-email-stages blueprint (+yaml), and the 'TripIt External' group; reverts the admin-services-restriction fence branch that contained those users (its sole member, the leftover tripit-demo@ test account, was deleted first, so the revert affects zero live principals). Real external collaborators (type=external) are untouched. tg plan: 0 add, 1 change (the policy expression), 20 destroy (all tripit_*). Closes tripit#97; moots the B2 per-app OIDC fences.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:04:24 +00:00
834c5e6a2a Merge pull request 'CrowdSec proxied: single CF list (block-only) + firewall-bouncer re-apply' (#5) from wizard/crowdsec-1list into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 19:31:01 +00:00
Viktor Barzin
7cf93a0587 crowdsec+rybbit: proxied edge to single CF list (block-only) + retrigger firewall-bouncer apply
CF account hard-limits to 1 Rules List, so proxied enforcement uses one crowdsec_ban
list + one WAF block rule; the sync writes both ban and captcha decisions into it
(captcha downgraded to block at the edge). Drops the second list + managed_challenge
rule. Trivial touch to firewall_bouncer.tf to make CI re-apply crowdsec and recreate
the DaemonSet (tar fix already in master; stale orphan was cleared).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 19:29:43 +00:00
1406d8a391 Merge pull request 'Fix CF ruleset import id + depends_on' (#4) from wizard/crowdsec-fix2 into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 19:13:03 +00:00
Viktor Barzin
f2b089e267 rybbit: fix cloudflare_ruleset import id (zone/ 3-part form) + depends_on lists
v4.52.7 import id must be zone/<zone_id>/<ruleset_id>; add depends_on so the
crowdsec_ban/captcha lists exist before the WAF rules reference them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 19:12:29 +00:00
58fc6d5061 Merge pull request 'Fix CrowdSec firewall-bouncer tar + CF WAF ruleset import' (#3) from wizard/crowdsec-fixes into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 19:06:15 +00:00
Viktor Barzin
a351a66843 crowdsec+rybbit: fix firewall-bouncer tar extraction (busybox) + import existing CF WAF ruleset
- initContainer used GNU tar --wildcards which fails on the busybox curl image (pod Init:Error); switch to extract-all + cp via shell glob.
- cloudflare_ruleset hit the per-zone singleton conflict; import the existing 'default' http_request_firewall_custom ruleset and manage all rules — CrowdSec ban/captcha first, the pre-existing disabled skip rule preserved verbatim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 19:04:30 +00:00
70e8ce1021 Merge pull request 'CrowdSec real enforcement: edge WAF (proxied) + firewall-bouncer (direct)' (#2) from wizard/crowdsec-enforcement into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 09:42:41 +00:00
Viktor Barzin
ca8d617e72 rybbit: use 'Account Rule Lists' permission group for the CF sync token (v4)
tg plan verified the agent's guess 'Account Filter Lists Edit/Read' is not a key in the v4.52.7 permission-group map; the live CF API lists the correct account-scoped groups as 'Account Rule Lists Read'/'Write'.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:41:41 +00:00
Viktor Barzin
0c56290af0 chore(forgejo): re-trigger apply of git.timeout/gc.auto (changed-stack skip)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
910d5892 landed the [git.timeout] + [git.config] env in master, but the CI apply
skipped stacks/forgejo (the changed-stack-diff race after a sync-merge), so the
Forgejo deployment never picked it up. A trivial comment touch to force a clean
apply of the stack so the durable push-mirror fix actually takes effect.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:19:53 +00:00
Viktor Barzin
cc4bfb593b rybbit: proxied CrowdSec enforcement via Cloudflare IP Lists + WAF rule
Replaces the Worker+KV approach (which only covered the ~27 routed hosts) with a
zone-wide mechanism that covers ALL proxied hosts: two CF account IP Lists
(crowdsec_ban, crowdsec_captcha) + one zone WAF custom rule that blocks
`(ip.src in $crowdsec_ban)` and managed-challenges `(ip.src in $crowdsec_captcha)`.
No per-request Worker, no cookie machinery — the rybbit Worker stays
analytics-only. lapi_kv_sync.py now full-reconciles the two lists from LAPI
(fail-safe: a LAPI blip skips the run and freezes the last-known-good block set;
serializes CF bulk ops since CF allows one pending op per account). A
least-privilege CF API token (Account Filter Lists Edit) is minted in TF.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:18:33 +00:00
Viktor Barzin
7e646e1c7c crowdsec: add cs-firewall-bouncer DaemonSet (direct-host nftables enforcement)
Drops banned source IPs in-kernel via nftables (hooks input+forward, so DNAT'd
LoadBalancer traffic is caught before reaching Traefik) for DIRECT hosts — the
direct-side replacement for the dead Traefik plugin, zero per-request hop.

No published image exists, so an initContainer fetches the pinned official
static binary (v0.0.34) onto a stock debian-slim base (nftables backend uses
netlink directly, no nft CLI needed). hostNetwork + NET_ADMIN/NET_RAW (not
privileged). Config (with api_key) in a Secret, Reloader-annotated. crowdsec ns
is already in the Kyverno wave-1 exclude list, so the privileged/hostNetwork pod
is admitted. Pinned to k8s-node2 (runs a Traefik pod) for one-node validation
before the nodeSelector is removed to roll cluster-wide. Fail-open by element
timeout if the bouncer stops.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:11:08 +00:00
Viktor Barzin
53117b193a portal-realtime: deploy the v2 full-duplex voice agent (Pipecat)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
New stack for the realtime voice agent — v2 of the portal-assistant brain
path. One persistent WebSocket per conversation: continuous mic audio ->
Silero VAD turn-taking -> Whisper STT (portal-stt) -> streaming Claude brain
(claude-agent-service) -> edge-tts (portal-tts) -> audio out, with barge-in.
Reuses all three upstream cluster services; nothing new is spun up.

Public Cloudflare ingress (proxied, WebSocket) at portal-realtime.viktorbarzin.me
with the app's own DEVICE_TOKEN as the edge gate (auth="app" — Authentik would
break the native Portal client). No buffering middleware: it would break the
streaming WebSocket. Image ghcr.io/viktorbarzin/portal-assistant-realtime
(private ghcr, pulled with ghcr_pull_token). Sibling to the v1 portal-assistant
gateway, which stays live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:23:17 +00:00
Viktor Barzin
44cac6f4e2 gitignore: ignore Python test artifacts (__pycache__, *.pyc, .pytest_cache)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Introduced the first pytest file in the tree
(stacks/k8s-version-upgrade/scripts/test_compat_gate.py); running it leaves an
untracked __pycache__/ dir. Ignore the standard Python build artifacts so test
runs don't show up as working-tree noise or get committed by accident.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:17:03 +00:00
Viktor Barzin
b58fe8cb1a docs(k8s-upgrade): record detector Packages-probe -L fix + compat-gate patch scope
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Two corrections to the runbook matching today's code fixes:
- The next-minor *patch* probe (GET .../Packages) also needs `-L`; it lacked it
  until 2026-06-20 and silently no-op'd the 2026-06-19 nightly run. Both probes
  now follow the 302.
- The compat gate's addon check is scoped to minor jumps — patches within the
  running minor are never addon-blocked (target_minor <= running_minor returns
  early), so a conservative ceiling like ESO 0.12 -> 1.31 no longer false-blocks
  a 1.34.x patch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:16:20 +00:00
Viktor Barzin
e5250f417e k8s-version-upgrade: compat gate must not false-block patch upgrades
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The compat gate compared every addon's matrix ceiling against the target
k8s minor unconditionally. That is correct for a minor JUMP, but it also
blocked patch upgrades within the minor the cluster is ALREADY running:
ESO v0.12's matrix ceiling is 1.31, the cluster runs 1.34.9, so a target of
1.34.10 (a patch) was refused with "external-secrets supports k8s <= 1.31;
target 1.34 exceeds it" — even though the running cluster is itself proof ESO
0.12 works on 1.34. That silently defeats autonomous patching (it would have
bitten the moment a 1.34.10 was published).

Fix: a target at or below the running minor crosses into no new k8s minor, so
every installed addon is already empirically proven on it — check_addons now
returns no reasons when target_minor <= running_minor. Added running_minor()
(oldest kubelet across nodes, mirroring the detector; RUNNING_K8S env override
for tests) and pass it in. Minor jumps are unchanged: 1.34->1.35 still blocks
on ESO 0.12 + kyverno 1.16. removed-API + containerd checks are naturally
inert for patches (no API removal / containerd floor inside a minor) and keep
running as defence. Added test_compat_gate.py (8 cases) covering both paths.

Verified end-to-end against live Prometheus: target 1.34.10 -> EXIT 0 (safe),
target 1.35.6 -> EXIT 2 (blocked on ESO+kyverno).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:14:50 +00:00
Viktor Barzin
38675b7922 crowdsec: register kvsync + firewall bouncer keys in LAPI
Seeds two new bouncers at LAPI startup (BOUNCER_KEY_kvsync, BOUNCER_KEY_firewall)
from Vault secret/platform, mirroring the existing BOUNCER_KEY_traefik wiring.
These are the two halves of the real enforcement that replaces the dead Yaegi
plugin: kvsync authenticates the LAPI->Cloudflare-KV sync (proxied edge Worker),
firewall authenticates the cs-firewall-bouncer DaemonSet (direct-host nftables).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:12:38 +00:00
Viktor Barzin
a9384a4067 Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 08:09:16 +00:00
Viktor Barzin
44a98d408e k8s-version-upgrade: detector next-minor probe must follow 302 (curl -sfL)
The next-minor Packages query used `curl -sf` without -L. pkgs.k8s.io
302-redirects every request to a backing host, so without -L curl returned
an empty body, NEXT_MINOR_PATCH came back empty, and the detector fell
through to "No upgrade needed". That is exactly why last night's 23:00 chain
no-op'd instead of resolving the 1.35 next-minor target (1.35.6) and handing
it to the compat gate. `curl -sfL` follows the redirect and returns the
Packages file (verified: -sf -> empty, -sfL -> 1.35.6). Mirrors the same
-L fix already applied to the Release availability probe (-sILo) above.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:09:08 +00:00
Viktor Barzin
910d589205 fix(forgejo): raise git-op timeouts + lower gc.auto to stop push-mirror timeouts
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
The tripit Forgejo->GitHub push-mirror silently stalled: `git cat-file
--batch-all-objects` over the NFS-backed repo exceeded the default git deadline
once ~4500 loose objects accumulated (gc.auto's 6700 threshold hadn't fired), so
pushes stopped reaching GitHub and prod deploys stalled. Raise [git.timeout]
(DEFAULT/MIRROR/GC) so a slow object enumeration can't abort the mirror, and set
[git.config] gc.auto=1000 so post-push autogc + the git_gc_repos cron keep repos
packed (the real fix). A one-off forced gc already unblocked tripit; this prevents
recurrence across all repos.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:08:50 +00:00
Viktor Barzin
45bed1c133 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-20 08:07:23 +00:00
Viktor Barzin
e1736d2e5c calico: hop 3.28.5->3.30.7 (operator v1.38.13) — restores a SUPPORTED Calico/k8s-1.34 pairing. Disabled new-in-3.30 Goldmane/Whisker (their CRs render before crds/ install on helm upgrade; we use Prometheus/Loki). calico-node 7/7 on quay/v3.30.7, tigerastatus green. Applied manually + verified overnight. 2026-06-20 08:07:08 +00:00
Viktor Barzin
4d9fdbc7f7 rybbit: add CrowdSec LAPI -> Cloudflare KV sync script (proxied edge control plane)
Pure-stdlib script (alert_digest pattern, runs on stock python:3.12-alpine) that
projects CrowdSec Ip-scope ban/captcha decisions into the Workers KV namespace
the edge Worker reads on each proxied request. Full-reconcile per run so an
un-ban clears from the edge within one interval; fail-safe (a LAPI read error
skips the run and leaves existing bans to expire by TTL = fail-open, never a
stale all-block). TF wiring (KV namespace + CronJob + key registration) follows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:05:11 +00:00
Viktor Barzin
0ac176da01 crowdsec: whitelist internal/LAN/tailnet CIDRs at the decision layer
Preparing for real CrowdSec enforcement (edge Cloudflare Worker for proxied
hosts + cs-firewall-bouncer for direct hosts). Both enforce by dropping the
real source IP, so if an internal/RFC1918 address ever ended up in a ban
decision it could blackhole legitimate internal traffic. Whitelisting the
cluster/LAN/tailnet ranges (10/8, 172.16/12, 192.168/16, 100.64/10) at the
CrowdSec parser layer makes that structurally impossible — a trusted source
can never produce a decision in the first place. Public IP already whitelisted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:03:46 +00:00
Viktor Barzin
3e3fdb34f0 homelab: v0.6.0 — usage telemetry (usage top), evidence-driven verb prioritization
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Answers the question that drove the whole CLI — which verbs to add next — with
data instead of one maintainer's habits, and resolves the cross-user-usage ask
in-bounds (no reading anyone's home).

- emit on dispatch: every verb fire-and-forgets one Loki line {job,user,verb} +
  "exit=N ver=X". ONLY the verb path + exit code — never args, paths, flags, or
  secrets (the emit never sees arguments). Best-effort: 800ms timeout, errors
  swallowed, never affects the command; opt-out HOMELAB_TELEMETRY=0. Discovery
  verbs (manifest/version/help) and usage itself don't self-record.
- usage top [--since 30d] [--user U] [--json]: ranks verbs via
  sum by (verb)(count_over_time({job="homelab-usage"}[…])) against the shared
  Loki. Cross-user analytics WITHOUT touching ~/.claude — the privacy-preserving
  answer to "what does the team use".
- Loki sink (zero new infra, dogfoods v0.5 logs path); push verified HTTP 204 no
  auth. ADR docs/adr/0011.

Live-verified: ran 4 verbs, usage top ranked them correctly (metrics query=2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 22:29:01 +00:00
Viktor Barzin
666fefd22b calico: hop 3.26->3.28.5 (operator v1.34.13); calico-node 7/7 healthy, tigerastatus green, kube-controller-manager restarted (3.28 UID change). Applied manually + verified.
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-19 22:09:23 +00:00
Viktor Barzin
8ed5368be9 calico: bring tigera-operator under Terraform via Helm (adopt at 3.26.1)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Base for the stepped 3.26->3.28->3.30->3.32 upgrade (k8s 1.36 prereq; 3.26 is
already unsupported on k8s 1.34). Manage ONLY the operator via the official
tigera-operator Helm chart (chart ver == Calico ver); installation.enabled=false
keeps the live Installation CR operator-managed so Helm never touches calico-node.
Adopted in place: existing operator Deployment/SA/ClusterRole/ClusterRoleBinding
pre-stamped with Helm ownership metadata (transient migration step), then the
release imported via a plan-verified create (1 to add, 0 destroy). Verified clean:
calico-node 7/7 unchanged, tigerastatus green. Closes the year-deferred adoption
(code-3ad) for the operator without adopting the Installation CR.
2026-06-19 21:50:34 +00:00
Viktor Barzin
dd029ca7fb traefik/crowdsec: switch bouncer to live mode (stream cache doesn't enforce under Yaegi)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
After bumping to v1.6.0 (stream goroutine runs) and disabling redis (in-memory
cache), the plugin logs `handleStreamCache:updated` but still does NOT enforce:
a ban present in the LAPI stream AND pulled by the plugin still let the banned IP
through. Stream-mode decision matching is unreliable under Traefik's Yaegi
interpreter here. Switch crowdsecMode stream->live: the plugin queries LAPI
synchronously per request (result cached per-IP for defaultDecisionSeconds), which
enforces reliably and picks up new decisions immediately. LAPI is 3-replica +
in-cluster so per-request latency is small; fail-open preserved (updateMaxFailure=-1).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:49:26 +00:00
Viktor Barzin
0cc48d83ac traefik/crowdsec: disable bouncer redis cache (broken under Yaegi → in-memory)
With the plugin on v1.6.0 the stream goroutine finally runs, and its slog output
revealed the real blocker: `handleStreamTicker ... isCrowdsecStreamHealthy:true
cache:unreachable`. The LAPI stream is healthy, but the plugin's redis client
cannot reach the cache under Traefik's Yaegi interpreter — even though
redis-master.redis.svc is reachable AND writable from the traefik namespace
(SET/GET verified via busybox; no NetworkPolicies; no auth). Same interpreter
-incompat class as the stream goroutine itself. With redisCacheUnreachableBlock
=false the bouncer then failed open and enforced nothing.

Disable the redis cache so the plugin uses its in-memory decision store (works
under Yaegi). Removes redisCacheHost/redisCacheUnreachableBlock. Trade-off:
captcha already-solved grace is per-pod across the 3 Traefik replicas (at worst
an occasional re-solve) — acceptable; bans/captcha decisions enforce correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:49:26 +00:00
Viktor Barzin
531efb218d traefik: bump crowdsec-bouncer plugin v1.4.2 -> v1.6.0 (fix stream not pulling)
The crowdsec-bouncer Yaegi plugin pinned at v1.4.2 loads on Traefik 3.7.5 but
its decision-stream goroutine never runs — no Traefik pod ever calls the LAPI
stream (verified: no traefik-pod bouncer entry / no @pod-ip auto-registration),
and it logs nothing. All deps are healthy (LAPI 200 + full ban list reachable
from the traefik ns, key valid, redis PONG, config correct, no NetworkPolicies),
so CrowdSec enforced nothing despite the bouncer now being registered. This is
the Traefik-v3 / Yaegi plugin-incompat class that already killed rewrite-body
here. v1.4.2 predates Nov 2025; latest is v1.6.0.

Bump to v1.6.0 (initContainer download URL + state.json + experimental.plugins
version). Config-verified compatible: every key we use survives (crowdsecMode,
crowdsecLapiKey/Host, updateMaxFailure, redisCache*, clientTrustedIPs, all
captcha* incl. turnstile); v1.6.0 also moves logging to slog/trace for future
diagnosis. Pinned, not auto-updated (Keel can't manage a Yaegi plugin, and
plugin bumps must be tested against the running Traefik/Yaegi).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:49:26 +00:00
78095aa273 docs(forgejo): runbook reflects Authentik disabled + zero-click GitHub
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Authentik OAuth2 source is now disabled (login_source.is_active=0) and GitHub
auto-registration (zero-click sign-up) is on. Document why (global auto-reg +
Authentik's email-as-username 500; Forgejo/Authentik email mismatch blocks
account-linking) and how to re-enable Authentik later.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:37:46 +00:00
7d99203fc6 forgejo: re-enable ENABLE_AUTO_REGISTRATION for zero-click GitHub sign-up
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Per Viktor: GitHub sign-up must work zero-click (account created on first login,
no form). This global [oauth2_client] setting enables it. It conflicts with
Authentik (preferred_username is an email → invalid Forgejo username → 500 on
auto-create), and Viktor's Forgejo email (me@viktorbarzin.me) doesn't match his
Authentik email (vbarzin@gmail.com) so account-linking can't bridge it — so the
Authentik OAuth2 source is DISABLED (login_source.is_active=0; DB-managed,
out-of-band) per his directive. Forgejo sign-in is now GitHub + native login.

Committed via API to land on origin without pushing a concurrent agent's unpushed
local commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:34:17 +00:00
ef530b7d38 forgejo: drop ENABLE_AUTO_REGISTRATION — it broke Authentik sign-in
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ENABLE_AUTO_REGISTRATION is a global [oauth2_client] setting (all OAuth sources).
On Authentik sign-in, Forgejo auto-created an account and derived the username
from Authentik's preferred_username claim — which is the user's email
(vbarzin@gmail.com), invalid as a Forgejo username (no '@') → CreateUser failed
→ 500 on the OAuth callback. (GitHub's username claim is valid, so only Authentik
broke.) Reverting to the standard link/register flow fixes both; GitHub sign-up
still works via a one-step register form. Committed via API to touch only main.tf
(forgejo-only CI apply) so it doesn't collide with concurrent crowdsec work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:24:29 +00:00
Viktor Barzin
a5bb4db9c5 crowdsec: register the Traefik bouncer with LAPI (fix fail-open)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The Traefik bouncer plugin's API key was never registered with LAPI — the
crowdsec stack reads many keys from Vault but not ingress_crowdsec_api_key, and
the chart registers no bouncer. So LAPI returned 403 to the plugin, which with
updateMaxFailure=-1 failed open and enforced NOTHING: no community-blocklist
bans, and the (now-Turnstile-wired) captcha never fired. cscli bouncers list was
empty; the registration was likely lost in the MySQL->PostgreSQL DB migration
with no IaC to recreate it.

Seed the bouncer at LAPI startup via BOUNCER_KEY_traefik, valued from the same
Vault key the middleware presents — so they match by construction, and the
bouncer re-registers automatically on every LAPI start (survives DB wipes).

- stacks/crowdsec/main.tf: read ingress_crowdsec_api_key, pass to module.
- module main.tf: new sensitive var + thread into the values templatefile.
- values.yaml: BOUNCER_KEY_traefik on lapi.env.
- docs/architecture/security.md: document registration + fail-open history and
  the proxied-app coverage caveat.

Activates enforcement (community blocklist bans + captcha) on non-proxied apps;
internal IPs stay bypassed (clientTrustedIPs), fail-open-on-LAPI-down preserved.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:08:28 +00:00
Viktor Barzin
56dadda453 traefik: pin helm chart to 40.2.0 (deployed version)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The traefik helm_release had no chart version pin, so a refreshed helm repo
index resolves `chart = "traefik"` to the latest (41.0.0), whose values schema
rejects this stack's `logs` block ("Additional property logs is not allowed") —
an unpinned apply attempts that upgrade and fails (atomic rollback). Pin to the
deployed 40.2.0 (release rev 57, since 2026-05-30) so applies are deterministic;
chart bumps must be deliberate with a values migration.

Follow-up to fd0c7493 (Turnstile captcha), which was applied with this pin
already in live TF state — this lands the pin in git to remove the drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:58:33 +00:00
Viktor Barzin
4a66377425 forgejo: add "Sign in with GitHub" (OAuth2 source + auto-registration)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted people to be able to sign up with GitHub, not just the
native form or Authentik SSO.

- Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth
  --provider github` (name "github", matching the callback registered on
  the GitHub OAuth App). Like the existing Authentik source, it lives in
  Forgejo's DB rather than Terraform — there's no clean TF resource for
  login sources. Client id/secret mirrored to Vault secret/viktor
  (forgejo_github_oauth_client_id / _secret) for recovery.
- This commit's TF change: ENABLE_AUTO_REGISTRATION=true in
  [oauth2_client], so a first GitHub sign-in creates the account directly
  ("sign up with GitHub") instead of a link-to-existing detour. The
  GitHub identity is the trust gate for this path; Turnstile + email
  confirmation still gate the native form.

Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github
redirects to GitHub's authorize URL with the correct client id +
callback, and the login page renders the button. Final browser
click-through is the user's to do.

Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section +
secret-rotation + DB-loss recreate steps).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:41:49 +00:00
Viktor Barzin
fd0c7493c3 traefik/crowdsec: serve Cloudflare Turnstile for captcha remediation
CrowdSec LAPI already issues `captcha`-type decisions for lower-severity abuse
(http-429-abuse, http-403-abuse, http-crawl-non_statics, http-sensitive-files),
but the Traefik bouncer plugin had no captcha provider configured — so those
decisions silently fell through to a 403 ban (traced in the plugin's bouncer.go
@ v1.4.2: captchaClient.Valid==false => handleBanServeHTTP). Flagged users had
no way to self-unblock, contradicting the profile's stated intent.

Wire Cloudflare Turnstile as the bouncer's captcha provider so a captcha
decision now renders a solvable challenge instead of a hard block:

- New cloudflare_turnstile_widget.crowdsec_captcha (managed mode), scoped to
  viktorbarzin.me so one widget covers every subdomain the bouncer fronts.
  Mirrors the existing Forgejo-signup Turnstile pattern; sitekey + secret are
  passed into the traefik module.
- middleware.tf: captchaProvider=turnstile + site/secret keys + grace 1800s +
  captchaHTMLFilePath=/captcha/captcha.html.
- Vendor the plugin's captcha.html and mount it into the Traefik container at
  /captcha via the chart `volumes` value — the pulled Yaegi plugin does not
  expose its bundled template to Traefik.
- docs/architecture/security.md: document the ban-vs-captcha remediation split.
- Remove the dead crowdsec-ingress-bouncer.yaml (unused nginx bouncer with
  placeholder reCAPTCHA keys; referenced by zero .tf).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:38:38 +00:00
Viktor Barzin
963e4fcdde forgejo: open native self-signups, gated by Turnstile + email confirmation
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wants Forgejo open for anyone to sign up, but without bot/spam
account floods. Flip the deployment from OAuth-only registration
(ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local
sign-up, and add two bot gates on the registration form:

  - Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget
    is managed in Terraform (turnstile.tf) via the CF Global API key, so
    the sitekey/secret are IaC, not a dashboard artifact.
  - Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the
    Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me
    (mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced
    credential Authentik uses (email-secret.tf ESO -> secret/authentik
    smtp_password).

Existing Authentik OAuth2 login is unchanged (additive). Deployment env
appended (not inserted) so the diff stays purely additive; a reloader
annotation rolls the pod on secret rotation.

Verified live: signup page renders the Turnstile widget, mailer delivers
a test message end-to-end, Forgejo healthy, plan-to-zero after apply.

Runbook: docs/runbooks/forgejo-open-signups.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:05:07 +00:00
Viktor Barzin
21dbd79ae4 Merge remote-tracking branch 'origin/master' into wizard/homelab-obs
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-19 11:27:44 +00:00
Viktor Barzin
e91e1612dd homelab: v0.5.0 — net/dns/metrics/logs probes (endpoint resolution)
The remaining verbs that pass the "saves reasoning, not just typing" test the
user posed mid-session: each encodes the non-obvious which-endpoint-reached-how
resolution otherwise re-derived every time. (Same test deprioritized node-ssh
and secret-get aliasing — thin wrappers over commands already known.)

- net check <host> [path]: two-legged reachability — external (public DNS→CF)
  vs internal (Traefik LB) — so you see WHERE a break is, not just that one path
  works. (live: surfaced the LB at 6ms vs CF 77ms.)
- dns lookup <name> [type]: Technitium (10.0.20.201) vs public (1.1.1.1) diff.
- metrics query "<promql>" / metrics alerts: Prometheus via the LB
  (prometheus-query.viktorbarzin.lan); alerts uses the synthetic ALERTS series
  since the query frontend has no /api/v1/alerts and Alertmanager has no ingress.
- logs query "<logql>" [--since 1h] [--limit N]: Loki range query via the LB.

All reach auth-free internal ingresses through the LB (Go form of
curl --resolve host:443:10.0.20.203) — no port-forward, no kubectl. In-cluster-
only endpoints (Alertmanager v2) deliberately out of scope. Verified live before
building; all five smoke-tested green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 11:27:31 +00:00
Viktor Barzin
6cb823e431 k8s-version-upgrade: complete autonomy P0 — blocked alert + deeper postflight + runbook
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt +
alert when not":
- monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning)
  in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see
  Slack for why" signal. (Until monitoring is applied, a block still surfaces via
  the already-live K8sUpgradeChainJobFailed.)
- upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests —
  apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and
  core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns)
  Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't
  downgrade). Catches a "pods look Running but cluster is broken" upgrade.
- runbook: documents the compat gate, the blocked alert, how to clear a block,
  matrix maintenance, and the detector minor-probe fix.

After deploy, the nightly chain detects 1.35 (minor detection now works) and
correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting
via K8sUpgradeBlocked — the autonomy working as designed until the catch-up
clears those addons.
2026-06-19 11:27:17 +00:00
Viktor Barzin
cecd9fe247 k8s-version-upgrade: compat gate — auto-upgrade when safe, halt + alert when not
Make k8s upgrades (patch AND minor) autonomous without being reckless: the chain
attempts every upgrade but refuses unless it can prove the target is safe. A
refusal is a BLOCK (not a crash) — it halts the chain and signals for attention.

- compat-gate.py: read-only preflight check. Blocks if (a) a critical addon's
  running version doesn't support the target k8s minor, (b) an in-use deprecated
  API (apiserver_requested_deprecated_apis) is removed at/before the target, or
  (c) a node's containerd is below the target's floor. Validated against the live
  cluster: correctly blocks 1.35/1.36 today on Calico 3.26 / ESO 0.12 / kyverno
  1.16 (all behind), which is exactly the auto-halt we want until they're bumped.
- addon-compat.json: curated addon -> max-supported-k8s matrix (Calico, ESO,
  kyverno, gpu-operator + containerd floor), sourced from each project's compat
  docs (2026-06-19). The keystone data the gate reads; keep current.
- upgrade-step.sh: phase_preflight runs the gate FIRST (before any mutation);
  block() pushes k8s_upgrade_blocked=1 + Slacks the reasons + halts.
- main.tf: detector minor-probe fix (curl -sILo so the 302 from pkgs.k8s.io
  resolves to 200 — minors were never being detected). Gated behind the compat
  gate above, so enabling minor detection can't roll an unsafe minor.

Not pushed yet: deploys with the K8sUpgradeBlocked alert + deeper postflight +
runbook (next commit) so the detector fix only goes live with the full net.
2026-06-19 11:23:30 +00:00
Viktor Barzin
9189560ac3 homelab: v0.4.0 — ci/deploy verbs (watch what you trigger)
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Adds the verb-group that kills the single biggest reasoning sink in agent
sessions — watching a build/deploy to completion (proven the session that built
it: hours hand-rolling Woodpecker polling + DB-schema spelunking for one CI
incident).

- ci status/watch: Woodpecker REST API (version-stable, not its DB schema),
  reached via the internal Traefik LB (dial 10.0.20.203, SNI=ci.viktorbarzin.me
  so the cert verifies — the Go form of the house `curl --resolve` pattern),
  token from WOODPECKER_TOKEN/Vault, repo id resolved from the cwd remote, with
  retries that ride Woodpecker's intermittent empty responses. watch matches the
  HEAD/given commit (avoids the post-push race) and exits non-zero on failure.
- deploy wait: image-sha match THEN rollout status (rollout status alone returns
  success on the old ReplicaSet); kubectl-based.
- work land now auto-watches CI to green on the landed commit (--no-ci-watch to
  skip), closing the v0.1 gap.
- ci logs deferred to v0.4.1 (Woodpecker detail/log endpoints were the least
  reliable; status/watch use the working list endpoint).

Live-verified ci status/watch against the live API.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 10:59:14 +00:00
Viktor Barzin
787ce4edfa homelab: v0.3.1 — fix k8s db PG target (resolve CNPG primary pod, not the Service)
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`k8s db <app>` (Postgres path) execed `pg-cluster-rw`, which is the CNPG
read-write SERVICE, not a pod — so kubectl exec failed with
`pods "pg-cluster-rw" not found`. The unit test only checked the plan; the verb
was never fired at live state (the gap flagged in v0.2), so it shipped broken.

Fix: the PG plan now carries a label selector (cnpg.io/instanceRole=primary)
instead of a pod name, and k8s db resolves the actual primary POD via
`kubectl get pod -l <selector>` before exec. MySQL path (real pod
mysql-standalone-0) unchanged. Live-verified both paths (psql + mysql).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 09:09:34 +00:00
Viktor Barzin
90c944a265 woodpecker: disable partial clone (partial: false) — fix intermittent git exit-128
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Infra pipelines were failing intermittently across all authors (e.g. #241-244,
#247) with the git clone step exiting 128:

  git fetch --depth=1 --filter=tree:0 ...   (partial/treeless clone)
  git reset --hard <sha>
  fatal: could not fetch <tree-sha> from promisor remote
  remote: 404 page not found

The plugin-git clone defaulted to a partial (treeless) clone. The initial ref
fetch carries credentials, but the lazy *promisor* object fetch triggered by
`git reset --hard` hits the PRIVATE Forgejo repo without creds -> 404 -> exit
128. Whether it fired was luck-of-the-draw, hence the ~50% intermittent failures
fleet-wide (not specific to any commit).

Fix: set `partial: false` on every clone block so all objects for the (still
shallow) commit are fetched upfront with creds — no fragile lazy promisor fetch.

Diagnosed against the woodpecker Postgres DB (steps/log_entries) since the
Woodpecker HTTP API was itself flapping. Earlier "permission for ViktorBarzin"
log lines were an unrelated cross-forge red herring.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 09:06:44 +00:00
Viktor Barzin
fd77c0dc4f monitoring: RpiSofiaUndervoltage alerts on new brown-out, not until reboot
Some checks failed
ci/woodpecker/push/default Pipeline failed
The rpi-sofia under-voltage alert keyed off the sticky firmware bit
(rpi_under_voltage_occurred == 1), which latches on the first brown-out and
stays 1 until the Pi reboots. With alert-on-change routing it re-paged on every
boot cycle and sat firing for ~211h of the last 14d — Viktor reported "getting a
few of these lately" — and it disagreed with the HA-sofia dashboard, which shows
the live state and reads OK once voltage recovers.

Can't just switch to the live bit: rpi_under_voltage_now never registered once in
14d (brown-outs are sub-second and fall between the 1-min textfile-collector
samples), so the sticky bit is the only reliable detector.

Fix: edge-trigger on a NEW latch via increase(rpi_under_voltage_occurred[1h]) > 0.
Fires once per brown-out and auto-resolves ~1h later (~2h active over the same
14d instead of ~211h); counter-reset handling makes a clean reboot a no-op. Both
real brown-out events in the window are still caught. Docs updated in the same
commit (monitoring.md).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 08:45:39 +00:00
Viktor Barzin
fbf6f11038 feat(tripit): #96 cutover — /api self-authenticates (remove forward-auth, add strip-auth-headers)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ADR-0028 #96 (website half): /api drops Authentik forward-auth so the browser can carry a TripIt session cookie (the outpost 302'd cookie-only requests). The app self-authenticates (TripIt-session-first in get_current_user); no session -> 401 -> SPA landing. strip-auth-headers is REQUIRED now: with forward-auth gone, the hybrid forward-auth arm would otherwise trust a client-injected X-authentik-email — stripping inbound X-authentik-* closes that. /metrics split into its own still-gated ingress. Shell keeps Authentik bearers on tripit-api.* until #94; full AUTH_MODE collapse follows then. Verified live: no-session->401, valid TripIt cookie->200, injected header->401, Shell->200.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 08:27:39 +00:00
Viktor Barzin
8559c4574a fix(tripit): pin Authentik invalidation_flow literal (data source flakes null in CI under provider skew)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Pipeline 244 failed: data.authentik_flow.default_provider_invalidation resolved null in CI (goauthentik 2024.x provider vs 2026.2 server), silently blocking every tripit-stack apply incl. the ADR-0028 #90 signing-key + redirect-URI delivery. Pin the literal UUID (what the slug resolves to) — matches the data-source-skew workaround used for the Vault binding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 08:10:25 +00:00
Viktor Barzin
e5bb16e02a feat(tripit): activate TripIt-native session auth — signing key + Authentik web redirect (ADR-0028 #90)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Adds SESSION_SIGNING_KEY (Vault secret/tripit -> tripit-secrets ExternalSecret -> env_from) so TripIt's own session JWTs are signed with a real key (the app fails closed under the dev default until this lands), and adds the website OIDC redirect URI https://tripit.viktorbarzin.me/api/auth/callback/authentik to the public tripit-app provider so 'Log in with Authentik' works. Reuses the Shell's existing public OAuth2 app.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 06:06:43 +00:00
Viktor Barzin
077ac97df5 k8s-version-upgrade: auto-restore apiserver OIDC after control-plane bumps
Some checks failed
ci/woodpecker/push/default Pipeline failed
kubeadm upgrade apply regenerates the apiserver static-pod manifest and drops
the --authentication-config flag, silently breaking SSO (kubectl/kubelogin + the
k8s dashboard) until someone manually re-applied the rbac stack. That manual step
ran after every control-plane upgrade — the one thing keeping autonomous patch
upgrades from being truly hands-off (it bit us this cycle: an earlier master bump
left SSO broken until we noticed).

Automate it: the rbac stack now publishes its existing OIDC restore script (the
same one its null_resource runs) to a kube-system/apiserver-oidc-restore
ConfigMap, and the upgrade chain's phase_master re-runs it on master right after
the kubeadm upgrade — while tigera-operator is still quiesced so the flag-add
apiserver restart can't crashloop it. The script is idempotent and health-gates
/livez with auto-rollback; the step is non-fatal (a failure only lags SSO until
the next rbac apply, it won't abort the upgrade). phase_master already self-skips
when master is at target, so this only fires when master was actually upgraded.

The chain SA gets a name-scoped get on that one ConfigMap. Runbook updated: the
manual restore is now a documented fallback (command corrected — it needs
-replace, since the null_resource trigger hash never changes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 06:04:30 +00:00
Viktor Barzin
48b63ffa6f homelab: add memory verb-group (v0.3.0) — direct claude-memory HTTP client
Some checks failed
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline failed
Lets agents search/navigate memory via the CLI, as the first step toward
deprecating the memory MCP. claude-memory is a FastAPI service (the MCP is just
one frontend); homelab memory is a thin Bearer-auth HTTP client over the same
API, using the env the hooks already set (CLAUDE_MEMORY_API_URL/KEY). It works
even when the MCP frontend is down — the recurring disconnect that took the MCP
offline for this whole session.

Verbs: recall (server-side semantic search), list, categories, tags, stats,
secret (read); store, update, delete (write). Validated against the live API
including a store→recall→delete round-trip — full data-plane parity with the MCP.

The deprecation itself (rewiring the per-prompt auto-recall + auto-learn hooks to
the CLI, then uninstalling the MCP) is a deliberate follow-up, sequenced after
the CLI is proven in the hooks — see docs/adr/0008.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 05:56:25 +00:00
Viktor Barzin
3594485f77 homelab: v0.2.0 — docs + version for the k8s verb-group
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Bump cli/VERSION to v0.2.0; document the k8s verbs (README table + resolver
note), add docs/adr/0007 (resolver, read/write split, config-mutation stays
raw, db dbaas pattern), and extend the AGENTS.md discovery pointer with the
Kubernetes surface.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 22:30:41 +00:00
Viktor Barzin
1f7438bb18 homelab: add k8s verb-group (v0.2) — the biggest remaining surface
Mining the post-v0.1 corpus showed kubectl is the dominant remaining domain by
far: 11,291 commands across 243 sessions (more than everything else combined).
This adds the full k8s verb-group built on an app→namespace→pod resolver (most
namespaces hold one app, so <app> defaults to the namespace and the target
defaults to deploy/<app>, letting kubectl resolve the pod; -n/--pod/-c/-l/--tty
override).

Read: status (pods + non-Normal events), get, logs, describe, debug (one-shot
triage), pf, rollout-status. Write/operational: db (the dbaas psql/mysql exec
pattern — PG via pg-cluster-rw -c postgres, MySQL via mysql-standalone-0 with the
env-password bash wrapper, never inline), exec, rm-pod (pods/jobs ONLY), restart.
Config-mutation verbs (apply/edit/patch/scale/create) are deliberately NOT
exposed — they stay raw per the Terraform-only policy.

Smoke-verified read verbs against the live cluster (get/logs/rollout-status);
write verbs are unit-tested (resolver, db-plan, shell-quoting) but not fired at
live state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 22:29:51 +00:00
Viktor Barzin
66caa0bf7f homelab: v0.1 docs, distribution wiring, and version
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Completes v0.1: documentation, build/install path, and version stamping.

- cli/VERSION (v0.1.0) stamped into the binary via ldflags.
- cli/README.md rewritten as the homelab overview (verbs + tiers, manifest,
  build, the preserved legacy webhook use-cases).
- docs/adr/0004-0006: why homelab exists (grown in place from infra/cli, not a
  separate repo), v0.1 scope + everything-allowed/tiers-recorded, and the
  work/tf behaviour (native worktree entry, verification-gated auto-land,
  presence-coupled apply).
- setup-devvm.sh builds cli/ -> /usr/local/bin/homelab each provisioning run
  (t3-dispatch pattern), so every devvm user gets the current binary.
- AGENTS.md: discovery pointer under Common Operations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 19:25:51 +00:00
Viktor Barzin
087b415f73 homelab: add work verbs (start/land/clean) with a land verification gate
Completes the infra-loop verb surface. work start creates .worktrees/<topic>
on <user>/<topic> off <remote>/master (git-crypt-aware, ensures .worktrees is
ignored) and prints the path for native EnterWorktree entry. work land fetches,
merges master in, verifies, pushes HEAD:master with non-fast-forward retry, and
falls back to pushing the feature branch for a PR when the direct push is
rejected (branch protection). work clean removes the worktree + branch.

Safety: work land REFUSES to push when it cannot verify (no --verify-cmd and no
auto-detected suite) unless --no-verify is passed. This was added after an
accidental smoke-test invocation pushed unverified WIP to master (benign — the
infra CI applied 0 stacks since the diff was cli/-only — but the gate makes an
unverified land a deliberate choice, not the default).

Known v0.1 limitation: land does not yet block on CI to green; that arrives with
the ci/deploy watch verbs. It prints a reminder to follow the pipeline manually.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 19:24:08 +00:00
Viktor Barzin
36d562c15c homelab: add tf verbs + stack/git-crypt substrate
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Adds the tf verb-group and the resolver substrate beneath it, continuing the
v0.1 infra-loop build.

- substrate: findInfraRoot (walk up to terragrunt.hcl + stacks/), stack→dir
  resolver, and repo/remote/git-crypt detection (preferRemote forgejo>origin,
  hasGitCryptAttr, gitCryptFlags) — the last is for `work` next.
- tf plan/validate/fmt/force-unlock/apply, resolving the stack from cwd and
  delegating to scripts/tg (which owns state decrypt/encrypt, the Vault lock,
  and the ingress auth-comment check) rather than calling terragrunt directly.
- tf apply is presence-coupled: claims stack:<name>, ALWAYS releases on exit
  (normal, error, or SIGINT/SIGTERM via sync.Once + signal handler) — fixing
  the documented ~200-claim leak — and prints an out-of-band reminder since CI
  applies canonically on push.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 19:16:33 +00:00
Viktor Barzin
ed6f22fd53 homelab: scaffold unified CLI (registry, manifest, claim/release) in infra/cli
Begin evolving the existing infra/cli into the agent-facing "homelab" CLI
decided in the design/grilling session: one composable, JSON-capable surface
for the operations agents run over and over (mined from 51k commands across
2,225 past sessions; the infra inner-loop is ~29% of them). v0.1 targets that
loop — work/tf/claim — and ships here, in place, in infra/cli.

This first slice:
- command registry + dispatcher (longest-prefix verb matching) and a
  `manifest`/`manifest --json` progressive-discovery entrypoint; every verb
  declares a read|write tier so write-gating can be added later (everything is
  allowed for now).
- claim/release verbs wrapping the existing presence script (not reimplemented),
  with label-taxonomy validation.
- main() front-dispatches the homelab verb surface but falls through to the
  legacy webhook -use-case path verbatim, so the in-cluster infra-cli image is
  unaffected.
- fix a pre-existing vet error (glog.Infof missing format directive) that
  blocked `go test`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 19:12:57 +00:00
Viktor Barzin
70e217db24 k8s-version-upgrade: preflight skips kubeadm-plan gate when master already at target
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The autonomous 1.34.9 version-upgrade chain has been failing its preflight every
night. A prior run left k8s-master + k8s-node1 on 1.34.9 while node2-6 stayed on
1.34.8, and preflight's gate-4 runs `kubeadm upgrade plan` on master. On an
already-at-target master, kubeadm prints no "kubeadm upgrade apply vX.Y.Z" line,
so the parsed target came back empty and the `!= requested` check aborted the
whole chain before any worker was touched. Deterministic — it self-cleaned and
re-failed identically each night, so it would have failed again tonight, leaving
node2-6 stuck on the old patch.

Fix: skip the kubeadm-plan-target gate when master is already on TARGET_VERSION
— the same at-target self-skip that phase_master and phase_worker already do.
The remaining workers are still validated by their own per-node phases, and the
detector already confirmed the target is installable via apt-cache. This lets
tonight's unattended chain resume and finish node2-6 -> 1.34.9.

Runbook updated: node count 5 -> 7, the gate skip note, and a Past Incidents
writeup (incl. the collateral apiserver OIDC wipe, restored via the rbac stack).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 09:17:46 +00:00
Viktor Barzin
8787d361dc claude-memory: HA (replicas 2 + PDB) to stop recurring MCP disconnects
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The claude-memory MCP backend ran as a single replica with no PDB, so every
voluntary disruption took it to zero for ~30-90s — which surfaced as the
memory MCP "keeps getting disconnected" problem. Disruption sources hitting
the lone pod: the descheduler (every-5-min CronJob, LowNodeUtilization —
caught evicting it live), Keel image bumps, Reloader restarts on the 7-day
DB-password rotation, node drains, and CI deploys.

The local stdio MCP subprocess itself was proven healthy (fast non-blocking
startup, stderr suppressed, graceful degradation), so the fault was purely
backend availability, not the MCP plumbing.

Fix: run 2 replicas (the backend is stateless FastAPI over shared CNPG
Postgres and already has hostname anti-affinity) + restore the PDB at
minAvailable=1 (safe now — the drain deadlock that justified removing it
only existed at 1 replica) + descheduler evict=false to stop the needless
5-min churn. All five disruption sources become zero-downtime rolling events.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 09:13:36 +00:00
Viktor Barzin
48b7be3b14 feat(tripit): live lodging-price scrape — LODGING_PROVIDER=playwright
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to turn lodging prices on and stop using the fake provider.
Mirrors the existing FARE_PROVIDER wiring: point the Booking.com/Airbnb lodging
scraper at the shared chrome-service browser over CDP (the namespace is already
admitted through chrome-service's NetworkPolicy for the fare scrape). The lodging
code (ADR-0025, tripit #78) is live in tripit 03973b5, so the env lands after
that rollout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 06:53:19 +00:00
Viktor Barzin
d709d338c6 service-catalog: add paperless-ai (RAG semantic search + auto-tagging)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Document the new paperless-ai service and the two non-obvious operational
facts: runtime config lives in the PVC .env (not TF env, which would shadow
it), and Qwen3 needs /no_think for parseable tagging output.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 06:44:00 +00:00
Viktor Barzin
4977153dfb paperless-ai: make the PVC .env the single source of config truth
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Auto-tagging silently no-op'd: the container env vars set in the deployment
shadowed the app's own /app/data/.env, because paperless-ai's dotenv loader
does not override process.env. A stale PROCESS_PREDEFINED_DOCUMENTS=yes (with
no TAGS) made the scan select zero documents.

Strip the wizard-owned behavioural config (Paperless URL, AI provider, model,
scan interval, tagging flags) from the container env, keeping only
infrastructural env (PUID/PGID/port/RAG/HF cache) and the Vault-sourced
secret refs. The app's setup-written .env on the PVC is now authoritative,
so processing runs and tags all documents. Qwen3 thinking is disabled via
SYSTEM_PROMPT=/no_think in that .env to keep the model's JSON output parseable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 06:41:29 +00:00
Viktor Barzin
aeee0d02e2 paperless-ai: deploy clusterzx/paperless-ai for semantic doc search + AI tagging
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor wanted real semantic search over his ~300 Paperless documents and
preferred a ready-made solution over building one. paperless-ai provides
local-embedding RAG (ChromaDB + sentence-transformers, GPU-free) plus
LLM-driven auto-analysis/tagging.

Wiring:
- LLM (chat answers + tagging) -> in-cluster llama-swap qwen3-8b
  (OpenAI-compatible); embeddings + vector store are local on the PVC.
- Reads Paperless over the internal service via a dedicated `paperless-ai`
  superuser token (Vault secret/paperless-ai); app-admin creds also in Vault.
- Encrypted PVC for /app/data (SQLite + ChromaDB + model cache).
- Ingress paperless-ai.viktorbarzin.me behind Authentik (auth=required).
- Third-party image pinned (docker.io/clusterzx/paperless-ai:3.0.9), no Keel.

Runtime config persists to the PVC .env via the app's one-time setup; the
deployment env vars are pre-fill/documentation only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 06:23:00 +00:00
Viktor Barzin
605cf99a1b portal-tts: docker.io/ prefix on edge-tts image (Kyverno trusted-registries)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The edge-tts apply was blocked by the require-trusted-registries Kyverno policy —
a bare `travisvn/openai-edge-tts` isn't in the allowlist. The policy blanket-
trusts `docker.io/*`, so prefixing the image with `docker.io/` passes admission
with no policy change. Verified live: bg synth round-trips through Whisper
verbatim and a full gateway /v1/talk bg turn returns a coherent spoken Bulgarian
reply ("Добър ден! Добре съм, благодаря!...").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 21:24:34 +00:00
Viktor Barzin
ab55cb5dcd portal-stt: drop setup_tls_secret module (ClusterIP-only, no fullchain.pem)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The landed portal-stt source still declared the setup_tls_secret module +
tls_secret_name variable, which file()-reads secrets/fullchain.pem — a file this
stack does not ship. portal-stt is ClusterIP-only (no ingress; the Gateway is the
sole externally-exposed component, ADR-0001), so it needs no TLS secret. The live
deployment never had it (removed during the original apply); this aligns the
source with reality so CI applies cleanly. Fixes the pipeline-229 portal-stt
apply failure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 20:29:31 +00:00
Viktor Barzin
e7b9a74756 portal-assistant: land voice stacks + switch TTS to edge-tts (intelligible Bulgarian)
Some checks failed
ci/woodpecker/push/default Pipeline failed
The portal-assistant voice-assistant stacks (portal-tts, portal-stt,
portal-assistant) were applied to the live cluster from feature branches but
never landed on master — the GitOps source of truth. This lands all three and,
in portal-tts, fixes Bulgarian speech.

Bulgarian was unintelligible: the local Piper voice (bg_BG-dimitar-medium via
espeak-ng) mangles Bulgarian consonants — a synth->Whisper round-trip turned
"Добър ден" into "Обърден", and a user heard pure gibberish. English was fine.

portal-tts now runs openai-edge-tts (Microsoft edge-tts neural voices) for BOTH
languages instead of Piper — ADR-0003 always named edge-tts as the online
Bulgarian-quality fallback. Validated before landing: edge bg round-trips
through Whisper verbatim ("Добър ден! Как сте днес? ..."). The gateway maps
detected language bg/en to the edge voice names via new TTS_VOICE_BG /
TTS_VOICE_EN env (bg-BG-KalinaNeural / en-US-AvaNeural). No GPU, no NFS model
store, no secrets — edge fetches voices from Microsoft per request (egress
verified). The assistant already needs the internet for the Claude brain, so an
online TTS adds no new failure mode.

The brain stays Sonnet with no extended thinking (already the default — a live
turn answers directly in ~3.4s), per the latency-over-smartness ask.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 20:25:29 +00:00
Viktor Barzin
677a181d49 reverse-proxy: dedicated rate limit for ha-london; bump ha-sofia (cold-client 429s)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
New, empty-cache clients (the repurposed Meta Portal running the HA companion
app) cold-load the whole HA frontend at once - dozens of frontend_latest/*.js +
MDI icon chunks. ha-london had no per-service rate limit, so it fell back to the
global 10/s burst 50 and 429'd those chunks, leaving every dashboard blank
(Settings, which loads less, worked). Give ha-london its own 200/500 middleware
(skip_global_rate_limit, mirroring ha-sofia, with depends_on to avoid the
dangling-middleware 404 window) and bump ha-sofia 100/200 -> 200/500 so a cold
Portal load of Sofia doesn't hit the same wall.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 19:53:47 +00:00
420 changed files with 42390 additions and 10698 deletions

File diff suppressed because one or more lines are too long

View file

@ -7,6 +7,7 @@ Control and query Home Assistant entities on ha-sofia.viktorbarzin.me.
import argparse
import json
import os
import subprocess
import sys
from urllib.parse import urljoin
@ -17,13 +18,29 @@ except ImportError:
print(" pip install requests")
sys.exit(1)
# Configuration from environment variables (ha-sofia specific)
HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/")
HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN")
if not HA_URL or not HA_TOKEN:
print("ERROR: HOME_ASSISTANT_SOFIA_URL and HOME_ASSISTANT_SOFIA_TOKEN environment variables must be set.")
print("These should be set when activating the Claude venv (~/.venvs/claude)")
def _token_from_homelab():
"""Resolve the token via the homelab CLI when the env var isn't set, so the
script works from any directory / unprovisioned session (see ADR-0012)."""
try:
out = subprocess.run(
["homelab", "ha", "token", "--instance", "sofia"],
capture_output=True, text=True, timeout=30)
if out.returncode == 0 and out.stdout.strip():
return out.stdout.strip()
except Exception:
pass
return None
# Configuration: prefer env vars (set by the Claude venv); otherwise fall back to
# defaults + the homelab CLI so the script is not cwd/env dependent (ADR-0012).
HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/") or "https://ha-sofia.viktorbarzin.me"
HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN") or _token_from_homelab()
if not HA_TOKEN:
print("ERROR: no ha-sofia API token available.")
print("Set HOME_ASSISTANT_SOFIA_TOKEN, or ensure `homelab ha token` works (kubeconfig reachable).")
sys.exit(1)
HEADERS = {

View file

@ -166,7 +166,8 @@ Pinned via Terraform in `stacks/authentik/`:
| Knob | Value | Surface | Effect |
|------|-------|---------|--------|
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. Used by password login (`default-authentication-flow`) AND passkey login (`webauthn` flow — both terminate on this stage). |
| `UserLoginStage.session_duration` on `default-source-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_source_login` in `authentik_provider.tf` (imported 2026-06-20, id `4c6977d2-…`) | **Social logins** (Google/GitHub/Facebook, via `default-source-authentication-flow`). Was the provider default `seconds=0`, which fell back to `UNAUTHENTICATED_AGE=hours=2` — so social logins expired every **2h** while password/passkey lasted 4 weeks. Pinned `weeks=4` on 2026-06-20 to make all login paths consistent. (Surfaced when the 2026-06-18 passkey wipe forced fallback to Google login → "re-login multiple times daily".) |
| `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
| `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
@ -177,6 +178,13 @@ Notes:
- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
## WebAuthn / Passkeys (2026-06-20)
- **Passkey devices live in the DB, NOT Terraform** (`WebAuthnDevice` model). They are user-owned; no TF resource or blueprint manages them. Re-enroll via the user settings UI (Authentik → Settings → MFA Devices → register a security key / passkey).
- **2026-06-18 wipe (root cause of the "WebAuthn broke" incident):** all 6 of Viktor's passkeys were deleted (`WebAuthnDevice.objects.count()` → 0) at 19:27 by an **ad-hoc tripit passkey E2E test** run from the devvm (`python-httpx/0.28.1`, as `akadmin`). The test cleanup did `GET /core/users/?search={demo}` (a **fuzzy** search) then `DELETE /api/v3/authenticators/admin/webauthn/{pk}/` for each device of `users[0]` — but `users[0]` resolved to the **real** account, not the intended demo user. **Lesson:** any future passkey-test cleanup MUST exact-match the demo user (`username == demo`), never `users[0]` of a fuzzy `?search=`. It was a one-off ad-hoc script (no committed/scheduled copy), so nothing auto-re-deletes — re-enrollment is safe.
- **Passkey login path itself is intact:** the identification stage's `passwordless_flow``webauthn` flow (UI-managed, in `ignore_changes`); the break was purely the missing device records.
- **Provider-schema gotcha:** the pinned authentik TF provider's `authentik_stage_identification` resource exposes **no** `webauthn_stage` or `enable_remember_me` attribute (they exist on the app *model*, not in the provider schema). Do NOT add them to `ignore_changes``tg plan` errors `Unsupported attribute`. They are purely UI/app-managed. (Commit `4e882989` removed them for exactly this reason; re-adding breaks every apply.)
- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
- **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
- **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin.

File diff suppressed because one or more lines are too long

View file

@ -11,8 +11,8 @@ description: |
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
Always use Home Assistant for smart home control.
author: Claude Code
version: 2.0.0
date: 2026-02-07
version: 2.1.0
date: 2026-06-24
---
# Home Assistant Control
@ -44,6 +44,12 @@ There are **two** Home Assistant instances:
- Environment variables for each instance:
- **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN`
- **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN`
- If those env vars aren't set (e.g. you're not in the infra repo / Claude venv), don't hand-roll a `kubectl | base64 | jq` token pipeline — use the global **`homelab` CLI** instead (on `$PATH` in any directory):
## homelab CLI (preferred — works from any directory)
- **Token**: `homelab ha token [--instance sofia|london]` resolves the long-lived API token live from the cluster. Use it directly in curl: `curl -H "Authorization: Bearer $(homelab ha token)" https://ha-sofia.viktorbarzin.me/api/states`. (The `home-assistant-sofia.py` script also auto-falls-back to this when its env var is unset.)
- **Host shell** (ha-sofia): `homelab ha ssh -- <cmd>` runs a command on the HA host with deterministic non-interactive ssh (no host-key prompt) — e.g. `homelab ha ssh -- "sudo docker ps"`, `homelab ha ssh -- "cat /config/configuration.yaml"`. Replaces bespoke `ssh -o StrictHostKeyChecking=no …` invocations.
- **Cluster metrics/logs** (not HA-specific): prefer `homelab metrics query "<promql>"` / `homelab logs query "<logql>"` over hand-rolled `curl …/api/v1/query`, and `homelab claim`/`release` over calling `scripts/presence` directly.
## API Control
@ -389,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
## ha-london Knowledge Map
### Overview
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
- **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/` (requires `sudo` for file access)
- **Platform**: Raspberry Pi 4, HA OS
- **Access from the Sofia devvm**: london is **remote**`homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/`
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
- **Zone**: London (home)
### Dashboards (redesigned 2026-06-24)
**Glossary** (HA terms — keep distinct):
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
- **Card** = a widget inside a view.
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
- **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
- **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed``detailed` path typo fixed 2026-06-24.)
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
### Key Systems
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -418,10 +437,15 @@ Named plugs with power/energy tracking:
- PM1.0/2.5/4.0/10 particulate sensors
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
#### 3. Cowboy E-Bike
- `sensor.bike_state_of_charge`: Battery %
- `sensor.bike_total_distance`: Total km
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
- `sensor.classic_performance_remaining_range`: Range km
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
#### 4. Uptime Monitoring (UptimeRobot)
- `sensor.blog`: blog uptime
@ -440,12 +464,17 @@ Named plugs with power/energy tracking:
- Scripts: `script.start_netflix`, `script.start_stremio`
- Scene: `scene.night` (turns off Livia + Michelle plugs)
### Custom Components
- **cowboy**: Cowboy e-bike integration (HACS)
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
### Custom Components (HACS integrations)
- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
### HACS frontend cards (plugins)
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
### Integrations
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
### AI / Voice Assistants
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -460,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
- Anca arrival/departure notifications
- Night scene: turns off Livia + Michelle
### Docker Setup
```bash
docker run -d --name homeassistant --privileged \
-e TZ=Europe/London \
-v /home/pi/docker/homeAssistant:/config \
-v /run/dbus:/run/dbus:ro \
--network=host --restart=unless-stopped \
homeassistant/home-assistant:2025.9
```
### Platform (HAOS — ignore any legacy `docker run` snippet)
ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~12 min and resets `sensor.uptime` (use that as the "back up" marker).
### SSH Access
```bash

View file

@ -0,0 +1,203 @@
export const meta = {
name: 'memory-overcommit-node-removal',
description: 'Read-only: assess PVE host + k8s memory overcommit, right-size deployment REQUESTS (scheduling) and LIMITS (OOM) separately from 30d usage, then test whether one worker node can be removed while preserving N-1 by BOTH a physical-usage and a scheduling-request model. Emits a gated plan.',
phases: [
{ title: 'Gather' },
{ title: 'Model' },
{ title: 'Verify' },
],
}
// ---------- confirmed read-only access paths ----------
const SSH = "ssh -o BatchMode=yes -o ConnectTimeout=8 root@192.168.1.127";
const PROM = "https://prometheus-query.viktorbarzin.lan/api/v1/query";
const G = (mib) => (mib == null ? "?" : (mib / 1024).toFixed(1) + "Gi");
// ---------- schema helpers ----------
const num = { type: "number" }, str = { type: "string" }, bool = { type: "boolean" };
const arr = (items) => ({ type: "array", items });
const obj = (props) => ({ type: "object", additionalProperties: false, required: Object.keys(props), properties: props });
const HOST = obj({
host_total_mib: num, host_used_mib: num, host_free_mib: num, host_available_mib: num,
swap_total_mib: num, swap_used_mib: num, ksm_saved_mib: num,
vms: arr(obj({ vmid: num, name: str, configured_mib: num, balloon_mib: num, rss_mib: num, is_k8s_node: bool })),
sum_vm_configured_mib: num, sum_vm_rss_mib: num, notes: str,
});
const K8S = obj({
nodes: arr(obj({
name: str, role: str, is_gpu: bool, is_control_plane: bool, gpu_tainted: bool, schedulable: bool,
capacity_mib: num, allocatable_mib: num, requests_mib: num, ds_requests_mib: num, limits_mib: num, usage_now_mib: num, peak_30d_mib: num, pod_count: num,
})),
cluster_allocatable_mib: num, cluster_requests_mib: num, cluster_usage_now_mib: num, cluster_peak_30d_mib: num, notes: str,
});
// NOTE the v2 split: requests are sized for SCHEDULING (cover normal load, can shrink below current),
// limits are sized for OOM SAFETY (cover peak). They are DIFFERENT knobs and must not be conflated.
const USAGE = obj({
totals: obj({
sum_current_requests_mib: num, sum_recommended_requests_mib: num, net_request_reclaim_mib: num,
reschedulable_request_recommended_mib: num, ds_request_recommended_per_node_mib: num, gpu_request_recommended_mib: num,
largest_single_request_mib: num, count_request_shrink: num, count_limit_raise_oom: num,
}),
request_shrinks: arr(obj({ namespace: str, name: str, kind: str, replicas: num, current_request_mib: num, p95_30d_mib: num, recommended_request_mib: num, delta_mib: num, rationale: str })),
limit_raises_oom: arr(obj({ namespace: str, name: str, container: str, current_limit_mib: num, peak_max_30d_mib: num, recommended_limit_mib: num, risk: str })),
spiky_periodic: arr(obj({ namespace: str, name: str, note: str })),
method_notes: str,
});
const TOPO = obj({
nodes: arr(obj({ name: str, sticky_pods: arr(str), local_pv_count: num, volumeattachments: num, cnpg_primary: bool, gpu_workloads: bool, evac_difficulty: str, evac_notes: str })),
spofs: arr(obj({ namespace: str, name: str, replicas: num, has_pdb: bool, issue: str })),
antiaffinity_risks: arr(str),
csi_pinning_note: str,
priority_classes_note: str,
notes: str,
});
const VERDICT = obj({ refuted: bool, confidence: str, reasoning: str, corrections: arr(str) });
// ---------- prompts ----------
const HOST_PROMPT = `Read-only PVE host memory audit. SSH (key-based): ${SSH} '<cmd>' (host 'pve', the Proxmox r730 at 192.168.1.127). Read-only ONLY; NEVER a state-changing qm/pvesh/ha-manager command.
- 'free -m' -> host_total/used/free/available_mib + swap_total/swap_used_mib.
- KSM: cat /sys/kernel/mm/ksm/pages_sharing ; ksm_saved_mib = pages_sharing*4096/1048576.
- 'qm list'; for each running VM 'qm config <vmid>' -> memory (configured_mib), balloon (balloon_mib; if balloon==memory or balloon==0 ballooning is effectively OFF -> host RSS pins near configured = the headroom RATCHET).
- Per-VM host RSS: read /var/run/qemu-server/<vmid>.pid then 'ps -o rss= -p <pid>' (KiB->MiB).
- is_k8s_node = VMs named k8s-*.
Return per-VM rows + sum_vm_configured_mib + sum_vm_rss_mib over ALL RUNNING VMs. notes: overcommit ratio, swap pressure, ballooning state.`;
const K8S_PROMPT = `Read-only Kubernetes node-capacity audit. kubectl read access confirmed. For every node (k8s-master + k8s-node1..6):
- capacity_mib & allocatable_mib from 'kubectl get node <n> -o json' (Ki->MiB).
- is_control_plane (node-role.kubernetes.io/control-plane), is_gpu (k8s-node1; nvidia.com/gpu in capacity), gpu_tainted (a NoSchedule taint general pods would NOT tolerate), schedulable.
- requests_mib, limits_mib, ds_requests_mib (DaemonSet-owned pods only), usage_now_mib, pod_count.
Prefer Prometheus (curl -sk -G '${PROM}' --data-urlencode 'query=<q>'):
sum by (node)(kube_pod_container_resource_requests{resource="memory"}) [these metrics HAVE a node label]
usage_now: cAdvisor container_memory_working_set_bytes has NO node label - join: sum by (node)(container_memory_working_set_bytes{container!="",container!="POD"} * on(namespace,pod) group_left(node) kube_pod_info)
- peak_30d_mib per node: max_over_time of that joined per-node sum over [30d:5m] (best effort; if the join is flaky leave 0 and rely on cluster figure).
ALSO return cluster-wide:
- cluster_allocatable_mib, cluster_requests_mib, cluster_usage_now_mib.
- cluster_peak_30d_mib = max_over_time(sum(container_memory_working_set_bytes{container!="",container!="POD"})[30d:5m]) /1024/1024 (this is the PHYSICAL reliability bedrock - the highest the whole cluster ever simultaneously used in 30d).
notes: host-vs-k8s overcommit contrast (requests vs allocatable vs actual usage).`;
const USAGE_PROMPT = `Read-only memory RIGHT-SIZING from 30-day usage. CRITICAL: requests and limits are DIFFERENT knobs - size them separately. Do NOT set requests to peak (that is what a flawed earlier run did; it manufactured a false capacity shortfall).
- REQUEST (scheduling reservation, drives bin-packing & node-removal feasibility): size to cover NORMAL operation = recommended_request_mib = ceil(max(p95_30d * 1.15, 64)). This SHRINKS the many over-provisioned requests toward real usage. requests should sit BELOW limits (Burstable). Be moderately conservative for stateful/db/critical infra (mysql, postgres/CNPG, redis, vault, prometheus, mailserver): use p99 instead of p95.
- LIMIT (OOM ceiling): recommended_limit_mib = ceil(peak_max_30d * 1.25). FLAG any container whose peak_max_30d >= 95% of current limit as an OOM risk (limit_raises_oom) - these are real reliability bugs to fix REGARDLESS of node removal.
Sources: kubectl (current requests/limits/replicas for Deployments/StatefulSets/DaemonSets, all namespaces); Prometheus (curl -sk -G '${PROM}' --data-urlencode 'query=<q>'):
p95: quantile_over_time(0.95, container_memory_working_set_bytes{container!="",container!="POD"}[30d])
p99: quantile_over_time(0.99, ...[30d])
peak: max_over_time(...[30d])
Aggregate by (namespace,pod,container), map pod->workload (strip hash suffixes), take MAX across a workload's pods as per-replica value.
Splits for the N-1 model (use the REQUEST recommendation; multiply per-replica by replicas):
- reschedulable_request_recommended_mib = SUM recommended_request of Deployment+StatefulSet pods that are NON-GPU and schedulable on general workers (everything that must reschedule if a worker is removed).
- ds_request_recommended_per_node_mib = SUM recommended_request of DaemonSet containers (one set per node).
- gpu_request_recommended_mib = SUM recommended_request of workloads pinned to GPU node k8s-node1 (REAL value; do not inflate).
- largest_single_request_mib = largest single recommended per-replica request among reschedulable.
Return totals (sum_current_requests_mib, sum_recommended_requests_mib, net_request_reclaim_mib = sum of POSITIVE request deltas i.e. shrinks, the splits, count_request_shrink, count_limit_raise_oom), request_shrinks (top ~30 by delta), limit_raises_oom (every OOM-tight container), spiky_periodic (mailserver/immich-ml/backups/dumps/postiz). NEVER mutate.`;
const TOPO_PROMPT = `Read-only reliability-topology audit: which worker is safest to remove? Candidates: k8s-node2..node6 (NOT master, NOT GPU node1). For each worker (k8s-node1..6): sticky_pods (StatefulSet members; pods with local/hostPath PVCs; single-replica critical), local_pv_count, volumeattachments, cnpg_primary (CNPG 'pg-cluster' PRIMARY here? check pod role labels), gpu_workloads, evac_difficulty (easy|medium|hard)+evac_notes.
Cluster-wide: spofs (1 replica AND no PDB); antiaffinity_risks (hard podAntiAffinity / topologySpread DoNotSchedule that becomes UNSATISFIABLE at one fewer worker - check replica counts vs surviving distinct hosts); csi_pinning_note (do Proxmox-CSI PVs pin to a node, or share one host-level topology so they reattach anywhere? check volumeHandle / topology zone/region on the PVs - this decides whether removal STRANDS data); priority_classes_note. NEVER mutate.`;
// ============================================================
phase('Gather');
log('Gather (read-only): PVE host memory, k8s capacity + cluster 30d peak, request/limit right-sizing, reliability topology');
const [host, k8s, usage, topo] = await parallel([
() => agent(HOST_PROMPT, { label: 'gather:pve-host', phase: 'Gather', schema: HOST }),
() => agent(K8S_PROMPT, { label: 'gather:k8s-capacity', phase: 'Gather', schema: K8S }),
() => agent(USAGE_PROMPT, { label: 'gather:rightsize', phase: 'Gather', schema: USAGE }),
() => agent(TOPO_PROMPT, { label: 'gather:reliability', phase: 'Gather', schema: TOPO }),
]);
if (!k8s || !usage) return { error: 'Critical gather agent failed (k8s/usage).', host, k8s, usage, topo };
// ============================================================
phase('Model');
const T = usage.totals;
const workers = k8s.nodes.filter((n) => !n.is_control_plane);
const generalPool = workers.filter((n) => !n.gpu_tainted); // general pods can land here (incl. GPU node if not tainted)
const candidates = workers.filter((n) => !n.is_gpu && !n.is_control_plane); // node2..node6
const clusterPeak = k8s.cluster_peak_30d_mib || 0;
const freeGeneral = (n) => n.allocatable_mib - (T.ds_request_recommended_per_node_mib || 0) - (n.is_gpu ? (T.gpu_request_recommended_mib || 0) : 0);
function evalRemove(removeName) {
const pool = generalPool.filter((n) => n.name !== removeName);
// --- scheduling N-1 (realistic requests): fit reschedulable load even if the largest survivor then fails ---
const frees = pool.map(freeGeneral);
const schedCap = frees.reduce((a, b) => a + b, 0) - (frees.length ? Math.max(...frees) : 0);
const schedNeed = T.reschedulable_request_recommended_mib;
const schedMargin = schedCap - schedNeed;
// --- physical N-1 (actual peak usage): cluster 30d peak must fit on survivors after losing the largest too ---
const survAlloc = pool.map((n) => n.allocatable_mib);
const physCap = survAlloc.reduce((a, b) => a + b, 0) - (survAlloc.length ? Math.max(...survAlloc) : 0);
const physMargin = physCap - clusterPeak;
const t = topo && topo.nodes ? topo.nodes.find((n) => n.name === removeName) : null;
return {
removeName, pool: pool.map((n) => n.name),
sched_capacityN1_mib: Math.round(schedCap), sched_need_mib: Math.round(schedNeed), sched_margin_mib: Math.round(schedMargin), sched_pass: schedMargin >= 0,
phys_capacityN1_mib: Math.round(physCap), cluster_peak_mib: Math.round(clusterPeak), phys_margin_mib: Math.round(physMargin), phys_pass: physMargin >= 0,
pass: schedMargin >= 0 && physMargin >= 0,
host_freed_mib: hostFreedFor(removeName),
evac_difficulty: t ? t.evac_difficulty : 'unknown', cnpg_primary: t ? t.cnpg_primary : false, sticky_pods: t ? t.sticky_pods : [],
};
}
function hostFreedFor(nodeName) {
if (host && host.vms) {
const s = nodeName.replace('k8s-', '');
const vm = host.vms.find((v) => v.name === nodeName || (v.name && v.name.includes(s)));
if (vm) return vm.configured_mib;
}
const n = k8s.nodes.find((x) => x.name === nodeName);
return n ? n.capacity_mib : 0;
}
const evalCandidates = candidates.map((c) => evalRemove(c.name));
const diffRank = { easy: 0, medium: 1, hard: 2, unknown: 3 };
const passing = evalCandidates.filter((c) => c.pass && !c.cnpg_primary)
.sort((a, b) => (diffRank[a.evac_difficulty] - diffRank[b.evac_difficulty]) || (b.phys_margin_mib - a.phys_margin_mib));
const best = passing[0] || null;
const hostOvercommit = host ? { sum_vm_configured_mib: host.sum_vm_configured_mib, host_total_mib: host.host_total_mib, ratio: +(host.sum_vm_configured_mib / host.host_total_mib).toFixed(3), free_mib: host.host_free_mib, available_mib: host.host_available_mib, swap_used_mib: host.swap_used_mib, swap_total_mib: host.swap_total_mib, ksm_saved_mib: host.ksm_saved_mib } : null;
const k8sOvercommit = { cluster_requests_mib: k8s.cluster_requests_mib, cluster_allocatable_mib: k8s.cluster_allocatable_mib, cluster_usage_now_mib: k8s.cluster_usage_now_mib, cluster_peak_30d_mib: clusterPeak, request_ratio: +(k8s.cluster_requests_mib / k8s.cluster_allocatable_mib).toFixed(3), usage_ratio: +(clusterPeak / k8s.cluster_allocatable_mib).toFixed(3) };
log(`Host overcommit ${hostOvercommit ? hostOvercommit.ratio : '?'}x (${G(hostOvercommit && hostOvercommit.free_mib)} free, swap ${G(hostOvercommit && hostOvercommit.swap_used_mib)}/${G(hostOvercommit && hostOvercommit.swap_total_mib)})`);
log(`K8s: requests ${G(k8s.cluster_requests_mib)} / 30d-peak-usage ${G(clusterPeak)} / allocatable ${G(k8s.cluster_allocatable_mib)} -> requests are ${(k8s.cluster_requests_mib / clusterPeak).toFixed(2)}x real peak`);
log(`Request right-sizing: ${G(T.net_request_reclaim_mib)} of over-provisioned requests can be trimmed (${T.count_request_shrink} workloads); ${T.count_limit_raise_oom} workloads are OOM-tight on LIMITS (raise regardless).`);
for (const c of evalCandidates) log(` remove ${c.removeName}: phys-N1 ${c.phys_pass ? 'PASS' : 'FAIL'} (${G(c.phys_margin_mib)}) | sched-N1 ${c.sched_pass ? 'PASS' : 'FAIL'} (${G(c.sched_margin_mib)}) | frees ~${G(c.host_freed_mib)} host | evac ${c.evac_difficulty}${c.cnpg_primary ? ' CNPG-PRIMARY' : ''}`);
log(best ? `Best candidate: ${best.removeName} (phys margin ${G(best.phys_margin_mib)}, frees ~${G(best.host_freed_mib)})` : 'No candidate passes both N-1 tests.');
// ============================================================
phase('Verify');
const headline = best
? `${best.removeName} can be removed while preserving N-1: cluster 30d peak usage ${G(clusterPeak)} fits on survivors-minus-one (${G(best.phys_capacityN1_mib)}); after trimming over-provisioned requests, scheduling also fits (${G(best.sched_margin_mib)} margin). Frees ~${G(best.host_freed_mib)} to the PVE host.`
: `No worker can be removed while preserving N-1 by BOTH physical-usage and scheduling-request models.`;
const verifyData = JSON.stringify({ hostOvercommit, k8sOvercommit, k8s_nodes: k8s.nodes, usage_totals: T, evalCandidates, best, csi_pinning_note: topo ? topo.csi_pinning_note : null, generalPool: generalPool.map((n) => n.name) }, null, 2);
const lenses = [
{ key: 'math', ask: 'Recompute BOTH N-1 models independently. Physical: cluster 30d peak vs (sum survivor allocatable - largest survivor). Scheduling: reschedulable recommended REQUESTS (not limits, not peak) vs (sum survivor freeGeneral - largest). Verify GPU node reserve uses REAL gpu requests, allocatable not capacity, DaemonSets are per-node fixed load. Are pool selection and numbers right?' },
{ key: 'temporal', ask: 'Challenge the 30-DAY peak window and the request shrinks. Could a monthly/quarterly peak exceed cluster_peak_30d (compare a 90d peak)? Are the shrunk REQUESTS safe given each workload keeps a limit above its peak (Burstable)? Name any shrink or any still-tight limit that is reckless.' },
{ key: 'stateful', ask: 'Check the chosen candidate for STRANDED state and drain blockers: CSI PV pinning (do volumes reattach anywhere?), CNPG primary, VolumeAttachment caps, anti-affinity/topologySpread unsatisfiable at one fewer worker, PDBs that block drain (disruptionsAllowed=0). Is removal actually safe, and what drain ORDERING is required?' },
];
const verdicts = (await parallel(lenses.map((l) => () =>
agent(`Adversarial reviewer. Try to REFUTE:\n"${headline}"\n\nLens: ${l.ask}\n\nData (read-only). Verify LIVE: kubectl, Prometheus (curl -sk -G '${PROM}' --data-urlencode 'query=...'), ${SSH} '<cmd>'.\n\n${verifyData}\n\nDefault refuted=true if evidence does not clearly hold. Give concrete corrections.`,
{ label: `verify:${l.key}`, phase: 'Verify', schema: VERDICT }))
)).filter(Boolean);
return {
headline,
hostOvercommit, k8sOvercommit,
rightsizing: T,
request_shrinks: usage.request_shrinks,
limit_raises_oom: usage.limit_raises_oom,
spiky_periodic: usage.spiky_periodic,
candidates: evalCandidates,
recommendation: best,
k8s_nodes: k8s.nodes,
host_vms: host ? host.vms : null,
topo_spofs: topo ? topo.spofs : [],
topo_nodes: topo ? topo.nodes : [],
csi_pinning_note: topo ? topo.csi_pinning_note : null,
antiaffinity_risks: topo ? topo.antiaffinity_risks : [],
verdicts,
verdict_summary: `${verdicts.filter((v) => v.refuted).length}/${verdicts.length} reviewers refuted the headline`,
};

9
.gitattributes vendored
View file

@ -4,3 +4,12 @@
*.tfvars filter=git-crypt diff=git-crypt
secrets/** filter=git-crypt diff=git-crypt
stacks/**/secrets/** filter=git-crypt diff=git-crypt
# Kubeconfigs / cluster credentials — encrypt at rest so a force-added or renamed
# commit can't push plaintext to the public GitHub mirror. Belt-and-suspenders to
# the .gitignore rules above; `.config` is explicit because that is exactly the
# name an admin kubeconfig once leaked under (GitGuardian, 2026-07-02).
.config filter=git-crypt diff=git-crypt
kubeconfig filter=git-crypt diff=git-crypt
*.kubeconfig filter=git-crypt diff=git-crypt
admin.conf filter=git-crypt diff=git-crypt

39
.github/workflows/build-authentik.yml vendored Normal file
View file

@ -0,0 +1,39 @@
name: Build Custom Authentik Image
# ADR-0002: infra-owned image built off-infra on GHA → ghcr.
# Thin SLOW-1a overlay over the official authentik server (narrows the login
# identification stage's select_subclasses() to the login-capable source subtypes;
# see stacks/authentik/Dockerfile). Rebuild only when the Dockerfile changes — on
# every authentik bump, edit the FROM tag + the patchN suffix here + the image tag
# in modules/authentik/values.yaml together.
on:
push:
branches: [master]
paths:
- 'stacks/authentik/Dockerfile'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/authentik
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3
ghcr.io/viktorbarzin/authentik-server:latest

View file

@ -0,0 +1,39 @@
name: Build chrome-service-browser
# ADR-0002: infra-owned image built off-infra on GHA → ghcr. Playwright base +
# real Google Chrome (proprietary H.264/AAC codecs) for the chrome-service
# browser container, so the noVNC view can play H.264 video (Reels). Rebuilds
# are rare → dispatch + path trigger. NOTE: after the first push, set the ghcr
# package `chrome-service-browser` to PUBLIC (same as chrome-service-novnc) so
# the pod pulls it without credentials.
on:
push:
branches: [master]
paths:
- 'stacks/chrome-service/files/chrome/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/chrome-service/files/chrome
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/chrome-service-browser:latest
ghcr.io/viktorbarzin/chrome-service-browser:${{ github.sha }}

42
.github/workflows/build-excalidraw.yml vendored Normal file
View file

@ -0,0 +1,42 @@
name: Build excalidraw-library
# ADR-0002 / no-local-builds: excalidraw-library (infra-owned Go app behind
# draw.viktorbarzin.me) builds off-infra on GHA → private ghcr; Keel polls
# ghcr:latest and rolls the deployment. Replaces the manual DockerHub pushes
# (viktorbarzin/excalidraw-library:v4 stays frozen as the rollback image).
on:
push:
branches: [master]
paths:
- 'stacks/excalidraw/project/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.21'
- run: go test ./...
working-directory: stacks/excalidraw/project
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/excalidraw/project
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/excalidraw-library:latest
ghcr.io/viktorbarzin/excalidraw-library:${{ github.sha }}

View file

@ -0,0 +1,39 @@
name: Build valia-sites-sync
# ADR-0002 + ADR-0018: infra-owned image built off-infra on GHA → ghcr (public).
# Rclone + wrangler runner for the Valia-sites Content-folder mirror CronJob.
# Rebuilds are rare (tool pins only change deliberately) → dispatch + path.
# Security note: no untrusted event inputs are interpolated anywhere (only
# github.actor / github.sha / GITHUB_TOKEN — same shape as the other
# build-*.yml workflows in this repo).
on:
push:
branches: [master]
paths:
- 'stacks/valia-sites/sync-image/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/valia-sites/sync-image
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/valia-sites-sync:latest
ghcr.io/viktorbarzin/valia-sites-sync:${{ github.sha }}

15
.gitignore vendored
View file

@ -71,8 +71,15 @@ stacks/*/cloudflare_provider.tf
stacks/*/tiers.tf
stacks/*/terragrunt_rendered.json
# Kubernetes config (sensitive)
# Kubernetes config / cluster credentials (sensitive) — never commit in plaintext.
# `config` alone missed the dotfile form: an admin kubeconfig once leaked to the
# public mirror as `.config` (GitGuardian, 2026-07-02). Cover the common names.
config
.config
kubeconfig
*.kubeconfig
admin.conf
.kube/
# Node.js (not part of infra)
node_modules/
@ -110,3 +117,9 @@ terraform.tfstate.backup
# Timestamped terraform state backups (terraform.tfstate.<ts>.backup) — plaintext Tier-0
# secrets; created by terraform state ops. The patterns above miss the timestamped form.
terraform.tfstate.*.backup
# Python test artifacts (pytest bytecode cache) — e.g. from
# stacks/k8s-version-upgrade/scripts/test_compat_gate.py
__pycache__/
*.pyc
.pytest_cache/

View file

@ -19,6 +19,7 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 2
attempts: 5
backoff: 10s
@ -64,6 +65,21 @@ steps:
# don't need explicit token propagation.
VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
commands:
# ── Forge guard: apply ONLY on the canonical Forgejo forge ──
# infra is registered in Woodpecker on BOTH the Forgejo canonical repo and
# the legacy GitHub mirror, and BOTH fire this push pipeline. Without this
# guard both run `terragrunt apply` on every push and race each other for
# the per-stack PG state lock — the dominant cause of the "Error acquiring
# the state lock" failures + push-supersede "killed" runs. The GitHub-mirror
# registration keeps running the CRONS (drift-detection, renew-tls, …) — only
# its duplicate push-apply no-ops here. Fail-open: an unknown forge (neither
# env var set) still applies, preserving prior behaviour.
- |
if echo "${CI_REPO_URL:-}${CI_FORGE_URL:-}" | grep -qi 'github\.com'; then
echo "[forge-guard] GitHub-mirror push — apply runs only on the Forgejo canonical repo (avoids double-apply + state-lock races). Skipping."
exit 0
fi
# ── Skip CI commits ──
- |
if echo "$CI_COMMIT_MESSAGE" | grep -q '\[CI SKIP\]\|\[ci skip\]'; then
@ -212,23 +228,40 @@ steps:
if [ -s .platform_apply ]; then
echo "=== Applying platform stacks (serial, locked) ==="
while read -r stack; do
# Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role
# lacks Vault-admin perms (sys/mounts + sys/policies/acl), so a CI
# apply always 403s and fails the pipeline. Kept in PLATFORM_STACKS
# (so the app-stack detector still excludes it) but skipped here.
# (2026-06-27 — see docs/architecture/ci-cd.md)
if [ "$stack" = "vault" ]; then echo "[vault] SKIPPED (Tier-0, human-applied via OIDC)"; continue; fi
echo "[$stack] Starting apply..."
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -ne 0 ]; then
if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)"
else
echo "$OUTPUT" | tail -50
echo "[$stack] FAILED (exit $EXIT)"
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
ATTEMPT=0
while :; do
ATTEMPT=$((ATTEMPT + 1))
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -eq 0 ]; then
echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
fi
else
echo "$OUTPUT" | tail -3
echo "[$stack] OK"
fi
# Lock contention → SKIP, not fail. Match BOTH the Tier-0 Vault lock
# ("is locked by", from scripts/tg) AND the Tier-1 PG-backend lock
# ("Error acquiring the state lock" / "already locked"). The PG case
# was previously counted as a failure — the #1 source of false reds.
if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
echo "[$stack] SKIPPED (locked by another session/run)"; break
fi
# Transient: provider-registry download timeout / Vault 5xx → bounded
# retry. Deliberately NOT helm atomic-timeouts or config errors
# (missing arg, invalid index) — those must fail fast, retry can't fix
# them and can worsen a stuck helm release.
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
fi
echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"; break
done
done < .platform_apply
fi
# Deferred until after app stacks so both lists get a chance to run.
@ -241,22 +274,27 @@ steps:
echo "=== Applying app stacks (serial, locked) ==="
while read -r stack; do
echo "[$stack] Starting apply..."
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -ne 0 ]; then
if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)"
else
echo "$OUTPUT" | tail -50
echo "[$stack] FAILED (exit $EXIT)"
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
ATTEMPT=0
while :; do
ATTEMPT=$((ATTEMPT + 1))
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -eq 0 ]; then
echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
fi
else
echo "$OUTPUT" | tail -3
echo "[$stack] OK"
fi
# Lock contention → SKIP, not fail (Tier-0 Vault + Tier-1 PG; see platform loop).
if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
echo "[$stack] SKIPPED (locked by another session/run)"; break
fi
# Transient provider-download / Vault 5xx → bounded retry (see platform loop).
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
fi
echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"; break
done
done < .app_apply
fi
# Fail the step loudly so the pipeline `default` workflow state
@ -286,13 +324,8 @@ steps:
fi
GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git push origin master
# ── Slack notification ──
- |
PLATFORM_COUNT=$(wc -l < .platform_apply 2>/dev/null | tr -d ' ')
APP_COUNT=$(wc -l < .app_apply 2>/dev/null | tr -d ' ')
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\"Woodpecker CI: infra pipeline ${CI_PIPELINE_STATUS} (platform:${PLATFORM_COUNT}, apps:${APP_COUNT})\"}" \
"$SLACK_WEBHOOK" || true
# (No Slack post on success — Viktor 2026-07-02: CI notifies on FAILED
# runs only; the notify-failure step below covers those.)
# Slack on failure (runs even if apply step fails)
- name: notify-failure

View file

@ -9,6 +9,7 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 1
attempts: 3
@ -84,6 +85,13 @@ steps:
stack=$(basename "$stack_dir")
[ -f "$stack_dir/terragrunt.hcl" ] || continue
# Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role lacks
# Vault-admin perms (sys/mounts + sys/policies/acl), so `terragrunt plan`
# on it ERRORs (detailed-exitcode 1) and fails the whole nightly drift
# run. Skip it — drift on Tier-0 vault is caught at human apply time.
# (2026-06-27)
[ "$stack" = "vault" ] && continue
echo -n "[$stack] planning... "
OUTPUT=$(cd "$stack_dir" && terragrunt plan -detailed-exitcode -input=false 2>&1)
EXIT=$?
@ -139,13 +147,30 @@ steps:
echo "Drift: ${DRIFTED:-none}"
echo "Errors: ${ERRORS:-none}"
# ── Slack alert if drift found ──
# ── Slack only when something is WRONG (drift or errors) ──
# All-clean runs are silent (Viktor 2026-07-02: CI notifies on
# failed/actionable runs only; clean is the daily normal).
if [ -n "$DRIFTED" ]; then
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\":warning: Drift detected in:${DRIFTED}\nClean: ${CLEAN} stacks. Errors:${ERRORS:-none}\"}" \
"$SLACK_WEBHOOK" || true
else
elif [ -n "$ERRORS" ]; then
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\":white_check_mark: Drift detection: all ${CLEAN} stacks clean${ERRORS:+. Errors: $ERRORS}\"}" \
--data "{\"channel\":\"general\",\"text\":\":red_circle: Drift detection had errors: ${ERRORS} (clean: ${CLEAN})\"}" \
"$SLACK_WEBHOOK" || true
fi
# Hard-failure catch: the in-script posts above never run if the step
# itself crashes early — this step is the only signal for that case.
- name: notify-failure
image: curlimages/curl
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\":red_circle: Drift-detection pipeline FAILED (crashed before reporting)\"}" \
"$SLACK_WEBHOOK" || true
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
when:
status: [failure]

View file

@ -5,6 +5,7 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 2
steps:

View file

@ -11,6 +11,7 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 5
steps:
@ -27,6 +28,7 @@ steps:
from_secret: slack_webhook
commands:
- apk add --no-cache curl
- "curl -sf -X POST https://hooks.slack.com/services/$SLACK_WEBHOOK -H 'Content-Type: application/json' -d '{\"text\": \"Post-mortem TODO pipeline completed\"}' || true"
- "curl -sf -X POST https://hooks.slack.com/services/$SLACK_WEBHOOK -H 'Content-Type: application/json' -d '{\"text\": \":red_circle: Post-mortem TODO pipeline FAILED\"}' || true"
when:
- status: [success, failure]
# Failure-only (Viktor 2026-07-02): CI notifies on failed runs only.
- status: [failure]

View file

@ -5,6 +5,7 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
attempts: 5
backoff: 10s

View file

@ -23,6 +23,7 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 1
attempts: 3
@ -57,7 +58,8 @@ steps:
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\"PVE /etc/exports sync: ${CI_PIPELINE_STATUS}\"}" \
--data "{\"channel\":\"general\",\"text\":\":red_circle: PVE /etc/exports sync FAILED\"}" \
"$SLACK_WEBHOOK" || true
when:
status: [success, failure]
# Failure-only (Viktor 2026-07-02): CI notifies on failed runs only.
status: [failure]

View file

@ -38,6 +38,7 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 1
attempts: 3
@ -150,7 +151,8 @@ steps:
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\"Registry config sync on 10.0.20.10: ${CI_PIPELINE_STATUS}\"}" \
--data "{\"channel\":\"general\",\"text\":\":red_circle: Registry config sync on 10.0.20.10 FAILED\"}" \
"$SLACK_WEBHOOK" || true
when:
status: [success, failure]
# Failure-only (Viktor 2026-07-02): CI notifies on failed runs only.
status: [failure]

View file

@ -6,6 +6,7 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
attempts: 5
backoff: 10s
@ -70,10 +71,11 @@ steps:
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\"Woodpecker CI: TLS certificate renewal ${CI_PIPELINE_STATUS}\"}" \
--data "{\"channel\":\"general\",\"text\":\":red_circle: Woodpecker CI: TLS certificate renewal FAILED\"}" \
"$SLACK_WEBHOOK" || true
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
when:
status: [success, failure]
# Failure-only (Viktor 2026-07-02): successful renewals are routine.
status: [failure]

View file

@ -9,7 +9,7 @@
- **Ask before `git push`** — always confirm with the user first
## Execution
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets)
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets; passes `-lock-timeout`, default `5m` / `TG_LOCK_TIMEOUT`, so a contended state lock waits instead of failing with `Error acquiring the state lock`)
- **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars)
- **kubectl**: `kubectl --kubeconfig $(pwd)/config`
- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
@ -273,8 +273,11 @@ To land a finished change from such a clone:
Slack audit feed; a no-op CI apply on a docs-only commit is harmless.
4. Leave the clone on clean `master` so auto-refresh keeps working.
5. Tell the user in plain language what happened. Stack changes are
auto-applied by CI — verify the live result with the user's read-only
kubectl before saying "it's live".
auto-applied by CI on push — or, with apply access, applied locally yourself
(`scripts/tg apply`, from the main checkout, not a worktree); either path is
fine, but the change must always be committed here, never applied
uncommitted. Verify the live result with the user's read-only kubectl before
saying "it's live".
If a push to `master` is rejected by branch protection (user not on the
whitelist — e.g. new users before Viktor grants it), fall back to a
@ -289,6 +292,7 @@ curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json'
```
## Common Operations
- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Usage telemetry (v0.6): every dispatched verb fire-and-forgets a Loki line (`{user,verb}` + exit only, NO args/secrets; opt-out `HOMELAB_TELEMETRY=0`); `homelab usage top [--since][--user]` ranks verb usage across all users — evidence for what to build next, queryable without reading anyone's home. Home Assistant (v0.7): `homelab ha token [--instance sofia|london]` (prints the long-lived API token, resolved live from k8s Secret `openclaw/openclaw-secrets` — use as `curl -H "Authorization: Bearer $(homelab ha token)"`), `homelab ha ssh [--instance sofia|london] -- <cmd>` (run a command on the HA host; deterministic non-interactive ssh, the invoking user's `~/.ssh/id_ed25519`, sofia=`vbarzin@192.168.1.8` default) — entity state/control stays with the `ha` MCP, these cover only what an API-only MCP can't (token + host shell). Full docs: `cli/README.md`.
- **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service.
- **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
- **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.

View file

@ -56,6 +56,28 @@ _Avoid_: "core service" (collides with the `0-core-*` Namespace tier name).
A non-admin identity declared in `secret/platform → k8s_users` (JSON map). Owns one or more namespaces and one or more public subdomains. Also drives a **Workstation profile** (an identity has both a cluster facet and a workstation facet).
_Avoid_: bare "user", "tenant".
### GPU sharing
**GPU slice**:
One unit of `nvidia.com/gpu` on the time-sliced Tesla T4 — a **scheduling turn, NOT a memory allocation**. The device plugin advertises the card ×100; a pod requesting `nvidia.com/gpu: 1` gets GPU *access*, with zero guarantee about how much of the 16 GB VRAM it may use. "Overallocate GPU memory" is a real failure precisely because a slice carries no memory accounting.
_Avoid_: reading a GPU slice as a memory reservation or a fraction of the card; "vGPU" (we run no vGPU/MIG/MPS — see ADR-0016).
**GPU memory budget**:
The custom node-level extended resource **`viktorbarzin.me/gpumem`** (integer MiB) that makes the scheduler VRAM-aware (ADR-0016). The GPU node advertises a total (~14000 MiB = physical minus driver/context slack); each GPU tenant declares `resources.limits."viktorbarzin.me/gpumem"`; being non-overcommittable, the scheduler refuses to co-schedule past the card (overflow → `Pending`). A *schedule-time* reservation, **not** a runtime cap — it stops pile-on, not a single tenant's runaway.
_Avoid_: treating it as a hard CUDA cap (it isn't — that's what the **GPU watchdog** is for); confusing it with the `nvidia.com/gpu` slice (orthogonal axes: access vs memory accounting).
**GPU watchdog**:
The `gpu-vram-watchdog` CronJob (nvidia ns) that supplies the runtime teeth the **GPU memory budget** lacks: when *actual* free VRAM (`gpu_pod_memory_used_bytes`) drops below a floor, it recycles the biggest tenant that is **over its declared budget**. Enforces the budget as a contract, acts only under pressure (so bursting into genuine slack is fine), and is what bounds the 2026-06-02 immich-ml runaway class.
_Avoid_: expecting it to act on priority (it enforces the *budget*, since co-tenants often share one PriorityClass); expecting instant prevention (it corrects with a detection lag — soft, by design).
**GPU demand-gate**:
The scale-0↔1 admission CronJobs (`stacks/tts`) that bring a best-effort *batch* GPU tenant (chatterbox-tts) up only when free VRAM ≥ a floor and idle it back down — letting on-demand tenants fill real slack without holding a reserved **GPU memory budget** seat.
_Avoid_: using it for interactive tenants (cold-load lag — portal-stt is warm-resident instead); conflating it with the **GPU watchdog** (gate = admit on free VRAM; watchdog = recycle on over-budget pressure).
**gpu-workload priority**:
The `gpu-workload` PriorityClass (1,200,000) auto-stamped on every GPU pod by the Kyverno `inject-gpu-workload-priority` policy — the exclude list (`tts`) drops to `tier-2-gpu` (600,000) so it loses node-pressure eviction first. Governs *Kubernetes node* eviction order, **not** VRAM (VRAM is the budget + watchdog's job).
_Avoid_: assuming it protects VRAM; it is a scheduling/eviction priority on node memory/CPU pressure.
### Workstation (multi-user devvm)
**devvm**:
@ -96,6 +118,14 @@ _Avoid_: "external", "outside".
`viktorbarzin.lan`, served by Technitium DNS. Resolves only inside the homelab network.
_Avoid_: bare "lan", "private", "intranet".
**Segment**:
One isolated L2/L3 network with pfSense as its gateway — realised as a Proxmox-bridge-level tag feeding one dedicated untagged pfSense interface (dManagementsVms 10.0.10.0/24 = vmbr1 tag 10, dKubernetes 10.0.20.0/24 = vmbr1 tag 20, dCCTV 10.0.30.0/24 = vmbr0 tag 30). pfSense itself never terminates 802.1Q.
_Avoid_: "VLAN" as the primary name (the tags 10/20/30 are transport detail; the Segment is the concept).
**CCTV segment**:
The untrusted camera **Segment** (`dCCTV`) — devices in it may be pulled from (RTSP/ISAPI) but may initiate nothing except NTP to their gateway. Deliberately outside every trusted source-IP allowlist (ADR-0017).
_Avoid_: "camera VLAN", "CCTV LAN".
**Ingress auth**:
The `auth = "..."` parameter on `ingress_factory` — a discrete *mode*, not a ranked tier — one of `required` (Authentik forward-auth gates every request), `app` (the backend owns its login), `public` (anonymous Authentik binding for audit only), or `none` (Anubis-fronted content, or native-client API). Default `required` (fail-closed).
_Avoid_: "auth tier" / "auth mode" — refer to it by the canonical key, `auth` (e.g. `auth = "required"`). "tier" is reserved for State tier and Namespace tier.
@ -117,9 +147,17 @@ The bare-metal load-balancer that assigns external IPs to `type=LoadBalancer` Se
_Avoid_: calling `.200` "the cluster IP" or assuming all ingress shares one LB IP.
**Calico**:
The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs; live flow observability via **Goldmane / Whisker**). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
_Avoid_: "firewall" (it's pod-level policy, not a perimeter); conflating a Calico **NetworkPolicy** (enforced in the data path) with a **Kyverno policy** (enforced at admission) — different layers.
**Service identity**:
How a **Service** is named in flow/audit data — its **namespace** is the primary identity (Goldmane stamps it natively, and "one Service ≈ one namespace" holds for ~87 namespaces), refined by an explicit identity label (e.g. `service-identity`) only in the handful of genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). Deliberately NOT a per-Service **ServiceAccount** (deferred — 56% of pods share `default`; revisit only if principal-based enforcement or mTLS is adopted) and NOT a SPIFFE/mesh identity (rejected — attribution-grade audit on a trusted single-tenant cluster doesn't justify a mesh).
_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
**Goldmane / Whisker**:
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`.
_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
### Storage
**proxmox-lvm-encrypted**:
@ -199,6 +237,20 @@ _Avoid_: expecting Diun to deploy; conflating with **Keel**.
**Anubis**:
A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content-bearing sites without app-level auth (blog, wiki, landing pages). Never in front of Git, WebDAV, CalDAV, or API endpoints (clients can't solve PoW).
### Externally-authored sites
**Valia site**:
A small public static site authored by Valia (Viktor's mother, external to the infra) and hosted for her under `<name>.viktorbarzin.me`. Its source of truth is a **Content folder** she owns; the live site is a mirror of that folder, fresh within ~10 minutes. Hosted **off-infra** (Cloudflare Pages) by decision: a homelab outage freezes content but never takes her sites down. Viktor picks the English subdomain name per site at registration (her folder names stay Bulgarian). Current instances: `stem95su`, `bridge`.
_Avoid_: "school site" (the family may grow beyond school projects); treating the deployed copy as editable — edits land only in the **Content folder**.
**Content folder**:
The Google Drive folder (or subfolder) Valia shares with `vbarzin@gmail.com` holding one **Valia site**'s files. Strictly read-only from the infra side — nothing ever writes back to her Drive. Empty or half-uploaded folder states must never wipe a live site.
_Avoid_: syncing a folder root when the servable content lives in a subfolder (stem95su serves `stem claude/files/`, not the folder root).
**Entry file**:
The HTML file a **Valia site** serves at `/`. Defaults to `index.html`; per-site override when she names it differently (stem95su: `stem_board.html`). The override is a registration-time setting, not a constraint on her authoring.
_Avoid_: asking Valia to rename her files to fit hosting conventions.
## Relationships
- A **Service** is defined by exactly one **Stack****flat** or wrapping a **Stack-local module** — which sources zero or more shared **Factory modules** and resolves to one or more K8s workloads.
@ -210,6 +262,7 @@ A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content
- A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither.
- An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**.
- Tier-1 **State tier** state and ~12 app databases share one **CNPG** `pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store.
- A **Valia site** mirrors exactly one **Content folder** and serves exactly one **Entry file** at `/`; the folder is hers, the subdomain name is Viktor's, the hosting is off-infra.
## Example dialogue

View file

@ -1,2 +1,287 @@
# What is this?
This is a CLI to manipulate files in the terraform repo and commit and push them
# homelab
`homelab` is the unified, agent-facing CLI for operating this homelab — one
composable, JSON-capable surface for the operations agents run over and over,
discovered progressively at runtime. It is grown **in place** from this
directory (the former `infra-cli`), and the legacy webhook use-cases still work
(see below).
It encodes *actions*, never *judgment*: methodology (debugging, TDD, review) and
third-party/owned MCP servers (e.g. phpIPAM) are deliberately out of scope.
## Usage
```
homelab <command> [args]
homelab manifest [--json] # list every verb + its read/write tier (discovery entrypoint)
homelab version
```
### v0.1 verbs — the infra inner-loop
| Command | Tier | What it does |
|---|---|---|
| `claim <kind>:<name> --purpose "…"` | write | claim a shared resource on the presence board (wraps `scripts/presence`) |
| `release <kind>:<name>` | write | release a presence claim |
| `tf plan <stack>` | read | `scripts/tg plan` for a stack (resolved from cwd) |
| `tf validate <stack>` | read | `scripts/tg validate` |
| `tf fmt <stack>` | read | `terraform fmt -recursive` on the stack |
| `tf force-unlock <stack> <lock-id>` | write | release a stuck state lock |
| `tf apply <stack>` | write | `scripts/tg apply` — auto-claims `stack:<name>`, always releases, warns it's out-of-band |
| `work start <topic>` | write | create `.worktrees/<topic>` on `<user>/<topic>` off `<remote>/master`; enter with native `EnterWorktree` |
| `work land [--verify-cmd "…"] [--no-verify]` | write | merge master in → verify → push `HEAD:master` (non-ff retry; PR fallback) |
| `work clean <topic>` | write | remove a task's worktree + branch (run from the main checkout) |
### v0.2 verbs — Kubernetes
Built on an **app→namespace→pod resolver**: `<app>` defaults to the namespace
(most namespaces hold one app); the target defaults to `deploy/<app>` and lets
kubectl resolve the pod. Override with `-n`/`--pod`/`-c`/`-l`/`--tty`. Uses the
ambient kubeconfig.
| Command | Tier | What it does |
|---|---|---|
| `k8s status [ns]` | read | pods (wide) + recent non-Normal events (`-A` if no ns) |
| `k8s get <ns> <resource> […]` | read | `kubectl -n <ns> get …` passthrough |
| `k8s logs <app>` | read | logs for `deploy/<app>` (`--tail` default 200; `-c`/`--previous`/`--since`/`-l`) |
| `k8s describe <app> [resource]` | read | describe the deployment (or an explicit resource) |
| `k8s debug <app>` | read | one-shot triage: pods + workloads + describe + recent logs + events |
| `k8s pf <app> <local:remote> [target]` | read | port-forward to `svc/<app>` (or an explicit target) |
| `k8s rollout-status <app>` | read | `rollout status deploy/<app>` |
| `k8s db <app> [--mysql] [--db N] -- "<SQL>"` | write | exec into the dbaas DB (PG `pg-cluster-rw`, or MySQL with env-password wrapper) |
| `k8s exec <app> [--tty] -- <cmd>` | write | exec in the app's pod |
| `k8s restart <app>` | write | `rollout restart deploy/<app>` then wait for status |
| `k8s rm-pod <name> -n <ns> [--job] [--force]` | write | delete a stuck **pod/job only** |
Config-mutation verbs (`apply`/`edit`/`patch`/`scale`/`create`) are intentionally
**not** exposed — they stay raw `kubectl`, per the Terraform-only policy.
`tf` resolves the stack dir by walking up from cwd to the infra root and
delegates to `scripts/tg` (which owns state decrypt/encrypt, the Vault lock, and
the ingress auth-comment check). git-crypt filter flags are auto-injected on git
operations in the encrypted infra repo.
**`work land` refuses to push when it cannot verify** (no `--verify-cmd` and no
auto-detected suite) unless you pass `--no-verify` — landing to master unverified
must be deliberate. After pushing it **watches CI to green** (`ci watch` on the
landed commit) and fails if the pipeline does; pass `--no-ci-watch` to skip.
Tiers are recorded per verb so a future PreToolUse classifier can auto-allow
reads / prompt writes; v0.1 allows everything and relies on existing gates
(permission mode, presence claims, plan approval).
### v0.3 verbs — memory
A thin HTTP client over the **claude-memory** service (the same backend the
memory MCP wraps), authed with `CLAUDE_MEMORY_API_KEY` against
`CLAUDE_MEMORY_API_URL` (the env the hooks already set; defaults to the
ingress). Because it hits the HTTP API directly, it **works even when the MCP
frontend is down**.
| Command | Tier | What it does |
|---|---|---|
| `memory recall "<context>" [--query --category --sort --limit]` | read | semantic search (server-side ranking) — the navigate workhorse |
| `memory list [--category --tag --limit]` | read | recent memories |
| `memory categories` / `memory tags` / `memory stats` | read | enumerate the store |
| `memory secret <id>` | read | reveal a sensitive memory's content |
| `memory store "<content>" [--category --tags --keywords --importance --sensitive]` | write | store a memory |
| `memory update <id> [--content --tags --importance]` | write | edit a memory |
| `memory delete <id>` | write | delete a memory |
All read/write paths are validated against the live API (incl. a
store→recall→delete round-trip). This gives full data-plane parity with the MCP;
the eventual deprecation (rewiring the per-prompt auto-recall + auto-learn hooks
to the CLI, then uninstalling the MCP) is a **separate, deliberate follow-up**
see `docs/adr/0008`.
### v0.4 verbs — ci / deploy
Watch what you trigger, without hand-rolling Woodpecker/kubectl polling. `ci`
talks to the Woodpecker API (token from `WOODPECKER_TOKEN` or Vault
`secret/ci/global`) via the internal Traefik LB, resolving the repo from the cwd
remote, with retries that ride Woodpecker's intermittent empty responses.
| Command | Tier | What it does |
|---|---|---|
| `ci status [commit]` | read | pipeline status for HEAD (or a commit) |
| `ci watch [commit]` | read | poll the pipeline to terminal; exit non-zero on failure |
| `deploy wait <ns>/<deploy> [--sha SHA]` | read | wait for the deployment image to match the sha, *then* rollout status (rollout status alone lies on the old ReplicaSet) |
`work land` now calls `ci watch` on the landed commit automatically (skip with
`--no-ci-watch`), closing the v0.1 "doesn't wait for CI" gap. `ci logs` (failing
step) is deferred to v0.4.1 — Woodpecker's per-pipeline detail/log endpoints were
the least reliable; `status`/`watch` use the list endpoint that works.
### v0.5 verbs — net / dns / metrics / logs
Reachability + observability probes. Their value is *endpoint resolution* — the
non-obvious "which host, public or LB, what auth, what URL shape" reasoning you'd
otherwise re-derive every time — not the HTTP call itself. All reach internal
ingresses through the Traefik LB (the Go form of `curl --resolve host:443:10.0.20.203`).
| Command | Tier | What it does |
|---|---|---|
| `net check <host> [path]` | read | probes the host two ways — external (public DNS → Cloudflare) vs internal (Traefik LB) — with status + latency, so you can tell *where* a break is (CF? app? the LB path?) |
| `dns lookup <name> [type]` | read | resolves via Technitium (`10.0.20.201`) and public (`1.1.1.1`), diffed — surfaces split-horizon vs propagation gaps |
| `metrics query "<promql>"` | read | Prometheus instant query (`prometheus-query.viktorbarzin.lan`); prints `value {labels}` or `--json` |
| `metrics alerts` | read | currently-firing alerts (via the synthetic `ALERTS` series — the query frontend has no `/api/v1/alerts`) |
| `logs query "<logql>" [--since 1h] [--limit N]` | read | Loki range query (`loki.viktorbarzin.lan`); prints log lines or `--json` |
Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forward,
no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the
firing set is reachable via `ALERTS` instead.)
### v0.6 — usage telemetry (`usage top`)
Makes "which verbs are actually used, by everyone" a query instead of a guess —
so adding the *next* verb is evidence-driven, not shaped by one person's habits.
Every dispatched verb emits one fire-and-forget Loki line: `{job, user, verb}`
labels + `exit=N ver=X` — **only the verb path and exit code, never args, paths,
flags, or secrets.** It's best-effort (tight timeout, errors swallowed, never
affects the command) and opt-out via `HOMELAB_TELEMETRY=0`. Because the sink is
the shared Loki, aggregate usage is queryable **without reading anyone's home**
the privacy-preserving answer to "what does the team use."
| Command | Tier | What it does |
|---|---|---|
| `usage top [--since 30d] [--user U] [--json]` | read | rank verbs by invocation count across all users (or one), via `sum by (verb) (count_over_time({job="homelab-usage"}[…]))` |
### v0.7 verbs — Home Assistant
Cover exactly the two things the `ha` **MCP server can't**: resolving the
long-lived API token out of the cluster, and SSH to the HA host for host-level
work (config files, docker, add-ons). Entity state and control (`turn_on`,
`get_state`, services) stay with the MCP — *actions an MCP already encodes are
out of scope* (see top of this doc). The value here is the same as `net`/`dns`:
the non-obvious *which secret, which host, which key, which flags* you'd
otherwise re-derive every session — agents were hand-rolling a
`kubectl | base64 | jq` token pipeline and a bespoke `ssh -o …` invocation on
every run because the existing `home-assistant-sofia.py` needs an env var set
and a cwd-relative path, neither of which holds in an arbitrary session.
| Command | Tier | What it does |
|---|---|---|
| `ha token [--instance sofia\|london]` | read | print the long-lived HA API token, resolved live from the dedicated k8s Secret `openclaw/ha-tokens` (key per instance) via the ambient kubeconfig — no pre-set env var. Use as `curl -H "Authorization: Bearer $(homelab ha token)" …`. The secret is a least-privilege carve-out (`stacks/openclaw/ha_tokens.tf`): the `Home Server Admins` group can read *just* it, so non-admin operators get the HA token without the rest of `skill_secrets` (slack webhook, uptime-kuma password) |
| `ha ssh [--instance sofia\|london] [-i KEY] -- <cmd>` | write | run `<cmd>` on the HA host over ssh with deterministic non-interactive flags (explicit key = the invoking user's `~/.ssh/id_ed25519`, no user ssh-config, no known_hosts prompt). sofia (`vbarzin@192.168.1.8`) is reachable from the devvm LAN; london is documented but generally remote |
`--instance` defaults to **sofia** (the devvm shares the Sofia LAN). `ha token`
prints the bare token to stdout so it composes in `$(…)`; it's read-tier like
`memory secret`. `ha ssh` resolves the *invoking user's* key, so it's per-user,
not tied to whoever first wrote the workflow (the user's key must be enrolled on
the HA host).
### v0.8 verbs — browser (headful anti-bot automation)
Drive the cluster's **headful** Chrome (`chrome-service`, real Chrome under Xvfb)
from the devvm over CDP, for sites that detect and block headless automation. The
headless `@playwright/mcp` browser can *load* such a site and fill its forms, but
the gated action (submit/login) silently fails — the motivating case was the
Stirling Ackroyd Fixflo tenant portal, whose pre-submit check returned
`net::ERR_FILE_NOT_FOUND` and hung. This path connects via `connect_over_cdp`,
injects the same `stealth.js` the in-cluster callers use, and submits first try.
The command owns only the *mechanics* (port-forward, stealth, lifecycle); the
agent supplies the Playwright script — judgment stays out of the CLI.
| Command | Tier | What it does |
|---|---|---|
| `browser run <script.js> [--url U] [--shared-context] [--keep-open] [--port N] [--timeout S]` | write | port-forward `svc/chrome-service:9222`, assert it's a real (non-headless) Chrome via `/json/version`, `connect_over_cdp`, `addInitScript(stealth.js)`, then run the script with `page`/`context`/`browser`/`log` in scope (top-level await ok; return a value to print it). Always tears the forward down. |
| `browser open <url> [--shared-context] [--timeout S]` | write | open `<url>` headful and print title + visible text + a screenshot path — a quick check. |
| `browser --help` | read | when-to-use signature + the error-code cheat-sheet (`ERR_FILE_NOT_FOUND` = automation-layer intercept, not egress; `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` = real egress; one endpoint 500 while siblings 200 = bot rejection). |
Default context is a **fresh incognito** one (closed on exit) — safe for the
shared browser and concurrent callers (e.g. tripit's fare scrape); `--shared-context`
reuses the warmed persistent profile when a pre-logged-in session is needed.
`port-forward` tunnels API-server→pod, so it bypasses the `:9222` NetworkPolicy
that gates in-cluster callers — no namespace label needed. The node CDP client is
pinned to **`playwright-core@1.48.2`** to match the chrome-service image minor
(Chromium 130; protocol changes between minors) and is installed once, lazily,
into `~/.cache/homelab/browser-client/` (no per-user setup). Because the client
runs on the devvm, `setInputFiles` streams local files to the remote browser over
CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
and `docs/adr/0013`.
### v0.9 verbs — edges (east-west "who-talks-to-whom" trail)
Read-only investigation helper over the `goldmane_edges` CNPG trail (ADR-0014):
filters render to a single safe `SELECT` (namespace values validated to the k8s
name charset) run via the dbaas primary pod — the same exec path as `k8s db`.
| Command | Tier | What it does |
| --- | --- | --- |
| `edges --ns <ns>` | read | edges touching `<ns>` (either direction) |
| `edges --src <ns>` / `--dst <ns>` | read | directional: `<ns>`'s egress / ingress peers |
| `edges --peers-of <ns>` | read | distinct peer namespaces of `<ns>` (both directions) |
| `edges --new-since <24h\|7d\|YYYY-MM-DD>` | read | edges first seen since a duration or date |
| `edges --denied` | read | only `action='deny'` edges (blocked / lateral-movement) |
| `edges --json` / `--limit N` | read | JSON array output / row cap (default 200) |
### v0.10 — `vault get --all` (browse every field)
`vault get <name> --all` returns the **whole item** as a normalized JSON object,
so an agent can discover and read fields the single-field `--field` allowlist
can't reach — notably arbitrary **custom fields**.
| Command | Tier | What it does |
| --- | --- | --- |
| `vault get <name> --all` | read | all fields as JSON: `{name, username?, password?, uris?, totp?, notes?, fields?}` |
Shape notes: present standard fields only (empty ones omitted); `fields` is a
custom `name→value` map (duplicate names → last-wins; `linked` fields skipped).
The TOTP **seed is never emitted**`totp` is a presence flag (`true`), so the
only seed-derived path stays the specially-audited `vault code`. Like
`get --json`, the dump is all secret values, so it **refuses a terminal** — pipe
it (`homelab vault get <name> --all | jq`).
### v0.10.1 — reads `bw sync` first (always fresh)
Every vault read (`get`, `get --all`, `list`, `code`, `status`) now runs `bw
sync` when opening its session, so it reflects the latest server-side values.
`bw unlock` only decrypts the *local* cache, so without this a persisted
(already-logged-in) session served stale data — a password changed in the web
vault wouldn't show up until the next login. The sync is **best-effort**: a
transient failure warns on stderr and falls back to the cached vault rather than
failing the read.
### v0.11 — `vault kv` (HashiCorp Vault / OpenBao infra secrets)
`homelab vault` now fronts **two unrelated stores**, made explicit in the bare
`homelab vault` help and via `[vaultwarden]` / `[hashicorp-vault]` summary tags:
- **Vaultwarden** — your personal password manager (`vault get/list/code/…`, unchanged).
- **HashiCorp Vault / OpenBao** — homelab infra secrets, the `secret/…` KV store, under `vault kv`.
| Command | Tier | What it does |
| --- | --- | --- |
| `vault kv get <path> [--field K]` | read | read a secret: `--field K` → one value (TTY-aware clipboard/stdout); no field → all fields as JSON (refuses a bare TTY) |
| `vault kv list <path>` | read | list sub-paths under `<path>` (no values) |
| `vault kv put <path> <key>` | write | write one key; **value via stdin** (piped or no-echo prompt, never argv); creates the path or **merges** (never clobbers siblings) |
**Different credentials:** the Vaultwarden verbs use the per-user *scoped* token
(bound to `claude-users/<user>`); `vault kv` uses your **own** Vault token
(`vault login -method=oidc``~/.vault-token`, or `$VAULT_TOKEN`) — the kv
handlers set `VAULT_ADDR` but never inject the scoped token (which would 403 off
its own path). Access is whatever your policy grants. Writes are merge-only;
`put` (replace) / `delete` are out of scope — use the raw `vault` CLI.
## Build / install
Built from source to `/usr/local/bin/homelab` during devvm provisioning
(`scripts/workstation/setup-devvm.sh`, the `t3-dispatch` pattern); version is
stamped from `cli/VERSION` via ldflags. Manual build:
```
cd cli && go build -ldflags "-X main.version=$(cat VERSION)" -o /usr/local/bin/homelab .
go test ./...
```
## Legacy webhook use-cases (preserved)
This binary is also the in-cluster `infra-cli` image. Invocations starting with
`-use-case=<vpn|setup-openwrt-dns|add-email-alias|...>` fall through to the
original flag-based path unchanged, so the webhook handler is unaffected.
## Design
See `infra/docs/adr/0004``0013` for the architecture decisions.

1
cli/VERSION Normal file
View file

@ -0,0 +1 @@
v0.11.0

388
cli/browser.go Normal file
View file

@ -0,0 +1,388 @@
package main
import (
_ "embed"
"encoding/json"
"fmt"
"io"
"net"
"net/http"
"os"
"os/exec"
"os/signal"
"path/filepath"
"strconv"
"strings"
"sync"
"syscall"
"time"
)
// playwrightVersion pins the node CDP client to the chrome-service image minor
// (mcr.microsoft.com/playwright:v1.48.0-noble → Chromium 130). connect_over_cdp
// speaks the browser's CDP, so the client minor must track the server minor;
// see docs/architecture/chrome-service.md "Image pin".
const playwrightVersion = "1.48.2"
// defaultBrowserTimeout is how long (seconds) to wait for the port-forwarded CDP
// endpoint to become ready before giving up.
const defaultBrowserTimeout = 60
const (
chromeServiceNamespace = "chrome-service"
chromeServiceName = "chrome-service"
chromeServiceCDPPort = 9222
)
// stealthJS is vendored verbatim from stacks/chrome-service/files/stealth.js (the
// source of truth the in-cluster callers use). TestStealthJSEmbeddedMatchesCanonical
// guards against drift.
//
//go:embed browser_stealth.js
var stealthJS string
// runnerJS is the node wrapper that connects to the port-forwarded CDP endpoint,
// installs the stealth init script, and runs the user's Playwright script.
//
//go:embed browser_runner.js
var runnerJS string
// browserOpts is the parsed form of `homelab browser run|open` arguments.
type browserOpts struct {
mode string // "run" | "open"
script string // path to the user Playwright script (run mode)
url string // initial URL (run: optional; open: required positional)
sharedCtx bool // use the warmed persistent profile instead of a fresh context
keepOpen bool // leave the created context/pages open on exit
port int // explicit local port for the forward (0 = auto)
timeout int // CDP readiness timeout, seconds
help bool
}
// parseBrowserArgs parses the args after `browser run` / `browser open`.
func parseBrowserArgs(mode string, args []string) (browserOpts, error) {
o := browserOpts{mode: mode, timeout: defaultBrowserTimeout}
var positionals []string
atoi := func(s, flag string) (int, error) {
n, err := strconv.Atoi(s)
if err != nil {
return 0, fmt.Errorf("%s expects an integer, got %q", flag, s)
}
return n, nil
}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "-h" || a == "--help":
o.help = true
case a == "--shared-context":
o.sharedCtx = true
case a == "--keep-open":
o.keepOpen = true
case a == "--url":
if i+1 < len(args) {
o.url = args[i+1]
i++
}
case strings.HasPrefix(a, "--url="):
o.url = strings.TrimPrefix(a, "--url=")
case a == "--port":
if i+1 < len(args) {
n, err := atoi(args[i+1], "--port")
if err != nil {
return o, err
}
o.port = n
i++
}
case strings.HasPrefix(a, "--port="):
n, err := atoi(strings.TrimPrefix(a, "--port="), "--port")
if err != nil {
return o, err
}
o.port = n
case a == "--timeout":
if i+1 < len(args) {
n, err := atoi(args[i+1], "--timeout")
if err != nil {
return o, err
}
o.timeout = n
i++
}
case strings.HasPrefix(a, "--timeout="):
n, err := atoi(strings.TrimPrefix(a, "--timeout="), "--timeout")
if err != nil {
return o, err
}
o.timeout = n
case strings.HasPrefix(a, "-"):
return o, fmt.Errorf("unknown flag %q (try: homelab browser --help)", a)
default:
positionals = append(positionals, a)
}
}
if o.help {
return o, nil
}
switch mode {
case "run":
if len(positionals) == 0 {
return o, fmt.Errorf("usage: homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]")
}
o.script = positionals[0]
case "open":
if len(positionals) == 0 {
return o, fmt.Errorf("usage: homelab browser open <url> [--shared-context] [--timeout S]")
}
o.url = positionals[0]
}
return o, nil
}
// cdpHealthy parses a CDP /json/version body and reports whether the endpoint is
// a real (non-headless) Chrome — the entire reason chrome-service exists.
func cdpHealthy(jsonBody []byte) (browser string, healthy bool, err error) {
var v struct {
Browser string `json:"Browser"`
UserAgent string `json:"User-Agent"`
}
if e := json.Unmarshal(jsonBody, &v); e != nil {
return "", false, fmt.Errorf("parse /json/version: %w", e)
}
if v.Browser == "" {
return "", false, fmt.Errorf("/json/version had no Browser field")
}
healthy = strings.HasPrefix(v.Browser, "Chrome/") &&
!strings.Contains(v.Browser, "Headless") &&
!strings.Contains(v.UserAgent, "Headless")
return v.Browser, healthy, nil
}
// buildPortForwardArgs is the kubectl invocation that exposes chrome-service's
// CDP locally. port-forward tunnels API-server→pod, so it bypasses the :9222
// NetworkPolicy that gates in-cluster callers.
func buildPortForwardArgs(localPort int) []string {
return []string{"-n", chromeServiceNamespace, "port-forward",
"svc/" + chromeServiceName, fmt.Sprintf("%d:%d", localPort, chromeServiceCDPPort)}
}
// browserClientPackageJSON is the auto-managed manifest for the pinned node CDP
// client kept under the user cache dir.
func browserClientPackageJSON() string {
return fmt.Sprintf(`{
"name": "homelab-browser-client",
"private": true,
"description": "Pinned CDP client for 'homelab browser' — auto-managed, do not edit.",
"dependencies": {
"playwright-core": "%s"
}
}
`, playwrightVersion)
}
// freePort asks the kernel for an unused ephemeral TCP port.
func freePort() (int, error) {
l, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
return 0, err
}
defer l.Close()
return l.Addr().(*net.TCPAddr).Port, nil
}
// browserClientDir is where the pinned node client + managed runner files live.
func browserClientDir() (string, error) {
cache, err := os.UserCacheDir()
if err != nil || cache == "" {
home, herr := os.UserHomeDir()
if herr != nil {
return "", fmt.Errorf("locate cache dir: %v / %v", err, herr)
}
cache = filepath.Join(home, ".cache")
}
return filepath.Join(cache, "homelab", "browser-client"), nil
}
// installedPlaywrightVersion reads the version of the playwright-core already
// installed in dir, or "" if absent/unreadable.
func installedPlaywrightVersion(dir string) string {
b, err := os.ReadFile(filepath.Join(dir, "node_modules", "playwright-core", "package.json"))
if err != nil {
return ""
}
var v struct {
Version string `json:"version"`
}
if json.Unmarshal(b, &v) != nil {
return ""
}
return v.Version
}
// ensureBrowserClient writes the managed runner/stealth/package files into dir
// and lazily installs the pinned playwright-core (only when missing/mismatched),
// so no per-user setup is needed and the client tracks the binary version.
func ensureBrowserClient(dir string) error {
if err := os.MkdirAll(dir, 0o755); err != nil {
return err
}
files := map[string]string{
"package.json": browserClientPackageJSON(),
"browser_runner.js": runnerJS,
"stealth.js": stealthJS,
}
for name, content := range files {
if err := os.WriteFile(filepath.Join(dir, name), []byte(content), 0o644); err != nil {
return err
}
}
if installedPlaywrightVersion(dir) == playwrightVersion {
return nil
}
fmt.Fprintf(os.Stderr, "homelab browser: installing pinned playwright-core@%s (one-time, ~a few seconds)…\n", playwrightVersion)
cmd := exec.Command("npm", "install", "--no-audit", "--no-fund", "--silent")
cmd.Dir = dir
cmd.Stdout = os.Stderr
cmd.Stderr = os.Stderr
if err := cmd.Run(); err != nil {
return fmt.Errorf("npm install playwright-core@%s in %s: %w (is node/npm installed?)", playwrightVersion, dir, err)
}
if got := installedPlaywrightVersion(dir); got != playwrightVersion {
return fmt.Errorf("playwright-core install mismatch in %s: want %s, got %q", dir, playwrightVersion, got)
}
return nil
}
// waitForCDP polls the local CDP endpoint until it answers as a healthy
// (non-headless) Chrome, or the timeout elapses.
func waitForCDP(cdpURL string, timeout time.Duration) (string, error) {
deadline := time.Now().Add(timeout)
client := &http.Client{Timeout: 3 * time.Second}
var lastErr error
for time.Now().Before(deadline) {
resp, err := client.Get(cdpURL + "/json/version")
if err != nil {
lastErr = err
time.Sleep(300 * time.Millisecond)
continue
}
body, _ := io.ReadAll(resp.Body)
resp.Body.Close()
browser, healthy, herr := cdpHealthy(body)
if herr != nil {
lastErr = herr
time.Sleep(300 * time.Millisecond)
continue
}
if !healthy {
return browser, fmt.Errorf("CDP reports %q — expected a non-headless Chrome (wrong target?)", browser)
}
return browser, nil
}
if lastErr == nil {
lastErr = fmt.Errorf("timed out after %s", timeout)
}
return "", lastErr
}
// runBrowser is the orchestration: pick a port, ensure the pinned client, start
// (and ALWAYS tear down) a CDP port-forward, wait for readiness, then run node.
func runBrowser(o browserOpts) error {
port := o.port
if port == 0 {
p, err := freePort()
if err != nil {
return fmt.Errorf("pick local port: %w", err)
}
port = p
}
dir, err := browserClientDir()
if err != nil {
return err
}
if err := ensureBrowserClient(dir); err != nil {
return err
}
// Start the forward in its own process group so the whole tree dies on cleanup.
pf := exec.Command("kubectl", buildPortForwardArgs(port)...)
pf.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
var pfLog strings.Builder
pf.Stdout = &pfLog
pf.Stderr = &pfLog
if err := pf.Start(); err != nil {
return fmt.Errorf("start kubectl port-forward (kubeconfig set?): %w", err)
}
var once sync.Once
teardown := func() {
once.Do(func() {
if pf.Process != nil {
_ = syscall.Kill(-pf.Process.Pid, syscall.SIGKILL)
}
_ = pf.Wait()
})
}
defer teardown()
// Tear down on Ctrl-C / SIGTERM too, then exit non-zero.
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)
defer signal.Stop(sigCh)
go func() {
if _, ok := <-sigCh; ok {
teardown()
os.Exit(130)
}
}()
cdpURL := fmt.Sprintf("http://127.0.0.1:%d", port)
browser, err := waitForCDP(cdpURL, time.Duration(o.timeout)*time.Second)
if err != nil {
return fmt.Errorf("chrome-service CDP not ready on %s: %w\n--- port-forward log ---\n%s", cdpURL, err, pfLog.String())
}
fmt.Fprintf(os.Stderr, "homelab browser: connected to %s via %s\n", browser, cdpURL)
return runBrowserNode(dir, cdpURL, o)
}
// runBrowserNode invokes the managed node runner with inputs passed via env.
func runBrowserNode(dir, cdpURL string, o browserOpts) error {
env := append(os.Environ(),
"HOMELAB_CDP_URL="+cdpURL,
"HOMELAB_BROWSER_MODE="+o.mode,
"HOMELAB_STEALTH_PATH="+filepath.Join(dir, "stealth.js"),
"NODE_PATH="+filepath.Join(dir, "node_modules"),
)
if o.url != "" {
env = append(env, "HOMELAB_BROWSER_URL="+o.url)
}
if o.script != "" {
abs, err := filepath.Abs(o.script)
if err != nil {
return err
}
if _, err := os.Stat(abs); err != nil {
return fmt.Errorf("script %s: %w", o.script, err)
}
env = append(env, "HOMELAB_BROWSER_SCRIPT="+abs)
}
if o.sharedCtx {
env = append(env, "HOMELAB_BROWSER_SHARED=1")
}
if o.keepOpen {
env = append(env, "HOMELAB_BROWSER_KEEP_OPEN=1")
}
if o.mode == "open" {
shot := filepath.Join(os.TempDir(), fmt.Sprintf("homelab-browser-%d.png", os.Getpid()))
env = append(env, "HOMELAB_BROWSER_SCREENSHOT="+shot)
}
cmd := exec.Command("node", filepath.Join(dir, "browser_runner.js"))
cmd.Env = env
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Stdin = os.Stdin
return cmd.Run()
}

106
cli/browser_runner.js Normal file
View file

@ -0,0 +1,106 @@
// homelab browser — node CDP runner (auto-managed; regenerated each run from the
// homelab binary — DO NOT EDIT here). Connects to the port-forwarded
// chrome-service CDP endpoint, installs the stealth init script, then runs the
// user's Playwright script (run mode) or opens a URL (open mode). All inputs
// arrive via HOMELAB_* env vars set by the Go CLI.
'use strict';
const fs = require('fs');
const { chromium } = require('playwright-core');
async function main() {
const cdpURL = process.env.HOMELAB_CDP_URL;
if (!cdpURL) throw new Error('HOMELAB_CDP_URL not set');
const mode = process.env.HOMELAB_BROWSER_MODE || 'run';
const stealthPath = process.env.HOMELAB_STEALTH_PATH || '';
const initURL = process.env.HOMELAB_BROWSER_URL || '';
const scriptPath = process.env.HOMELAB_BROWSER_SCRIPT || '';
const shared = process.env.HOMELAB_BROWSER_SHARED === '1';
const keepOpen = process.env.HOMELAB_BROWSER_KEEP_OPEN === '1';
const screenshotPath = process.env.HOMELAB_BROWSER_SCREENSHOT || '';
const browser = await chromium.connectOverCDP(cdpURL);
// Fresh isolated context by default (safe for the shared browser + concurrent
// callers); --shared-context reuses the warmed persistent profile.
let context;
let createdContext = false;
if (shared) {
const existing = browser.contexts();
if (existing.length) {
context = existing[0];
} else {
context = await browser.newContext();
createdContext = true;
}
} else {
context = await browser.newContext();
createdContext = true;
}
if (stealthPath) {
const stealth = fs.readFileSync(stealthPath, 'utf8');
if (stealth.trim()) await context.addInitScript(stealth);
}
const page = await context.newPage();
const log = (...a) => console.error('[browser]', ...a);
let exitCode = 0;
try {
if (initURL) {
await page.goto(initURL, { waitUntil: 'domcontentloaded' });
}
if (mode === 'open') {
console.log('url: ' + page.url());
console.log('title: ' + (await page.title()));
const text = (await page.evaluate(() => (document.body ? document.body.innerText : ''))).trim();
console.log('--- visible text (truncated to 4000 chars) ---');
console.log(text.slice(0, 4000));
if (screenshotPath) {
await page.screenshot({ path: screenshotPath, fullPage: true });
console.log('screenshot: ' + screenshotPath);
}
} else {
if (!scriptPath) throw new Error('run mode requires HOMELAB_BROWSER_SCRIPT');
const src = fs.readFileSync(scriptPath, 'utf8');
// Run the user's source with page/context/browser/log in lexical scope.
// AsyncFunction body permits top-level await.
const AsyncFunction = Object.getPrototypeOf(async () => {}).constructor;
const fn = new AsyncFunction('page', 'context', 'browser', 'log', src);
const result = await fn(page, context, browser, log);
if (result !== undefined) {
let out;
try {
out = typeof result === 'string' ? result : JSON.stringify(result, null, 2);
} catch (_) {
out = String(result);
}
console.log(out);
}
}
} catch (e) {
console.error('homelab browser: script error:', e && e.stack ? e.stack : e);
exitCode = 1;
} finally {
if (!keepOpen) {
try {
// Close only what we created; never tear down the shared persistent context.
if (createdContext) {
await context.close();
} else {
await page.close();
}
} catch (_) { /* ignore */ }
}
// Disconnect from the CDP endpoint; this does NOT kill the remote browser.
try {
await browser.close();
} catch (_) { /* ignore */ }
}
process.exit(exitCode);
}
main().catch((e) => {
console.error('homelab browser: fatal:', e && e.stack ? e.stack : e);
process.exit(1);
});

54
cli/browser_stealth.js Normal file
View file

@ -0,0 +1,54 @@
// Minimal stealth init script for Playwright-driven Chromium.
// Vendored from puppeteer-extra-plugin-stealth/evasions/* (MIT) — covers:
// webdriver, chrome.runtime, navigator.plugins, navigator.languages,
// Permissions.query, WebGL getParameter (vendor + renderer spoof).
// Run via context.add_init_script() so it executes before any page script.
(() => {
// navigator.webdriver — most common detection, removed entirely.
Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined });
// window.chrome.runtime — many sites check that real Chrome exposes this.
if (!window.chrome) window.chrome = {};
window.chrome.runtime = window.chrome.runtime || {};
// navigator.plugins — headless reports zero; spoof a plausible PDF viewer.
Object.defineProperty(navigator, 'plugins', {
get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }],
});
// navigator.languages — headless returns empty array.
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
// Permissions.query — headless returns 'denied' for notifications instead of 'default'.
const origQuery = window.navigator.permissions && window.navigator.permissions.query;
if (origQuery) {
window.navigator.permissions.query = (parameters) =>
parameters && parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: origQuery(parameters);
}
// WebGL getParameter — spoof vendor + renderer strings to a real GPU.
const spoofGl = (proto) => {
if (!proto) return;
const orig = proto.getParameter;
proto.getParameter = function (parameter) {
if (parameter === 37445) return 'Intel Inc.'; // UNMASKED_VENDOR_WEBGL
if (parameter === 37446) return 'Intel Iris OpenGL Engine'; // UNMASKED_RENDERER_WEBGL
return orig.apply(this, arguments);
};
};
spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype);
spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype);
// disable-devtool.js (theajack/disable-devtool) auto-inits via a script
// tag with `disable-devtool-auto`. Its Performance detector trips under
// Playwright (CDP adds console.log latency vs console.table) and the
// redirect URL is hard-coded — for hmembeds that's google.com.
// Hide the auto-init marker so the library's IIFE exits early.
const origQS = Document.prototype.querySelector;
Document.prototype.querySelector = function (sel) {
if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null;
return origQS.apply(this, arguments);
};
})();

117
cli/cmd_browser.go Normal file
View file

@ -0,0 +1,117 @@
package main
import "fmt"
// browser verbs drive the cluster's HEADFUL Chrome (ns chrome-service) over CDP
// from outside the cluster, for sites that detect/block headless automation.
// The headless @playwright/mcp browser can load such sites but their gated
// actions (submit/login) silently fail; this path submits first try. Mechanics
// only — the agent supplies the Playwright script. See docs/adr/0013.
func browserCommands() []Command {
return []Command{
{Path: []string{"browser"}, Tier: TierRead,
Summary: "headful cluster-Chrome automation for anti-bot sites (run `browser --help`)", Run: browserTopHelp},
{Path: []string{"browser", "run"}, Tier: TierWrite,
Summary: "run a Playwright script against headful cluster Chrome: browser run <script.js> [--url U] [--shared-context]", Run: browserRun},
{Path: []string{"browser", "open"}, Tier: TierWrite,
Summary: "open a URL in headful cluster Chrome; print title + text + screenshot: browser open <url>", Run: browserOpen},
}
}
func browserTopHelp([]string) error {
fmt.Print(browserHelp())
return nil
}
func browserRun(args []string) error {
o, err := parseBrowserArgs("run", args)
if err != nil {
return err
}
if o.help {
fmt.Print(browserHelp())
return nil
}
return runBrowser(o)
}
func browserOpen(args []string) error {
o, err := parseBrowserArgs("open", args)
if err != nil {
return err
}
if o.help {
fmt.Print(browserHelp())
return nil
}
return runBrowser(o)
}
// browserHelp carries the discoverability payload: WHEN to reach for this, and
// the diagnostic cheat-sheet that lets the agent self-correct instead of
// retrying a deterministic form blind (the failure mode that motivated this).
func browserHelp() string {
return `homelab browser drive the cluster's HEADFUL Chrome (anti-bot) over CDP
The shared chrome-service (ns chrome-service) runs a REAL, headed Chrome under
Xvfb. This connects to it via a port-forward + Playwright connect_over_cdp,
injects the same stealth.js the in-cluster callers use, and runs your script.
USAGE
homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]
homelab browser open <url> [--shared-context] [--timeout S]
WHEN TO USE THIS escalation only; DEFAULT to the headless/MCP browser
Default to the Playwright MCP / headless browser for ALL routine browsing and
automation it's interactive (snapshot per step), fast to start, isolated.
Reach for THIS command ONLY when headless is demonstrably blocked: a site
LOADS fine but a gated action FAILS or HANGS a submit/login/checkout spins
forever, or ONE request errors while its siblings 200. That is the signature
of headless / anti-bot detection (navigator.webdriver, UA "HeadlessChrome",
disable-devtool traps). It presents as a real Chrome and usually succeeds
first try but it's the shared cluster browser (slower startup, one batch
run, no per-step feedback), so it's the escalation path, never the default.
ERROR-CODE CHEAT-SHEET (diagnose BEFORE retrying)
ERR_FILE_NOT_FOUND (-6) request intercepted/resolved locally by the
automation layer NOT a network/egress problem.
(This is what silently broke the headless submit.)
ERR_CONNECTION_REFUSED / real egress failure (DNS/route/firewall). These also
ERR_TIMED_OUT / break the initial page load if the page loaded,
ERR_NAME_NOT_RESOLVED egress is fine and the cause is elsewhere.
one endpoint 500s while server-side bot rejection of the automation, not
its siblings 200 your payload.
HABITS
- Inspect the network panel BEFORE retrying a deterministic form; a blind
retry just repeats the same silent failure.
- Don't park a half-filled multi-step form across a user pause the session
can expire; re-run the whole flow from this command in one shot.
- Uploads stream over CDP via setInputFiles from THIS host no chmod/staging
of $HOME needed; just point setInputFiles at a local path.
CONTEXT
Default: a FRESH incognito context, closed on exit safe for the shared
browser and concurrent callers (e.g. tripit). Your script does its own login.
--shared-context: reuse the warmed PERSISTENT profile (cookies from a manual
noVNC login at chrome.viktorbarzin.me) when you need a pre-logged-in session.
SCRIPT CONTRACT (run mode)
Your file's body runs with page, context, browser and log() already in scope
(top-level await allowed). Return a value to print it. Example flow.js:
await page.goto('https://portal.example.com/login');
await page.fill('#user', 'me'); await page.fill('#pass', process.env.PW);
await page.click('button[type=submit]');
await page.waitForURL('**/dashboard');
return 'logged in: ' + page.url();
Run it: homelab browser run flow.js
NOTES
- The Playwright client is pinned to playwright-core@` + playwrightVersion + ` to match the
chrome-service image (Chrome 130); installed once into ~/.cache/homelab/.
- The port-forward is always torn down, on success and on error.
`
}

172
cli/cmd_browser_test.go Normal file
View file

@ -0,0 +1,172 @@
package main
import (
"os"
"reflect"
"strings"
"testing"
)
func TestParseBrowserArgsRun(t *testing.T) {
got, err := parseBrowserArgs("run", []string{
"flow.js", "--url", "https://example.com", "--shared-context",
"--port", "19999", "--timeout", "45", "--keep-open",
})
if err != nil {
t.Fatalf("parseBrowserArgs run: unexpected err: %v", err)
}
want := browserOpts{
mode: "run", script: "flow.js", url: "https://example.com",
sharedCtx: true, keepOpen: true, port: 19999, timeout: 45,
}
if !reflect.DeepEqual(got, want) {
t.Fatalf("parseBrowserArgs run =\n %+v\nwant\n %+v", got, want)
}
}
func TestParseBrowserArgsRunDefaults(t *testing.T) {
got, err := parseBrowserArgs("run", []string{"flow.js"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.script != "flow.js" || got.sharedCtx || got.keepOpen || got.port != 0 {
t.Fatalf("defaults wrong: %+v", got)
}
if got.timeout != defaultBrowserTimeout {
t.Fatalf("timeout default = %d, want %d", got.timeout, defaultBrowserTimeout)
}
}
func TestParseBrowserArgsRunRequiresScript(t *testing.T) {
if _, err := parseBrowserArgs("run", []string{"--url", "https://x"}); err == nil {
t.Fatalf("run without a script path should error")
}
}
func TestParseBrowserArgsOpenRequiresURL(t *testing.T) {
got, err := parseBrowserArgs("open", []string{"https://example.com"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.url != "https://example.com" || got.mode != "open" {
t.Fatalf("open parse wrong: %+v", got)
}
if _, err := parseBrowserArgs("open", []string{}); err == nil {
t.Fatalf("open without a URL should error")
}
}
func TestParseBrowserArgsHelp(t *testing.T) {
for _, a := range [][]string{{"--help"}, {"-h"}, {"flow.js", "--help"}} {
got, err := parseBrowserArgs("run", a)
if err != nil {
t.Fatalf("help parse %v: %v", a, err)
}
if !got.help {
t.Fatalf("args %v should set help", a)
}
}
}
func TestParseBrowserArgsEqualsForm(t *testing.T) {
got, err := parseBrowserArgs("run", []string{"flow.js", "--url=https://x", "--port=8123", "--timeout=10"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.url != "https://x" || got.port != 8123 || got.timeout != 10 {
t.Fatalf("--flag=value form not parsed: %+v", got)
}
}
func TestCDPHealthy(t *testing.T) {
real := []byte(`{"Browser":"Chrome/130.0.6723.31","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) Chrome/130.0.0.0 Safari/537.36","webSocketDebuggerUrl":"ws://127.0.0.1/devtools/browser/x"}`)
browser, ok, err := cdpHealthy(real)
if err != nil || !ok {
t.Fatalf("real Chrome should be healthy: ok=%v err=%v", ok, err)
}
if !strings.HasPrefix(browser, "Chrome/") {
t.Fatalf("browser = %q, want Chrome/ prefix", browser)
}
headless := []byte(`{"Browser":"HeadlessChrome/130.0.6723.31","User-Agent":"Mozilla/5.0 HeadlessChrome/130.0.0.0"}`)
if _, ok, _ := cdpHealthy(headless); ok {
t.Fatalf("HeadlessChrome must be reported unhealthy (the whole point of chrome-service)")
}
if _, _, err := cdpHealthy([]byte("not json")); err == nil {
t.Fatalf("malformed /json/version body should error")
}
}
func TestBuildPortForwardArgs(t *testing.T) {
got := buildPortForwardArgs(18080)
want := []string{"-n", "chrome-service", "port-forward", "svc/chrome-service", "18080:9222"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("buildPortForwardArgs =\n %v\nwant\n %v", got, want)
}
}
func TestBrowserClientPackageJSONPinsVersion(t *testing.T) {
pj := browserClientPackageJSON()
if !strings.Contains(pj, `"playwright-core": "`+playwrightVersion+`"`) {
t.Fatalf("package.json must pin playwright-core to %s; got:\n%s", playwrightVersion, pj)
}
}
func TestPlaywrightVersionPinnedToServerMinor(t *testing.T) {
// chrome-service runs mcr.microsoft.com/playwright:v1.48.0-noble; the CDP
// client minor MUST match (protocol changes between minors).
if !strings.HasPrefix(playwrightVersion, "1.48.") {
t.Fatalf("playwrightVersion = %q, must be 1.48.x to match the chrome-service image", playwrightVersion)
}
}
func TestBrowserHelpHasDiagnosticCheatSheet(t *testing.T) {
h := browserHelp()
for _, want := range []string{
"homelab browser run",
"ERR_FILE_NOT_FOUND",
"ERR_CONNECTION_REFUSED",
"network panel",
"headless",
"--shared-context",
} {
if !strings.Contains(h, want) {
t.Errorf("browser --help is missing %q (the discoverability/self-correction payload)", want)
}
}
}
func TestBrowserHelpIsTiered(t *testing.T) {
// --help must frame this as the ESCALATION path (default to headless first),
// matching ~/code/CLAUDE.md and chrome-service.md — non-conflicting agent
// instructions. Guard against a regression to "co-equal choice" wording.
h := browserHelp()
for _, want := range []string{"Default to the", "escalation"} {
if !strings.Contains(h, want) {
t.Errorf("browser --help must carry the tiered/default-headless framing; missing %q", want)
}
}
}
func TestStealthJSEmbeddedMatchesCanonical(t *testing.T) {
// The embedded copy must never drift from the source of truth that the
// in-cluster callers use, else the CLI's stealth and the cluster's diverge.
canonical, err := os.ReadFile("../stacks/chrome-service/files/stealth.js")
if err != nil {
t.Fatalf("read canonical stealth.js: %v", err)
}
if stealthJS != string(canonical) {
t.Fatalf("cli/browser_stealth.js has drifted from stacks/chrome-service/files/stealth.js — re-copy it")
}
}
func TestFreePortReturnsUsablePort(t *testing.T) {
p, err := freePort()
if err != nil {
t.Fatalf("freePort: %v", err)
}
if p <= 1024 || p > 65535 {
t.Fatalf("freePort returned %d, want an ephemeral port", p)
}
}

99
cli/cmd_ci.go Normal file
View file

@ -0,0 +1,99 @@
package main
import (
"fmt"
"os"
"strings"
"time"
)
func ciCommands() []Command {
return []Command{
{Path: []string{"ci", "status"}, Tier: TierRead,
Summary: "pipeline status for HEAD/a commit: ci status [commit]", Run: ciStatus},
{Path: []string{"ci", "watch"}, Tier: TierRead,
Summary: "poll the pipeline for HEAD (or a commit) to terminal; non-zero on failure", Run: ciWatch},
}
}
func short(s string) string {
if len(s) > 8 {
return s[:8]
}
return s
}
func firstLine(s string) string { return strings.SplitN(s, "\n", 2)[0] }
// currentHEAD returns the full HEAD sha of the cwd repo (empty if not a repo).
func currentHEAD() string {
cwd, _ := os.Getwd()
root, err := gitRepoRoot(cwd)
if err != nil {
return ""
}
sha, _ := gitOutput(root, "rev-parse", "HEAD")
return sha
}
func ciStatus(args []string) error {
commit, _ := firstPositional(args)
c, err := newWPClient()
if err != nil {
return err
}
id, err := c.repoID()
if err != nil {
return err
}
p, err := c.findPipeline(id, commit)
if err != nil {
return err
}
fmt.Printf("#%d %s event=%s %s %s\n", p.Number, p.Status, p.Event, short(p.Commit), firstLine(p.Message))
return nil
}
func ciWatch(args []string) error {
commit, _ := firstPositional(args)
if commit == "" {
commit = currentHEAD()
}
if commit == "" {
return fmt.Errorf("no commit given and not in a git repo")
}
c, err := newWPClient()
if err != nil {
return err
}
id, err := c.repoID()
if err != nil {
return err
}
timeout := 20 * time.Minute
deadline := time.Now().Add(timeout)
last := ""
for time.Now().Before(deadline) {
p, err := c.findPipeline(id, commit)
if err != nil {
if last != "waiting" {
fmt.Fprintf(os.Stderr, "homelab: waiting for pipeline (%s)...\n", short(commit))
last = "waiting"
}
} else {
if p.Status != last {
fmt.Fprintf(os.Stderr, "homelab: #%d %s\n", p.Number, p.Status)
last = p.Status
}
if isTerminalStatus(p.Status) {
fmt.Printf("#%d %s %s\n", p.Number, p.Status, short(commit))
if isFailureStatus(p.Status) {
return fmt.Errorf("pipeline #%d %s (woodpecker repo, see UI/DB for the failing step)", p.Number, p.Status)
}
return nil
}
}
time.Sleep(15 * time.Second)
}
return fmt.Errorf("timed out after %s waiting for CI on %s", timeout, short(commit))
}

56
cli/cmd_claim.go Normal file
View file

@ -0,0 +1,56 @@
package main
import (
"fmt"
"strings"
)
func claimCommands() []Command {
return []Command{
{Path: []string{"claim"}, Tier: TierWrite,
Summary: "claim a shared infra resource on the presence board",
Run: runClaim},
{Path: []string{"release"}, Tier: TierWrite,
Summary: "release a presence claim",
Run: runRelease},
}
}
// runClaim parses `<kind>:<name> --purpose "..."` in either order (the presence
// script takes the label first, so we can't rely on Go's flag package which
// stops at the first positional).
func runClaim(args []string) error {
var label, purpose string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--purpose" || a == "-purpose":
if i+1 < len(args) {
purpose = args[i+1]
i++
}
case strings.HasPrefix(a, "--purpose="):
purpose = strings.TrimPrefix(a, "--purpose=")
case !strings.HasPrefix(a, "-") && label == "":
label = a
}
}
if label == "" {
return fmt.Errorf(`usage: homelab claim <kind>:<name> --purpose "what + why"`)
}
return presenceClaim(label, purpose)
}
func runRelease(args []string) error {
var label string
for _, a := range args {
if !strings.HasPrefix(a, "-") {
label = a
break
}
}
if label == "" {
return fmt.Errorf("usage: homelab release <kind>:<name>")
}
return presenceRelease(label)
}

51
cli/cmd_deploy.go Normal file
View file

@ -0,0 +1,51 @@
package main
import (
"fmt"
"os"
"strings"
"time"
)
func deployCommands() []Command {
return []Command{
{Path: []string{"deploy", "wait"}, Tier: TierRead,
Summary: "wait for <ns>/<deploy> to roll out the current (or --sha) image: deploy wait <ns>/<deploy> [--sha SHA]", Run: deployWait},
}
}
// deployWait closes the "did the NEW code land" gap: rollout status alone returns
// success on the OLD ReplicaSet, so we first wait for the deployment image to
// reference the expected sha, THEN block on rollout status.
func deployWait(args []string) error {
target, _ := firstPositional(args)
if target == "" || !strings.Contains(target, "/") {
return fmt.Errorf("usage: homelab deploy wait <ns>/<deploy> [--sha SHA] [--timeout 10m]")
}
parts := strings.SplitN(target, "/", 2)
ns, deploy := parts[0], parts[1]
sha := flagValue(args, "--sha")
if sha == "" {
sha = short(currentHEAD())
}
deadline := time.Now().Add(10 * time.Minute)
if sha != "" {
fmt.Fprintf(os.Stderr, "homelab: waiting for %s/%s image to match %s...\n", ns, deploy, sha)
matched := false
for time.Now().Before(deadline) {
img, _ := kubectlCapture(ns, "get", "deploy", deploy, "-o", "jsonpath={.spec.template.spec.containers[*].image}")
if strings.Contains(img, sha) {
matched = true
break
}
time.Sleep(10 * time.Second)
}
if !matched {
return fmt.Errorf("timed out: %s/%s image never matched %q", ns, deploy, sha)
}
}
fmt.Fprintf(os.Stderr, "homelab: rollout status %s/%s...\n", ns, deploy)
return kubectlStream(ns, "rollout", "status", "deploy/"+deploy, "--timeout=180s")
}

69
cli/cmd_edges.go Normal file
View file

@ -0,0 +1,69 @@
package main
import "fmt"
func edgesCommands() []Command {
return []Command{
{Path: []string{"edges"}, Tier: TierRead,
Summary: "who-talks-to-whom trail: edges [--ns|--src|--dst|--peers-of N] [--new-since 24h] [--denied] [--json] [--limit N]",
Run: edgesRun},
}
}
// edgesRun renders the filter flags to SQL and runs it read-only against the
// goldmane_edges CNPG DB via the dbaas primary pod (same exec path as `k8s db`).
func edgesRun(args []string) error {
for _, a := range args {
if a == "-h" || a == "--help" {
fmt.Print(edgesUsage())
return nil
}
}
o, err := parseEdgesArgs(args)
if err != nil {
return fmt.Errorf("%w\n\n%s", err, edgesUsage())
}
sql, err := buildEdgesQuery(o)
if err != nil {
return err
}
// pg-cluster-rw is a Service (not exec-able); resolve the primary POD.
pod, err := kubectlCapture("dbaas", "get", "pod", "-l", "cnpg.io/instanceRole=primary",
"-o", "jsonpath={.items[0].metadata.name}")
if err != nil || pod == "" {
return fmt.Errorf("could not resolve CNPG primary pod in dbaas: %v", err)
}
exec := []string{"exec", pod, "-c", "postgres", "--", "psql", "-U", "postgres", "-d", "goldmane_edges"}
if o.asJSON {
exec = append(exec, "-tAc", sql) // raw tuple → the JSON array
} else {
exec = append(exec, "-P", "pager=off", "-c", sql) // aligned table for humans
}
return kubectlStream("dbaas", exec...)
}
func edgesUsage() string {
return `homelab edges query the who-talks-to-whom trail (goldmane_edges, ADR-0014)
Usage: homelab edges [filters]
Filters (AND-combined; namespace values are validated to the k8s name charset):
--ns NAME edges touching NAME (either direction)
--src NAME edges where source namespace = NAME
--dst NAME edges where destination namespace = NAME
--peers-of NAME distinct peer namespaces of NAME (both directions)
--new-since SPEC first seen since SPEC: a duration (24h, 7d, 30m, 90s) or a date (YYYY-MM-DD)
--denied only denied (action='deny') edges blocked / lateral-movement attempts
--json output a JSON array (for agents/pipelines)
--limit N cap rows (default 200)
Examples:
homelab edges --ns immich # everything immich talks to / is talked to by
homelab edges --peers-of authentik # authentik's peer namespaces
homelab edges --src recruiter-responder # that namespace's egress peers
homelab edges --new-since 24h # edges first seen in the last day
homelab edges --denied --json # blocked flows, machine-readable
Read-only SELECT against CNPG DB goldmane_edges via the dbaas primary pod.
`
}

172
cli/cmd_ha.go Normal file
View file

@ -0,0 +1,172 @@
package main
import (
"encoding/base64"
"fmt"
"os"
"path/filepath"
"strings"
)
// Home Assistant verbs cover the two things the `ha` MCP server can't: resolving
// the long-lived API token out of the cluster, and SSH to the HA host for
// host-level work (config files, docker, add-ons). Entity state/control stays
// with the MCP — see docs/adr/0012.
//
// The token lives in the dedicated k8s Secret openclaw/ha-tokens (one key per
// instance), split out of openclaw-secrets so non-admin operators (emo / "Home
// Server Admins") can read JUST the HA token, not the full skill_secrets blob.
// `ha token` resolves it on demand via the ambient kubeconfig, so it never
// depends on a pre-set env var (the gap that made agents re-derive the
// kubectl|base64|jq pipeline every session).
type haInstance struct {
name string // sofia | london
sshUser string // SSH login on the HA host
sshHost string // host reachable from the devvm (Sofia LAN)
secretKey string // key inside the openclaw/ha-tokens Secret holding this token
}
const (
haDefaultInstance = "sofia"
haSecretNamespace = "openclaw"
haSecretName = "ha-tokens" // dedicated, least-privilege; see stacks/openclaw/ha_tokens.tf
)
// haInstances maps instance name → connection/secret facts. sofia is the default
// because the devvm is on the Sofia LAN; london is documented but its host
// (192.168.8.x) is only reachable remotely, so `ha ssh --instance london`
// generally won't connect from here (token resolution still works).
var haInstances = map[string]haInstance{
"sofia": {name: "sofia", sshUser: "vbarzin", sshHost: "192.168.1.8", secretKey: "sofia"},
"london": {name: "london", sshUser: "hassio", sshHost: "192.168.8.103", secretKey: "london"},
}
func haCommands() []Command {
return []Command{
{Path: []string{"ha", "token"}, Tier: TierRead,
Summary: "reveal the HA long-lived API token from the cluster: ha token [--instance sofia|london]", Run: haToken},
{Path: []string{"ha", "ssh"}, Tier: TierWrite,
Summary: "run a command on the HA host over ssh: ha ssh [--instance sofia|london] [-i KEY] -- <cmd>", Run: haSSH},
}
}
// resolveHAInstance looks up an instance by name; "" yields the default (sofia).
func resolveHAInstance(name string) (haInstance, error) {
if name == "" {
name = haDefaultInstance
}
inst, ok := haInstances[name]
if !ok {
return haInstance{}, fmt.Errorf("unknown HA instance %q (want sofia or london)", name)
}
return inst, nil
}
// decodeSecretValue base64-decodes a k8s Secret `.data.<key>` value as returned
// by kubectl jsonpath (trailing whitespace tolerated).
func decodeSecretValue(b64 string) (string, error) {
raw, err := base64.StdEncoding.DecodeString(strings.TrimSpace(b64))
if err != nil {
return "", fmt.Errorf("base64-decode secret value: %w", err)
}
return string(raw), nil
}
func haToken(args []string) error {
name, _ := firstPositional(args) // accept `ha token sofia` as well as `--instance sofia`
for i := 0; i < len(args); i++ {
if args[i] == "--instance" && i+1 < len(args) {
name = args[i+1]
} else if strings.HasPrefix(args[i], "--instance=") {
name = strings.TrimPrefix(args[i], "--instance=")
}
}
inst, err := resolveHAInstance(name)
if err != nil {
return err
}
b64, err := kubectlCapture(haSecretNamespace, "get", "secret", haSecretName,
"-o", "jsonpath={.data."+inst.secretKey+"}")
if err != nil {
return fmt.Errorf("read secret %s/%s (kubeconfig set? RBAC?): %w", haSecretNamespace, haSecretName, err)
}
if b64 == "" {
return fmt.Errorf("secret %s/%s has no %q key", haSecretNamespace, haSecretName, inst.secretKey)
}
tok, err := decodeSecretValue(b64)
if err != nil {
return err
}
fmt.Println(tok)
return nil
}
// defaultHAKeyPath is the invoking user's ed25519 key, so the verb is per-user
// rather than tied to whoever first wrote the workflow.
func defaultHAKeyPath() string {
if home, err := os.UserHomeDir(); err == nil && home != "" {
return filepath.Join(home, ".ssh", "id_ed25519")
}
return filepath.Join("~", ".ssh", "id_ed25519")
}
// parseHASSH reads `[--instance X] [-i|--key PATH] [-- ] <cmd...>`. Tokens after
// `--` are taken verbatim; bare tokens before it are also the remote command.
func parseHASSH(args []string) (inst haInstance, keyPath string, remote []string, err error) {
name := haDefaultInstance
keyPath = defaultHAKeyPath()
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--":
remote = append(remote, args[i+1:]...)
i = len(args)
case a == "--instance":
if i+1 < len(args) {
name = args[i+1]
i++
}
case strings.HasPrefix(a, "--instance="):
name = strings.TrimPrefix(a, "--instance=")
case a == "--key" || a == "-i":
if i+1 < len(args) {
keyPath = args[i+1]
i++
}
case strings.HasPrefix(a, "--key="):
keyPath = strings.TrimPrefix(a, "--key=")
default:
remote = append(remote, a)
}
}
inst, err = resolveHAInstance(name)
return inst, keyPath, remote, err
}
// buildHASSHArgs assembles deterministic, non-interactive ssh args: an explicit
// key, no user ssh config, and no known_hosts prompt/record — so it runs
// unattended in an agent session without hanging on a host-key prompt.
func buildHASSHArgs(inst haInstance, keyPath string, remote []string) []string {
args := []string{
"-F", "/dev/null",
"-o", "IdentityFile=" + keyPath,
"-o", "StrictHostKeyChecking=no",
"-o", "UserKnownHostsFile=/dev/null",
"-o", "ConnectTimeout=10",
"-o", "BatchMode=yes",
inst.sshUser + "@" + inst.sshHost,
}
return append(args, remote...)
}
func haSSH(args []string) error {
inst, keyPath, remote, err := parseHASSH(args)
if err != nil {
return err
}
if len(remote) == 0 {
return fmt.Errorf(`usage: homelab ha ssh [--instance sofia|london] [-i KEY] -- <command>`)
}
return runStreaming("ssh", buildHASSHArgs(inst, keyPath, remote)...)
}

92
cli/cmd_ha_test.go Normal file
View file

@ -0,0 +1,92 @@
package main
import (
"encoding/base64"
"reflect"
"strings"
"testing"
)
func TestResolveHAInstance(t *testing.T) {
// empty defaults to sofia (the devvm sits on the Sofia LAN)
if got, err := resolveHAInstance(""); err != nil || got.name != "sofia" {
t.Fatalf(`resolveHAInstance("") = %+v, %v; want sofia`, got, err)
}
if got, err := resolveHAInstance("sofia"); err != nil || got.secretKey != "sofia" {
t.Fatalf("sofia secretKey = %q, %v", got.secretKey, err)
}
if got, err := resolveHAInstance("london"); err != nil || got.secretKey != "london" || got.sshUser != "hassio" {
t.Fatalf("london = %+v, %v", got, err)
}
if _, err := resolveHAInstance("paris"); err == nil {
t.Fatalf("resolveHAInstance(paris) should error on unknown instance")
}
}
func TestDecodeSecretValue(t *testing.T) {
// k8s stores Secret values base64-encoded; `kubectl -o jsonpath={.data.<k>}`
// returns that base64, which decodeSecretValue turns back into the raw token.
enc := base64.StdEncoding.EncodeToString([]byte("tok-sofia"))
if got, err := decodeSecretValue(enc); err != nil || got != "tok-sofia" {
t.Fatalf("decodeSecretValue = %q, %v; want tok-sofia", got, err)
}
// trailing whitespace/newline from jsonpath output must be tolerated
if got, err := decodeSecretValue(enc + "\n"); err != nil || got != "tok-sofia" {
t.Fatalf("decodeSecretValue (trailing ws) = %q, %v; want tok-sofia", got, err)
}
if _, err := decodeSecretValue("not-base64!!"); err == nil {
t.Fatalf("decodeSecretValue should error on undecodable base64")
}
}
func TestBuildHASSHArgs(t *testing.T) {
inst, _ := resolveHAInstance("sofia")
got := buildHASSHArgs(inst, "/home/u/.ssh/id_ed25519", []string{"cat", "/config/configuration.yaml"})
want := []string{
"-F", "/dev/null",
"-o", "IdentityFile=/home/u/.ssh/id_ed25519",
"-o", "StrictHostKeyChecking=no",
"-o", "UserKnownHostsFile=/dev/null",
"-o", "ConnectTimeout=10",
"-o", "BatchMode=yes",
"vbarzin@192.168.1.8",
"cat", "/config/configuration.yaml",
}
if !reflect.DeepEqual(got, want) {
t.Fatalf("buildHASSHArgs =\n %v\nwant\n %v", got, want)
}
}
func TestParseHASSH(t *testing.T) {
// instance flag + everything after `--` is the verbatim remote command
inst, key, remote, err := parseHASSH([]string{"--instance", "sofia", "--", "docker", "ps", "-a"})
if err != nil {
t.Fatalf("parseHASSH err: %v", err)
}
if inst.name != "sofia" {
t.Errorf("instance = %q, want sofia", inst.name)
}
if !strings.HasSuffix(key, "/.ssh/id_ed25519") {
t.Errorf("default key = %q, want it to end in /.ssh/id_ed25519", key)
}
if !reflect.DeepEqual(remote, []string{"docker", "ps", "-a"}) {
t.Errorf("remote = %v, want [docker ps -a]", remote)
}
// bare args (no `--`) are also taken as the remote command; -i overrides the key
_, key2, remote2, err := parseHASSH([]string{"-i", "/tmp/k", "uptime"})
if err != nil {
t.Fatalf("parseHASSH err: %v", err)
}
if key2 != "/tmp/k" {
t.Errorf("key = %q, want /tmp/k", key2)
}
if !reflect.DeepEqual(remote2, []string{"uptime"}) {
t.Errorf("remote = %v, want [uptime]", remote2)
}
// unknown instance surfaces as an error
if _, _, _, err := parseHASSH([]string{"--instance", "paris", "--", "ls"}); err == nil {
t.Errorf("parseHASSH should error on unknown instance")
}
}

288
cli/cmd_k8s.go Normal file
View file

@ -0,0 +1,288 @@
package main
import (
"fmt"
"os"
"strings"
)
func k8sCommands() []Command {
return []Command{
{Path: []string{"k8s", "status"}, Tier: TierRead,
Summary: "pods (wide) + recent non-Normal events for a namespace (or -A)", Run: k8sStatus},
{Path: []string{"k8s", "get"}, Tier: TierRead,
Summary: "kubectl get in a namespace: k8s get <ns> <resource> [args]", Run: k8sGet},
{Path: []string{"k8s", "logs"}, Tier: TierRead,
Summary: "logs for <app> (deploy/<app>; --tail/-c/--previous/--since/-l)", Run: k8sLogs},
{Path: []string{"k8s", "describe"}, Tier: TierRead,
Summary: "describe <app>'s deployment (or an explicit resource)", Run: k8sDescribe},
{Path: []string{"k8s", "debug"}, Tier: TierRead,
Summary: "one-shot triage for <app>: pods+deploy+describe+logs+events", Run: k8sDebug},
{Path: []string{"k8s", "pf"}, Tier: TierRead,
Summary: "port-forward: k8s pf <app> <local:remote> [svc/pod target]", Run: k8sPortForward},
{Path: []string{"k8s", "db"}, Tier: TierWrite,
Summary: `query a dbaas DB: k8s db <app> [--mysql] [--db N] -- "<SQL>"`, Run: k8sDB},
{Path: []string{"k8s", "exec"}, Tier: TierWrite,
Summary: "exec in <app>'s pod: k8s exec <app> [--tty] -- <cmd>", Run: k8sExec},
{Path: []string{"k8s", "rm-pod"}, Tier: TierWrite,
Summary: "delete a stuck pod/job ONLY: k8s rm-pod <name> -n <ns> [--job] [--force]", Run: k8sRmPod},
{Path: []string{"k8s", "rollout-status"}, Tier: TierRead,
Summary: "rollout status of deploy/<app>", Run: k8sRolloutStatus},
{Path: []string{"k8s", "restart"}, Tier: TierWrite,
Summary: "rollout restart deploy/<app> then wait for status", Run: k8sRestart},
{Path: []string{"k8s", "probe"}, Tier: TierRead,
Summary: "in-cluster reachability: ephemeral curl pod to <app>.<ns>.svc", Run: k8sProbe},
}
}
func k8sStatus(args []string) error {
t := parseK8sTarget(args)
ns := t.namespace() // "" when no app/ns given → cluster-wide
get := []string{"get", "pods", "-o", "wide"}
ev := []string{"get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp"}
if ns == "" {
get = append(get, "-A")
ev = append(ev, "-A")
}
if err := kubectlStream(ns, get...); err != nil {
return err
}
fmt.Fprintln(os.Stderr, "\n--- recent events (type!=Normal) ---")
_ = kubectlStream(ns, ev...) // best-effort
return nil
}
func k8sGet(args []string) error {
t := parseK8sTarget(args)
if t.app == "" || len(t.rest) == 0 {
return fmt.Errorf("usage: homelab k8s get <ns> <resource> [args]")
}
return kubectlStream(t.app, append([]string{"get"}, t.rest...)...)
}
func k8sLogs(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s logs <app> [--tail N] [-c ctr] [--previous] [--since 1h] [-l sel]")
}
a := []string{"logs"}
if t.selector != "" {
a = append(a, "-l", t.selector)
} else {
a = append(a, t.objectRef())
}
if t.container != "" {
a = append(a, "-c", t.container)
}
if !containsPrefix(t.rest, "--tail") {
a = append(a, "--tail=200")
}
a = append(a, t.rest...)
return kubectlStream(t.namespace(), a...)
}
func k8sDescribe(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s describe <app> [resource]")
}
if len(t.rest) > 0 {
return kubectlStream(t.namespace(), append([]string{"describe"}, t.rest...)...)
}
return kubectlStream(t.namespace(), "describe", t.objectRef())
}
func k8sDebug(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s debug <app>")
}
ns := t.namespace()
sec := func(title string) { fmt.Fprintf(os.Stderr, "\n=== %s ===\n", title) }
sec("pods")
_ = kubectlStream(ns, "get", "pods", "-o", "wide")
sec("workloads")
_ = kubectlStream(ns, "get", "deploy,sts,ds", "-o", "wide")
sec("describe "+t.objectRef())
_ = kubectlStream(ns, "describe", t.objectRef())
sec("recent logs (--tail=50)")
_ = kubectlStream(ns, "logs", t.objectRef(), "--tail=50")
sec("events (type!=Normal)")
_ = kubectlStream(ns, "get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp")
return nil
}
func k8sPortForward(args []string) error {
t := parseK8sTarget(args)
if t.app == "" || len(t.rest) == 0 {
return fmt.Errorf("usage: homelab k8s pf <app> <local:remote> [svc/pod target]")
}
ports := t.rest[0]
target := "svc/" + t.app
if len(t.rest) > 1 {
target = t.rest[1]
}
return kubectlStream(t.namespace(), "port-forward", target, ports)
}
func k8sDB(args []string) error {
var app, dbName, sql string
mysql := false
for i := 0; i < len(args); i++ {
a := args[i]
if a == "--" {
sql = strings.Join(args[i+1:], " ")
break
}
switch {
case a == "--mysql":
mysql = true
case a == "--db":
if i+1 < len(args) {
dbName = args[i+1]
i++
}
case strings.HasPrefix(a, "--db="):
dbName = strings.TrimPrefix(a, "--db=")
case !strings.HasPrefix(a, "-") && app == "":
app = a
}
}
if app == "" {
return fmt.Errorf(`usage: homelab k8s db <app> [--mysql] [--db NAME] -- "<SQL>"`)
}
p := planDBExec(app, dbName, sql, mysql)
pod := p.pod
if pod == "" && p.selector != "" {
resolved, err := kubectlCapture(p.ns, "get", "pod", "-l", p.selector, "-o", "jsonpath={.items[0].metadata.name}")
if err != nil || resolved == "" {
return fmt.Errorf("could not resolve db pod in %s (selector %q): %v", p.ns, p.selector, err)
}
pod = resolved
}
exec := []string{"exec"}
if sql == "" {
exec = append(exec, "-it") // interactive client when no SQL given
}
exec = append(exec, pod)
if p.container != "" {
exec = append(exec, "-c", p.container)
}
exec = append(exec, "--")
exec = append(exec, p.argv...)
return kubectlStream(p.ns, exec...)
}
func k8sExec(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s exec <app> [--pod p] [-c ctr] [--tty] -- <cmd>")
}
if len(t.rest) == 0 {
return fmt.Errorf("provide a command after --, e.g. homelab k8s exec %s -- env", t.app)
}
a := []string{"exec"}
if t.tty {
a = append(a, "-it")
}
a = append(a, t.objectRef())
if t.container != "" {
a = append(a, "-c", t.container)
}
a = append(a, "--")
a = append(a, t.rest...)
return kubectlStream(t.namespace(), a...)
}
func k8sRmPod(args []string) error {
var pod, ns, grace string
force, job := false, false
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "-n" || a == "--namespace":
if i+1 < len(args) {
ns = args[i+1]
i++
}
case a == "--force":
force = true
case a == "--job":
job = true
case a == "--grace":
if i+1 < len(args) {
grace = args[i+1]
i++
}
case !strings.HasPrefix(a, "-") && pod == "":
pod = a
}
}
if pod == "" || ns == "" {
return fmt.Errorf("usage: homelab k8s rm-pod <name> -n <ns> [--job] [--force] [--grace N] (pods/jobs only)")
}
kind := "pod"
if job {
kind = "job"
}
a := []string{"delete", kind, pod}
if grace != "" {
a = append(a, "--grace-period="+grace)
}
if force {
a = append(a, "--force")
}
return kubectlStream(ns, a...)
}
func k8sRolloutStatus(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s rollout-status <app>")
}
return kubectlStream(t.namespace(), "rollout", "status", "deploy/"+t.app)
}
func k8sRestart(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s restart <app>")
}
ns := t.namespace()
if err := kubectlStream(ns, "rollout", "restart", "deploy/"+t.app); err != nil {
return err
}
return kubectlStream(ns, "rollout", "status", "deploy/"+t.app)
}
func k8sProbe(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s probe <app> [path] [--port N]")
}
ns := t.namespace()
url := "http://" + t.app + "." + ns + ".svc.cluster.local"
if port := flagValue(args, "--port"); port != "" {
url += ":" + port
}
if len(t.rest) > 0 {
p := t.rest[0]
if !strings.HasPrefix(p, "/") {
p = "/" + p
}
url += p
}
return kubectlStream(ns, "run", "homelab-probe", "--rm", "-i", "--restart=Never",
"--image=curlimages/curl:latest", "--",
"curl", "-sS", "--max-time", "10", "-w", "\n[%{http_code}] %{time_total}s\n", url)
}
// containsPrefix reports whether any arg starts with prefix.
func containsPrefix(args []string, prefix string) bool {
for _, a := range args {
if strings.HasPrefix(a, prefix) {
return true
}
}
return false
}

314
cli/cmd_memory.go Normal file
View file

@ -0,0 +1,314 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"strings"
)
func memoryCommands() []Command {
return []Command{
{Path: []string{"memory", "recall"}, Tier: TierRead,
Summary: `semantic search of memory: memory recall "<context>" [--query …] [--category] [--sort] [--limit]`, Run: memoryRecall},
{Path: []string{"memory", "list"}, Tier: TierRead,
Summary: "list recent memories [--category C] [--tag T] [--limit N]", Run: memoryList},
{Path: []string{"memory", "categories"}, Tier: TierRead,
Summary: "list memory categories", Run: memorySimpleGet("/api/categories")},
{Path: []string{"memory", "tags"}, Tier: TierRead,
Summary: "list memory tags", Run: memorySimpleGet("/api/tags")},
{Path: []string{"memory", "stats"}, Tier: TierRead,
Summary: "memory store stats", Run: memorySimpleGet("/api/stats")},
{Path: []string{"memory", "secret"}, Tier: TierRead,
Summary: "reveal a sensitive memory's content: memory secret <id>", Run: memorySecret},
{Path: []string{"memory", "store"}, Tier: TierWrite,
Summary: `store a memory: memory store "<content>" [--category --tags --keywords --importance --sensitive]`, Run: memoryStore},
{Path: []string{"memory", "update"}, Tier: TierWrite,
Summary: "update a memory: memory update <id> [--content --tags --importance --keywords]", Run: memoryUpdate},
{Path: []string{"memory", "delete"}, Tier: TierWrite,
Summary: "delete a memory: memory delete <id>", Run: memoryDelete},
}
}
// printMemories renders a {memories:[…]} response as compact lines, or raw JSON.
func printMemories(raw []byte, jsonOut bool) error {
if jsonOut {
fmt.Println(string(raw))
return nil
}
var r struct {
Memories []struct {
ID int `json:"id"`
Content string `json:"content"`
Category string `json:"category"`
Tags string `json:"tags"`
Importance float64 `json:"importance"`
} `json:"memories"`
}
if err := json.Unmarshal(raw, &r); err != nil {
fmt.Println(string(raw))
return nil
}
if len(r.Memories) == 0 {
fmt.Println("(no memories)")
return nil
}
for _, m := range r.Memories {
c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240)
fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
if m.Tags != "" {
fmt.Printf(" tags: %s\n", m.Tags)
}
}
return nil
}
// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it
// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240]
// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte
// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict
// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit
// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit
// hook error" for Cyrillic-language users.
func truncatePreview(s string, maxRunes int) string {
r := []rune(s)
if len(r) <= maxRunes {
return s
}
return string(r[:maxRunes]) + "…"
}
func memoryRecall(args []string) error {
req := memRecallReq{}
jsonOut := false
var pos []string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--query":
if i+1 < len(args) {
req.ExpandedQuery = args[i+1]
i++
}
case a == "--category":
if i+1 < len(args) {
req.Category = args[i+1]
i++
}
case a == "--sort":
if i+1 < len(args) {
req.SortBy = args[i+1]
i++
}
case a == "--limit":
if i+1 < len(args) {
fmt.Sscanf(args[i+1], "%d", &req.Limit)
i++
}
case a == "--json":
jsonOut = true
case !strings.HasPrefix(a, "-"):
pos = append(pos, a)
}
}
req.Context = strings.Join(pos, " ")
if req.Context == "" {
return fmt.Errorf(`usage: homelab memory recall "<context>" [--query …] [--category C] [--sort importance|relevance|recency] [--limit N]`)
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("POST", "/api/memories/recall", req)
if err != nil {
return err
}
return printMemories(raw, jsonOut)
}
func memoryList(args []string) error {
q := url.Values{}
jsonOut := false
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--category":
if i+1 < len(args) {
q.Set("category", args[i+1])
i++
}
case a == "--tag":
if i+1 < len(args) {
q.Set("tag", args[i+1])
i++
}
case a == "--limit":
if i+1 < len(args) {
q.Set("limit", args[i+1])
i++
}
case a == "--json":
jsonOut = true
}
}
c, err := newMemoryClient()
if err != nil {
return err
}
path := "/api/memories"
if len(q) > 0 {
path += "?" + q.Encode()
}
raw, err := c.do("GET", path, nil)
if err != nil {
return err
}
return printMemories(raw, jsonOut)
}
func memorySimpleGet(path string) func([]string) error {
return func(args []string) error {
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("GET", path, nil)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
}
func memorySecret(args []string) error {
id, _ := firstPositional(args)
if id == "" {
return fmt.Errorf("usage: homelab memory secret <id>")
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("POST", "/api/memories/"+id+"/secret", nil)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
func memoryStore(args []string) error {
req := memStoreReq{Category: "facts", Importance: 0.5}
var pos []string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--category":
if i+1 < len(args) {
req.Category = args[i+1]
i++
}
case a == "--tags":
if i+1 < len(args) {
req.Tags = args[i+1]
i++
}
case a == "--keywords":
if i+1 < len(args) {
req.ExpandedKeywords = args[i+1]
i++
}
case a == "--importance":
if i+1 < len(args) {
fmt.Sscanf(args[i+1], "%f", &req.Importance)
i++
}
case a == "--sensitive":
req.ForceSensitive = true
case !strings.HasPrefix(a, "-"):
pos = append(pos, a)
}
}
req.Content = strings.Join(pos, " ")
if req.Content == "" {
return fmt.Errorf(`usage: homelab memory store "<content>" [--category C] [--tags ...] [--keywords ...] [--importance 0.5] [--sensitive]`)
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("POST", "/api/memories", req)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
func memoryUpdate(args []string) error {
var id string
req := memUpdateReq{}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--content":
if i+1 < len(args) {
v := args[i+1]
req.Content = &v
i++
}
case a == "--tags":
if i+1 < len(args) {
v := args[i+1]
req.Tags = &v
i++
}
case a == "--keywords":
if i+1 < len(args) {
v := args[i+1]
req.ExpandedKeywords = &v
i++
}
case a == "--importance":
if i+1 < len(args) {
var f float64
fmt.Sscanf(args[i+1], "%f", &f)
req.Importance = &f
i++
}
case !strings.HasPrefix(a, "-") && id == "":
id = a
}
}
if id == "" {
return fmt.Errorf("usage: homelab memory update <id> [--content ...] [--tags ...] [--importance N] [--keywords ...]")
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("PUT", "/api/memories/"+id, req)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
func memoryDelete(args []string) error {
id, _ := firstPositional(args)
if id == "" {
return fmt.Errorf("usage: homelab memory delete <id>")
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("DELETE", "/api/memories/"+id, nil)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}

83
cli/cmd_net.go Normal file
View file

@ -0,0 +1,83 @@
package main
import (
"fmt"
"strings"
"time"
)
func netCommands() []Command {
return []Command{
{Path: []string{"net", "check"}, Tier: TierRead,
Summary: "reachability of <host>[/path]: external (public DNS→CF) vs internal (Traefik LB)", Run: netCheck},
{Path: []string{"dns", "lookup"}, Tier: TierRead,
Summary: "resolve <name> via Technitium (10.0.20.201) and public (1.1.1.1), diffed", Run: dnsLookup},
}
}
func fmtProbe(code int, d time.Duration, err error) string {
if err != nil {
return "ERR " + err.Error()
}
return fmt.Sprintf("HTTP %d %dms", code, d.Milliseconds())
}
func netCheck(args []string) error {
host, rest := firstPositional(args)
if host == "" {
return fmt.Errorf("usage: homelab net check <host> [path]")
}
path := "/"
if len(rest) > 0 && !strings.HasPrefix(rest[0], "-") {
path = rest[0]
if !strings.HasPrefix(path, "/") {
path = "/" + path
}
}
u := "https://" + host + path
fmt.Printf("%s\n", u)
// external leg: resolve via public DNS, dial the public IP (tests the real CF path)
pubOut, _ := dig(hostOnly(host), "1.1.1.1", "")
if pubIP := firstLine(pubOut); pubIP != "" {
c, d, e := probeURL(clientDialingIP(pubIP, 10*time.Second), u)
fmt.Printf(" external (public %-15s) %s\n", pubIP, fmtProbe(c, d, e))
} else {
fmt.Println(" external (public) no public A record")
}
// internal leg: dial the Traefik LB directly
c, d, e := probeURL(clientDialingIP(internalLBIP, 10*time.Second), u)
fmt.Printf(" internal (LB %-15s) %s\n", internalLBIP, fmtProbe(c, d, e))
return nil
}
func dnsLookup(args []string) error {
name, rest := firstPositional(args)
if name == "" {
return fmt.Errorf("usage: homelab dns lookup <name> [A|AAAA|TXT|MX|PTR]")
}
rr := ""
if len(rest) > 0 {
rr = rest[0]
}
tech, _ := dig(name, "10.0.20.201", rr)
pub, _ := dig(name, "1.1.1.1", rr)
fmt.Printf("technitium (10.0.20.201): %s\n", oneLineList(tech))
fmt.Printf("public (1.1.1.1) : %s\n", oneLineList(pub))
if strings.TrimSpace(tech) != strings.TrimSpace(pub) {
fmt.Println("⚠ mismatch — split-horizon (expected for internal-only apps) or a propagation gap")
}
return nil
}
func hostOnly(h string) string { // strip any path accidentally included
return strings.SplitN(h, "/", 2)[0]
}
func oneLineList(s string) string {
s = strings.TrimSpace(s)
if s == "" {
return "(none)"
}
return strings.ReplaceAll(s, "\n", ", ")
}

197
cli/cmd_obs.go Normal file
View file

@ -0,0 +1,197 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"sort"
"strconv"
"strings"
"time"
)
const (
promHost = "prometheus-query.viktorbarzin.lan"
lokiHost = "loki.viktorbarzin.lan"
)
func obsCommands() []Command {
return []Command{
{Path: []string{"metrics", "query"}, Tier: TierRead,
Summary: `Prometheus instant query: metrics query "<promql>" [--json]`, Run: metricsQuery},
{Path: []string{"metrics", "alerts"}, Tier: TierRead,
Summary: "list currently firing Prometheus alerts", Run: metricsAlerts},
{Path: []string{"logs", "query"}, Tier: TierRead,
Summary: `Loki query (last --since, default 1h): logs query "<logql>" [--since 1h] [--limit N] [--json]`, Run: logsQuery},
}
}
// queryArg joins non-flag args into the query (PromQL/LogQL should normally be
// passed as a single quoted argument; this also tolerates unquoted multi-token).
func queryArg(args []string, valueFlags map[string]bool) string {
var parts []string
for i := 0; i < len(args); i++ {
a := args[i]
if valueFlags[a] {
i++
continue
}
if strings.HasPrefix(a, "-") {
continue
}
parts = append(parts, a)
}
return strings.Join(parts, " ")
}
func labelStr(m map[string]string) string {
name := m["__name__"]
var kv []string
for k, v := range m {
if k != "__name__" {
kv = append(kv, k+"="+v)
}
}
sort.Strings(kv)
return name + "{" + strings.Join(kv, ",") + "}"
}
func metricsQuery(args []string) error {
q := queryArg(args, nil)
if q == "" {
return fmt.Errorf(`usage: homelab metrics query "<promql>" [--json]`)
}
v := url.Values{}
v.Set("query", q)
body, err := lbGetBody(promHost, "/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
Value []interface{} `json:"value"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
if len(r.Data.Result) == 0 {
fmt.Println("(no series)")
return nil
}
for _, s := range r.Data.Result {
val := ""
if len(s.Value) == 2 {
val = fmt.Sprint(s.Value[1])
}
fmt.Printf("%-14s %s\n", val, labelStr(s.Metric))
}
return nil
}
func metricsAlerts(args []string) error {
// prometheus-query is a query-only frontend (no /api/v1/alerts); the firing
// set is exposed as the synthetic ALERTS series, queryable the normal way.
v := url.Values{}
v.Set("query", `ALERTS{alertstate="firing"}`)
body, err := lbGetBody(promHost, "/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
if len(r.Data.Result) == 0 {
fmt.Println("(no firing alerts)")
return nil
}
for _, a := range r.Data.Result {
m := a.Metric
scope := ""
for _, k := range []string{"namespace", "deployment", "instance", "job", "node"} {
if v := m[k]; v != "" {
scope = k + "=" + v
break
}
}
fmt.Printf("%-9s %-34s %s\n", m["severity"], m["alertname"], scope)
}
return nil
}
func logsQuery(args []string) error {
q := queryArg(args, map[string]bool{"--since": true, "--limit": true})
if q == "" {
return fmt.Errorf(`usage: homelab logs query "<logql>" [--since 1h] [--limit N] [--json]`)
}
since := flagValue(args, "--since")
if since == "" {
since = "1h"
}
dur, err := time.ParseDuration(since)
if err != nil {
return fmt.Errorf("bad --since %q: %w", since, err)
}
limit := flagValue(args, "--limit")
if limit == "" {
limit = "100"
}
end := time.Now()
v := url.Values{}
v.Set("query", q)
v.Set("limit", limit)
v.Set("start", strconv.FormatInt(end.Add(-dur).UnixNano(), 10))
v.Set("end", strconv.FormatInt(end.UnixNano(), 10))
body, err := lbGetBody(lokiHost, "/loki/api/v1/query_range", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Values [][]string `json:"values"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
n := 0
for _, s := range r.Data.Result {
for _, val := range s.Values {
if len(val) == 2 {
fmt.Println(val[1])
n++
}
}
}
if n == 0 {
fmt.Println("(no log lines)")
}
return nil
}

122
cli/cmd_tf.go Normal file
View file

@ -0,0 +1,122 @@
package main
import (
"fmt"
"os"
"os/signal"
"path/filepath"
"strings"
"sync"
"syscall"
)
func tfCommands() []Command {
return []Command{
{Path: []string{"tf", "plan"}, Tier: TierRead,
Summary: "terragrunt plan a stack (via scripts/tg)", Run: tfPassthrough("plan")},
{Path: []string{"tf", "validate"}, Tier: TierRead,
Summary: "terragrunt validate a stack", Run: tfPassthrough("validate")},
{Path: []string{"tf", "fmt"}, Tier: TierRead,
Summary: "terraform fmt a stack's files", Run: tfFmt},
{Path: []string{"tf", "force-unlock"}, Tier: TierWrite,
Summary: "release a stuck terraform state lock (needs <stack> <lock-id>)", Run: tfForceUnlock},
{Path: []string{"tf", "apply"}, Tier: TierWrite,
Summary: "terragrunt apply a stack — presence-coupled, out-of-band", Run: tfApply},
}
}
// firstPositional returns the first non-flag arg and the remaining args with it removed.
func firstPositional(args []string) (string, []string) {
for i, a := range args {
if !strings.HasPrefix(a, "-") {
rest := append(append([]string{}, args[:i]...), args[i+1:]...)
return a, rest
}
}
return "", args
}
// resolveTfStack finds the infra root (from cwd) and the stack directory named
// by the first positional arg, returning the remaining args.
func resolveTfStack(args []string) (infraRoot, stackName, stackDir string, rest []string, err error) {
stackName, rest = firstPositional(args)
if stackName == "" {
err = fmt.Errorf("missing <stack> argument")
return
}
cwd, e := os.Getwd()
if e != nil {
err = e
return
}
infraRoot, err = findInfraRoot(cwd)
if err != nil {
return
}
stackDir, err = resolveStack(infraRoot, stackName)
return
}
func tgPath(infraRoot string) string { return filepath.Join(infraRoot, "scripts", "tg") }
// tfPassthrough runs `scripts/tg <verb> [extra]` in the stack directory.
func tfPassthrough(verb string) func([]string) error {
return func(args []string) error {
infraRoot, _, stackDir, rest, err := resolveTfStack(args)
if err != nil {
return err
}
return runStreamingIn(stackDir, tgPath(infraRoot), append([]string{verb}, rest...)...)
}
}
func tfFmt(args []string) error {
_, _, stackDir, _, err := resolveTfStack(args)
if err != nil {
return err
}
return runStreamingIn(stackDir, "terraform", "fmt", "-recursive", ".")
}
func tfForceUnlock(args []string) error {
infraRoot, _, stackDir, rest, err := resolveTfStack(args)
if err != nil {
return err
}
if len(rest) < 1 {
return fmt.Errorf("usage: homelab tf force-unlock <stack> <lock-id>")
}
return runStreamingIn(stackDir, tgPath(infraRoot), "force-unlock", "-force", rest[0])
}
// tfApply applies a stack out-of-band: claim the stack on the presence board,
// ALWAYS release on exit (normal, error, or signal — fixing the claim leak),
// and warn that CI applies canonically on push.
func tfApply(args []string) error {
infraRoot, stackName, stackDir, _, err := resolveTfStack(args)
if err != nil {
return err
}
label := "stack:" + stackName
fmt.Fprintf(os.Stderr,
"homelab: out-of-band apply of %q — CI applies canonically on push to master.\n", stackName)
if err := presenceClaim(label, "homelab tf apply "+stackName); err != nil {
return fmt.Errorf("presence claim failed (run `vault login -method=oidc`?): %w", err)
}
// Release exactly once, whether we exit normally, on error, or on signal —
// sync.Once makes the defer and the signal goroutine safe to both call it.
var once sync.Once
release := func() { once.Do(func() { _ = presenceRelease(label) }) }
defer release()
sig := make(chan os.Signal, 1)
signal.Notify(sig, os.Interrupt, syscall.SIGTERM)
go func() {
<-sig
release()
os.Exit(130)
}()
return runStreamingIn(stackDir, tgPath(infraRoot), "apply", "--non-interactive")
}

27
cli/cmd_tf_test.go Normal file
View file

@ -0,0 +1,27 @@
package main
import (
"reflect"
"testing"
)
func TestFirstPositional(t *testing.T) {
cases := []struct {
args []string
wantName string
wantRest []string
}{
{[]string{"vault"}, "vault", []string{}},
{[]string{"--json", "vault"}, "vault", []string{"--json"}},
{[]string{"vault", "abc-123"}, "vault", []string{"abc-123"}},
{[]string{"--foo", "monitoring", "extra"}, "monitoring", []string{"--foo", "extra"}},
{[]string{"--only-flags"}, "", []string{"--only-flags"}},
}
for _, c := range cases {
gotName, gotRest := firstPositional(c.args)
if gotName != c.wantName || !reflect.DeepEqual(gotRest, c.wantRest) {
t.Errorf("firstPositional(%v) = (%q, %v), want (%q, %v)",
c.args, gotName, gotRest, c.wantName, c.wantRest)
}
}
}

77
cli/cmd_usage.go Normal file
View file

@ -0,0 +1,77 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"sort"
"strconv"
)
func usageCommands() []Command {
return []Command{
{Path: []string{"usage", "top"}, Tier: TierRead,
Summary: "rank homelab verb usage across users (from Loki): usage top [--since 30d] [--user U] [--json]", Run: usageTop},
}
}
// usageQuery builds the LogQL metric query that counts invocations per verb.
func usageQuery(since, user string) string {
sel := `job="` + usageJob + `"`
if user != "" {
sel += `, user="` + user + `"`
}
return fmt.Sprintf(`sum by (verb) (count_over_time({%s}[%s]))`, sel, since)
}
func usageTop(args []string) error {
since := flagValue(args, "--since")
if since == "" {
since = "30d"
}
v := url.Values{}
v.Set("query", usageQuery(since, flagValue(args, "--user")))
body, err := lbGetBody(lokiHost, "/loki/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
Value []interface{} `json:"value"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
type row struct {
verb string
n int
}
var rows []row
for _, s := range r.Data.Result {
n := 0
if len(s.Value) == 2 {
if f, e := strconv.ParseFloat(fmt.Sprint(s.Value[1]), 64); e == nil {
n = int(f)
}
}
rows = append(rows, row{s.Metric["verb"], n})
}
if len(rows) == 0 {
fmt.Println("(no usage recorded yet)")
return nil
}
sort.Slice(rows, func(i, j int) bool { return rows[i].n > rows[j].n })
for _, r := range rows {
fmt.Printf("%6d %s\n", r.n, r.verb)
}
return nil
}

944
cli/cmd_vault.go Normal file
View file

@ -0,0 +1,944 @@
package main
import (
"bufio"
"encoding/base64"
"encoding/json"
"errors"
"fmt"
"os"
"os/exec"
"strings"
"syscall"
)
// vault verbs give each unix user no-HITL access to THEIR OWN Vaultwarden vault.
// Identity is the kernel UID; per-user creds live in that user's isolated Vault
// path (secret/workstation/claude-users/<user>) read via their scoped token, and
// decryption is done by the official `bw` CLI. See
// docs/runbooks/homelab-vault-onboarding.md.
func vaultCommands() []Command {
cmds := []Command{
// Vaultwarden — your personal password manager (logins/passwords/TOTP).
{Path: []string{"vault", "setup"}, Tier: TierWrite,
Summary: "[vaultwarden] one-time: store your master password + API key in your Vault path", Run: vaultSetup},
{Path: []string{"vault", "status"}, Tier: TierRead,
Summary: "[vaultwarden] show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
{Path: []string{"vault", "list"}, Tier: TierRead,
Summary: "[vaultwarden] list your item names: vault list [--search Q]", Run: vaultList},
{Path: []string{"vault", "get"}, Tier: TierRead,
Summary: "[vaultwarden] fetch one login: vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]", Run: vaultGet},
{Path: []string{"vault", "search"}, Tier: TierRead,
Summary: "[vaultwarden] search your item names: vault search <query>", Run: vaultSearch},
{Path: []string{"vault", "code"}, Tier: TierRead,
Summary: "[vaultwarden] current TOTP code for an item: vault code <name>", Run: vaultCode},
{Path: []string{"vault", "lock"}, Tier: TierWrite,
Summary: "[vaultwarden] lock/log out the local bw session", Run: vaultLock},
{Path: []string{"vault"}, Tier: TierRead,
Summary: "two stores: Vaultwarden (logins) + HashiCorp Vault/OpenBao kv (infra secrets) — run `homelab vault` for help",
Run: func([]string) error { fmt.Print(vaultHelp()); return nil }},
}
// HashiCorp Vault / OpenBao — homelab INFRA secrets (the secret/… KV store).
return append(cmds, vaultKVCommands()...)
}
// vaultHelp is shown for bare `homelab vault`. It LEADS with the distinction
// between the two unrelated "vaults" this command fronts, because the name
// collides: Vaultwarden (a password manager) vs HashiCorp Vault / OpenBao (the
// infra secrets store).
func vaultHelp() string {
return `homelab vault two different secret stores under one command:
Vaultwarden your personal PASSWORD MANAGER (logins / passwords / TOTP)
HashiCorp Vault / OpenBao homelab INFRA secrets (the secret/ KV store) 'vault kv '
Vaultwarden (reads YOUR OWN vault; no-HITL after one-time setup)
homelab vault setup one-time: store your master password + API key in your Vault path
homelab vault status configured / unlocked / reachable (no secrets)
homelab vault list [--search Q] list your item names (no secrets)
homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
TTY clipboard (auto-clears); piped stdout
homelab vault get <name> --all all fields (incl. custom) as JSON; piped only.
TOTP shown as presence flag use 'vault code' for a code.
homelab vault code <name> current TOTP code
homelab vault lock lock / log out the local bw session
HashiCorp Vault / OpenBao (infra secrets; uses your own OIDC vault token)
homelab vault kv get <path> [--field K] read an infra KV secret
homelab vault kv list <path> list sub-paths
homelab vault kv put <path> <key> write one key (value via stdin)
Vaultwarden creds live only in your own Vault path; the admin never sees them.
Security model: docs/runbooks/homelab-vault-onboarding.md
(note: anything running as your user can decrypt your vault the accepted no-HITL trade).
`
}
const vwUserPathPrefix = "secret/workstation/claude-users/"
// vwCreds is one user's Vaultwarden auth material, read from their Vault path.
type vwCreds struct {
Email string
MasterPassword string
ClientID string
ClientSecret string
}
// cmdRunner shells out to an external command with an explicit environment and
// returns trimmed stdout. Secrets are passed via envv, NEVER argv. Tests inject
// a fake; realRunner is the production implementation.
type cmdRunner func(name string, argv, envv []string) (string, error)
func realRunner(name string, argv, envv []string) (string, error) {
cmd := exec.Command(name, argv...)
if envv != nil {
cmd.Env = envv
}
out, err := cmd.Output()
// Trim only the trailing newline the tool appends — NOT all whitespace, so a
// fetched secret with significant leading/trailing spaces is preserved.
return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
}
// exitStderr returns the stderr captured by cmd.Output() on a failed exec (it
// stows it on *exec.ExitError), or nil. The tools we shell out to (vault, bw)
// write the actionable message there — "connection refused", "permission
// denied" — which the caller would otherwise never see behind a bare
// "exit status N".
func exitStderr(err error) []byte {
var ee *exec.ExitError
if errors.As(err, &ee) {
return ee.Stderr
}
return nil
}
// augmentErr appends captured stderr to an error so failures are diagnosable
// (not just "exit status 2"). Returns nil when err is nil, and err unchanged
// when there's no stderr; preserves the wrapped error for errors.Is/As.
func augmentErr(err error, stderr []byte) error {
if err == nil {
return nil
}
if s := strings.TrimSpace(string(stderr)); s != "" {
return fmt.Errorf("%w: %s", err, s)
}
return err
}
// realRunnerStdin runs a command feeding `stdin` to it, for secret values that
// must NOT appear in argv (visible via ps / /proc/<pid>/cmdline to same-UID
// processes). Used by setup to write the master password / client_secret.
func realRunnerStdin(name string, argv, envv []string, stdin string) (string, error) {
cmd := exec.Command(name, argv...)
if envv != nil {
cmd.Env = envv
}
cmd.Stdin = strings.NewReader(stdin)
out, err := cmd.Output()
return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
}
func vwCredsPath(user string) string { return vwUserPathPrefix + user }
func bwAppDataDir(uid string) string { return "/run/user/" + uid + "/homelab-bw" }
// readVaultField returns one field from a KV-v2 path, "" if absent/error.
func readVaultField(run cmdRunner, field, path string) string {
out, err := run("vault", []string{"kv", "get", "-field=" + field, path}, nil)
if err != nil {
return ""
}
return out
}
// loadCreds reads the four vaultwarden_* keys from the user's isolated path.
// A missing master password means the user hasn't onboarded.
func loadCreds(run cmdRunner, user string) (vwCreds, error) {
p := vwCredsPath(user)
c := vwCreds{
Email: readVaultField(run, "vaultwarden_email", p),
MasterPassword: readVaultField(run, "vaultwarden_master_password", p),
ClientID: readVaultField(run, "vaultwarden_client_id", p),
ClientSecret: readVaultField(run, "vaultwarden_client_secret", p),
}
if c.MasterPassword == "" {
return vwCreds{}, fmt.Errorf("vault not configured for this user — run `homelab vault setup`")
}
return c, nil
}
// vaultCurrentUser/vaultCurrentUID are seams for tests (avoid conflict with repo.go's currentUser func).
var vaultCurrentUser = func() string { return os.Getenv("USER") }
var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }
// scopedTokenPath is where claude-auth-sync keeps the user's scoped Vault token.
// MUST match CAS_VAULT_TOKEN_FILE in scripts/workstation/claude-auth-sync.sh.
func scopedTokenPath(home string) string {
return home + "/.config/claude-auth-sync/vault-token"
}
// vaultTokenSource decides which Vault token the `vault` child processes should
// use. Precedence: an explicit $VAULT_TOKEN (deliberate override), then the
// per-user scoped token claude-auth-sync maintains at scopedTokenPath(HOME)
// (policy workstation-claude-<user>, which grants exactly the create/read/update
// this tool needs on the user's own path), then a native ~/.vault-token.
//
// The scoped token MUST beat ~/.vault-token: this tool only ever touches the
// caller's own secret/workstation/claude-users/<user> path, and a power-user who
// ran `vault login -method=oidc` carries a read-only ~/.vault-token whose
// capability on that path is `deny` — letting it win shadows the scoped token
// and every op fails 403/deny (emo, 2026-06-28). ~/.vault-token is only the
// right credential when there is no scoped token (admins). Returns the token to
// export — "" when the vault CLI should read the ambient/native credential —
// plus a source tag for tests/logging.
func vaultTokenSource(envToken string, haveVaultTokenFile bool, scopedToken string) (token, source string) {
switch {
case envToken != "":
return "", "env"
case strings.TrimSpace(scopedToken) != "":
return strings.TrimSpace(scopedToken), "scoped"
case haveVaultTokenFile:
return "", "file"
default:
return "", "none"
}
}
// vaultAddrDefault is the cluster Vault the workstation talks to. The bw server
// is likewise hardcoded (openSession), so a sane default here is consistent.
const vaultAddrDefault = "https://vault.viktorbarzin.me"
// vaultAddrToSet returns the VAULT_ADDR to export when the caller's environment
// doesn't already set one, else "". homelab vault is invoked by AFK agent
// sessions — frequently non-login shells (tmux panes, agent subprocesses) that
// never sourced /etc/environment — so, like claude-auth-sync, the CLI must NOT
// depend on an ambient VAULT_ADDR; otherwise every `vault` child falls back to
// the 127.0.0.1:8200 default and fails "connection refused" (exit 2).
func vaultAddrToSet(envAddr string) string {
if strings.TrimSpace(envAddr) == "" {
return vaultAddrDefault
}
return ""
}
// ensureVaultAddr exports the default VAULT_ADDR when none is set, so the vault
// child processes reach the cluster Vault regardless of the caller's shell. An
// explicit VAULT_ADDR (admins, CI) is left untouched.
func ensureVaultAddr() {
if a := vaultAddrToSet(os.Getenv("VAULT_ADDR")); a != "" {
os.Setenv("VAULT_ADDR", a)
}
}
// fileNonEmpty reports whether path exists and has content.
func fileNonEmpty(path string) bool {
fi, err := os.Stat(path)
return err == nil && fi.Size() > 0
}
// ensureVaultToken wires vaultTokenSource to the real environment: when the user
// has no ambient Vault credential, it exports the claude-auth-sync scoped token
// so the `vault` child processes authenticate as workstation-claude-<user>. It
// is idempotent and safe for admins, whose explicit $VAULT_TOKEN / ~/.vault-token
// take precedence and are left untouched.
func ensureVaultToken() {
// Every vault verb funnels through here, so this is the one place that also
// guarantees VAULT_ADDR is set (see vaultAddrToSet for why it can't be
// assumed from the caller's shell).
ensureVaultAddr()
home := os.Getenv("HOME")
scoped, _ := os.ReadFile(scopedTokenPath(home))
tok, src := vaultTokenSource(os.Getenv("VAULT_TOKEN"), home != "" && fileNonEmpty(home+"/.vault-token"), string(scoped))
if src == "scoped" {
os.Setenv("VAULT_TOKEN", tok)
}
}
// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
// do NOT inherit the full parent env (keeps stray secrets out of the child).
func bwBaseEnv(appdata string) []string {
path := os.Getenv("PATH")
if path == "" {
path = "/usr/local/bin:/usr/bin:/bin"
}
return []string{
"PATH=" + path,
"HOME=" + os.Getenv("HOME"),
"BITWARDENCLI_APPDATA_DIR=" + appdata,
"BW_NOINTERACTION=true",
}
}
// bwSecretEnv adds the secret-bearing vars. session may be "" (pre-unlock).
func bwSecretEnv(appdata string, c vwCreds, session string) []string {
env := bwBaseEnv(appdata)
env = append(env,
"BW_CLIENTID="+c.ClientID,
"BW_CLIENTSECRET="+c.ClientSecret,
"BW_PASSWORD="+c.MasterPassword,
)
if session != "" {
env = append(env, "BW_SESSION="+session)
}
return env
}
func bwLoginArgs() []string { return []string{"login", "--apikey"} }
func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} }
func bwGetArgs(field, name string) []string { return []string{"get", field, name} }
func bwItemArgs(name string) []string { return []string{"get", "item", name} }
func bwStatusArgs() []string { return []string{"status"} }
func bwSyncArgs() []string { return []string{"sync"} }
// bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is
// required. Unparseable/empty output → true (safer to attempt login).
func bwNeedsLogin(statusJSON string) bool {
var s struct {
Status string `json:"status"`
}
if err := json.Unmarshal([]byte(statusJSON), &s); err != nil {
return true
}
return s.Status == "unauthenticated" || s.Status == ""
}
func bwListArgs(search string) []string {
a := []string{"list", "items"}
if search != "" {
a = append(a, "--search", search)
}
return a
}
// bwUnlock runs `bw unlock` and returns the raw session key.
func bwUnlock(run cmdRunner, env []string) (string, error) {
out, err := run("bw", bwUnlockArgs(), env)
if err != nil {
return "", fmt.Errorf("bw unlock failed (wrong master password? run `homelab vault setup`): %w", err)
}
return out, nil
}
// bwGet fetches one field of one item; session must be present in env.
func bwGet(run cmdRunner, env []string, field, name string) (string, error) {
return run("bw", bwGetArgs(field, name), env)
}
func returnMode(isTTY bool) string {
if isTTY {
return "clipboard"
}
return "stdout"
}
// stdoutIsTTY reports whether stdout is a character device (a terminal).
func stdoutIsTTY() bool {
fi, err := os.Stdout.Stat()
if err != nil {
return false
}
return fi.Mode()&os.ModeCharDevice != 0
}
// stderrIsTTY reports whether stderr is a terminal (the OSC52 escape is written
// to stderr, so the clipboard path is only viable when stderr is a terminal).
func stderrIsTTY() bool {
fi, err := os.Stderr.Stat()
if err != nil {
return false
}
return fi.Mode()&os.ModeCharDevice != 0
}
// osc52 returns the OSC 52 escape that makes the local terminal copy payload to
// the system clipboard (works over SSH; no X11). osc52clear copies empty.
func osc52(payload string) string {
return "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte(payload)) + "\a"
}
func osc52clear() string { return "\x1b]52;c;\a" }
// terminalAllowed gates OSC 52: only terminals known to honor clipboard writes,
// else we'd dump the secret's base64 into scrollback on unsupported terminals.
func terminalAllowed(term, termProgram string) bool {
t := strings.ToLower(term)
p := strings.ToLower(termProgram)
for _, ok := range []string{"kitty", "alacritty", "foot", "wezterm", "ghostty", "tmux", "screen"} {
if strings.Contains(t, ok) || strings.Contains(p, ok) {
return true
}
}
// xterm proper supports it only when the program is a known-good emulator.
return false
}
// opRecord is one CLI operation. ItemName is accepted for the caller's
// convenience but is INTENTIONALLY never rendered into the log line — auditing
// which of your own logins you opened is itself sensitive, and per-item reads
// are invisible server-side anyway (spec §9a).
type opRecord struct {
User string
Verb string
PID int
PPID int
ParentComm string
ItemName string // never logged
}
func opLogLine(r opRecord) string {
return fmt.Sprintf("user=%s verb=%s pid=%d ppid=%d parent=%s",
r.User, r.Verb, r.PID, r.PPID, r.ParentComm)
}
// parentComm reads /proc/<ppid>/comm (best-effort; "" on failure).
func parentComm(ppid int) string {
b, err := os.ReadFile(fmt.Sprintf("/proc/%d/comm", ppid))
if err != nil {
return ""
}
return strings.TrimSpace(string(b))
}
// writeOpLog appends one privacy-aware line to the user's op-log (best-effort;
// never blocks or fails the command). Goes to syslog so it ships to Loki.
func writeOpLog(r opRecord) {
exec.Command("logger", "-t", "homelab-vault", opLogLine(r)).Run() // best-effort
}
func vaultLockPath(uid string) string { return "/run/user/" + uid + "/homelab-vault.lock" }
// hardenProcess disables core dumps so a bw/homelab crash can't spill the master
// password to a core file. Best-effort.
func hardenProcess() {
_ = syscall.Setrlimit(syscall.RLIMIT_CORE, &syscall.Rlimit{Cur: 0, Max: 0})
}
// withUserLock serializes bw mutations for this user (concurrent Claude sessions
// as the same user otherwise race bw's appdata). Returns an unlock func.
func withUserLock(uid string) (func(), error) {
f, err := os.OpenFile(vaultLockPath(uid), os.O_CREATE|os.O_RDWR, 0600)
if err != nil {
return nil, err
}
if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil {
f.Close()
return nil, err
}
return func() { syscall.Flock(int(f.Fd()), syscall.LOCK_UN); f.Close() }, nil
}
// session is one usable bw context: the env (with BW_SESSION) ready for `bw get`.
type session struct {
env []string
}
// openSession resolves creds, ensures login, unlocks, and returns a ready env.
// Caller must hold the user lock. appdata is created on tmpfs (0700).
func openSession(run cmdRunner, user, uid string) (session, error) {
creds, err := loadCreds(run, user)
if err != nil {
return session{}, err
}
appdata := bwAppDataDir(uid)
if err := os.MkdirAll(appdata, 0700); err != nil {
return session{}, fmt.Errorf("create bw appdata %s: %w", appdata, err)
}
loginEnv := bwSecretEnv(appdata, creds, "")
// Ensure server is set and we're logged in (idempotent; ignore "already").
_, _ = run("bw", []string{"config", "server", "https://vaultwarden.viktorbarzin.me"}, loginEnv)
st, _ := run("bw", bwStatusArgs(), loginEnv)
if bwNeedsLogin(st) {
if _, err := run("bw", bwLoginArgs(), loginEnv); err != nil {
return session{}, fmt.Errorf("bw login --apikey failed (API key valid? run `homelab vault setup`): %w", err)
}
}
sess, err := bwUnlock(run, loginEnv)
if err != nil {
return session{}, err
}
sessEnv := bwSecretEnv(appdata, creds, sess)
// Pull the latest server-side state so reads reflect current values. `bw
// unlock` only decrypts the LOCAL cache, so a persisted (already-logged-in)
// session would otherwise serve stale data until the next login. Best-effort:
// a transient sync failure must not break a read — fall back to the cached
// vault and warn (status reports reachability separately).
if _, err := run("bw", bwSyncArgs(), sessEnv); err != nil {
fmt.Fprintln(os.Stderr, "homelab vault: warning: bw sync failed; using cached vault (values may be stale): "+err.Error())
}
return session{env: sessEnv}, nil
}
type getOpts struct {
name string
field string
json bool
all bool // dump every field (incl. custom) as normalized JSON
}
var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true}
func parseGetArgs(args []string) (getOpts, error) {
o := getOpts{field: "password"}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--json":
o.json = true
case a == "--all":
o.all = true
case a == "--field" && i+1 < len(args):
o.field = args[i+1]
i++
case strings.HasPrefix(a, "--field="):
o.field = strings.TrimPrefix(a, "--field=")
case !strings.HasPrefix(a, "-") && o.name == "":
o.name = a
}
}
if o.name == "" {
return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]")
}
// --all dumps the whole item, so --field is irrelevant — skip its allowlist.
if !o.all && !validGetFields[o.field] {
return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field)
}
return o, nil
}
// getValue opens a session and fetches one field. Pure of I/O side effects
// besides the runner, so it is unit-tested with a fake runner.
func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return "", err
}
return bwGet(run, s.env, o.field, o.name)
}
// getItem opens a session and returns the whole item as raw `bw get item` JSON.
// Used by `get --all`; normalization is a separate, pure step (normalizeItem).
func getItem(run cmdRunner, user, uid, name string) (string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return "", err
}
return run("bw", bwItemArgs(name), s.env)
}
// normalizedItem is the browse-all-fields projection of a Vaultwarden item: the
// standard login fields that are present, notes, and a flat map of custom field
// name→value. bw internals (id, object, reprompt, passwordHistory) are dropped,
// and the TOTP *seed* is reduced to a presence flag — the only seed-derived path
// stays the specially-audited `vault code` (see the design §10/§16).
type normalizedItem struct {
Name string `json:"name"`
Username string `json:"username,omitempty"`
Password string `json:"password,omitempty"`
URIs []string `json:"uris,omitempty"`
TOTP bool `json:"totp,omitempty"` // presence only, never the seed
Notes string `json:"notes,omitempty"`
Fields map[string]string `json:"fields,omitempty"` // custom field name→value
}
// bwFieldLinked is the Bitwarden custom-field type for a "linked" field: it
// references another field and carries a null value, so it is not real data.
const bwFieldLinked = 3
// normalizeItem parses a `bw get item` payload into the browse projection. It is
// pure (no I/O), so it is the unit-tested heart of `get --all`.
func normalizeItem(raw string) (normalizedItem, error) {
var it struct {
Name string `json:"name"`
Notes string `json:"notes"`
Login *struct {
Username string `json:"username"`
Password string `json:"password"`
Totp string `json:"totp"`
URIs []struct {
URI string `json:"uri"`
} `json:"uris"`
} `json:"login"`
Fields []struct {
Name string `json:"name"`
Value string `json:"value"`
Type int `json:"type"`
} `json:"fields"`
}
if err := json.Unmarshal([]byte(raw), &it); err != nil {
return normalizedItem{}, fmt.Errorf("parse bw item: %w", err)
}
n := normalizedItem{Name: it.Name, Notes: it.Notes}
if it.Login != nil {
n.Username = it.Login.Username
n.Password = it.Login.Password
n.TOTP = it.Login.Totp != ""
for _, u := range it.Login.URIs {
if u.URI != "" {
n.URIs = append(n.URIs, u.URI)
}
}
}
for _, f := range it.Fields {
if f.Type == bwFieldLinked {
continue // references another field, no value of its own
}
if n.Fields == nil {
n.Fields = map[string]string{}
}
n.Fields[f.Name] = f.Value // duplicate names: last-wins (rare; documented)
}
return n, nil
}
// clipboardDecision picks how to return a secret value. "stdout" prints it (a
// pipe/agent — the intended machine path); "clipboard" copies via OSC52;
// "refuse" emits nothing sensitive (would otherwise risk dumping the secret's
// base64 into scrollback, or silently fail because the OSC52 escape goes to a
// non-terminal stderr).
func clipboardDecision(stdoutTTY, stderrTTY bool, term, termProgram string) string {
if !stdoutTTY {
return "stdout"
}
if terminalAllowed(term, termProgram) && stderrTTY {
return "clipboard"
}
return "refuse"
}
// jsonToStdoutOK reports whether `--json` may print the secret to stdout — only
// when stdout is NOT a terminal (i.e. piped to a machine consumer).
func jsonToStdoutOK(stdoutTTY bool) bool { return !stdoutTTY }
// emitSecret returns a value TTY-aware (see clipboardDecision). Never prints the
// secret to a terminal's stdout/scrollback.
func emitSecret(value string) {
switch clipboardDecision(stdoutIsTTY(), stderrIsTTY(), os.Getenv("TERM"), os.Getenv("TERM_PROGRAM")) {
case "stdout":
fmt.Println(value)
case "clipboard":
fmt.Fprint(os.Stderr, osc52(value))
fmt.Fprintln(os.Stderr, "copied to clipboard; clearing in 30s")
clearClipboardAfter(30)
default: // refuse
fmt.Fprintln(os.Stderr, "refusing to print secret: this terminal can't do OSC52 clipboard safely; pipe the command (e.g. | cat) or use a supported terminal")
}
}
// clearClipboardAfter spawns a detached background clear so the secret doesn't
// linger in the clipboard. Best-effort.
func clearClipboardAfter(seconds int) {
exec.Command("sh", "-c", fmt.Sprintf("sleep %d; printf '%s'", seconds, osc52clear())).Start()
}
// listNames extracts "name (id)" from `bw list items` JSON; never values.
func listNames(jsonOut string) []string {
var items []struct {
ID string `json:"id"`
Name string `json:"name"`
}
if err := json.Unmarshal([]byte(jsonOut), &items); err != nil {
return nil
}
out := make([]string, 0, len(items))
for _, it := range items {
out = append(out, fmt.Sprintf("%s (%s)", it.Name, it.ID))
}
return out
}
func runList(run cmdRunner, user, uid, search string) ([]string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return nil, err
}
out, err := run("bw", bwListArgs(search), s.env)
if err != nil {
return nil, err
}
return listNames(out), nil
}
func vaultList(args []string) error {
hardenProcess()
ensureVaultToken()
search := ""
for i := 0; i < len(args); i++ {
if args[i] == "--search" && i+1 < len(args) {
search = args[i+1]
i++
} else if strings.HasPrefix(args[i], "--search=") {
search = strings.TrimPrefix(args[i], "--search=")
}
}
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
names, err := runList(realRunner, vaultCurrentUser(), uid, search)
if err != nil {
return err
}
for _, n := range names {
fmt.Println(n)
}
return nil
}
func vaultSearch(args []string) error {
if len(args) == 0 {
return fmt.Errorf("usage: homelab vault search <query>")
}
return vaultList([]string{"--search", strings.Join(args, " ")})
}
func vaultCode(args []string) error {
hardenProcess()
ensureVaultToken()
if len(args) == 0 {
return fmt.Errorf("usage: homelab vault code <name>")
}
name := args[0]
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
user := vaultCurrentUser()
val, err := getValue(realRunner, user, uid, getOpts{name: name, field: "totp"})
if err != nil {
return err
}
// TOTP is the most sensitive op: log AND emit an ntfy-bound marker (spec §9a-d).
writeOpLog(opRecord{User: user, Verb: "code", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
exec.Command("logger", "-t", "homelab-vault-totp", "user="+user+" totp-fetch parent="+parentComm(os.Getppid())).Run()
emitSecret(val)
return nil
}
// statusSummary reports config/reachability without revealing secrets.
func statusSummary(run cmdRunner, user, uid string) string {
if _, err := loadCreds(run, user); err != nil {
return "vault: not configured — run `homelab vault setup`"
}
s, err := openSession(run, user, uid)
if err != nil {
return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error()
}
// openSession already did a best-effort sync; status re-runs it explicitly so
// a reachability failure surfaces in this report rather than only on stderr.
if _, err := run("bw", bwSyncArgs(), s.env); err != nil {
return "vault: configured + unlocked, but sync/reachability failed: " + err.Error()
}
return "vault: configured, unlocked, reachable ✓"
}
func vaultStatus(args []string) error {
hardenProcess()
ensureVaultToken()
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
fmt.Println(statusSummary(realRunner, vaultCurrentUser(), uid))
return nil
}
func vaultLock(args []string) error {
uid := vaultCurrentUID()
unlock, err := withUserLock(uid) // logout mutates bw state — serialize with get/list
if err != nil {
return err
}
defer unlock()
appdata := bwAppDataDir(uid)
_, _ = realRunner("bw", []string{"lock"}, bwBaseEnv(appdata))
_, logoutErr := realRunner("bw", []string{"logout"}, bwBaseEnv(appdata))
if logoutErr == nil {
fmt.Println("locked")
}
return nil // lock/logout best-effort; never error the caller
}
// kvWriteVerb selects the KV write semantics. merge=true → `kv patch -method=rw`
// (read-modify-write: needs only read+update, NOT the `patch` capability the
// scoped workstation-claude-<user> policy lacks, and preserves co-located keys
// such as claude-auth-sync's claude_ai_oauth_json). merge=false → `kv put`
// (creates the path on first use, before any sibling keys exist).
func kvWriteVerb(merge bool) []string {
if merge {
return []string{"kv", "patch", "-method=rw"}
}
return []string{"kv", "put"}
}
// vaultWritePublicArgs writes the non-secret identifiers via argv. Neither the
// email nor the API client_id is a usable credential on its own.
func vaultWritePublicArgs(merge bool, user, email, clientID string) []string {
return append(kvWriteVerb(merge), vwCredsPath(user),
"vaultwarden_email="+email,
"vaultwarden_client_id="+clientID,
)
}
// vaultWriteSecretArgs writes ONE secret value via the `key=-` stdin form, so the
// value never appears in argv (ps / /proc/<pid>/cmdline). Fed on stdin by
// realRunnerStdin.
func vaultWriteSecretArgs(merge bool, user, key string) []string {
return append(kvWriteVerb(merge), vwCredsPath(user), key+"=-")
}
// credsPathExists reports whether the user's KV path already holds data. Used to
// pick create (`kv put`) vs merge (`kv patch -method=rw`) for the first write:
// claude-auth-sync usually creates the path first (Claude OAuth backup), but a
// user could run `homelab vault setup` before that ever happens.
func credsPathExists(run cmdRunner, user string) bool {
_, err := run("vault", []string{"kv", "get", "-format=json", vwCredsPath(user)}, nil)
return err == nil
}
// cmdRunnerStdin is realRunnerStdin's shape, injected so writeCreds is testable.
type cmdRunnerStdin func(name string, argv, envv []string, stdin string) (string, error)
// writeCreds stores all four fields in the user's Vault path using only the
// capabilities the scoped policy grants (create/read/update — NOT `patch`). The
// first (public) write creates the path when absent; the two real secrets then
// merge in via read-modify-write so the public keys — and any claude-auth-sync
// keys already present — survive. Secret values travel on stdin, never argv.
func writeCreds(run cmdRunner, runStdin cmdRunnerStdin, user string, c vwCreds) error {
merge := credsPathExists(run, user)
if _, err := run("vault", vaultWritePublicArgs(merge, user, c.Email, c.ClientID), nil); err != nil {
return err
}
// The path now exists regardless of the branch above → merge the secrets in.
if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
return err
}
if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
return err
}
return nil
}
// promptNoEcho reads one line without terminal echo (for the master password).
func promptNoEcho(prompt string) (string, error) {
fmt.Fprint(os.Stderr, prompt)
exec.Command("stty", "-echo").Run()
defer func() { exec.Command("stty", "echo").Run(); fmt.Fprintln(os.Stderr) }()
r := bufio.NewReader(os.Stdin)
line, err := r.ReadString('\n')
// Trim only the line terminator — a master password / API secret may
// legitimately contain leading/trailing spaces.
return strings.TrimRight(line, "\r\n"), err
}
func promptLine(prompt string) (string, error) {
fmt.Fprint(os.Stderr, prompt)
line, err := bufio.NewReader(os.Stdin).ReadString('\n')
return strings.TrimSpace(line), err
}
func vaultSetup(args []string) error {
hardenProcess()
ensureVaultToken()
fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
email, err := promptLine("Vaultwarden email: ")
if err != nil {
return err
}
clientID, err := promptLine("API key client_id (user.xxxx): ")
if err != nil {
return err
}
clientSecret, err := promptNoEcho("API key client_secret: ")
if err != nil {
return err
}
master, err := promptNoEcho("Master password: ")
if err != nil {
return err
}
if master == "" || clientID == "" || clientSecret == "" {
return fmt.Errorf("all fields are required")
}
c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
if err := writeCreds(realRunner, realRunnerStdin, vaultCurrentUser(), c); err != nil {
return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
}
fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
if _, err := openSession(realRunner, vaultCurrentUser(), uid); err != nil {
return fmt.Errorf("stored, but verification failed — double-check master password / API key: %w", err)
}
fmt.Fprintln(os.Stderr, "✓ Verified. Fetches are now AFK.")
return nil
}
func vaultGet(args []string) error {
hardenProcess()
ensureVaultToken()
o, err := parseGetArgs(args)
if err != nil {
return err
}
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
user := vaultCurrentUser()
if o.all {
return getAllFields(user, uid, o.name)
}
val, err := getValue(realRunner, user, uid, o)
if err != nil {
return err
}
writeOpLog(opRecord{User: user, Verb: "get", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: o.name})
if o.json {
if !jsonToStdoutOK(stdoutIsTTY()) {
return fmt.Errorf("refusing to print a secret as JSON to a terminal; pipe it (e.g. | cat) or drop --json")
}
fmt.Printf("{%q:%q}\n", o.field, val)
return nil
}
emitSecret(val)
return nil
}
// getAllFields prints every field of one item as normalized JSON. Like
// `get --json`, the payload is all secret values, so it refuses a terminal
// (pipe it). The TOTP seed is never emitted — only a presence flag — so no extra
// TOTP audit is needed; the op-log uses a distinct verb so a bulk dump is
// distinguishable from a single-field get (the item name is still never logged).
func getAllFields(user, uid, name string) error {
if !jsonToStdoutOK(stdoutIsTTY()) {
return fmt.Errorf("refusing to print all fields as JSON to a terminal; pipe it (e.g. | jq)")
}
raw, err := getItem(realRunner, user, uid, name)
if err != nil {
return err
}
item, err := normalizeItem(raw)
if err != nil {
return err
}
out, err := json.Marshal(item)
if err != nil {
return err
}
writeOpLog(opRecord{User: user, Verb: "get-all", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
fmt.Println(string(out))
return nil
}

248
cli/cmd_vault_kv.go Normal file
View file

@ -0,0 +1,248 @@
package main
import (
"encoding/json"
"fmt"
"io"
"os"
"strings"
)
// The `vault kv` verbs talk to HashiCorp Vault / OpenBao — the homelab INFRA
// secrets store (the `secret/…` KV-v2 mount at vault.viktorbarzin.me) — NOT
// Vaultwarden. They are a thin, TTY-aware wrapper over the `vault` CLI that adds
// the same conveniences as the Vaultwarden verbs: a self-defaulted VAULT_ADDR
// (so non-login agent shells work) and clipboard/refuse-on-TTY secret handling.
//
// CREDENTIALS DIFFER FROM THE VAULTWARDEN VERBS. Those use the per-user *scoped*
// token (bound only to secret/workstation/claude-users/<user>). A general kv read
// of e.g. secret/viktor must use the caller's OWN Vault token (the OIDC
// ~/.vault-token or an explicit $VAULT_TOKEN) — the scoped token has `deny`
// everywhere else and would 403. So the kv handlers call ensureVaultAddr() to
// guarantee VAULT_ADDR but deliberately do NOT call ensureVaultToken() (which
// injects the scoped token). Access is then whatever the caller's policy grants.
func vaultKVCommands() []Command {
return []Command{
{Path: []string{"vault", "kv", "get"}, Tier: TierRead,
Summary: "[hashicorp-vault] read an infra KV secret: vault kv get <path> [--field K]", Run: vaultKVGet},
{Path: []string{"vault", "kv", "list"}, Tier: TierRead,
Summary: "[hashicorp-vault] list infra KV sub-paths: vault kv list <path>", Run: vaultKVList},
{Path: []string{"vault", "kv", "put"}, Tier: TierWrite,
Summary: "[hashicorp-vault] write one KV key (value via stdin): vault kv put <path> <key>", Run: vaultKVPut},
{Path: []string{"vault", "kv"}, Tier: TierRead,
Summary: "[hashicorp-vault] infra secrets (run `homelab vault kv` for help)",
Run: func([]string) error { fmt.Print(vaultKVHelp()); return nil }},
}
}
func vaultKVHelp() string {
return `homelab vault kv HashiCorp Vault / OpenBao (homelab INFRA secrets, the secret/ KV store)
homelab vault kv get <path> [--field K] read a secret
--field K one value (TTY clipboard; piped stdout)
no --field all fields as JSON (piped only)
homelab vault kv list <path> list sub-paths under <path> (no values)
homelab vault kv put <path> <key> write one key; value read from stdin
(piped, or no-echo prompt); merges never clobbers siblings
Uses YOUR Vault token (vault login -method=oidc ~/.vault-token); access is
whatever your policy grants. This is NOT Vaultwarden for your personal logins
use 'homelab vault get' (see 'homelab vault').
`
}
// --- arg builders (pure; values never travel via argv) --------------------
func vaultKVGetFieldArgs(path, field string) []string {
return []string{"kv", "get", "-field=" + field, path}
}
func vaultKVGetJSONArgs(path string) []string { return []string{"kv", "get", "-format=json", path} }
func vaultKVListArgs(path string) []string { return []string{"kv", "list", "-format=json", path} }
// vaultKVPutArgs builds the write argv. merge=true → `kv patch -method=rw`
// (read-modify-write: merges, needs only read+update — not the `patch` capability
// — and preserves sibling keys); merge=false → `kv put` (creates the path on
// first write). The value is ALWAYS read from stdin via the `<key>=-` form, so it
// never appears in argv (visible via ps / /proc/<pid>/cmdline to same-UID procs).
func vaultKVPutArgs(merge bool, path, key string) []string {
return append(kvWriteVerb(merge), path, key+"=-")
}
// --- pure parsers ----------------------------------------------------------
// extractKVData returns the inner secret object from a `vault kv get -format=json`
// envelope (`{"data":{"data":{…},"metadata":{…}}}`), dropping the metadata/request
// wrapper so only the secret's own key→value data is emitted.
func extractKVData(jsonOut string) (string, error) {
var env struct {
Data struct {
Data json.RawMessage `json:"data"`
} `json:"data"`
}
if err := json.Unmarshal([]byte(jsonOut), &env); err != nil {
return "", fmt.Errorf("parse vault kv json: %w", err)
}
if len(env.Data.Data) == 0 {
return "", fmt.Errorf("no secret data at that path")
}
return string(env.Data.Data), nil
}
// parseKVList parses the JSON array `vault kv list -format=json` prints.
func parseKVList(jsonOut string) ([]string, error) {
var keys []string
if err := json.Unmarshal([]byte(jsonOut), &keys); err != nil {
return nil, fmt.Errorf("parse vault kv list json: %w", err)
}
return keys, nil
}
// --- testable cores (injected cmdRunner) -----------------------------------
func kvGetField(run cmdRunner, path, field string) (string, error) {
return run("vault", vaultKVGetFieldArgs(path, field), nil)
}
func kvGetJSON(run cmdRunner, path string) (string, error) {
out, err := run("vault", vaultKVGetJSONArgs(path), nil)
if err != nil {
return "", err
}
return extractKVData(out)
}
func kvList(run cmdRunner, path string) ([]string, error) {
out, err := run("vault", vaultKVListArgs(path), nil)
if err != nil {
return nil, err
}
return parseKVList(out)
}
// kvPathExists reports whether the KV path already holds data, to pick create
// (`kv put`) vs merge (`kv patch -method=rw`) — so a write never clobbers
// sibling keys on an existing path.
func kvPathExists(run cmdRunner, path string) bool {
_, err := run("vault", vaultKVGetJSONArgs(path), nil)
return err == nil
}
// kvPut writes one key, creating the path when absent and merging when present.
// The value travels on stdin only (never argv).
func kvPut(run cmdRunner, runStdin cmdRunnerStdin, path, key, value string) error {
merge := kvPathExists(run, path)
_, err := runStdin("vault", vaultKVPutArgs(merge, path, key), nil, value)
return err
}
// --- handlers --------------------------------------------------------------
func vaultKVGet(args []string) error {
hardenProcess()
ensureVaultAddr() // own token, NOT the scoped one (see file header)
var path, field string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--field" && i+1 < len(args):
field = args[i+1]
i++
case strings.HasPrefix(a, "--field="):
field = strings.TrimPrefix(a, "--field=")
case !strings.HasPrefix(a, "-") && path == "":
path = a
}
}
if path == "" {
return fmt.Errorf("usage: homelab vault kv get <path> [--field <key>]")
}
if field != "" {
val, err := kvGetField(realRunner, path, field)
if err != nil {
return err
}
emitSecret(val) // TTY-aware: clipboard on a terminal, stdout when piped
return nil
}
// No --field → the whole secret. All values, so refuse a bare TTY (like
// `vault get --json`): pick a --field for the clipboard path, or pipe it.
if !jsonToStdoutOK(stdoutIsTTY()) {
return fmt.Errorf("refusing to print all KV fields as JSON to a terminal; use --field <key>, or pipe it (e.g. | jq)")
}
out, err := kvGetJSON(realRunner, path)
if err != nil {
return err
}
fmt.Println(out)
return nil
}
func vaultKVList(args []string) error {
ensureVaultAddr()
var path string
for _, a := range args {
if !strings.HasPrefix(a, "-") {
path = a
break
}
}
if path == "" {
return fmt.Errorf("usage: homelab vault kv list <path>")
}
keys, err := kvList(realRunner, path)
if err != nil {
return err
}
for _, k := range keys {
fmt.Println(k)
}
return nil
}
func vaultKVPut(args []string) error {
hardenProcess()
ensureVaultAddr()
var path, key string
for _, a := range args {
if strings.HasPrefix(a, "-") {
continue
}
switch {
case path == "":
path = a
case key == "":
key = a
}
}
if path == "" || key == "" {
return fmt.Errorf("usage: homelab vault kv put <path> <key> (value read from stdin)")
}
value, err := readSecretValue("Value for " + key + ": ")
if err != nil {
return err
}
if value == "" {
return fmt.Errorf("empty value; aborting (nothing written)")
}
if err := kvPut(realRunner, realRunnerStdin, path, key, value); err != nil {
return fmt.Errorf("writing %q to %s failed (does your token have write access? path correct?): %w", key, path, err)
}
fmt.Fprintln(os.Stderr, "wrote "+key+" to "+path)
return nil
}
// readSecretValue obtains a secret value WITHOUT putting it in argv: piped stdin
// is read verbatim (trailing newline trimmed, internal newlines preserved so
// multi-line values like PEM keys survive); an interactive TTY is prompted
// without echo.
func readSecretValue(prompt string) (string, error) {
fi, err := os.Stdin.Stat()
if err == nil && fi.Mode()&os.ModeCharDevice == 0 {
b, rerr := io.ReadAll(os.Stdin)
if rerr != nil {
return "", rerr
}
return strings.TrimRight(string(b), "\r\n"), nil
}
return promptNoEcho(prompt)
}

1057
cli/cmd_vault_test.go Normal file

File diff suppressed because it is too large Load diff

212
cli/cmd_work.go Normal file
View file

@ -0,0 +1,212 @@
package main
import (
"fmt"
"os"
"path/filepath"
"strings"
)
func workCommands() []Command {
return []Command{
{Path: []string{"work", "start"}, Tier: TierWrite,
Summary: "create a worktree + branch for a task (enter it with EnterWorktree)", Run: workStart},
{Path: []string{"work", "land"}, Tier: TierWrite,
Summary: "merge master in, verify, push HEAD:master (run from the worktree)", Run: workLand},
{Path: []string{"work", "clean"}, Tier: TierWrite,
Summary: "remove a task's worktree + branch (run from the main checkout)", Run: workClean},
}
}
// flagValue extracts `--name value` or `--name=value` from args.
func flagValue(args []string, name string) string {
for i, a := range args {
if a == name && i+1 < len(args) {
return args[i+1]
}
if strings.HasPrefix(a, name+"=") {
return strings.TrimPrefix(a, name+"=")
}
}
return ""
}
func remotesOrEmpty(repoRoot string) []string {
r, _ := gitRemotes(repoRoot)
return r
}
// workStart creates .worktrees/<topic> on branch <user>/<topic> off <remote>/master.
func workStart(args []string) error {
topic, _ := firstPositional(args)
if topic == "" {
return fmt.Errorf("usage: homelab work start <topic>")
}
cwd, _ := os.Getwd()
repoRoot, err := gitRepoRoot(cwd)
if err != nil {
return fmt.Errorf("not in a git repository: %w", err)
}
remote := preferRemote(remotesOrEmpty(repoRoot))
if remote == "" {
return fmt.Errorf("no git remote configured in %s", repoRoot)
}
flags := cryptFlagsFor(repoRoot)
branch := currentUser() + "/" + topic
wtRel := filepath.Join(".worktrees", topic)
ensureWorktreesIgnored(repoRoot)
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
return fmt.Errorf("fetch %s failed: %w", remote, err)
}
if err := gitStream(repoRoot, flags, "worktree", "add", wtRel, "-b", branch, remote+"/master"); err != nil {
return fmt.Errorf("worktree add failed: %w", err)
}
wtPath := filepath.Join(repoRoot, wtRel)
fmt.Printf("homelab: created worktree %s (branch %s off %s/master)\n", wtPath, branch, remote)
fmt.Printf("homelab: enter it with the native tool: EnterWorktree(path=%q)\n", wtPath)
return nil
}
// workLand integrates the current branch into master: fetch, merge master in,
// verify, push HEAD:master (retrying on non-fast-forward), with a feature-branch
// fallback when the direct push is rejected (e.g. branch protection).
func workLand(args []string) error {
verifyCmd := flagValue(args, "--verify-cmd")
cwd, _ := os.Getwd()
repoRoot, err := gitRepoRoot(cwd)
if err != nil {
return fmt.Errorf("not in a git repository: %w", err)
}
branch, err := gitOutput(repoRoot, "rev-parse", "--abbrev-ref", "HEAD")
if err != nil {
return err
}
if branch == "master" || branch == "main" {
return fmt.Errorf("refusing to land: already on %s", branch)
}
remote := preferRemote(remotesOrEmpty(repoRoot))
if remote == "" {
return fmt.Errorf("no git remote configured in %s", repoRoot)
}
flags := cryptFlagsFor(repoRoot)
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
return fmt.Errorf("fetch failed: %w", err)
}
if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
return fmt.Errorf("merging %s/master failed — resolve conflicts then re-run `homelab work land`: %w", remote, err)
}
if err := runVerify(repoRoot, verifyCmd, containsArg(args, "--no-verify")); err != nil {
return fmt.Errorf("not landing: %w", err)
}
if err := pushWithRetry(repoRoot, flags, remote, 3); err != nil {
return landFallback(repoRoot, flags, remote, branch, err)
}
fmt.Printf("homelab: landed %s -> %s/master.\n", branch, remote)
if containsArg(args, "--no-ci-watch") {
fmt.Println("homelab: --no-ci-watch set; not waiting for CI.")
return nil
}
landed, _ := gitOutput(repoRoot, "rev-parse", "HEAD")
fmt.Fprintln(os.Stderr, "homelab: watching CI for the landed commit...")
if err := ciWatch([]string{landed}); err != nil {
return fmt.Errorf("landed, but CI did not go green: %w", err)
}
return nil
}
// runVerify runs the explicit --verify-cmd, else auto-detects (go test). If
// neither is available it REFUSES (returns an error) unless allowSkip is set —
// landing to master unverified must be a deliberate choice (--no-verify).
func runVerify(repoRoot, verifyCmd string, allowSkip bool) error {
if verifyCmd != "" {
fmt.Fprintf(os.Stderr, "homelab: verify: %s\n", verifyCmd)
return runStreamingIn(repoRoot, "sh", "-c", verifyCmd)
}
if isFile(filepath.Join(repoRoot, "go.mod")) {
fmt.Fprintln(os.Stderr, "homelab: verify: go test ./...")
return runStreamingIn(repoRoot, "go", "test", "./...")
}
if allowSkip {
fmt.Fprintln(os.Stderr, "homelab: WARNING: --no-verify set — landing without verification")
return nil
}
return fmt.Errorf("no verification configured for this repo — pass --verify-cmd \"...\" or --no-verify to land without verifying")
}
// pushWithRetry pushes HEAD:master, recovering from non-fast-forward rejections
// by fetching + merging master and retrying.
func pushWithRetry(repoRoot string, flags []string, remote string, attempts int) error {
var lastErr error
for i := 0; i < attempts; i++ {
if err := gitStream(repoRoot, flags, "push", remote, "HEAD:master"); err == nil {
return nil
} else {
lastErr = err
}
if i < attempts-1 {
fmt.Fprintln(os.Stderr, "homelab: push rejected — fetching + merging master, then retrying")
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
return err
}
if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
return err
}
}
}
return fmt.Errorf("push to %s/master failed after %d attempts: %w", remote, attempts, lastErr)
}
// landFallback pushes the feature branch when the direct master push is rejected
// (e.g. branch protection), so the work isn't lost and a PR can be opened.
func landFallback(repoRoot string, flags []string, remote, branch string, pushErr error) error {
fmt.Fprintf(os.Stderr, "homelab: direct push to master failed (%v)\n", pushErr)
fmt.Fprintf(os.Stderr, "homelab: falling back to pushing the feature branch %q for a PR\n", branch)
if err := gitStream(repoRoot, flags, "push", "-u", remote, branch); err != nil {
return fmt.Errorf("fallback branch push also failed: %w", err)
}
fmt.Printf("homelab: pushed %s to %s. Open a PR to land it (branch protection blocked the direct push).\n", branch, remote)
return nil
}
// workClean removes a task's worktree and branch. Run from the main checkout.
func workClean(args []string) error {
topic, _ := firstPositional(args)
if topic == "" {
return fmt.Errorf("usage: homelab work clean <topic> (run from the main checkout)")
}
cwd, _ := os.Getwd()
repoRoot, err := gitRepoRoot(cwd)
if err != nil {
return fmt.Errorf("not in a git repository: %w", err)
}
flags := cryptFlagsFor(repoRoot)
wtRel := filepath.Join(".worktrees", topic)
branch := currentUser() + "/" + topic
if err := gitStream(repoRoot, flags, "worktree", "remove", wtRel); err != nil {
return fmt.Errorf("worktree remove failed (uncommitted changes? run from the main checkout, not the worktree): %w", err)
}
if err := gitStream(repoRoot, flags, "branch", "-d", branch); err != nil {
fmt.Fprintf(os.Stderr, "homelab: note: could not delete branch %s (unmerged — use `git branch -D` if intended): %v\n", branch, err)
}
fmt.Printf("homelab: removed worktree %s and branch %s\n", wtRel, branch)
return nil
}
// ensureWorktreesIgnored appends .worktrees/ to .gitignore if not already ignored.
func ensureWorktreesIgnored(repoRoot string) {
if _, err := gitOutput(repoRoot, "check-ignore", ".worktrees"); err == nil {
return
}
gi := filepath.Join(repoRoot, ".gitignore")
f, err := os.OpenFile(gi, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o644)
if err != nil {
return
}
defer f.Close()
if _, err := f.WriteString("\n.worktrees/\n"); err == nil {
fmt.Fprintln(os.Stderr, "homelab: added .worktrees/ to .gitignore")
}
}

32
cli/cmd_work_test.go Normal file
View file

@ -0,0 +1,32 @@
package main
import "testing"
func TestRunVerifyRefusesWhenNothingToVerify(t *testing.T) {
dir := t.TempDir() // no go.mod, no verify cmd
if err := runVerify(dir, "", false); err == nil {
t.Fatal("runVerify must refuse (error) when nothing to verify and --no-verify absent")
}
if err := runVerify(dir, "", true); err != nil {
t.Fatalf("runVerify must skip when --no-verify set, got: %v", err)
}
}
func TestFlagValue(t *testing.T) {
cases := []struct {
args []string
name string
want string
}{
{[]string{"--verify-cmd", "go test ./..."}, "--verify-cmd", "go test ./..."},
{[]string{"--verify-cmd=make test"}, "--verify-cmd", "make test"},
{[]string{"topic", "--verify-cmd", "x"}, "--verify-cmd", "x"},
{[]string{"topic"}, "--verify-cmd", ""},
{[]string{"--verify-cmd"}, "--verify-cmd", ""}, // no value
}
for _, c := range cases {
if got := flagValue(c.args, c.name); got != c.want {
t.Errorf("flagValue(%v, %q) = %q, want %q", c.args, c.name, got, c.want)
}
}
}

104
cli/command.go Normal file
View file

@ -0,0 +1,104 @@
package main
import (
"encoding/json"
"fmt"
"sort"
"strings"
)
// Tier classifies whether a command observes (read) or mutates (write) state.
// v0.1 allows everything; the tier is recorded so a classifier hook can gate
// writes later without restructuring (see docs/adr/0005).
type Tier string
const (
TierRead Tier = "read"
TierWrite Tier = "write"
)
// Command is one homelab verb. Path is the token sequence that selects it,
// e.g. ["claim"] or ["tf", "plan"]. Run receives the args after the path.
type Command struct {
Path []string
Tier Tier
Summary string
Run func(args []string) error
}
// dispatch routes args to the command whose Path is the longest matching prefix
// of args, passing the remaining args to its Run.
func dispatch(reg []Command, args []string) error {
best := -1
bestLen := 0
for i, c := range reg {
if len(c.Path) > len(args) {
continue
}
match := true
for j, p := range c.Path {
if args[j] != p {
match = false
break
}
}
if match && len(c.Path) >= bestLen {
best = i
bestLen = len(c.Path)
}
}
if best < 0 {
return fmt.Errorf("unknown command: %q", strings.Join(args, " "))
}
matched := reg[best]
runErr := matched.Run(args[bestLen:])
emitUsage(matched.name(), runErr) // best-effort usage telemetry; never affects the command
return runErr
}
// name is the space-joined verb path, e.g. "tf plan".
func (c Command) name() string { return strings.Join(c.Path, " ") }
// sortedByName returns a copy of reg ordered by verb path for stable output.
func sortedByName(reg []Command) []Command {
out := make([]Command, len(reg))
copy(out, reg)
sort.Slice(out, func(i, j int) bool { return out[i].name() < out[j].name() })
return out
}
// manifestText renders one aligned line per command: "<path> <tier> <summary>".
// This is the cheap progressive-discovery entrypoint (see docs/adr/0004).
func manifestText(reg []Command) string {
cmds := sortedByName(reg)
width := 0
for _, c := range cmds {
if n := len(c.name()); n > width {
width = n
}
}
var b strings.Builder
for _, c := range cmds {
fmt.Fprintf(&b, "%-*s %-5s %s\n", width, c.name(), c.Tier, c.Summary)
}
return b.String()
}
// manifestJSON renders the registry as a JSON array of {command, tier, summary}
// so agents can parse the full surface in one call.
func manifestJSON(reg []Command) (string, error) {
type entry struct {
Command string `json:"command"`
Tier string `json:"tier"`
Summary string `json:"summary"`
}
entries := make([]entry, 0, len(reg))
for _, c := range sortedByName(reg) {
entries = append(entries, entry{Command: c.name(), Tier: string(c.Tier), Summary: c.Summary})
}
b, err := json.MarshalIndent(entries, "", " ")
if err != nil {
return "", err
}
return string(b), nil
}

73
cli/command_test.go Normal file
View file

@ -0,0 +1,73 @@
package main
import (
"encoding/json"
"reflect"
"strings"
"testing"
)
// Tracer bullet: the dispatcher must route `homelab <path...> <args...>` to the
// command whose Path is the longest matching prefix of the input tokens, and
// hand the command the remaining args.
func TestDispatchRoutesToLongestPrefixMatch(t *testing.T) {
var gotArgs []string
ran := ""
reg := []Command{
{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource",
Run: func(a []string) error { ran = "claim"; gotArgs = a; return nil }},
{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack",
Run: func(a []string) error { ran = "tf plan"; gotArgs = a; return nil }},
}
if err := dispatch(reg, []string{"tf", "plan", "vault", "--json"}); err != nil {
t.Fatalf("dispatch returned error: %v", err)
}
if ran != "tf plan" {
t.Fatalf("routed to %q, want %q", ran, "tf plan")
}
if want := []string{"vault", "--json"}; !reflect.DeepEqual(gotArgs, want) {
t.Fatalf("command got args %v, want %v", gotArgs, want)
}
}
func TestDispatchUnknownCommandErrors(t *testing.T) {
reg := []Command{{Path: []string{"claim"}, Run: func(a []string) error { return nil }}}
if err := dispatch(reg, []string{"bogus"}); err == nil {
t.Fatal("expected error for unknown command, got nil")
}
}
// The manifest is the progressive-discovery entrypoint: one line per command
// showing the full verb path, its tier, and summary, sorted for stable output.
func TestManifestTextListsEveryCommandWithTier(t *testing.T) {
reg := []Command{
{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack"},
{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource"},
}
out := manifestText(reg)
for _, want := range []string{"claim", "tf plan", "read", "write", "plan a stack", "claim a resource"} {
if !strings.Contains(out, want) {
t.Errorf("manifest text missing %q\n---\n%s", want, out)
}
}
// sorted: claim (c) must appear before tf plan (t)
if strings.Index(out, "claim") > strings.Index(out, "tf plan") {
t.Errorf("manifest not sorted by path:\n%s", out)
}
}
func TestManifestJSONIsParsableAndTagged(t *testing.T) {
reg := []Command{{Path: []string{"tf", "apply"}, Tier: TierWrite, Summary: "apply a stack"}}
out, err := manifestJSON(reg)
if err != nil {
t.Fatalf("manifestJSON error: %v", err)
}
var got []map[string]string
if err := json.Unmarshal([]byte(out), &got); err != nil {
t.Fatalf("manifest JSON not parsable: %v\n%s", err, out)
}
if len(got) != 1 || got[0]["command"] != "tf apply" || got[0]["tier"] != "write" {
t.Fatalf("unexpected manifest JSON: %v", got)
}
}

164
cli/edges.go Normal file
View file

@ -0,0 +1,164 @@
package main
import (
"fmt"
"regexp"
"strconv"
"strings"
)
// edgesOpts is the parsed filter set for `homelab edges` (the who-talks-to-whom
// investigation helper over the goldmane_edges trail; see ADR-0014).
type edgesOpts struct {
ns string // edges touching this namespace (either direction)
src string // edges where src_ns = this
dst string // edges where dst_ns = this
peersOf string // distinct peers of this namespace (both directions)
newSince string // first_seen >= duration (24h/7d/30m) or date (YYYY-MM-DD)
denied bool // action = 'deny' only
asJSON bool // wrap result as a JSON array
limit int // row cap (default 200)
}
// parseEdgesArgs parses the edges flag surface. Unknown flags error out so a
// typo surfaces instead of silently dumping the whole table.
func parseEdgesArgs(args []string) (edgesOpts, error) {
o := edgesOpts{limit: 200}
i := 0
for i < len(args) {
a := args[i]
key, inline, hasInline := a, "", false
if eq := strings.IndexByte(a, '='); eq >= 0 {
key, inline, hasInline = a[:eq], a[eq+1:], true
}
needVal := func() (string, error) {
if hasInline {
return inline, nil
}
if i+1 < len(args) {
i++
return args[i], nil
}
return "", fmt.Errorf("flag %s needs a value", key)
}
var err error
switch key {
case "--ns":
o.ns, err = needVal()
case "--src":
o.src, err = needVal()
case "--dst":
o.dst, err = needVal()
case "--peers-of":
o.peersOf, err = needVal()
case "--new-since":
o.newSince, err = needVal()
case "--denied":
o.denied = true
case "--json":
o.asJSON = true
case "--limit":
var v string
if v, err = needVal(); err == nil {
if o.limit, err = strconv.Atoi(v); err != nil {
err = fmt.Errorf("--limit must be an integer: %q", v)
}
}
default:
return o, fmt.Errorf("unknown flag: %s", a)
}
if err != nil {
return o, err
}
i++
}
return o, nil
}
// nsRE is the safe namespace-token charset (k8s names + "Global"). Used as the
// injection guard — anything else is rejected rather than quoted-and-hoped.
var nsRE = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9_.-]*$`)
func validateNS(s string) error {
if s == "" || len(s) > 63 || !nsRE.MatchString(s) {
return fmt.Errorf("invalid namespace name: %q", s)
}
return nil
}
// sqlStr renders a SQL string literal (belt-and-suspenders on top of validateNS).
func sqlStr(s string) string { return "'" + strings.ReplaceAll(s, "'", "''") + "'" }
var (
durRE = regexp.MustCompile(`^(\d+)([smhd])$`)
dateRE = regexp.MustCompile(`^\d{4}-\d{2}-\d{2}([ T]\d{2}:\d{2}(:\d{2})?)?$`)
)
// newSinceCond turns a duration (24h/7d/30m/90s) or a date (YYYY-MM-DD[ HH:MM])
// into a first_seen predicate.
func newSinceCond(v string) (string, error) {
if m := durRE.FindStringSubmatch(v); m != nil {
unit := map[string]string{"s": "seconds", "m": "minutes", "h": "hours", "d": "days"}[m[2]]
return fmt.Sprintf("first_seen >= now() - interval '%s %s'", m[1], unit), nil
}
if dateRE.MatchString(v) {
return "first_seen >= " + sqlStr(v), nil
}
return "", fmt.Errorf("--new-since must be a duration (e.g. 24h, 7d, 30m) or a date (YYYY-MM-DD): %q", v)
}
// buildEdgesQuery renders the SQL for the given filters against the `edge` table.
func buildEdgesQuery(o edgesOpts) (string, error) {
limit := o.limit
if limit <= 0 {
limit = 200
}
// peers-of is a distinct-peer summary, a different shape from the row list.
if o.peersOf != "" {
if err := validateNS(o.peersOf); err != nil {
return "", err
}
p := sqlStr(o.peersOf)
return fmt.Sprintf("SELECT DISTINCT peer, action FROM ("+
"SELECT dst_ns AS peer, action FROM edge WHERE src_ns = %s "+
"UNION SELECT src_ns AS peer, action FROM edge WHERE dst_ns = %s"+
") t ORDER BY peer LIMIT %d", p, p, limit), nil
}
var conds []string
for _, f := range []struct{ val, tmpl string }{
{o.ns, "(src_ns = %[1]s OR dst_ns = %[1]s)"},
{o.src, "src_ns = %s"},
{o.dst, "dst_ns = %s"},
} {
if f.val == "" {
continue
}
if err := validateNS(f.val); err != nil {
return "", err
}
conds = append(conds, fmt.Sprintf(f.tmpl, sqlStr(f.val)))
}
if o.denied {
conds = append(conds, "action = 'deny'")
}
if o.newSince != "" {
c, err := newSinceCond(o.newSince)
if err != nil {
return "", err
}
conds = append(conds, c)
}
q := "SELECT src_ns, dst_ns, action, flow_count, first_seen, last_seen FROM edge"
if len(conds) > 0 {
q += " WHERE " + strings.Join(conds, " AND ")
}
q += fmt.Sprintf(" ORDER BY first_seen DESC LIMIT %d", limit)
if o.asJSON {
q = "SELECT coalesce(json_agg(row_to_json(t)), '[]') FROM (" + q + ") t"
}
return q, nil
}

163
cli/edges_test.go Normal file
View file

@ -0,0 +1,163 @@
package main
import (
"strings"
"testing"
)
func TestParseEdgesArgs(t *testing.T) {
cases := []struct {
name string
args []string
want edgesOpts
}{
{"defaults", nil, edgesOpts{limit: 200}},
{"ns", []string{"--ns", "immich"}, edgesOpts{ns: "immich", limit: 200}},
{"ns equals", []string{"--ns=immich"}, edgesOpts{ns: "immich", limit: 200}},
{"src dst", []string{"--src", "a", "--dst", "b"}, edgesOpts{src: "a", dst: "b", limit: 200}},
{"peers-of", []string{"--peers-of", "authentik"}, edgesOpts{peersOf: "authentik", limit: 200}},
{"denied json", []string{"--denied", "--json"}, edgesOpts{denied: true, asJSON: true, limit: 200}},
{"new-since", []string{"--new-since", "24h"}, edgesOpts{newSince: "24h", limit: 200}},
{"limit", []string{"--limit", "50"}, edgesOpts{limit: 50}},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
got, err := parseEdgesArgs(c.args)
if err != nil {
t.Fatalf("parseEdgesArgs(%v) error: %v", c.args, err)
}
if got != c.want {
t.Fatalf("parseEdgesArgs(%v) = %+v, want %+v", c.args, got, c.want)
}
})
}
}
func TestParseEdgesArgsErrors(t *testing.T) {
for _, args := range [][]string{
{"--limit", "abc"},
{"--bogus"},
} {
if _, err := parseEdgesArgs(args); err == nil {
t.Errorf("parseEdgesArgs(%v) expected error, got nil", args)
}
}
}
func TestBuildEdgesQueryDefaults(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{limit: 200})
if err != nil {
t.Fatal(err)
}
for _, want := range []string{"FROM edge", "ORDER BY first_seen DESC", "LIMIT 200"} {
if !strings.Contains(q, want) {
t.Errorf("query %q missing %q", q, want)
}
}
if strings.Contains(q, "WHERE") {
t.Errorf("no-filter query should have no WHERE: %q", q)
}
}
func TestBuildEdgesQueryFilters(t *testing.T) {
cases := []struct {
name string
o edgesOpts
want string
}{
{"ns both directions", edgesOpts{ns: "immich", limit: 10}, "(src_ns = 'immich' OR dst_ns = 'immich')"},
{"src only", edgesOpts{src: "authentik", limit: 10}, "src_ns = 'authentik'"},
{"dst only", edgesOpts{dst: "dbaas", limit: 10}, "dst_ns = 'dbaas'"},
{"denied", edgesOpts{denied: true, limit: 10}, "action = 'deny'"},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
q, err := buildEdgesQuery(c.o)
if err != nil {
t.Fatal(err)
}
if !strings.Contains(q, "WHERE") || !strings.Contains(q, c.want) {
t.Errorf("query %q missing WHERE/%q", q, c.want)
}
})
}
}
func TestBuildEdgesQueryCombinedFiltersAnded(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{src: "a", denied: true, limit: 5})
if err != nil {
t.Fatal(err)
}
if !strings.Contains(q, "src_ns = 'a' AND action = 'deny'") {
t.Errorf("combined filters not AND'd: %q", q)
}
}
func TestBuildEdgesQueryPeersOf(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{peersOf: "authentik", limit: 100})
if err != nil {
t.Fatal(err)
}
for _, want := range []string{"DISTINCT", "src_ns = 'authentik'", "dst_ns = 'authentik'", "UNION"} {
if !strings.Contains(q, want) {
t.Errorf("peers-of query %q missing %q", q, want)
}
}
}
func TestBuildEdgesQueryJSON(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{asJSON: true, limit: 200})
if err != nil {
t.Fatal(err)
}
if !strings.Contains(q, "json_agg") || !strings.Contains(q, "row_to_json") {
t.Errorf("json query missing json_agg wrapper: %q", q)
}
}
func TestBuildEdgesQueryRejectsInjection(t *testing.T) {
for _, bad := range []string{"a'; DROP TABLE edge;--", "a b", "a;b", "a\"b"} {
if _, err := buildEdgesQuery(edgesOpts{ns: bad, limit: 10}); err == nil {
t.Errorf("buildEdgesQuery(ns=%q) expected validation error, got nil", bad)
}
}
}
func TestNewSinceCond(t *testing.T) {
cases := []struct {
in string
want string
}{
{"24h", "first_seen >= now() - interval '24 hours'"},
{"7d", "first_seen >= now() - interval '7 days'"},
{"30m", "first_seen >= now() - interval '30 minutes'"},
{"2026-06-28", "first_seen >= '2026-06-28'"},
}
for _, c := range cases {
got, err := newSinceCond(c.in)
if err != nil {
t.Fatalf("newSinceCond(%q) error: %v", c.in, err)
}
if got != c.want {
t.Errorf("newSinceCond(%q) = %q, want %q", c.in, got, c.want)
}
}
for _, bad := range []string{"yesterday", "1y", "'; DROP", ""} {
if _, err := newSinceCond(bad); err == nil {
t.Errorf("newSinceCond(%q) expected error, got nil", bad)
}
}
}
func TestValidateNS(t *testing.T) {
for _, ok := range []string{"immich", "calico-system", "kube-system", "Global", "pg-cluster-rw"} {
if err := validateNS(ok); err != nil {
t.Errorf("validateNS(%q) unexpected error: %v", ok, err)
}
}
for _, bad := range []string{"", "a b", "a'b", "a;b", "../x", "a$b"} {
if err := validateNS(bad); err == nil {
t.Errorf("validateNS(%q) expected error, got nil", bad)
}
}
}

99
cli/homelab.go Normal file
View file

@ -0,0 +1,99 @@
package main
import (
"fmt"
"strings"
)
// version is stamped at build time via -ldflags "-X main.version=vX.Y.Z".
var version = "dev"
// buildRegistry returns every homelab verb. New verb-groups append here.
func buildRegistry() []Command {
var reg []Command
reg = append(reg, claimCommands()...)
reg = append(reg, tfCommands()...)
reg = append(reg, workCommands()...)
reg = append(reg, k8sCommands()...)
reg = append(reg, memoryCommands()...)
reg = append(reg, ciCommands()...)
reg = append(reg, deployCommands()...)
reg = append(reg, netCommands()...)
reg = append(reg, obsCommands()...)
reg = append(reg, edgesCommands()...)
reg = append(reg, usageCommands()...)
reg = append(reg, haCommands()...)
reg = append(reg, browserCommands()...)
reg = append(reg, vaultCommands()...)
return reg
}
// dispatchTop handles the homelab verb surface. handled=false means the args are
// not a homelab verb, so main() falls back to the legacy -use-case path.
func dispatchTop(args []string) (handled bool, err error) {
if len(args) == 0 {
fmt.Print(usage())
return true, nil
}
switch args[0] {
case "help", "-h", "--help":
fmt.Print(usage())
return true, nil
case "version", "--version":
fmt.Println("homelab " + version)
return true, nil
case "manifest":
reg := buildRegistry()
if containsArg(args[1:], "--json") {
out, err := manifestJSON(reg)
if err != nil {
return true, err
}
fmt.Println(out)
return true, nil
}
fmt.Print(manifestText(reg))
return true, nil
}
if strings.HasPrefix(args[0], "-") {
return false, nil
}
reg := buildRegistry()
if !isCommandGroup(reg, args[0]) {
return false, nil
}
return true, dispatch(reg, args)
}
func isCommandGroup(reg []Command, group string) bool {
for _, c := range reg {
if len(c.Path) > 0 && c.Path[0] == group {
return true
}
}
return false
}
func containsArg(args []string, want string) bool {
for _, a := range args {
if a == want {
return true
}
}
return false
}
func usage() string {
var b strings.Builder
fmt.Fprintf(&b, "homelab %s — unified homelab operations CLI\n\n", version)
b.WriteString("Usage:\n homelab <command> [args]\n\nCommands:\n")
for _, line := range strings.Split(strings.TrimRight(manifestText(buildRegistry()), "\n"), "\n") {
if line != "" {
b.WriteString(" " + line + "\n")
}
}
b.WriteString("\n manifest [--json] list all commands (machine-readable with --json)\n")
b.WriteString(" version print version\n")
b.WriteString("\nLegacy webhook use-cases remain available via -use-case=<name>.\n")
return b.String()
}

138
cli/k8s.go Normal file
View file

@ -0,0 +1,138 @@
package main
import (
"fmt"
"os/exec"
"strings"
)
// kubectl helpers use the ambient kubeconfig (no per-call auth flags).
func kubectlBase(ns string, args ...string) []string {
var full []string
if ns != "" {
full = append(full, "-n", ns)
}
return append(full, args...)
}
func kubectlStream(ns string, args ...string) error {
return runStreamingIn("", "kubectl", kubectlBase(ns, args...)...)
}
// kubectlCapture runs kubectl and returns trimmed stdout (for resolving pods).
func kubectlCapture(ns string, args ...string) (string, error) {
out, err := exec.Command("kubectl", kubectlBase(ns, args...)...).Output()
return strings.TrimSpace(string(out)), err
}
// k8sTarget is the parsed `<app>` + selectors shared by the k8s verbs.
type k8sTarget struct {
app string
ns string
pod string
container string
selector string
tty bool
rest []string // passthrough flags and, after `--`, the exec command
}
// parseK8sTarget reads `<app> [-n ns] [--pod p] [-c ctr] [-l sel] [flags] [-- cmd]`.
// The first bare token is the app; unknown flags pass through in rest.
func parseK8sTarget(args []string) k8sTarget {
t := k8sTarget{}
i := 0
take := func() string {
if i+1 < len(args) {
i++
return args[i]
}
return ""
}
for i = 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--":
t.rest = append(t.rest, args[i+1:]...)
return t
case a == "-n" || a == "--namespace":
t.ns = take()
case strings.HasPrefix(a, "--namespace="):
t.ns = strings.TrimPrefix(a, "--namespace=")
case a == "--pod":
t.pod = take()
case strings.HasPrefix(a, "--pod="):
t.pod = strings.TrimPrefix(a, "--pod=")
case a == "-c" || a == "--container":
t.container = take()
case strings.HasPrefix(a, "--container="):
t.container = strings.TrimPrefix(a, "--container=")
case a == "-l" || a == "--selector":
t.selector = take()
case strings.HasPrefix(a, "--selector="):
t.selector = strings.TrimPrefix(a, "--selector=")
case a == "--tty" || a == "-it" || a == "-ti":
t.tty = true
case !strings.HasPrefix(a, "-") && t.app == "":
t.app = a
default:
t.rest = append(t.rest, a)
}
}
return t
}
// namespace defaults to the app name (most namespaces hold exactly one app).
func (t k8sTarget) namespace() string {
if t.ns != "" {
return t.ns
}
return t.app
}
// objectRef is the kubectl object for logs/exec: an explicit pod, else
// deploy/<app> (kubectl resolves a pod from the Deployment).
func (t k8sTarget) objectRef() string {
if t.pod != "" {
return "pod/" + t.pod
}
return "deploy/" + t.app
}
// --- database access (the dbaas exec pattern) ---
type dbPlan struct {
ns string
pod string // explicit pod (e.g. mysql-standalone-0)
selector string // resolve the pod by this label when pod == "" (CNPG primary)
container string // "" = default container
argv []string // command + args to run inside the pod
}
// planDBExec builds the in-pod command to run sql against app's database.
// PG (default): CNPG primary POD (resolved by label — pg-cluster-rw is a
// Service, not an exec target), psql -U postgres -d <db>.
// MySQL: mysql-standalone-0, password from env (never on the command line).
// dbName defaults to app. sql empty => interactive client.
func planDBExec(app, dbName, sql string, mysql bool) dbPlan {
if dbName == "" {
dbName = app
}
if mysql {
inner := fmt.Sprintf(`mysql -u root -p"$MYSQL_ROOT_PASSWORD" %s`, shellQuote(dbName))
if sql != "" {
inner += " -e " + shellQuote(sql)
}
return dbPlan{ns: "dbaas", pod: "mysql-standalone-0", argv: []string{"bash", "-c", inner}}
}
argv := []string{"psql", "-U", "postgres", "-d", dbName}
if sql != "" {
argv = append(argv, "-tAc", sql)
}
return dbPlan{ns: "dbaas", selector: "cnpg.io/instanceRole=primary", container: "postgres", argv: argv}
}
// shellQuote single-quotes s for safe embedding in a bash -c string.
func shellQuote(s string) string {
return "'" + strings.ReplaceAll(s, "'", `'\''`) + "'"
}

65
cli/k8s_test.go Normal file
View file

@ -0,0 +1,65 @@
package main
import (
"reflect"
"strings"
"testing"
)
func TestParseK8sTarget(t *testing.T) {
got := parseK8sTarget([]string{"tripit", "-n", "prod", "--pod", "x-123", "-c", "app", "-l", "k=v", "--tail=50", "--", "ls", "-la"})
want := k8sTarget{app: "tripit", ns: "prod", pod: "x-123", container: "app", selector: "k=v", rest: []string{"--tail=50", "ls", "-la"}}
if !reflect.DeepEqual(got, want) {
t.Fatalf("parseK8sTarget =\n %+v\nwant\n %+v", got, want)
}
}
func TestK8sTargetNamespaceDefaultsToApp(t *testing.T) {
if ns := parseK8sTarget([]string{"immich"}).namespace(); ns != "immich" {
t.Errorf("namespace() = %q, want immich", ns)
}
if ns := parseK8sTarget([]string{"immich", "-n", "dbaas"}).namespace(); ns != "dbaas" {
t.Errorf("namespace() = %q, want dbaas", ns)
}
}
func TestK8sTargetObjectRef(t *testing.T) {
if r := parseK8sTarget([]string{"tripit"}).objectRef(); r != "deploy/tripit" {
t.Errorf("objectRef() = %q, want deploy/tripit", r)
}
if r := parseK8sTarget([]string{"tripit", "--pod", "tripit-abc"}).objectRef(); r != "pod/tripit-abc" {
t.Errorf("objectRef() = %q, want pod/tripit-abc", r)
}
}
func TestPlanDBExecPostgresDefault(t *testing.T) {
p := planDBExec("fire-planner", "", "SELECT 1", false)
// pg-cluster-rw is a Service, so the PG plan resolves the primary POD by
// label rather than naming an (un-exec-able) Service.
if p.ns != "dbaas" || p.pod != "" || p.selector != "cnpg.io/instanceRole=primary" || p.container != "postgres" {
t.Fatalf("unexpected pg target: %+v", p)
}
// db name defaults to the app; SQL passed via -tAc
joined := strings.Join(p.argv, " ")
if !strings.Contains(joined, "-d fire-planner") || !strings.Contains(joined, "-tAc") {
t.Fatalf("pg argv missing db/sql: %v", p.argv)
}
}
func TestPlanDBExecMysqlEnvPassword(t *testing.T) {
p := planDBExec("wrongmove", "wrongmove", "SHOW TABLES", true)
if p.pod != "mysql-standalone-0" {
t.Fatalf("unexpected mysql pod: %+v", p)
}
inner := strings.Join(p.argv, " ")
// password must come from the env var, never inline
if !strings.Contains(inner, `-p"$MYSQL_ROOT_PASSWORD"`) {
t.Fatalf("mysql must use env password wrapper: %v", p.argv)
}
}
func TestShellQuoteEscapes(t *testing.T) {
if got := shellQuote("a'b"); got != `'a'\''b'` {
t.Fatalf("shellQuote = %q", got)
}
}

View file

@ -26,8 +26,16 @@ var (
)
func main() {
err := run()
if err != nil {
// homelab verb surface (work/tf/claim/...) is tried first; if the args are
// not a homelab verb, fall through to the legacy webhook -use-case path.
if handled, err := dispatchTop(os.Args[1:]); handled {
if err != nil {
fmt.Fprintln(os.Stderr, "homelab: "+err.Error())
os.Exit(1)
}
return
}
if err := run(); err != nil {
glog.Errorf("run failed: %s", err.Error())
os.Exit(255)
}

103
cli/memory.go Normal file
View file

@ -0,0 +1,103 @@
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
"os"
"strings"
"time"
)
// defaultMemoryURL is used when no env override is present (agents normally have
// CLAUDE_MEMORY_API_URL set by the memory hooks).
const defaultMemoryURL = "https://claude-memory.viktorbarzin.me"
type memoryClient struct {
base string
key string
http *http.Client
}
func firstEnv(keys ...string) string {
for _, k := range keys {
if v := os.Getenv(k); v != "" {
return v
}
}
return ""
}
func resolveMemoryBase() string {
if b := firstEnv("CLAUDE_MEMORY_API_URL", "MEMORY_API_URL"); b != "" {
return strings.TrimRight(b, "/")
}
return defaultMemoryURL
}
// newMemoryClient talks straight to the claude-memory HTTP API (the same backend
// the MCP wraps), so it works even when the MCP frontend is down.
func newMemoryClient() (*memoryClient, error) {
key := firstEnv("CLAUDE_MEMORY_API_KEY", "MEMORY_API_KEY")
if key == "" {
return nil, fmt.Errorf("no memory API key — set CLAUDE_MEMORY_API_KEY (or MEMORY_API_KEY)")
}
return &memoryClient{base: resolveMemoryBase(), key: key, http: &http.Client{Timeout: 30 * time.Second}}, nil
}
func (c *memoryClient) do(method, path string, body interface{}) ([]byte, error) {
var r io.Reader
if body != nil {
b, err := json.Marshal(body)
if err != nil {
return nil, err
}
r = bytes.NewReader(b)
}
req, err := http.NewRequest(method, c.base+path, r)
if err != nil {
return nil, err
}
req.Header.Set("Authorization", "Bearer "+c.key)
if body != nil {
req.Header.Set("Content-Type", "application/json")
}
resp, err := c.http.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
out, _ := io.ReadAll(resp.Body)
if resp.StatusCode >= 300 {
return nil, fmt.Errorf("memory API %s %s -> %d: %s", method, path, resp.StatusCode, strings.TrimSpace(string(out)))
}
return out, nil
}
// Request bodies mirror src/claude_memory/api/models.py.
type memRecallReq struct {
Context string `json:"context"`
ExpandedQuery string `json:"expanded_query,omitempty"`
Category string `json:"category,omitempty"`
SortBy string `json:"sort_by,omitempty"`
Limit int `json:"limit,omitempty"`
}
type memStoreReq struct {
Content string `json:"content"`
Category string `json:"category,omitempty"`
Tags string `json:"tags,omitempty"`
ExpandedKeywords string `json:"expanded_keywords,omitempty"`
Importance float64 `json:"importance"`
ForceSensitive bool `json:"force_sensitive,omitempty"`
}
type memUpdateReq struct {
Content *string `json:"content,omitempty"`
Tags *string `json:"tags,omitempty"`
Importance *float64 `json:"importance,omitempty"`
ExpandedKeywords *string `json:"expanded_keywords,omitempty"`
}

74
cli/memory_test.go Normal file
View file

@ -0,0 +1,74 @@
package main
import (
"encoding/json"
"os"
"strings"
"testing"
"unicode/utf8"
)
func TestTruncatePreviewKeepsValidUTF8(t *testing.T) {
// Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits
// invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must
// cut on a rune boundary and always stay valid UTF-8.
long := strings.Repeat("я", 300) // 300 runes / 600 bytes
got := truncatePreview(long, 240)
if !utf8.ValidString(got) {
t.Fatalf("truncatePreview produced invalid UTF-8: %q", got)
}
if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' {
t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r))
}
// Short multibyte strings pass through untouched (no ellipsis).
if got := truncatePreview("кратко", 240); got != "кратко" {
t.Fatalf("short string altered: %q", got)
}
// ASCII boundary still works.
if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" {
t.Fatalf("ascii truncation wrong: %q", got)
}
}
func TestResolveMemoryBase(t *testing.T) {
old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL")
defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }()
os.Unsetenv("CLAUDE_MEMORY_API_URL")
os.Unsetenv("MEMORY_API_URL")
if got := resolveMemoryBase(); got != defaultMemoryURL {
t.Errorf("resolveMemoryBase() = %q, want default %q", got, defaultMemoryURL)
}
os.Setenv("CLAUDE_MEMORY_API_URL", "https://m.example/") // trailing slash trimmed
if got := resolveMemoryBase(); got != "https://m.example" {
t.Errorf("resolveMemoryBase() = %q, want https://m.example", got)
}
}
func TestMemStoreReqAlwaysSendsImportance(t *testing.T) {
b, _ := json.Marshal(memStoreReq{Content: "x", Category: "facts", Importance: 0.5})
s := string(b)
if !strings.Contains(s, `"content":"x"`) || !strings.Contains(s, `"importance":0.5`) {
t.Fatalf("memStoreReq JSON missing fields: %s", s)
}
}
func TestMemUpdateReqOmitsUnsetFields(t *testing.T) {
tags := "a,b"
b, _ := json.Marshal(memUpdateReq{Tags: &tags})
s := string(b)
if strings.Contains(s, "content") || strings.Contains(s, "importance") {
t.Fatalf("unset update fields must be omitted: %s", s)
}
if !strings.Contains(s, `"tags":"a,b"`) {
t.Fatalf("set field missing: %s", s)
}
}
func TestMemRecallReqOmitsEmptyOptionals(t *testing.T) {
b, _ := json.Marshal(memRecallReq{Context: "hi"})
s := string(b)
if strings.Contains(s, "expanded_query") || strings.Contains(s, "category") || strings.Contains(s, "limit") {
t.Fatalf("empty optionals must be omitted: %s", s)
}
}

58
cli/presence.go Normal file
View file

@ -0,0 +1,58 @@
package main
import (
"fmt"
"os"
"path/filepath"
"strings"
)
// validPresenceKinds is the fixed label taxonomy accepted by the presence board.
var validPresenceKinds = []string{"node", "host", "stack", "service", "db", "pvc", "infra"}
// presenceScript locates the presence CLI — homelab WRAPS it, it does not
// reimplement it. Override with HOMELAB_PRESENCE; defaults to ~/code/scripts/presence.
func presenceScript() string {
if p := os.Getenv("HOMELAB_PRESENCE"); p != "" {
return p
}
home, err := os.UserHomeDir()
if err != nil {
return "presence"
}
return filepath.Join(home, "code", "scripts", "presence")
}
// validateLabel checks a presence label is <kind>:<name> with a known kind.
func validateLabel(label string) error {
parts := strings.SplitN(label, ":", 2)
if len(parts) != 2 || parts[0] == "" || parts[1] == "" {
return fmt.Errorf("label must be <kind>:<name> (e.g. stack:vault), got %q", label)
}
for _, k := range validPresenceKinds {
if parts[0] == k {
return nil
}
}
return fmt.Errorf("invalid label kind %q; valid kinds: %s", parts[0], strings.Join(validPresenceKinds, ", "))
}
// presenceClaim claims label on the board with a purpose note.
func presenceClaim(label, purpose string) error {
if err := validateLabel(label); err != nil {
return err
}
args := []string{"claim", label}
if purpose != "" {
args = append(args, "--purpose", purpose)
}
return runStreaming(presenceScript(), args...)
}
// presenceRelease releases a prior claim on label.
func presenceRelease(label string) error {
if err := validateLabel(label); err != nil {
return err
}
return runStreaming(presenceScript(), "release", label)
}

24
cli/presence_test.go Normal file
View file

@ -0,0 +1,24 @@
package main
import "testing"
func TestValidateLabelAcceptsTaxonomy(t *testing.T) {
good := []string{
"stack:vault", "service:health", "node:k8s-node1", "db:pg-cluster",
"infra:gpu-operator", "host:proxmox-1", "pvc:dbaas/data",
}
for _, l := range good {
if err := validateLabel(l); err != nil {
t.Errorf("validateLabel(%q) = %v, want nil", l, err)
}
}
}
func TestValidateLabelRejectsBadLabels(t *testing.T) {
bad := []string{"vault", "stack:", "bogus:x", ":x", "stack", ""}
for _, l := range bad {
if err := validateLabel(l); err == nil {
t.Errorf("validateLabel(%q) = nil, want error", l)
}
}
}

76
cli/probe.go Normal file
View file

@ -0,0 +1,76 @@
package main
import (
"context"
"crypto/tls"
"fmt"
"io"
"net"
"net/http"
"net/url"
"os/exec"
"strings"
"time"
)
// internalLBIP is the dedicated Traefik LB; every internal ingress routes through it.
const internalLBIP = "10.0.20.203"
// clientDialingIP returns an http.Client that dials ip for ANY host while keeping
// the URL host as SNI (so the cert matches) — the Go form of `curl --resolve
// host:443:ip`. TLS verification is skipped (these are reachability/observability
// probes, not security checks; internal .lan vhosts may serve a non-matching cert).
func clientDialingIP(ip string, timeout time.Duration) *http.Client {
d := &net.Dialer{Timeout: 8 * time.Second}
tr := &http.Transport{
DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
if i := strings.LastIndex(addr, ":"); i >= 0 {
addr = ip + addr[i:]
}
return d.DialContext(ctx, network, addr)
},
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
}
return &http.Client{Timeout: timeout, Transport: tr}
}
// probeURL issues a GET and returns status code + elapsed time.
func probeURL(c *http.Client, rawurl string) (int, time.Duration, error) {
start := time.Now()
resp, err := c.Get(rawurl)
dur := time.Since(start)
if err != nil {
return 0, dur, err
}
resp.Body.Close()
return resp.StatusCode, dur, nil
}
// lbGetBody GETs https://<host><path>?<q> through the internal LB and returns the body.
func lbGetBody(host, path string, q url.Values) ([]byte, error) {
u := "https://" + host + path
if len(q) > 0 {
u += "?" + q.Encode()
}
resp, err := clientDialingIP(internalLBIP, 20*time.Second).Get(u)
if err != nil {
return nil, err
}
defer resp.Body.Close()
body, _ := io.ReadAll(resp.Body)
if resp.StatusCode >= 300 {
return nil, fmt.Errorf("%s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
}
return body, nil
}
// dig runs `dig +short` against a resolver, optionally for a record type.
func dig(name, server, rrtype string) (string, error) {
args := []string{"+short", "+time=3", "+tries=1"}
if rrtype != "" {
args = append(args, rrtype)
}
args = append(args, name, "@"+server)
out, err := exec.Command("dig", args...).Output()
return strings.TrimSpace(string(out)), err
}

49
cli/probe_test.go Normal file
View file

@ -0,0 +1,49 @@
package main
import "testing"
func TestQueryArg(t *testing.T) {
if got := queryArg([]string{"up"}, nil); got != "up" {
t.Errorf(`queryArg(["up"]) = %q, want "up"`, got)
}
if got := queryArg([]string{"up", "--json"}, nil); got != "up" {
t.Errorf(`--json should be dropped, got %q`, got)
}
// single quoted PromQL arrives as one token
if got := queryArg([]string{"count by (node) (up)", "--json"}, nil); got != "count by (node) (up)" {
t.Errorf(`quoted query mangled: %q`, got)
}
// value-flags and their values are skipped, query survives
vf := map[string]bool{"--since": true, "--limit": true}
if got := queryArg([]string{`{app="x"}`, "--since", "1h", "--limit", "50"}, vf); got != `{app="x"}` {
t.Errorf(`value-flag skipping failed: %q`, got)
}
}
func TestLabelStr(t *testing.T) {
got := labelStr(map[string]string{"__name__": "up", "job": "x", "instance": "y"})
if got != "up{instance=y,job=x}" { // __name__ extracted, rest sorted
t.Errorf("labelStr = %q", got)
}
if got := labelStr(map[string]string{"alertname": "Foo"}); got != "{alertname=Foo}" {
t.Errorf("labelStr (no __name__) = %q", got)
}
}
func TestOneLineList(t *testing.T) {
if got := oneLineList(" "); got != "(none)" {
t.Errorf("empty = %q, want (none)", got)
}
if got := oneLineList("a\nb"); got != "a, b" {
t.Errorf("multi = %q, want 'a, b'", got)
}
}
func TestHostOnly(t *testing.T) {
if got := hostOnly("foo.me/path"); got != "foo.me" {
t.Errorf("hostOnly = %q", got)
}
if got := hostOnly("foo.me"); got != "foo.me" {
t.Errorf("hostOnly = %q", got)
}
}

101
cli/repo.go Normal file
View file

@ -0,0 +1,101 @@
package main
import (
"os"
"os/exec"
"os/user"
"path/filepath"
"strings"
)
// preferRemote picks the canonical remote: forgejo if present, else origin,
// else the first listed. (For infra, origin and forgejo both point at Forgejo.)
func preferRemote(remotes []string) string {
has := map[string]bool{}
for _, r := range remotes {
has[r] = true
}
switch {
case has["forgejo"]:
return "forgejo"
case has["origin"]:
return "origin"
case len(remotes) > 0:
return remotes[0]
default:
return ""
}
}
// hasGitCryptAttr reports whether .gitattributes content enables git-crypt.
func hasGitCryptAttr(gitattributes string) bool {
return strings.Contains(gitattributes, "filter=git-crypt")
}
// gitCryptFlags are the per-command flags that disable smudge/clean so git
// operations in a git-crypt repo don't try to decrypt (NEVER persisted to config).
func gitCryptFlags() []string {
return []string{
"-c", "filter.git-crypt.smudge=cat",
"-c", "filter.git-crypt.clean=cat",
"-c", "filter.git-crypt.required=false",
}
}
// gitOutput runs `git -C dir <args>` and returns trimmed stdout.
func gitOutput(dir string, args ...string) (string, error) {
cmd := exec.Command("git", append([]string{"-C", dir}, args...)...)
out, err := cmd.Output()
return strings.TrimSpace(string(out)), err
}
func gitRepoRoot(dir string) (string, error) {
return gitOutput(dir, "rev-parse", "--show-toplevel")
}
// gitRemotes lists configured remote names for the repo at dir.
func gitRemotes(dir string) ([]string, error) {
out, err := gitOutput(dir, "remote")
if err != nil {
return nil, err
}
if out == "" {
return nil, nil
}
return strings.Split(out, "\n"), nil
}
// isGitCryptRepo reports whether the repo at repoRoot uses git-crypt.
func isGitCryptRepo(repoRoot string) bool {
b, err := os.ReadFile(filepath.Join(repoRoot, ".gitattributes"))
if err != nil {
return false
}
return hasGitCryptAttr(string(b))
}
// cryptFlagsFor returns the git-crypt filter flags when repoRoot is encrypted,
// else nil. These are injected per-command and never persisted.
func cryptFlagsFor(repoRoot string) []string {
if isGitCryptRepo(repoRoot) {
return gitCryptFlags()
}
return nil
}
// gitStream runs `git [cryptFlags] -C repoRoot <args>` with live output.
func gitStream(repoRoot string, cryptFlags []string, args ...string) error {
full := append(append([]string{}, cryptFlags...), append([]string{"-C", repoRoot}, args...)...)
return runStreamingIn("", "git", full...)
}
// currentUser returns the OS username for branch naming (<user>/<topic>).
func currentUser() string {
if u := os.Getenv("USER"); u != "" {
return u
}
if u, err := user.Current(); err == nil && u.Username != "" {
return u.Username
}
return "user"
}

37
cli/repo_test.go Normal file
View file

@ -0,0 +1,37 @@
package main
import "testing"
func TestPreferRemote(t *testing.T) {
cases := []struct {
in []string
want string
}{
{[]string{"origin", "forgejo"}, "forgejo"},
{[]string{"forgejo"}, "forgejo"},
{[]string{"origin"}, "origin"},
{[]string{"upstream"}, "upstream"},
{nil, ""},
}
for _, c := range cases {
if got := preferRemote(c.in); got != c.want {
t.Errorf("preferRemote(%v) = %q, want %q", c.in, got, c.want)
}
}
}
func TestHasGitCryptAttr(t *testing.T) {
if !hasGitCryptAttr("*.tfvars filter=git-crypt diff=git-crypt") {
t.Error("expected git-crypt detected")
}
if hasGitCryptAttr("*.md text\n*.png binary") {
t.Error("expected no git-crypt")
}
}
func TestGitCryptFlagsShape(t *testing.T) {
f := gitCryptFlags()
if len(f) != 6 || f[0] != "-c" || f[1] != "filter.git-crypt.smudge=cat" {
t.Fatalf("unexpected git-crypt flags: %v", f)
}
}

23
cli/run.go Normal file
View file

@ -0,0 +1,23 @@
package main
import (
"os"
"os/exec"
)
// runStreaming executes name with args, wiring std streams to this process so
// the caller sees live output, and returns the command's error (non-nil on
// non-zero exit — preserved so homelab's own exit code reflects the child's).
func runStreaming(name string, args ...string) error {
return runStreamingIn("", name, args...)
}
// runStreamingIn is runStreaming with a working directory (empty = inherit).
func runStreamingIn(dir, name string, args ...string) error {
cmd := exec.Command(name, args...)
cmd.Dir = dir
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Stdin = os.Stdin
return cmd.Run()
}

54
cli/stack.go Normal file
View file

@ -0,0 +1,54 @@
package main
import (
"fmt"
"os"
"path/filepath"
"sort"
"strings"
)
// findInfraRoot walks up from start to the infra repo root — the directory
// holding both terragrunt.hcl and a stacks/ directory.
func findInfraRoot(start string) (string, error) {
dir := start
for {
if isFile(filepath.Join(dir, "terragrunt.hcl")) && isDir(filepath.Join(dir, "stacks")) {
return dir, nil
}
parent := filepath.Dir(dir)
if parent == dir {
return "", fmt.Errorf("not inside an infra checkout (no terragrunt.hcl + stacks/ found above %s)", start)
}
dir = parent
}
}
// resolveStack maps a bare stack name to its directory under <infraRoot>/stacks.
func resolveStack(infraRoot, name string) (string, error) {
dir := filepath.Join(infraRoot, "stacks", name)
if isDir(dir) {
return dir, nil
}
avail := listStacks(infraRoot)
return "", fmt.Errorf("stack %q not found under stacks/; available: %s", name, strings.Join(avail, ", "))
}
// listStacks returns the sorted names of every directory under <infraRoot>/stacks.
func listStacks(infraRoot string) []string {
entries, err := os.ReadDir(filepath.Join(infraRoot, "stacks"))
if err != nil {
return nil
}
var out []string
for _, e := range entries {
if e.IsDir() {
out = append(out, e.Name())
}
}
sort.Strings(out)
return out
}
func isFile(p string) bool { fi, err := os.Stat(p); return err == nil && !fi.IsDir() }
func isDir(p string) bool { fi, err := os.Stat(p); return err == nil && fi.IsDir() }

52
cli/stack_test.go Normal file
View file

@ -0,0 +1,52 @@
package main
import (
"os"
"path/filepath"
"testing"
)
func newInfraTree(t *testing.T, stacks ...string) string {
t.Helper()
root := t.TempDir()
if err := os.WriteFile(filepath.Join(root, "terragrunt.hcl"), []byte("# root"), 0o644); err != nil {
t.Fatal(err)
}
for _, s := range stacks {
if err := os.MkdirAll(filepath.Join(root, "stacks", s), 0o755); err != nil {
t.Fatal(err)
}
}
return root
}
func TestFindInfraRootWalksUp(t *testing.T) {
root := newInfraTree(t, "vault")
got, err := findInfraRoot(filepath.Join(root, "stacks", "vault"))
if err != nil {
t.Fatalf("findInfraRoot error: %v", err)
}
if got != root {
t.Fatalf("findInfraRoot = %q, want %q", got, root)
}
}
func TestFindInfraRootErrorsOutsideInfra(t *testing.T) {
if _, err := findInfraRoot(t.TempDir()); err == nil {
t.Fatal("expected error outside an infra checkout")
}
}
func TestResolveStack(t *testing.T) {
root := newInfraTree(t, "vault", "monitoring")
dir, err := resolveStack(root, "vault")
if err != nil {
t.Fatalf("resolveStack error: %v", err)
}
if want := filepath.Join(root, "stacks", "vault"); dir != want {
t.Fatalf("resolveStack = %q, want %q", dir, want)
}
if _, err := resolveStack(root, "nonesuch"); err == nil {
t.Fatal("expected error for unknown stack")
}
}

62
cli/telemetry.go Normal file
View file

@ -0,0 +1,62 @@
package main
import (
"bytes"
"encoding/json"
"net/http"
"os"
"strconv"
"strings"
"time"
)
// usageJob is the Loki stream job label for homelab usage telemetry.
const usageJob = "homelab-usage"
// emitUsage best-effort records one verb invocation to Loki for cross-user
// usage analytics. Labels are low-cardinality (job/user/verb); the line carries
// only exit code + CLI version. NEVER args, paths, flags, or secrets. It must
// never affect the command: all errors are swallowed and a tight timeout bounds
// the cost. Opt out with HOMELAB_TELEMETRY=0.
func emitUsage(verb string, runErr error) {
switch os.Getenv("HOMELAB_TELEMETRY") {
case "0", "off", "false", "no":
return
}
if verb == "" || strings.HasPrefix(verb, "usage") {
return // don't self-record the analytics reader
}
exit := 0
if runErr != nil {
exit = 1
}
body, err := json.Marshal(lokiPush{Streams: []lokiStream{{
Stream: map[string]string{"job": usageJob, "user": currentUser(), "verb": verb},
Values: [][2]string{{
strconv.FormatInt(time.Now().UnixNano(), 10),
"exit=" + strconv.Itoa(exit) + " ver=" + version,
}},
}}})
if err != nil {
return
}
req, err := http.NewRequest("POST", "https://"+lokiHost+"/loki/api/v1/push", bytes.NewReader(body))
if err != nil {
return
}
req.Header.Set("Content-Type", "application/json")
resp, err := clientDialingIP(internalLBIP, 800*time.Millisecond).Do(req)
if err != nil {
return
}
resp.Body.Close()
}
type lokiPush struct {
Streams []lokiStream `json:"streams"`
}
type lokiStream struct {
Stream map[string]string `json:"stream"`
Values [][2]string `json:"values"`
}

View file

@ -103,6 +103,6 @@ func notifyForIPChange(oldIP, newIP net.IP) error {
if err != nil {
return errors.Wrapf(err, "Error reading response")
}
glog.Infof("Response:", string(responseBody))
glog.Infof("Response: %s", string(responseBody))
return nil
}

18
cli/usage_test.go Normal file
View file

@ -0,0 +1,18 @@
package main
import (
"strings"
"testing"
)
func TestUsageQuery(t *testing.T) {
got := usageQuery("30d", "")
want := `sum by (verb) (count_over_time({job="homelab-usage"}[30d]))`
if got != want {
t.Errorf("usageQuery(30d,\"\") = %q, want %q", got, want)
}
withUser := usageQuery("7d", "emo")
if !strings.Contains(withUser, `user="emo"`) || !strings.Contains(withUser, "[7d]") {
t.Errorf("usageQuery with user missing filter/range: %q", withUser)
}
}

191
cli/woodpecker.go Normal file
View file

@ -0,0 +1,191 @@
package main
import (
"context"
"encoding/json"
"fmt"
"io"
"net"
"net/http"
"os"
"os/exec"
"strings"
"time"
)
// Woodpecker is reached at ci.viktorbarzin.me but routed via the internal Traefik
// LB (mirrors the proven `curl --resolve ci.viktorbarzin.me:443:10.0.20.203`):
// we dial the LB IP while keeping SNI/Host = the hostname so the cert verifies.
const (
wpHost = "ci.viktorbarzin.me"
wpLBIP = "10.0.20.203"
)
type wpClient struct {
base string
token string
http *http.Client
}
// wpToken reads WOODPECKER_TOKEN, else the canonical Vault path.
func wpToken() string {
if t := firstEnv("WOODPECKER_TOKEN", "WP_TOKEN"); t != "" {
return t
}
out, err := exec.Command("vault", "kv", "get", "-field=woodpecker_api_token", "secret/ci/global").Output()
if err != nil {
return ""
}
return strings.TrimSpace(string(out))
}
func newWPClient() (*wpClient, error) {
tok := wpToken()
if tok == "" {
return nil, fmt.Errorf("no woodpecker token — set WOODPECKER_TOKEN or `vault login` (reads secret/ci/global)")
}
ip := firstEnv("HOMELAB_WP_IP")
if ip == "" {
ip = wpLBIP
}
dialer := &net.Dialer{Timeout: 8 * time.Second}
tr := &http.Transport{
DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
if strings.HasPrefix(addr, wpHost+":") {
addr = ip + addr[strings.LastIndex(addr, ":"):]
}
return dialer.DialContext(ctx, network, addr)
},
}
return &wpClient{base: "https://" + wpHost, token: tok, http: &http.Client{Timeout: 20 * time.Second, Transport: tr}}, nil
}
// getJSON GETs path into v, retrying the transient empty/5xx responses the
// Woodpecker API intermittently returns under load.
func (c *wpClient) getJSON(path string, v interface{}) error {
var lastErr error
for attempt := 0; attempt < 5; attempt++ {
if attempt > 0 {
time.Sleep(2 * time.Second)
}
req, _ := http.NewRequest("GET", c.base+path, nil)
req.Header.Set("Authorization", "Bearer "+c.token)
resp, err := c.http.Do(req)
if err != nil {
lastErr = err
continue
}
body, _ := io.ReadAll(resp.Body)
resp.Body.Close()
if resp.StatusCode >= 500 || len(strings.TrimSpace(string(body))) == 0 {
lastErr = fmt.Errorf("woodpecker GET %s -> %d (empty/5xx, retrying)", path, resp.StatusCode)
continue
}
if resp.StatusCode >= 300 {
return fmt.Errorf("woodpecker GET %s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
}
return json.Unmarshal(body, v)
}
return lastErr
}
type wpPipeline struct {
Number int `json:"number"`
Status string `json:"status"`
Event string `json:"event"`
Commit string `json:"commit"`
Message string `json:"message"`
}
func (c *wpClient) recentPipelines(repoID, n int) ([]wpPipeline, error) {
var ps []wpPipeline
err := c.getJSON(fmt.Sprintf("/api/repos/%d/pipelines?per_page=%d", repoID, n), &ps)
return ps, err
}
// findPipeline returns the pipeline for commit (prefix match), or the latest when
// commit is empty.
func (c *wpClient) findPipeline(repoID int, commit string) (wpPipeline, error) {
ps, err := c.recentPipelines(repoID, 25)
if err != nil {
return wpPipeline{}, err
}
if len(ps) == 0 {
return wpPipeline{}, fmt.Errorf("no pipelines for repo %d", repoID)
}
if commit == "" {
return ps[0], nil
}
for _, p := range ps {
if strings.HasPrefix(p.Commit, commit) {
return p, nil
}
}
return wpPipeline{}, fmt.Errorf("no pipeline for commit %s in the last %d", commit[:min(8, len(commit))], len(ps))
}
func (c *wpClient) repoID() (int, error) {
owner, repo, err := repoOwnerName()
if err != nil {
return 0, err
}
var r struct {
ID int `json:"id"`
}
if err := c.getJSON("/api/repos/lookup/"+owner+"/"+repo, &r); err != nil {
return 0, err
}
if r.ID == 0 {
return 0, fmt.Errorf("repo %s/%s not registered in woodpecker", owner, repo)
}
return r.ID, nil
}
// repoOwnerName derives <owner>/<repo> from the cwd git remote.
func repoOwnerName() (string, string, error) {
cwd, _ := os.Getwd()
root, err := gitRepoRoot(cwd)
if err != nil {
return "", "", fmt.Errorf("not in a git repository: %w", err)
}
remote := preferRemote(remotesOrEmpty(root))
url, err := gitOutput(root, "remote", "get-url", remote)
if err != nil {
return "", "", err
}
return parseOwnerRepo(url)
}
// parseOwnerRepo extracts owner/repo from an https or ssh git remote URL.
func parseOwnerRepo(url string) (string, string, error) {
u := strings.TrimSuffix(strings.TrimSpace(url), ".git")
u = strings.TrimSuffix(u, "/")
if i := strings.Index(u, "://"); i >= 0 {
u = u[i+3:]
}
u = strings.ReplaceAll(u, ":", "/") // git@host:owner/repo -> git@host/owner/repo
parts := strings.Split(u, "/")
if len(parts) < 2 || parts[len(parts)-1] == "" || parts[len(parts)-2] == "" {
return "", "", fmt.Errorf("cannot parse owner/repo from remote %q", url)
}
return parts[len(parts)-2], parts[len(parts)-1], nil
}
func isTerminalStatus(s string) bool {
switch s {
case "success", "failure", "error", "killed", "declined", "blocked":
return true
}
return false
}
func isFailureStatus(s string) bool {
return s == "failure" || s == "error" || s == "killed" || s == "declined"
}
func min(a, b int) int {
if a < b {
return a
}
return b
}

40
cli/woodpecker_test.go Normal file
View file

@ -0,0 +1,40 @@
package main
import "testing"
func TestParseOwnerRepo(t *testing.T) {
cases := []struct{ in, owner, repo string }{
{"https://forgejo.viktorbarzin.me/viktor/infra.git", "viktor", "infra"},
{"https://forgejo.viktorbarzin.me/viktor/infra", "viktor", "infra"},
{"git@github.com:ViktorBarzin/infra.git", "ViktorBarzin", "infra"},
{"https://github.com/ViktorBarzin/tripit/", "ViktorBarzin", "tripit"},
}
for _, c := range cases {
o, r, err := parseOwnerRepo(c.in)
if err != nil || o != c.owner || r != c.repo {
t.Errorf("parseOwnerRepo(%q) = (%q, %q, %v), want (%q, %q)", c.in, o, r, err, c.owner, c.repo)
}
}
if _, _, err := parseOwnerRepo("nonsense"); err == nil {
t.Error("expected error for unparseable remote")
}
}
func TestStatusClassification(t *testing.T) {
for _, s := range []string{"success", "failure", "error", "killed"} {
if !isTerminalStatus(s) {
t.Errorf("%q should be terminal", s)
}
}
for _, s := range []string{"running", "pending"} {
if isTerminalStatus(s) {
t.Errorf("%q should not be terminal", s)
}
}
if !isFailureStatus("failure") || !isFailureStatus("error") {
t.Error("failure/error should classify as failure")
}
if isFailureStatus("success") {
t.Error("success must not classify as failure")
}
}

Binary file not shown.

View file

@ -13,7 +13,7 @@ The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling
Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:
- Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. `Plotting-Your-Dream-Book` (owned by Anca, dev in her org) keeps its GHA build in-place and pushes the image to **its own org's ghcr** (`ghcr.io/passionprojectsanca/book-plotter`, private) via the workflow's built-in `GITHUB_TOKEN` — no Forgejo mirror, no `viktorbarzin`-namespace push, no shared PAT in her repo (2026-06-27, migrated off DockerHub).
- `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
- Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."

View file

@ -0,0 +1,30 @@
# homelab: a unified infra-ops CLI grown in place from infra/cli
Agents re-derive the same operational command boilerplate every session — mining
51,116 bash commands across 2,225 past sessions showed dense, repeated patterns
(the infra inner-loop alone is ~29%). We are building `homelab`, one CLI encoding
the deterministic, repeated **actions** (not judgment) agents run — composable in
bash, JSON-capable, and discovered progressively via `homelab manifest`. It is
grown **in place** in `cli/` (the existing `infra-cli`), absorbing new verb-groups
alongside the preserved legacy webhook use-cases. Versioned with a `cli/VERSION`
file (the infra repo deploys continuously and does not cut semver tags).
## Considered options
- **Its own top-level repo** (the original plan) — rejected in favour of keeping
it where the Terraform/Terragrunt and `scripts/tg` it drives already live; the
Go source isn't git-crypt-encrypted and a provision-time build is unaffected by
GitOps continuous-deploy.
- **A fresh CLI ignoring infra-cli** — rejected: strands the VPN/DNS/email
webhook use-cases.
- **Raw kubectl/tg/ssh + skills + MCP only** — kept for everything outside the
recurring action surface (methodology skills; third-party/owned MCP such as
phpIPAM, which homelab does NOT duplicate).
## Consequences
- The binary is dual-purpose: the agent-facing `homelab` verb surface AND the
in-cluster `infra-cli` webhook image. `main()` front-dispatches homelab verbs
and falls through to the legacy `-use-case` path verbatim.
- Distribution: built from source to `/usr/local/bin/homelab` during devvm
provisioning (`t3-dispatch` precedent), refreshed by `t3-autoupdate`.

View file

@ -0,0 +1,23 @@
# homelab v0.1 scope: the infra inner-loop; everything allowed, tiers recorded
v0.1 ships only the highest-volume surface — the infra inner-loop: `work`
(worktree lifecycle), `tf` (terragrunt via `scripts/tg` + fmt/validate/
force-unlock), and `claim`/`release` (presence) — because it is ~29% of all mined
commands and where agents lose the most time and leak the most presence claims.
v0.1 enforces **no** homelab-level permission gating: everything is allowed,
relying on existing gates (harness permission mode, presence claims, plan
approval). But every verb records a `read|write` tier (visible in `manifest`), so
a PreToolUse classifier hook (auto-allow reads / prompt writes) can be added
later with zero restructuring.
## Considered options
- **Reads-first vertical slice** (top read verb per domain) — lower risk, broad
value, but defers the toil that motivated the project.
- **One domain deep (k8s)** — cleanest template, narrow day-one value.
We chose the highest-volume-but-write-heavy infra loop deliberately, accepting
the extra complexity (worktree lifecycle, git-crypt flag injection, presence
coupling, branch-protection PR fallback) for the biggest immediate toil
reduction. k8s/node/secret/net/ci verb-groups are deferred to later versions.

View file

@ -0,0 +1,29 @@
# homelab work/tf behaviour: native worktree entry, gated auto-land, presence-coupled apply
Four behaviours of the infra-loop verbs are surprising enough to record:
1. **`work` owns worktree create/land/clean, but session *entry* delegates to the
native harness worktree tool.** A CLI is a child process and cannot change the
agent's working directory; `EnterWorktree` can. So `homelab work start <topic>`
creates the worktree + branch off `<remote>/master` (git-crypt-aware) and
prints the path — the agent enters it with native `EnterWorktree({path})`.
2. **`work land` is auto-land, but gated on verification.** It merges master in →
runs verification → pushes `HEAD:master` (fetch+merge+retry on
non-fast-forward) → falls back to pushing the feature branch for a PR when the
direct push is rejected (branch protection). It **refuses to push when it
cannot verify** (no `--verify-cmd` and no auto-detected suite) unless
`--no-verify` is passed — added after an accidental smoke-test land pushed
unverified WIP to master (benign: the infra CI applied 0 stacks because the
diff was `cli/`-only, but an unverified land must be deliberate, not default).
3. **`tf apply` is first-class despite GitOps, and mandatorily presence-coupled.**
Local applies are out-of-band (CI applies canonically on push) but happen
constantly (~763× in the corpus). `tf apply <stack>` auto-claims `stack:<name>`,
delegates to `scripts/tg apply --non-interactive`, and **always releases on
exit** (normal, error, or signal via `sync.Once` + handler) — fixing the
documented ~200-claim leak — and prints an out-of-band reminder.
4. **Known v0.1 limitation:** `work land` does not yet block on CI to green; that
arrives with the ci/deploy watch verb-group. It prints a reminder to follow
the pipeline manually.

View file

@ -0,0 +1,30 @@
# homelab k8s verb-group: app→pod resolver, read/write split, config-mutation stays raw
v0.2 adds the Kubernetes verb-group — the biggest remaining surface by far
(mining the post-v0.1 corpus: 11,291 `kubectl` commands across 243 sessions, more
than every other domain combined).
It is built on an **app→namespace→pod resolver**: most namespaces hold exactly
one app, so `<app>` defaults to the namespace, and the target defaults to
`deploy/<app>` (kubectl resolves a pod from the Deployment). `-n`/`--pod`/`-c`/
`-l`/`--tty` override; multi-pod namespaces (`dbaas`, `monitoring`) need
specificity. The CLI uses the ambient kubeconfig — no per-call auth flags.
Verbs: read — `status`, `get`, `logs`, `describe`, `debug` (one-shot triage),
`pf`, `rollout-status`; write/operational — `db`, `exec`, `restart`, `rm-pod`.
## Decisions worth recording
- **Config-mutation verbs are deliberately NOT exposed** (`apply`/`edit`/`patch`/
`scale`/`create`). They stay raw `kubectl`, by design, per the repo's
Terraform-only policy — the corpus confirms they're low-frequency, and a
friendly verb would normalise a policy violation.
- **`rm-pod` is restricted to pods/jobs only** — deleting Deployments/STS/PVCs is
config mutation and forbidden; the verb cannot target them.
- **`db` encodes the dbaas exec pattern** (the single highest-value k8s
sub-pattern, ~886 dbaas ops): PG via `pg-cluster-rw -c postgres`,
`psql -U postgres -d <app>`; MySQL via `mysql-standalone-0` with a
`bash -c 'mysql -p"$MYSQL_ROOT_PASSWORD" …'` wrapper so the password comes from
the pod env and never appears on the command line.
- Read verbs were smoke-tested against the live cluster; write verbs are
unit-tested (resolver, db-plan, shell-quoting) but not fired at live state.

View file

@ -0,0 +1,30 @@
# homelab memory verb-group: direct HTTP client to claude-memory; MCP deprecation path
v0.3 adds the memory verb-group so agents can search and navigate memory from the
CLI. `claude-memory` is a FastAPI service (Postgres-backed, `Bearer`-auth,
ingress `auth = "none"` so programmatic clients work) — the **MCP is just one
frontend over it**. `homelab memory` is a thin HTTP client over the same API,
using the env the hooks already set (`CLAUDE_MEMORY_API_URL` +
`CLAUDE_MEMORY_API_KEY`; defaults to the ingress). Because it talks to the HTTP
API directly, it **works even when the MCP frontend is down** — the recurring
MCP-disconnect problem that motivated claude-memory HA (and that took the MCP
offline for the entire session this was built in).
Verbs: `recall` (server-side semantic ranking), `list`, `categories`, `tags`,
`stats`, `secret` (read); `store`, `update`, `delete` (write). Validated against
the live API including a store→recall→delete round-trip — full data-plane parity
with the MCP.
## Deprecation path (deliberate follow-up — NOT done in v0.3)
The MCP is more than tools: the **per-prompt auto-recall hook** and the
**auto-learn hook** run on every prompt for every agent. Deprecating it safely is
a separate, sequenced change:
1. Rewire the auto-recall hook to `homelab memory recall` and the auto-learn hook
to `homelab memory store`.
2. Update the CLAUDE.md memory policy to point at the CLI.
3. Uninstall the MCP.
Done CLI-first (verbs proven before touching the every-prompt path) so a
regression can't silently break auto-recall/auto-learn fleet-wide.

View file

@ -0,0 +1,29 @@
# homelab ci/deploy verbs: API-based watch, internal-LB dialer, work-land integration
v0.4 adds `ci`/`deploy` — the biggest *reasoning* sink in agent sessions (watching
a build/deploy to completion), proven during the session that built it (hours
spent hand-rolling Woodpecker API polling, DB-schema reverse-engineering, and
retrigger logic for a single CI incident).
## Decisions
- **API, not DB.** The verbs query the Woodpecker REST API (version-stable),
not its Postgres schema (which drifts across upgrades — column renames bit us
mid-incident). Reached via the internal Traefik LB by dialing `10.0.20.203`
while keeping SNI/Host = `ci.viktorbarzin.me` so the cert verifies (the Go
equivalent of the house `curl --resolve` pattern). Token from
`WOODPECKER_TOKEN` or Vault `secret/ci/global`; repo id resolved from the cwd
git remote via `/api/repos/lookup/<owner>/<repo>`.
- **Retries are mandatory.** The Woodpecker API intermittently returns empty/5xx
under load (it flapped through the whole build session); `getJSON` retries
empties with backoff so `ci watch` is reliable exactly when it's needed.
- **`work land` now waits for CI.** After pushing, `work land` calls `ci watch`
on the landed commit and fails if the pipeline does — closing the gap ADR-0005
deferred. `--no-ci-watch` opts out.
- **`deploy wait` encodes the "rollout status lies" rule:** it first waits for
the deployment image to reference the expected sha, *then* blocks on rollout
status (kubectl-based; reuses the k8s helpers).
- **`ci logs` deferred to v0.4.1.** Woodpecker's per-pipeline detail/log
endpoints were the least reliable this session (often empty); `status`/`watch`
rely on the list endpoint that works. A DB-backed `ci logs` is a possible
follow-up if the API path stays flaky.

View file

@ -0,0 +1,37 @@
# homelab net/dns/metrics/logs verbs: endpoint resolution as the unit of value
v0.5 adds `net`/`dns`/`metrics`/`logs`. These were chosen against an explicit
test the user posed mid-build: *does the verb save reasoning, or only typing?* A
wrapper over a command already known fluently (plain `ssh`, `vault kv get`) saves
keystrokes but not thought. These four save thought — the reasoning they encode
is **which endpoint, reached how, with what auth/URL shape** — re-derived every
time otherwise. (That same test deprioritized `node ssh` aliasing and `secret
get`, which are thin wrappers; see the session discussion.)
## Decisions
- **Internal ingresses, reached via the LB.** Everything routes through the
Traefik LB by dialing `10.0.20.203` with the URL host preserved as SNI — the
Go form of the house `curl --resolve host:443:10.0.20.203` pattern
(`probe.go: clientDialingIP`). Verified live before building: Prometheus
(`prometheus-query.viktorbarzin.lan`) and Loki (`loki.viktorbarzin.lan`) both
answer JSON over the LB with **no auth gate and no port-forward** — so these
stay clean HTTP clients, not kubectl wrappers.
- **`net check` is two-legged on purpose.** It resolves the host via public DNS
(→ Cloudflare) AND dials the internal LB, reporting both — because the useful
question is *where* a break is (CF edge vs the app vs the LB path), which a
single curl can't answer. The external leg forces public resolution (the devvm
resolver is split-horizon and would otherwise hit the LB for both).
- **`metrics alerts` uses the `ALERTS` series, not `/api/v1/alerts`.**
`prometheus-query.*` is a query-only frontend (404 on `/api/v1/alerts`), and
Alertmanager has no LB ingress (the alert-digest reads it in-cluster). Firing
alerts are exposed as the synthetic `ALERTS{alertstate="firing"}` time series,
queryable through the working endpoint — so no new dependency.
- **Deliberately NOT built:** in-cluster-only endpoints (Alertmanager v2,
raw `*.svc` services) that would force port-forward/`kubectl run`. The
reasoning-savings there don't beat the added moving parts; kept out of scope.
- **No `node`/`secret` group.** Same test: their high-volume parts are
command-wrappers (low savings); only compound node ops (serial console, VM
wait, fan-out) would qualify, and those are lower-frequency. Left unbuilt
unless a concrete pain surfaces — the high-value deterministic surface
(tf/work/ci/k8s/memory + these probes) is now covered.

View file

@ -0,0 +1,42 @@
# homelab usage telemetry: evidence-driven verb prioritization, privacy by construction
v0.6 adds `usage top` plus a fire-and-forget emit on every dispatched verb. It
exists to answer the question that drove the whole CLI — *which verbs are worth
adding next* — with data instead of one maintainer's habits (the earlier mining
covered a single user's ~51k commands, so the surface is shaped to that user).
> **Update (2026-06-26) — the cross-user privacy *norm* below is superseded by
> [ADR-0015](0015-os-is-the-authorization-boundary.md).** The prohibition this
> ADR leaned on ("reading another user's `~/.claude` is off-limits even for an
> owner in-session") no longer holds: the managed-settings policy now **defers
> to OS/sudo authorization**. The `usage top` telemetry design itself is
> unchanged and still current — only the "never read homes" framing in the
> third decision below is overtaken.
## Decisions
- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
the verb path; after `Run` returns we emit `{verb, exit}`. Discovery verbs
don't go through `dispatch()` (`manifest`/`version`/`help` are handled in
`dispatchTop`), so they don't self-record; `usage *` is skipped explicitly so
the analytics reader doesn't pollute its own data.
- **Payload is deliberately minimal: verb path + exit code only.** Labels
`{job=homelab-usage, user, verb}` (all low-cardinality) + line `exit=N ver=X`.
**No args, paths, flags, hostnames, or secrets** ever leave the process — the
emit sees only the matched verb name, not the arguments. This is what makes
cross-user aggregation safe.
- **Shared Loki sink → cross-user analytics WITHOUT reading homes.** Each user's
CLI writes its own invocations (attributed to its OS user) to the shared Loki
push API via the Traefik LB (verified: HTTP 204, no auth). `usage top` reads
back with a LogQL metric query. This is the privacy-preserving resolution to
"what does everyone (e.g. another user) use" — it never touches anyone's
`~/.claude`, which the org per-user policy bars (see the per-user red-line in
managed-settings; reading another user's home is off-limits even for an owner
in-session — a fresh session under changed MDM policy is the only legitimate
path, and even then this telemetry is the better answer).
- **Best-effort, never affects the command.** All errors swallowed; an 800ms
client timeout bounds the cost; opt-out via `HOMELAB_TELEMETRY=0`. Telemetry
must never slow or break the tool it measures.
- **Loki, not a new datastore.** Zero new infra, and it dogfoods the v0.5 `logs`
path (same host, same LB dial). Presence MySQL was the alternative (queryable
SQL) but would add a write dependency and creds; Loki needs neither.

View file

@ -0,0 +1,54 @@
# homelab Home Assistant verbs: token resolution + host SSH, not entity control
v0.7 adds `ha token` and `ha ssh`. They were chosen by mining a heavy HA
operator's sessions: across ~1,900 shell commands the single most-repeated line
(420×) was a hand-rolled `kubectl … | base64 -d | python -c '…token'` pipeline,
and a bespoke `ssh -o StrictHostKeyChecking=no -o …` invocation was redefined as
a shell function ~30× — both re-derived from scratch every session. The existing
`home-assistant-sofia.py` already covers the *API*, but it goes unused from an
arbitrary cwd (it needs `HOME_ASSISTANT_SOFIA_TOKEN` set and is referenced by a
cwd-relative path), so agents bypassed it. A global verb on `$PATH` closes that
gap for every user in every directory.
## Decisions
- **Only the two gaps the `ha` MCP can't fill.** The `ha` MCP server already
does entity state and control (`get_state`, `call_service`, history, logs).
Per the CLI's founding rule — *MCP-encoded actions are out of scope* (ADR-0004)
— we do **not** reimplement `on`/`off`/`list`/`state`. We add only token
*resolution* and host *SSH*, neither of which an API-only MCP can provide. The
value is endpoint/secret/host resolution, exactly like `net`/`dns` (ADR-0010).
- **`ha token` resolves live from the cluster, not from an env var.** It reads
the dedicated k8s Secret `openclaw/ha-tokens` (one key per instance: `sofia` /
`london`) via the ambient kubeconfig. This is robust to env drift — the precise
failure that made agents re-derive the pipeline. Read-tier, prints the bare
token to stdout so it composes in `$(…)`, mirroring `memory secret`.
- **The token is split into its own least-privilege secret** (`stacks/openclaw/ha_tokens.tf`).
It was originally read from `openclaw-secrets``skill_secrets` (a JSON blob
also holding `slack_webhook` + `uptime_kuma_password`), which only cluster
admins can read — so the verb hung/failed for the non-admin operator it was
built for (emo = `emil.barzin@gmail.com`, group `Home Server Admins`, whose
OIDC identity is barred from secrets in `openclaw`). `ha-tokens` carries only
the HA tokens, with a Role+RoleBinding granting `get` on *just that secret* to
the `Home Server Admins` group (k8s RBAC can't scope to a JSON sub-key, hence
the separate object). openclaw's own deployment keeps reading `openclaw-secrets`
— this is purely additive.
- **`ha ssh` is deterministic and per-user.** Flags are fixed for unattended
use: `-F /dev/null` (ignore user ssh-config), `StrictHostKeyChecking=no` +
`UserKnownHostsFile=/dev/null` (no host-key prompt/record — agents have no
TTY), `BatchMode=yes` + `ConnectTimeout=10` (fail fast, never hang). The key
is the **invoking user's** `~/.ssh/id_ed25519`, so the verb isn't tied to
whoever first wrote the workflow; that user's key must be enrolled on the HA
host. Write-tier (runs an arbitrary remote command).
- **sofia is the default; london is structural.** The devvm sits on the Sofia
LAN, so `vbarzin@192.168.1.8` is reachable and is the default instance. london
(`hassio@192.168.8.103`) is in the instance map so `ha token --instance london`
works (a pure secret read), but `ha ssh --instance london` generally won't
connect from here — london is remote. We model it correctly rather than
pretend it's reachable.
- **Scope held at two verbs.** `ha api` (an authenticated curl passthrough for
the endpoints the MCP/script don't cover — `/api/template`, `/reload`,
`check_config`, `/error_log`) was deferred: once `ha token` exists, raw curl is
already unblocked, and a generic passthrough overlaps the MCP. Re-measure via
`usage top` (ADR-0011); add targeted sugar verbs only if those endpoints are
still hand-rolled often.

View file

@ -0,0 +1,75 @@
# homelab browser verbs: headful (anti-bot) web automation via cluster Chrome
v0.8 adds `browser run`, `browser open`, and `browser --help`. They package a
capability that already existed but was undiscoverable: driving the cluster's
**headful** Chrome (`chrome-service` — real Chrome under Xvfb, CDP on
`svc/chrome-service:9222`) from the devvm, for sites that detect and block
headless automation.
## Motivating incident (2026-06-22)
Logging a washing-machine repair on the Stirling Ackroyd **Fixflo** tenant
portal: the headless `@playwright/mcp` browser loaded the site and filled the
entire multi-step form, but the **final submit silently failed** — Fixflo's
pre-submit `POST /IssuePreCreationCheck` returned `net::ERR_FILE_NOT_FOUND`, the
spinner hung, no issue was created. Root cause = headless-Chrome detection. The
fix was to drive the headful `chrome-service` over `connect_over_cdp` — it
submitted first try (Fixflo ref IS22657587). That capability was documented
(`docs/architecture/chrome-service.md`) but **not packaged or discoverable**, so
it took ~40 min, three redundant full form re-runs, and a user hint. The agent
also misread `ERR_FILE_NOT_FOUND` as "network egress" and retried blind instead
of inspecting the network panel.
## Decisions
- **Mechanics in `homelab`, not a `~/.claude` skill.** A standalone skill was
rejected: the CLI is run every session (so the verb is *discoverable*), is
versioned, multi-user, and test-covered. A private, untested skill is none of
those. The command owns only the deterministic *mechanics* (port-forward,
stealth injection, lifecycle) — the agent supplies the Playwright script, so
*judgment* stays out of the CLI (the founding rule, ADR-0004/0005).
- **The failure was judgment, not setup friction**, so the CLI is paired with a
one-line pointer in always-in-context `~/code/CLAUDE.md` and a diagnostic
payload in `browser --help`: the *when-to-use* signature (a site loads but a
gated action fails/hangs, or one request 500s/aborts while siblings 200 →
suspect headless detection) and an error-code cheat-sheet (`ERR_FILE_NOT_FOUND`
= request resolved/intercepted by the automation layer, **not** egress;
egress failures are `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED`
and would break the page load too). A command the agent doesn't think to run is
useless; the cheat-sheet is the actual fix for the misdiagnosis.
- **Reach the pod via `kubectl port-forward`, then `connect_over_cdp` to
localhost.** port-forward tunnels API-server→pod, so it **bypasses the `:9222`
NetworkPolicy** that gates in-cluster callers — the devvm needs no namespace
label. Readiness is asserted against `/json/version`: the endpoint must report
a real `Chrome/…`, never `HeadlessChrome` (the whole point). The forward is
**always** torn down (process-group kill + signal handler), on success and on
error — an acceptance requirement.
- **Default to a fresh incognito context; `--shared-context` opts into the warmed
profile.** chrome-service is a single shared browser with a persistent profile.
A fresh, always-closed context is safe for concurrent callers (tripit's fare
scrape connects per-quote) and is what production already does. The warmed
persistent profile (cookies from a manual noVNC login) is opt-in for flows that
need a pre-logged-in session.
- **Pin the node CDP client to `playwright-core@1.48.2`** to match the
chrome-service image minor (`mcr.microsoft.com/playwright:v1.48.0-noble`,
Chromium 130). `connect_over_cdp` speaks the browser's CDP, and protocol
changes between Playwright minors — the devvm's ambient Python Playwright was
1.58, a 10-minor skew. The pin makes behaviour deterministic across the fleet
regardless of local drift. `playwright-core` (not `playwright`) because no
browser binary is needed — we connect to the remote one.
- **Self-provision the client lazily, no per-user setup.** The pinned client is
installed once into `~/.cache/homelab/browser-client/` (idempotent, version-
guarded) on first use, alongside the embedded runner + stealth files. node is
already fleet-wide; this avoids coupling the feature to a provisioner change
and keeps it self-contained and self-healing. The client runs on the devvm, so
`setInputFiles` streams local files to the remote browser over CDP — no
`chmod`/staging-dir workaround on the CDP path.
- **Vendor `stealth.js`, guard against drift.** The CLI embeds a byte-for-byte
copy of `stacks/chrome-service/files/stealth.js` (the source of truth the
in-cluster callers use) via `go:embed`; a unit test fails if the copy drifts.
`go:embed` can't reach outside the package dir, hence the vendored copy rather
than a path reference.
- **Scope held at two action verbs + help.** `run` (arbitrary script — the
workhorse) and `open` (navigate + title/text/screenshot — a quick check) cover
the surface. Both are write-tier; the bare `browser`/`--help` is read. Re-measure
via `usage top` (ADR-0011) before adding more.

View file

@ -0,0 +1,35 @@
---
status: accepted
date: 2026-06-24
---
# Service identity is namespace + label; east-west observability via Calico Goldmane; no service mesh
As the Service count grows we want an audit-grade record of which Service talks to which — the "service mesh evaluation" `docs/plans/2026-04-20-infra-audit-design.md` flagged as never done ("worth a design doc even if the answer is no, too much complexity for the gain"). We evaluated the full design space against two constraints: the trust model is a single-tenant cluster needing **attribution-grade** forensics (reconstruct events in a cluster we trust), not cryptographic non-repudiation against a hostile pod; and we are acutely **etcd-constrained** (we removed VPA/Goldilocks for exactly this, and carry open beads `code-oflt`/`code-at4f` on etcd starvation). Decision: **service identity = the workload's namespace** (primary; Goldmane stamps it natively and "one Service ≈ one namespace" holds for ~87 of our namespaces), refined by an explicit `service-identity` label only in the few genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). **East-west observability = Calico 3.30 Goldmane + Whisker** (already in our Calico v3.30.7, currently `enabled = false` in `stacks/calico/main.tf`), with Goldmane's emitter shipping flows to **Loki** for a durable trail. **Enforcement reuses the existing Wave 1 observe-then-enforce egress track**, now selecting on namespace/label and fed by Goldmane's allow/deny + policy-trace flows. We explicitly **reject** a service mesh, mTLS, SPIFFE/SPIRE, and dedicated per-Service ServiceAccounts for now.
## Considered options
- **Dedicated per-Service ServiceAccount as the identity primitive** — initially chosen, then reversed. 56% of pods (257/458) run as `default`, so it is a ~116-stack rollout; and Goldmane (the chosen flow source) carries pod/namespace/workload **labels but no ServiceAccount field**, so SAs would not even reach the audit trail without a custom pod→SA mapping. The cheaper, etcd-inert path (namespace is free; a handful of static labels) delivers the same attribution. Deferred until identity-aware NetworkPolicy needs a principal finer than namespace/label, or mTLS is adopted.
- **Service mesh (Istio / Linkerd / Cilium-mesh) + mTLS + SPIFFE/SPIRE** — the only thing that makes the trail cryptographically non-repudiable against a hostile pod. The trust model does not justify it, east-west stays single-tenant plaintext, and it is precisely the "too much complexity for the gain" the audit doc predicted. Rejected.
- **Microsoft Retina (CNI-agnostic eBPF)** — more capable (DNS, drops, Hubble UI) and GA, runs on Calico without a CNI change. But identity-rich mode writes **one `RetinaEndpoint` CRD per pod to etcd** (continuous, pod-proportional churn — the exact axis we guard), and it is metrics-first, not log-first (no per-flow Loki records without custom glue). Rejected for this use case; noted as the fallback if DNS/drop-level detail is ever needed.
- **Cilium Hubble** — reads Cilium's eBPF datapath maps; unusable on Calico without migrating the CNI. A CNI migration is not justified. Rejected.
- **Kiali** — builds its graph entirely from an Istio mesh's Prometheus telemetry; no mesh, no graph. Rejected.
- **Custom Grafana Alloy enrichment exporter over raw iptables-`LOG` flow lines** — Alloy has no IP→identity dictionary-lookup primitive (`loki.process` lacks a lookup stage; `k8sattributes` can't do per-line/dual-IP association), so this is a multi-day custom build that also has to beat pod-IP churn. Goldmane delivers identity-stamped flows natively and obviates it. Rejected.
- **Kyverno generate+mutate to provision/assign identity** — rejected on etcd grounds: background scans + PolicyReports + UpdateRequests are continuous writes, the VPA-class cost we shed. Identity stays static.
## Consequences
- **No etcd cost from the flow plane.** Goldmane streams flows from Felix (the existing `calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — nothing written to etcd or the K8s API. Steady-state cost is two Deployments (`goldmane`, `whisker`) + RAM/CPU on the goldmane pod.
- **The ring buffer is not a trail.** Durable, queryable history depends on the emitter→Loki path (reuse the 90-day security-stream retention); on a Goldmane restart the in-memory window is lost.
- **Goldmane is tech-preview** in OSS Calico 3.30 — the main risk. Enabling it is a reversible toggle in `stacks/calico/main.tf`, but the toggle interacts with the operator-managed Installation CR (only namespaces are TF-adopted today; verify how Goldmane/Whisker are enabled before applying).
- **Attribution is namespace-grained for free** across ~87 single-Service namespaces. Multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`) need a `service-identity` label to disambiguate; most are platform/infra and already on the Wave 1 enforcement exclude list.
- **The trail is attribution-grade, not cryptographic.** It reliably reconstructs events in a trusted cluster but cannot prove identity against a pod that spoofs its source — an accepted limit of the trust model. This ADR does not change the east-west encryption posture (still plaintext, no mTLS).
- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
## As-built (2026-06-25)
Implemented across infra issues #57#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.

View file

@ -0,0 +1,57 @@
# OS is the authorization boundary: agents defer to Unix/sudo, not a stricter in-policy rule
Supersedes the cross-user privacy *norm* that the devvm managed-settings policy
carried and that ADR-0011 leaned on ("never read another user's home /
`~/.claude`, off-limits even for an owner in-session"). ADR-0011's actual
subject — `usage top` telemetry and its emit design — is unchanged and still
current; only the privacy prohibition it referenced is superseded here.
## Context
The devvm managed-settings policy (`/etc/claude-code/managed-settings.json`,
`claudeMd`) carried two rules that were, in practice, *stricter than the OS*:
"you are not the admin, do not escalate privileges" and "never read another
user's home directory, credentials, tokens, or `~/.claude`." The OS told a
different story: `wizard` holds `(ALL) NOPASSWD: ALL` — full passwordless root.
The kernel had already granted total read access; the policy was layering an
artificial refusal on top of an authorization the OS already permits, and the
"not the admin" framing was factually wrong for a NOPASSWD-root user.
Two honest ways to resolve the inconsistency: tighten sudo to match the policy,
or loosen the policy to match the OS. The owner chose the latter on 2026-06-26,
for analytics/debugging across the shared box.
## Decision
- **Authorization follows the OS, not this policy.** Agents may access whatever
their OS user can access — directly or via `sudo` where they hold sudo rights
— and must not impose restrictions stricter than the OS. On this box that
includes other users' home directories and `~/.claude` for users who hold
broad sudo.
- **No separate prompt or carve-out** for OS-authorized access. The Unix
permission model + sudoers is the single source of truth for who may read
what. Other homes are `0750`-owned, so a cross-home read necessarily transits
`sudo` and is therefore captured in the sudo/auth audit log.
- **Cluster/infra RBAC tiering is unchanged.** kubectl / Vault / infra access
stays scoped to each user's RBAC tier; "defer to the OS" is about OS-level
file access, not a licence to exceed cluster RBAC.
- **Scope is symmetric and multi-user.** The rule lives in the *shared*
managed-settings, so every user's agents defer to that user's own sudo grant.
Any user with broad sudo gets the same cross-home read capability over other
users' files. Accepted by the owner with that understanding; emo's and
ancamilea's `~/.claude` is now agent-readable by sudo-holders.
- **Takes effect in a fresh session.** managed-settings loads at session start;
the session that made the change keeps running under the old policy.
## Consequences
- The privacy-preserving telemetry rationale in ADR-0011 (`usage top` as the
"cross-user analytics without reading homes" answer) remains useful but is no
longer the *only* sanctioned path; direct reads via `sudo` are now permitted.
- Larger blast radius: if an agent session running as a sudo-holder is
prompt-injected or otherwise compromised, it can now read every user's secrets
with no in-agent friction (sudo here is passwordless). The sudo/auth audit log
is the remaining accountability control.
- Reversible: restore the prior `claudeMd` bullets (backup kept at
`/etc/claude-code/managed-settings.json.bak-2026-06-26`) and start a fresh
session.

View file

@ -0,0 +1,107 @@
# GPU VRAM protection via a scheduler extended-resource budget + a runtime watchdog (HAMi/MPS rejected)
The single Tesla T4 (16 GB, ~15360 MiB usable) on `k8s-node1` is **time-sliced**
(`nvidia.com/gpu` advertised ×100, `migStrategy: none`) and shared by ~9 tenants
(immich-ml, immich-server, frigate, llama-swap, portal-stt, tts,
ebook2audiobook, ytdlp, android-emulator). Time-slicing grants a *scheduling
turn, not memory* — the scheduler is blind to VRAM, so the tenants can
collectively overallocate the card. On 2026-06-02 immich-ml's unbounded
onnxruntime OCR arena grew from ~2 GB to **10.7 GB**, starved llama-swap's
qwen3-8b, and silently broke recruiter-responder triage for ~5 h
(`docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). The
post-mortem's #1 follow-up — alert/guard on GPU VRAM — was never built.
## Context
- **MIG is impossible.** The T4 is Turing; hardware memory partitioning (MIG)
only exists on Ampere+. So per-tenant *hardware* isolation is off the table.
- **The card is busy but not steadily oversubscribed.** Measured steady residents
(2026-06-17, `gpu_pod_memory_used_bytes`): immich-ml ~2.1 GiB, frigate ~1.9 GiB,
llama-swap ~4.35 GiB peak (one model at a time — it already swaps), immich-server
~1.2 GiB, portal-stt ~1.5 GiB, android-emulator ~0.15 GiB → ~11 GiB used, ~4 GiB
free. **The failure mode is a single tenant's runtime runaway, not a
scheduling-time pile-on.**
- **Prior art already exists (soft):** a `gpu-workload` PriorityClass (1,200,000)
is auto-stamped on every GPU pod by the Kyverno `inject-gpu-workload-priority`
policy (tts excluded → `tier-2-gpu`, evicted first); tts runs behind a
free-VRAM demand-gate (`stacks/tts`, scales 0↔1 on `sum(gpu_pod_memory_used_bytes)`
vs a floor); immich-ml is soft-bounded by `MACHINE_LEARNING_MODEL_TTL=600`. What
was missing is anything that bounds a tenant's VRAM *during active use*.
### Alternatives considered and rejected
- **NVIDIA MPS** (device-plugin `sharing.mps`, hard `CUDA_MPS_PINNED_DEVICE_MEM_LIMIT`):
caps are **uniform** — slice = `total ÷ replicas`, tenants get integer multiples.
Nine heterogeneous tenants spanning 0.15→6 GB do not fit uniform slices without
large rounding waste on a card that has none to spare. Rejected.
- **HAMi vGPU** (per-container `nvidia.com/gpumem` MiB caps, libvgpu CUDA hook):
the *correct* hard-cap primitive and T4-supported, but it **replaces the
operator's device plugin** (the operator owns/reconciles it), enforces via an
`LD_PRELOAD` CUDA hook that is **unproven for our NVENC transcode path**
(open codec bug), **cannot cap the android-emulator** (QEMU bypasses the CUDA
hook — KubeVirt/Kata explicitly unsupported), carries a **restart-triggered
false-OOM bug** (#1181) directly in our blast radius (kured reboots node1
regularly), and its reservation-based scheduling would **supersede the working
demand-gate** and **strand the ~4 GB of steady headroom**. Too much risk and
behavioral change for the single proven failure mode. Rejected for now; this
ADR is the record of *why*, so a future "let's just use HAMi" re-opens with the
trade-offs already on the table.
## Decision
Make the scheduler VRAM-aware and add runtime teeth — entirely with repo-native
pieces, **no device-plugin/driver change, time-slicing untouched**:
1. **Budget (schedule-time).** Advertise a custom node-level **extended resource
`viktorbarzin.me/gpumem`** on the GPU node (= ~14000 MiB; ~15.4 GB physical
minus ~1.4 GB driver/CUDA-context/exporter slack), via a reconcile Job +
CronJob that `kubectl patch node --subresource=status` (dynamic over
`nvidia.com/gpu.present=true` nodes; re-asserts after node re-register).
Every GPU tenant declares `resources.limits."viktorbarzin.me/gpumem"` (immich-ml
3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500 — sum
≤ advertised). Extended resources are **non-overcommittable** (request==limit,
integer), so the scheduler refuses to co-schedule past the card → overflow
`Pending`. On-demand batch tenants (tts/ebook2audiobook/ytdlp) keep the
free-VRAM demand-gate and fill the real slack rather than holding a reserved seat.
2. **Watchdog (runtime).** A `gpu-vram-watchdog` CronJob (every minute, nvidia ns)
reads per-pod `gpu_pod_memory_used_bytes` (the host-PID exporter) and each GPU
pod's *declared* `gpumem`, and **only when actual free VRAM < floor (~1536 MiB)**
recycles the biggest **over-budget** offender (used > declared). Contract
enforcement, not priority (immich-ml and llama-swap share `gpu-workload`, so
priority can't distinguish them). Acting only under pressure lets a tenant burst
into genuine slack; the recycle clears its arena (exactly what the TTL=600
Recreate does for immich-ml when idle). This is what would have caught 2026-06-02.
3. **Alerting** (the never-built follow-up): GPU free-VRAM below floor, GPU pod
`Pending` on `gpumem`, and pod-over-budget → the `#alerts` digest.
This is **soft enforcement**: the scheduler reserves on paper and the watchdog
corrects at runtime with a detection lag (secondsminute), so a brief physical
overshoot is possible before a recycle. Accepted, given the failure mode is a
slow arena drift, not an instantaneous spike, and the alternative (HAMi) carries
disproportionate risk for this hardware.
## Consequences
- **The 2026-06-02 class is bounded** without touching the pinned driver, the GPU
operator, or time-slicing. immich-ml can no longer silently grow into
llama-swap's VRAM: it either schedules within its budget or, on a true runaway
under pressure, gets recycled (its heavy library job is the intended loser).
- **The card has a seating chart now.** Sum of declared budgets ≤ ~14 GB, so a new
always-on GPU tenant requires re-budgeting; an over-budget on-demand tenant sits
`Pending`. This is the intended, legible back-pressure.
- **Small/on-demand tenants (android-emulator, ytdlp, tts, ebook2audiobook) are
NOT budgeted in v1** — they fill *actual* slack rather than holding a scheduler
seat (tts via its existing free-VRAM demand-gate), and are covered by the
~1.4 GiB physical reserve plus budget headroom (the five residents' budgets sum
to 13300 ≤ 14000 advertised). Give them budgets later if they grow; until then
the watchdog protects the budgeted five and counts everyone's usage toward free.
- **New RBAC:** the reconcile SA patches `nodes/status`; the watchdog SA lists pods
cluster-wide and deletes pods in GPU tenant namespaces. Far less privileged than
existing cluster-admin tooling (woodpecker-agent).
- **Apply order matters:** advertise `gpumem` (nvidia stack) **before** the
consumer stacks declare it, or a pod requesting an unadvertised extended
resource is unschedulable. The reconcile runs as a Job (immediate) for this.
- **Fully reversible:** delete the CronJobs/Job + the `gpumem` stanzas, and
`kubectl patch node --subresource=status` to remove the capacity key. Nothing
structural; no driver/operator state to unwind.
- The `gpumem` numbers are first estimates; tune from `gpu_pod_memory_used_bytes`.

View file

@ -0,0 +1,126 @@
<svg xmlns="http://www.w3.org/2000/svg" width="1600" height="820" viewBox="0 0 1600 820" font-family="system-ui, -apple-system, 'Segoe UI', Roboto, sans-serif">
<!-- ADR-0017: PHYSICAL cabling only — no VLANs, no flows. Solid = cable in
place today · dashed = camera-day work · ~~~ = radio. Palette: neutral
grays + blue for copper runs (reference dataviz palette text tokens). -->
<defs>
<marker id="dot" viewBox="0 0 8 8" refX="4" refY="4" markerWidth="5" markerHeight="5">
<circle cx="4" cy="4" r="3" fill="#52514e"/>
</marker>
</defs>
<rect width="1600" height="820" fill="#fcfcfb"/>
<text x="40" y="42" font-size="26" font-weight="700" fill="#0b0b0b">ADR-0017 — physical cabling (single-switch, rev 3)</text>
<text x="40" y="66" font-size="15" fill="#52514e">wires only — no VLANs, no traffic · solid = in place · dashed = camera-day · ~ = radio</text>
<!-- ═════════ APARTMENT ═════════ -->
<rect x="40" y="100" width="330" height="330" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="56" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">APARTMENT</text>
<text x="70" y="158" font-size="13" fill="#52514e">☁ ISP (internet)</text>
<path d="M120,166 L120,196" fill="none" stroke="#52514e" stroke-width="2"/>
<rect x="64" y="198" width="220" height="64" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="80" y="222" font-size="14.5" font-weight="700" fill="#0b0b0b">AX6000 router</text>
<text x="80" y="242" font-size="12" fill="#52514e">192.168.1.1 · WAN←ISP · 8×LAN</text>
<rect x="64" y="290" width="220" height="52" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="80" y="312" font-size="14" font-weight="700" fill="#0b0b0b">Synology NAS · .13</text>
<text x="80" y="330" font-size="12" fill="#52514e">on an AX6000 LAN port</text>
<path d="M174,262 L174,290" fill="none" stroke="#2a78d6" stroke-width="2"/>
<text x="70" y="376" font-size="12.5" fill="#52514e">📶 wifi clients (phones, laptops)</text>
<path d="M110,262 C104,272 106,278 100,286 C106,294 104,300 100,308 C106,316 104,322 100,330 C106,338 104,344 100,352 C104,358 102,362 98,366" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
<!-- in-wall run apartment -> garage -->
<path d="M284,230 C450,230 540,228 616,228" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
<text x="330" y="218" font-size="12.5" font-weight="700" fill="#2a78d6">in-wall run → garage</text>
<!-- ═════════ GARAGE — RACK ═════════ -->
<rect x="560" y="100" width="640" height="680" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="576" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE — RACK</text>
<!-- switch -->
<rect x="600" y="150" width="560" height="150" rx="8" fill="#ffffff" stroke="#0b0b0b" stroke-opacity="0.5" stroke-width="1.6"/>
<text x="616" y="176" font-size="14.5" font-weight="700" fill="#0b0b0b">TL-SG105PE · 5-port gigabit PoE switch</text>
<text x="616" y="194" font-size="12" fill="#52514e">mgmt 192.168.1.6 · replaces the old TL-SG105E (→ shelf, cold spare)</text>
<g font-size="11.5" text-anchor="middle">
<rect x="616" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="664" y="227" font-weight="700" fill="#0b0b0b">P1</text>
<text x="664" y="242" fill="#52514e">← apartment</text>
<rect x="722" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="770" y="227" font-weight="700" fill="#0b0b0b">P2</text>
<text x="770" y="242" fill="#52514e">← 4G router</text>
<rect x="828" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="876" y="227" font-weight="700" fill="#0b0b0b">P3</text>
<text x="876" y="242" fill="#52514e">← UPS mgmt</text>
<rect x="934" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984" stroke-dasharray="4,3"/>
<text x="982" y="227" font-weight="700" fill="#0b0b0b">P4 ⚡PoE</text>
<text x="982" y="242" fill="#52514e">← camera</text>
<rect x="1040" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="1088" y="227" font-weight="700" fill="#0b0b0b">P5</text>
<text x="1088" y="242" fill="#52514e">← R730 eno1</text>
</g>
<text x="616" y="284" font-size="12" fill="#52514e">every cable below re-plugs old-switch → PE on camera day (≈3 min)</text>
<!-- 4G router -->
<rect x="600" y="360" width="250" height="64" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="616" y="384" font-size="14" font-weight="700" fill="#0b0b0b">4G router · 192.168.1.7</text>
<text x="616" y="403" font-size="12" fill="#52514e">~cellular uplink (out-of-band)</text>
<path d="M770,300 L770,360" fill="none" stroke="#2a78d6" stroke-width="2"/>
<path d="M856,392 C866,386 864,380 874,376 C866,370 868,364 876,360" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
<text x="884" y="380" font-size="12" fill="#52514e">📡 cellular</text>
<!-- UPS -->
<rect x="600" y="452" width="250" height="56" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="616" y="476" font-size="14" font-weight="700" fill="#0b0b0b">UPS (Huawei)</text>
<text x="616" y="494" font-size="12" fill="#52514e">network mgmt card</text>
<path d="M876,300 C876,340 800,410 720,452" fill="none" stroke="#2a78d6" stroke-width="2"/>
<!-- R730 -->
<rect x="600" y="540" width="560" height="220" rx="8" fill="#ffffff" stroke="#0b0b0b" stroke-opacity="0.5" stroke-width="1.6"/>
<text x="616" y="566" font-size="14.5" font-weight="700" fill="#0b0b0b">Dell R730 · PVE host · 192.168.1.127</text>
<g font-size="11.5">
<rect x="616" y="582" width="128" height="38" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="628" y="598" font-weight="700" fill="#0b0b0b">eno1 · LAN1</text>
<text x="628" y="613" fill="#52514e">← switch P5 · 1GbE</text>
<rect x="756" y="582" width="128" height="38" rx="5" fill="#ffffff" stroke="#8a8984" stroke-dasharray="4,3"/>
<text x="768" y="598" font-weight="700" fill="#52514e">eno2 · LAN2</text>
<text x="768" y="613" fill="#8a8984">dark · fallback leg</text>
<rect x="896" y="582" width="128" height="38" rx="5" fill="#ffffff" stroke="#d8d7d2"/>
<text x="908" y="598" fill="#8a8984">eno3 / eno4</text>
<text x="908" y="613" fill="#8a8984">free, uncabled</text>
<rect x="1036" y="582" width="108" height="38" rx="5" fill="#ffffff" stroke="#d8d7d2"/>
<text x="1048" y="598" fill="#8a8984">iDRAC · .4</text>
<text x="1048" y="613" fill="#8a8984">shared-LOM/eno1</text>
</g>
<text x="616" y="648" font-size="12" fill="#52514e">no other network cables — everything else on this host is VIRTUAL:</text>
<text x="616" y="668" font-size="12" fill="#52514e">pfSense · ha-sofia (HA) · devvm · k8s-master + node1-6 · registry VM …</text>
<text x="616" y="696" font-size="12" fill="#8a8984">(power: host + switch fed from the UPS — power wiring not drawn)</text>
<path d="M1088,300 C1088,420 720,500 680,582" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
<text x="1100" y="330" font-size="12.5" font-weight="700" fill="#2a78d6">LAN1 cable</text>
<!-- ═════════ GARAGE ENTRANCE ═════════ -->
<rect x="1280" y="100" width="280" height="200" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="1296" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE ENTRANCE</text>
<rect x="1304" y="150" width="232" height="110" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="1320" y="176" font-size="14" font-weight="700" fill="#0b0b0b">vermont-garage camera</text>
<text x="1320" y="196" font-size="12" fill="#52514e">HiLook IPC-T241H-C · 10.0.30.70</text>
<text x="1320" y="214" font-size="12" fill="#52514e">powered over the data cable (PoE)</text>
<text x="1320" y="232" font-size="12" fill="#52514e">outdoor · armored conduit</text>
<path d="M982,210 C982,150 1140,140 1304,180" fill="none" stroke="#52514e" stroke-width="2.5" stroke-dasharray="7,5"/>
<text x="1080" y="136" font-size="12.5" font-weight="700" fill="#52514e">single cat6 in conduit · data + PoE power (camera day)</text>
<!-- legend -->
<g transform="translate(40,780)" font-size="12.5">
<line x1="0" y1="-4" x2="44" y2="-4" stroke="#2a78d6" stroke-width="2.5"/>
<text x="52" y="0" fill="#0b0b0b">copper, in place</text>
<line x1="190" y1="-4" x2="234" y2="-4" stroke="#52514e" stroke-width="2.5" stroke-dasharray="7,5"/>
<text x="242" y="0" fill="#0b0b0b">camera-day cable / dark port</text>
<path d="M450,-4 C456,-10 454,-14 460,-18" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
<text x="470" y="0" fill="#0b0b0b">radio (wifi / cellular)</text>
<text x="650" y="0" fill="#52514e">total wired links at the rack: 5 (all on the one switch) · ADR-0017 rev 3</text>
</g>
</svg>

After

Width:  |  Height:  |  Size: 9 KiB

View file

@ -0,0 +1,99 @@
# CCTV segment: dedicated pfSense interface, VLAN-30 trunk on the LAN1 cable
Status: accepted (2026-07-02, rev 3 — single-switch)
![Network topology — dCCTV segment, flows, and camera-day steps](./0017-cctv-segment-topology.svg)
![Physical cabling — wires only, no VLANs](./0017-cctv-physical-cabling.svg)
The first owned camera at the Sofia/Vermont site (`vermont-garage`, HiLook
IPC-T241H-C at the garage entrance) needs to be network-isolated: its cable is
physically exposed outside the apartment, so anything plugged into that cable
must land in a segment that can reach nothing. The original design doc
(NAS: `Emo shared/Claude shared/garage-camera/`) called for an "802.1Q trunk
to pfSense" — but nothing in this network terminates dot1q on pfSense; the
site idiom is one vlan-aware Proxmox bridge → one tagged VM NIC → one clean
untagged pfSense interface per segment.
**Decision (rev 3):** ONE switch — the new TL-SG105PE **replaces** the old
garage TL-SG105E (Viktor prefers not running two switches; retired unit
becomes a cold spare, its 192.168.1.6 mgmt IP passes to the PE). Five ports,
all used: apartment uplink, 4G router 192.168.1.7, UPS mgmt (all untagged
VLAN 1), the camera (untagged VLAN 30, PoE), and the **trunk to R730 `eno1`
carrying home LAN untagged + CCTV tagged 30** over the existing LAN1 cable.
pfSense `net3` (vtnet3) sits on `vmbr0` with `tag=30` — exactly the site
idiom used for dManagementsVms/dKubernetes (bridge-level tag → clean untagged
vNIC; pfSense still terminates no dot1q itself). The earlier dedicated
`eno2`/`vmbr2` leg is kept **dormant as a fallback** (rev 2 wired it; moving
net3 back to vmbr2 restores pure physical isolation in one `qm set`).
This narrows the earlier 802.1Q objection rather than contradicting it: the
rejection assumed *unmanaged* switches, where any LAN device could inject
tagged frames; with the managed PE as the only device on eno1, VLAN-30
membership is {camera port, trunk port} only, so tag-30 ingress from every
other port — and from the exposed camera cable — is dropped or contained.
Cameras are untrusted: default-deny on dCCTV with a single
NTP-to-gateway exception; Frigate (k8s) pulls RTSP in; ha-sofia (192.168.1.8)
may reach ISAPI/RTSP directly; home-LAN clients route in via an AX6000 static
route (10.0.30.0/24 via 192.168.1.2). 10.0.30.0/24 is deliberately NOT in the
10.0.20.0/22 trusted source-IP allowlist.
## Traffic on the trunk — how one cable carries two networks
The LAN1 cable is shared, but the two networks on it diverge at `vmbr0`
(the vlan-aware bridge on the PVE host), and only ONE of them ever touches
pfSense:
- **Untagged (VLAN 1, home LAN)** is plain L2 bridging: vmbr0 switches it
between the trunk, the host's own IP (192.168.1.127) and pfSense `net0`
where pfSense sits as an ordinary LAN *client* (WAN 192.168.1.2). The home
LAN's gateway is and remains the AX6000; home-LAN traffic never transits
pfSense. Consequently a pfSense (or R730 VM-level) outage does not affect
the home LAN, and the apartment ↔ 4G-router ↔ UPS paths don't even leave
the switch (P1/P2/P3 bridge internally), so out-of-band recovery via the
4G router survives the whole rack being down.
- **Tagged 30 (CCTV)** has exactly one possible landing: vmbr0 delivers
VID 30 only to pfSense `net3` (dCCTV, 10.0.30.1), which is the camera
segment's gateway, firewall and sole exit. "Camera → AX6000 → internet"
is impossible by construction, not merely by firewall rule.
- pfSense forwards *upstream* only its own segments (10.0.10/20/30), NATed
out of its WAN toward the AX6000. Load-wise the trunk gained only the
camera's ~8 Mbps — it already carried all rack-bound home-LAN traffic.
![VLAN tagging — where traffic can flow](./0017-cctv-vlan-tagging.svg)
*(editable source: [`0017-cctv-vlan-tagging.excalidraw`](./0017-cctv-vlan-tagging.excalidraw) — open it in excalidraw to tweak)*
## Considered options
- **802.1Q over the LAN path behind an UNMANAGED switch** (the original plan
read this way) — rejected: any LAN device could inject tagged frames into
vmbr0 (`bridge-vids 2-4094`) and tag-passing through a dumb switch is
undefined. Rev 3 adopts the tagged path ONLY because the managed PE now
polices VLAN-30 membership at the single entry point to eno1; no bridge
reconfiguration was needed (vmbr0 was already vlan-aware).
- **Dedicated physical leg (eno2 → vmbr2 → net3), one switch per role**
(rev 1/2 as-built) — superseded by rev 3: it forced either a second switch
(6 connections vs 5 ports once the PE also replaced the old switch) or new
hardware. Strongest isolation of all options; kept dormant as the fallback.
- **AX6000 as the camera gateway** — rejected earlier in the design (consumer
router, no inter-VLAN firewall).
## Consequences
- The switch is now single-point and load-bearing for everything in the rack
(apartment uplink, pfSense backup-WAN via 4G, UPS mgmt, CCTV) AND its VLAN
table + mgmt password are part of the isolation boundary — the Easy Smart
mgmt UI answers on every port, so the password is the gate between a
compromised camera and the switch config. All 5 ports are consumed: the
next camera forces an 8-port PoE upgrade (the wiring plan already fits it).
- `eno2`/`vmbr2` stay cabled-ready but dormant (fallback to rev 2's physical
leg); eno3/eno4 remain free.
- The old TL-SG105E is retired to cold spare; the PE inherits 192.168.1.6
(Kea reservation by MAC).
- Revision history (all 2026-07-02): rev 1 assumed one shared PE with a
port-VLAN split (conflated the two devices); rev 2 split into two switches
after inspecting 192.168.1.6 (old non-PoE SG105E, 4/5 ports used); rev 3
consolidated back to one switch — the PE replacing the SG105E — per
Viktor's preference, moving CCTV onto a managed tagged trunk.
- Frigate's ADR-0016 VRAM budget was bumped 2000 → 2300 MiB for the extra
NVDEC stream.

View file

@ -0,0 +1,178 @@
<svg xmlns="http://www.w3.org/2000/svg" width="1600" height="880" viewBox="0 0 1600 880" font-family="system-ui, -apple-system, 'Segoe UI', Roboto, sans-serif">
<!-- ADR-0017 rev 3 dCCTV topology (single switch, VLAN-30 trunk on LAN1).
Colors: reference dataviz palette (light mode). blue #2a78d6 = home LAN ·
violet #4a3aa7 = dCCTV · aqua #1baf7a = dKubernetes ·
yellow #eda100 = dManagementsVms · green #008300 allow · red #e34948 deny -->
<defs>
<marker id="arrGreen" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
<path d="M0,0 L10,5 L0,10 z" fill="#008300"/>
</marker>
<marker id="arrRed" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
<path d="M0,0 L10,5 L0,10 z" fill="#e34948"/>
</marker>
<marker id="arrGray" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse">
<path d="M0,0 L10,5 L0,10 z" fill="#52514e"/>
</marker>
</defs>
<rect width="1600" height="880" fill="#fcfcfb"/>
<text x="40" y="42" font-size="26" font-weight="700" fill="#0b0b0b">ADR-0017 — CCTV segment behind pfSense, VLAN-30 trunk on the LAN1 cable</text>
<text x="40" y="66" font-size="15" fill="#52514e">Sofia/Vermont · rev 3 (single switch) 2026-07-02 · dashed = camera-day · the ONLY 802.1Q is the trunk between the switch and eno1</text>
<!-- camera -> everything else (denied) -->
<path d="M240,168 C520,104 900,104 1148,140" fill="none" stroke="#e34948" stroke-width="3" marker-end="url(#arrRed)"/>
<g transform="translate(560,111)">
<circle r="11" fill="#fcfcfb" stroke="#e34948" stroke-width="2.5"/>
<path d="M-5,-5 L5,5 M5,-5 L-5,5" stroke="#e34948" stroke-width="2.5"/>
</g>
<text x="588" y="100" font-size="13.5" font-weight="700" fill="#e34948">DENY · camera → LAN / other segments / internet (default deny on dCCTV)</text>
<!-- GARAGE ENTRANCE -->
<rect x="40" y="128" width="240" height="180" rx="10" fill="#4a3aa7" fill-opacity="0.06" stroke="#4a3aa7" stroke-opacity="0.35"/>
<text x="56" y="154" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE ENTRANCE</text>
<rect x="64" y="170" width="192" height="112" rx="8" fill="#ffffff" stroke="#4a3aa7" stroke-width="2"/>
<text x="80" y="196" font-size="15" font-weight="700" fill="#0b0b0b">vermont-garage</text>
<text x="80" y="216" font-size="12.5" fill="#52514e">HiLook IPC-T241H-C · pure IR</text>
<text x="80" y="234" font-size="12.5" fill="#52514e">10.0.30.70 (Kea reservation)</text>
<text x="80" y="252" font-size="12.5" fill="#52514e">DNS: garage-cam.viktorbarzin.lan</text>
<text x="80" y="270" font-size="12.5" fill="#52514e">PoE from switch · cloud/P2P off</text>
<path d="M256,284 C330,330 412,368 417,430" fill="none" stroke="#52514e" stroke-width="2" stroke-dasharray="6,5" marker-end="url(#arrGray)"/>
<text x="330" y="322" font-size="12" fill="#52514e">cat6 in conduit · PoE → P4</text>
<!-- RACK zone: single switch -->
<rect x="40" y="360" width="560" height="265" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="56" y="384" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">RACK — GARAGE · ONE SWITCH</text>
<rect x="64" y="396" width="512" height="176" rx="8" fill="#4a3aa7" fill-opacity="0.04" stroke="#4a3aa7" stroke-width="2"/>
<text x="80" y="420" font-size="15" font-weight="700" fill="#0b0b0b">TL-SG105PE <tspan font-size="12.5" font-weight="400" fill="#52514e">replaces the SG105E · mgmt 192.168.1.6 (Kea) · all 5 ports used</tspan></text>
<g font-size="11.5" text-anchor="middle">
<rect x="80" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="124" y="454" font-weight="700" fill="#0b0b0b">P1 · V1</text>
<text x="124" y="470" fill="#52514e">apartment</text>
<text x="124" y="484" fill="#52514e">uplink</text>
<rect x="178" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="222" y="454" font-weight="700" fill="#0b0b0b">P2 · V1</text>
<text x="222" y="470" fill="#52514e">4G router</text>
<text x="222" y="484" fill="#52514e">192.168.1.7</text>
<rect x="276" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="320" y="454" font-weight="700" fill="#0b0b0b">P3 · V1</text>
<text x="320" y="470" fill="#52514e">UPS mgmt</text>
<rect x="374" y="436" width="88" height="56" rx="6" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="418" y="454" font-weight="700" fill="#0b0b0b">P4 · V30</text>
<text x="418" y="470" fill="#52514e">camera</text>
<text x="418" y="484" fill="#52514e">PoE ON</text>
<rect x="472" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.10" stroke="#4a3aa7" stroke-width="2" stroke-dasharray="0"/>
<text x="516" y="454" font-weight="700" fill="#0b0b0b">P5 · trunk</text>
<text x="516" y="470" fill="#52514e">V1 untagged</text>
<text x="516" y="484" fill="#4a3aa7">+ V30 tagged</text>
</g>
<text x="80" y="516" font-size="12" fill="#52514e">802.1Q: VLAN 1 untagged {P1,P2,P3,P5} · VLAN 30 {P4 untagged/PVID 30, P5 tagged}</text>
<text x="80" y="534" font-size="12" fill="#52514e">tag-30 ingress on P1/P2/P3 is dropped (not members) — the trunk is the only tagged path</text>
<text x="80" y="558" font-size="12" fill="#8a8984">old TL-SG105E → retired, cold spare · backup-WAN (4G) + UPS keep their ports</text>
<!-- trunk: two parallel lines to eno1 -->
<path d="M560,458 C630,458 640,428 692,420" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
<path d="M560,466 C632,466 644,436 692,428" fill="none" stroke="#4a3aa7" stroke-width="2.5"/>
<text x="588" y="404" font-size="12" font-weight="700" fill="#0b0b0b">LAN1 cable</text>
<!-- R730 / PVE zone -->
<rect x="680" y="330" width="880" height="440" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="696" y="356" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">DELL R730 — PVE HOST 192.168.1.127 (IN THE RACK)</text>
<g font-size="12">
<rect x="700" y="400" width="150" height="46" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="712" y="419" font-weight="700" fill="#0b0b0b">eno1 → vmbr0</text>
<text x="712" y="436" fill="#52514e">untag V1 + tag 30</text>
<rect x="700" y="471" width="150" height="46" rx="6" fill="#ffffff" stroke="#8a8984" stroke-dasharray="4,3"/>
<text x="712" y="490" font-weight="700" fill="#52514e">eno2 → vmbr2</text>
<text x="712" y="507" fill="#8a8984">dormant fallback leg</text>
<rect x="700" y="542" width="150" height="46" rx="6" fill="#0b0b0b" fill-opacity="0.04" stroke="#8a8984"/>
<text x="712" y="561" font-weight="700" fill="#0b0b0b">vmbr1</text>
<text x="712" y="578" fill="#52514e">internal · tags 10/20</text>
</g>
<!-- pfSense VM -->
<rect x="890" y="388" width="300" height="230" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="906" y="414" font-size="15" font-weight="700" fill="#0b0b0b">pfSense (VM 101)</text>
<text x="906" y="432" font-size="12" fill="#52514e">gateway + firewall for every segment</text>
<g font-size="12">
<rect x="906" y="444" width="268" height="34" rx="5" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="916" y="465" fill="#0b0b0b">net0 · WAN <tspan fill="#52514e">192.168.1.2 · vmbr0 untagged</tspan></text>
<rect x="906" y="484" width="268" height="34" rx="5" fill="#eda100" fill-opacity="0.14" stroke="#eda100"/>
<text x="916" y="505" fill="#0b0b0b">net1 · dManagementsVms <tspan fill="#52514e">10.0.10.1</tspan></text>
<rect x="906" y="524" width="268" height="34" rx="5" fill="#1baf7a" fill-opacity="0.12" stroke="#1baf7a"/>
<text x="916" y="545" fill="#0b0b0b">net2 · dKubernetes <tspan fill="#52514e">10.0.20.1</tspan></text>
<rect x="906" y="564" width="268" height="34" rx="5" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="916" y="585" fill="#0b0b0b">net3 · dCCTV <tspan fill="#52514e">10.0.30.1/24 · vmbr0 tag 30</tspan></text>
</g>
<path d="M850,415 L890,458" fill="none" stroke="#2a78d6" stroke-width="1.6" opacity="0.6"/>
<path d="M850,430 L890,581" fill="none" stroke="#4a3aa7" stroke-width="2"/>
<path d="M850,565 L890,501" fill="none" stroke="#8a8984" stroke-width="1.6" opacity="0.6"/>
<path d="M850,565 L890,541" fill="none" stroke="#8a8984" stroke-width="1.6" opacity="0.6"/>
<!-- k8s VMs -->
<rect x="1240" y="388" width="290" height="230" rx="8" fill="#1baf7a" fill-opacity="0.07" stroke="#1baf7a"/>
<text x="1256" y="414" font-size="15" font-weight="700" fill="#0b0b0b">k8s VMs · 10.0.20.0/24</text>
<text x="1256" y="434" font-size="12.5" fill="#52514e">vmbr1 tag 20 · pod egress SNATs</text>
<text x="1256" y="450" font-size="12.5" fill="#52514e">to node IPs</text>
<rect x="1256" y="464" width="258" height="66" rx="6" fill="#ffffff" stroke="#1baf7a"/>
<text x="1268" y="486" font-size="13.5" font-weight="700" fill="#0b0b0b">Frigate · k8s-node1 (T4)</text>
<text x="1268" y="504" font-size="12" fill="#52514e">detect sub / record main</text>
<text x="1268" y="520" font-size="12" fill="#52514e">gpumem budget 2300 MiB</text>
<rect x="1256" y="540" width="258" height="52" rx="6" fill="#ffffff" stroke="#1baf7a"/>
<text x="1268" y="562" font-size="13.5" font-weight="700" fill="#0b0b0b">go2rtc LB 10.0.20.204</text>
<text x="1268" y="580" font-size="12" fill="#52514e">restream → HA live view (MSE/HLS)</text>
<!-- HOME LAN zone -->
<rect x="1148" y="128" width="412" height="180" rx="10" fill="#2a78d6" fill-opacity="0.06" stroke="#2a78d6" stroke-opacity="0.4"/>
<text x="1164" y="154" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">HOME LAN 192.168.1.0/24</text>
<rect x="1164" y="168" width="180" height="56" rx="6" fill="#ffffff" stroke="#2a78d6"/>
<text x="1176" y="190" font-size="13.5" font-weight="700" fill="#0b0b0b">AX6000 · .1</text>
<text x="1176" y="208" font-size="11.5" fill="#52514e">+ route 10.0.30.0/24 → .2</text>
<rect x="1164" y="236" width="180" height="52" rx="6" fill="#ffffff" stroke="#2a78d6"/>
<text x="1176" y="258" font-size="13.5" font-weight="700" fill="#0b0b0b">ha-sofia · .8</text>
<text x="1176" y="275" font-size="11.5" fill="#52514e">Frigate card + hikvision_next</text>
<rect x="1360" y="168" width="184" height="56" rx="6" fill="#ffffff" stroke="#2a78d6"/>
<text x="1372" y="190" font-size="13.5" font-weight="700" fill="#0b0b0b">apartment clients</text>
<text x="1372" y="208" font-size="11.5" fill="#52514e">laptops, phones</text>
<rect x="1360" y="236" width="184" height="52" rx="6" fill="#ffffff" stroke="#52514e" stroke-dasharray="5,4"/>
<text x="1372" y="256" font-size="11.5" font-weight="700" fill="#52514e">CAMERA DAY: static route</text>
<text x="1372" y="272" font-size="11.5" fill="#52514e">10.0.30.0/24 via 192.168.1.2</text>
<path d="M1254,308 C1150,352 950,372 790,400" fill="none" stroke="#2a78d6" stroke-width="2" opacity="0.6"/>
<text x="1010" y="374" font-size="12" fill="#2a78d6">apartment uplink · switch P1 · trunk · eno1</text>
<!-- FLOWS -->
<path d="M1256,497 C1010,690 330,730 120,650 C40,618 40,380 96,286" fill="none" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
<text x="620" y="700" font-size="13.5" font-weight="700" fill="#008300">ALLOW · Frigate → camera RTSP :554 (routed k8s → dCCTV; opt1 allow-all)</text>
<path d="M1164,262 C820,282 470,268 302,176 C286,167 278,166 270,172" fill="none" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
<text x="484" y="216" font-size="13.5" font-weight="700" fill="#008300">ALLOW · ha-sofia → camera :80 ISAPI + :554</text>
<text x="484" y="234" font-size="12" fill="#52514e">enters pfSense WAN · reply-to off · needs the AX6000 route</text>
<path d="M280,232 C660,200 860,320 936,386" fill="none" stroke="#008300" stroke-width="2" opacity="0.85" marker-end="url(#arrGreen)"/>
<text x="740" y="322" font-size="12.5" font-weight="700" fill="#008300">ALLOW · camera → 10.0.30.1:123 (NTP)</text>
<!-- LEGEND -->
<g transform="translate(40,800)" font-size="12.5">
<rect x="0" y="0" width="18" height="18" rx="4" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="26" y="14" fill="#0b0b0b">home LAN / VLAN 1</text>
<rect x="200" y="0" width="18" height="18" rx="4" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="226" y="14" fill="#0b0b0b">CCTV / VLAN 30 / dCCTV 10.0.30.0/24</text>
<rect x="500" y="0" width="18" height="18" rx="4" fill="#1baf7a" fill-opacity="0.12" stroke="#1baf7a"/>
<text x="526" y="14" fill="#0b0b0b">dKubernetes</text>
<rect x="640" y="0" width="18" height="18" rx="4" fill="#eda100" fill-opacity="0.14" stroke="#eda100"/>
<text x="666" y="14" fill="#0b0b0b">dManagementsVms</text>
<line x1="820" y1="9" x2="860" y2="9" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
<text x="870" y="14" fill="#0b0b0b">allowed flow</text>
<line x1="980" y1="9" x2="1020" y2="9" stroke="#e34948" stroke-width="3" marker-end="url(#arrRed)"/>
<text x="1030" y="14" fill="#0b0b0b">denied</text>
<line x1="1100" y1="9" x2="1140" y2="9" stroke="#52514e" stroke-width="2" stroke-dasharray="6,5"/>
<text x="1150" y="14" fill="#0b0b0b">camera-day step</text>
<text x="1320" y="14" fill="#52514e">ADR-0017 · rev 3</text>
</g>
</svg>

After

Width:  |  Height:  |  Size: 13 KiB

File diff suppressed because it is too large Load diff

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 23 KiB

View file

@ -0,0 +1,47 @@
# Valia sites are served off-infra (Cloudflare Pages), synced in-cluster
Valia (Viktor's mother) authors small one-page static sites in Google Drive folders she
shares, and keeps asking for them to be hosted — two exist already (`stem95su`, `bridge`)
and more are expected. We decided all **Valia sites** are served **off-infra on Cloudflare
Pages** under `<english-name>.viktorbarzin.me`, kept fresh by **one shared in-cluster
CronJob** (`stacks/valia-sites/`) that mirrors each **Content folder** every 10 minutes
(rclone, drive.readonly) and re-deploys only on change (wrangler direct upload). The
existing in-cluster `stem95su` serving stack (nginx + NFS + ingress + per-site sync)
migrates onto this and is retired.
Why off-infra serving: these are her sites, shown to teachers/parents — they must survive
homelab outages (cf. the 2026-06-27 egress incident that took every proxied in-cluster
site down). With Pages, a homelab outage degrades to "content frozen until we're back",
never "site down". Serving costs no cluster resources and no per-site nginx/PVC/ingress/
Anubis. Why the syncer stays in-cluster anyway: secrets stay in Vault (no per-site GHA
secret sprawl), and the stem95su guard patterns (hard-fail on Drive auth errors, never
wipe a live site on an empty/partial folder, capped deletes) carry over wholesale. The
deliberate asymmetry — off-infra serving, on-infra syncing — is the point, not an
accident.
## Considered options
- **In-cluster everywhere** (generalise stem95su into a factory module): one roof, no
Cloudflare Pages dependency — but her sites share the homelab's fate and each site
spends cluster resources to serve static files a free CDN serves better.
- **Pages for new sites only**: less work now, two patterns and two runbooks forever.
- **GHA-scheduled sync** (fully off-infra pipeline): no cluster dependency at all, but
Drive + Cloudflare credentials would live as GitHub secrets per repo, outside Vault.
## Consequences
- Registration is one entry in the `sites` map (name, Content folder, optional Entry
file); CI applies Pages project, custom domain, public CNAME, and internal-DNS config
together. Names are English, picked by Viktor (most → bridge set the precedent).
- The internal split-horizon zone learns Valia sites from a ConfigMap the
`technitium-ingress-dns-sync` script consumes — declaratively, including **removal**
(the previous static-CNAME approach was add-only; a retired site left a stale record).
- Deploy-on-change is mandatory, not an optimisation: Pages caps monthly deployments on
the free tier, and a 10-minute cadence would burn ~4,300/month if unchanged runs
deployed.
- Failure visibility is **failed-Job-only** by explicit choice (no stale-sync alert, no
per-site uptime monitors, no notifications to Valia) — Viktor fields "it didn't
update" reports, consistent with the alert-noise-reduction posture. Revisit if a
silent stall actually bites.
- If the homelab is down, content updates pause; the sites keep serving last-deployed
content. Accepted degradation.

View file

@ -0,0 +1,97 @@
# Inbound mail gets a self-hosted store-and-forward backup MX on Oracle Always-Free
`viktorbarzin.me` has run a single direct MX to the home IP since the 2026-04-12
inbound overhaul, with sender-MTA retry (15 days, sender-dependent) as the only
outage protection — a documented "No Backup MX" decision made after ForwardEmail's
forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email
Routing proved pass-through-only. Viktor now wants inbound mail to survive
homelab outages **without loss** (2026-07-04): delayed delivery is fine,
mid-outage reading is not required, and the budget is **$0** — a hard
constraint that eliminated every managed option (see below).
We run a minimal **Postfix store-and-forward relay on an Oracle Cloud
Always-Free `VM.Standard.E2.1.Micro`** (`mx2.viktorbarzin.me`, **reserved**
public IP, MX preference 20; primary untouched at 1). It accepts everything
for the domain (catch-all — every RCPT is valid; reputation may only ever
4xx-defer, via postscreen pregreet + conservative DNSBL-defer on the VM —
never 5xx: a backup MX that hard-rejects manufactures the loss it exists to
prevent), queues up to **30 days** (bounce lifetime 1 day — the VM can never
deliver a DSN, its only egress is the drain), and drains to the primary over
**port 2526** — one scripted pfSense WAN NAT rule onto the existing HAProxy
frontend — because Oracle blocks egress TCP 25 tenancy-wide. Management is
tailnet-only (headscale preauth key, `tag:backup-mx`; OCI console as
mid-outage break-glass since headscale itself lives in the cluster); TLS via
certbot HTTP-01 (port 80 permanently open — LE validation is
multi-perspective and unscopeable); the VM is a cattle-rebuild from a new
`stacks/backup-mx/` Terraform stack (OCI provider + cloud-init, which must
also punch 25/80 through the OCI Ubuntu image's OS-level iptables REJECT).
On the primary, the drain stream (one /32) is enabled at the layers that
actually bite — `check_client_access` permits past
`reject_unknown_client_hostname` and spoof-protection, an anvil rate-limit
exception, and rspamd `external_relay` (score against the *original* sender
IP) with the reject action capped to tag/fold so drained spam can never force
the VM to emit backscatter. Go-live is gated on empirical checks: inbound-25
reachability (recurring probe — Oracle publishes no commitment), drain
end-to-end, and a live failover test that includes a high-spam-score and a
>10 MB message. Two independent adversarial reviews (2026-07-04) shaped this
final form. Design:
[`plans/2026-07-04-backup-mx-design.md`](../plans/2026-07-04-backup-mx-design.md).
## Considered options
- **Roller Network free Secondary MX** — v1 of this decision, killed at the
validation gates the same day: free tier caps at 200 relayed messages or
10 MB per rolling 7 days, and overage suspends the domain for 48 h
answering **SMTP 5xx** (permanent bounces) — since spammers target backup
MXes even while the primary is up, background spam alone can hold it
suspended, making it *worse than no backup MX*. Free accounts are also
being discontinued. (Their TLS checked out; their paid Basic at $30/yr is
the documented fallback if the OCI route sours.)
- **Dynu Email Backup ($9.99/yr)** — queue lifetime undocumented (FAQ hints
1224 h, barely beating sender retry); filtering black-box; not free.
- **Cloudflare Email Routing / mailflare** — no store-and-forward / terminal
inbox on Cloudflare; rejected earlier (2026-04-12; 2026-07-04 memory #7148).
- **Other free tiers** (challenged and re-verified 2026-07-04): GCP e2-micro
blocks egress 25 too and its free regions are US-only; AWS's 2025+ "free"
plan is a 6-month credit; Azure has no always-free VM and blocks 25;
Hetzner has no free tier; Fly.io ended free allowances; Vultr/Linode are
trial credits; DNSExit/KisoLabs/DuoCircle backup-MX are paid or dead. OCI
is the only standing free option.
- **Harden-only** (5xx-misconfig guards + paging) — does not address
multi-day outages or short-retry senders; deferred as a complementary
track.
## Consequences
- **A pet outside the cluster** — deliberately cattle: rebuilt entirely from
Terraform + cloud-init, patched by unattended-upgrades, scraped by the
cluster's Prometheus (exporters on the reserved public IP, allowlisted to
the homelab WAN /32 — there is **no cluster→tailnet route**, so tailnet
scraping was rejected as fictional; blackbox TCP:25 + MX-set drift alerts
besides). Never a backup target itself.
- **Oracle free-tier caprice is the top risk**: Oracle silently halved the A1
free allowance in June 2026 and terminated over-limit instances, and
publishes no commitment that inbound 25 stays open. Mitigations:
**Pay-As-You-Go conversion is a required prerequisite** (exempts idle
reclamation, stays $0), a recurring inbound-25 probe, `BackupMxDown`, and
the queue being empty outside outages (a surprise reclamation loses
coverage, never mail). Home region is fixed at signup — Frankfurt, chosen
once.
- The drain stream bypasses `reject_unknown_client_hostname`, anvil limits,
and rspamd's reject tier for one /32; DKIM verification, SPF/DMARC (against
the original IP via `external_relay`), and content scoring stay on — spam
arriving via the backup is tagged and folded to Junk, never bounced. The VM
is deliberately NOT in the primary's `mynetworks` (a compromised VM must
not relay through us).
- **Outages > 30 days lose queued mail silently** — no DSN can ever leave the
VM. Stated and accepted (6× better than the status quo).
- Outage mail sits in plaintext on Oracle disk ≤ 30 days — single-tenant but
off-premises; accepted (same class as Brevo holding outbound today).
- Cloudflare zone lands at 197/200 records; the MTA-STS follow-up (policy
host found dangling during design — inert today; must list `mx2` when
fixed) needs 12 more → schedule the next record purge proactively.
- `architecture/mailserver.md` §"No Backup MX" superseded at implementation;
new runbook `docs/runbooks/backup-mx.md` (incl. OCI console break-glass);
`vpn.md`'s stale headscale claims fixed in passing; the roundtrip probe's
failure semantics change (a "failing" probe may now mean "delayed via mx2,
drains shortly" — noted in alert description).

View file

@ -86,10 +86,56 @@ Signin latency is dominated by screen count and round trips, not server time
use the explicit-consent flow (it re-prompted every 4 weeks per app).
- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
15m policy cache, 60s persistent DB connections.
15m policy cache, gunicorn `max_requests=10000`/jitter=1000 (recycle
hardening — decorrelates the 9 workers' recycles from PG blips). **No
`CONN_MAX_AGE`** — persistent Django connections pin a PgBouncer server conn
1:1 and saturate the session-mode pool (reverted 2026-06-10).
- **Static assets cached immutable**: `/static` ingress carve-out adds
`Cache-Control: public, max-age=31536000, immutable` (assets are
version-fingerprinted; authentik itself sends no max-age).
- **Rate-limit carve-out** (2026-06-28): `/` and `/static` use a dedicated
`authentik-rate-limit` (100/1000) instead of the shared 10/50 default — the
login SPA cold-loads ~70 flow-executor chunks from `/static`; the default
burst 429'd the tail and a failed ES-module import left a blank login screen.
- **Readiness tolerance** (2026-06-28): server `readinessProbe.failureThreshold:8`
(~80s, was the chart-default ~30s). The probe (`/-/health/ready/`) queries the
DB; too-tight tolerance let a sub-60s PG/pgbouncer transient return 503 on all
3 server pods at once → Traefik had no healthy backend → 502/503/504 (episodic
blank login + 30s hangs). 80s absorbs a full CNPG failover reconnect. Sessions
+ cache are PostgreSQL-only since Redis was removed in 2026.2 (no external-cache
option), so request-serving is coupled to PG — this survives a short transient,
not a total CNPG outage.
- **Rolling-update strategy** (2026-06-28): the chart key is `deploymentStrategy`
(the repo's old `strategy:` key was silently inert → live ran the chart-default
25%/25% and dropped a server pod out of rotation on every roll). Now
`maxSurge:1/maxUnavailable:0` keeps all 3 ready throughout a roll.
- **Old-browser login (SFE)** (2026-06-28): authentik's modern flow SPA is ES2022
and renders a **blank login** on Safari/WebKit ≤16.3 (every iOS browser shares
the system WebKit, so it's not browser-choice — e.g. iPadOS ≤15). The overlay
image patches `flows/views/interface.py::compat_needs_sfe()` to also serve
authentik's built-in no-JS **Simplified Flow Executor** (SFE, ES5) to old Safari
**and any iOS browser** (Chrome/Firefox on iOS are WebKit skins) on iOS ≤16.3,
so those clients get the *real* authentik login (password + MFA + reputation —
no auth downgrade). The SFE can't render Identification-stage **sources**
(authentik limitation), so the patch also injects static social-login `<a>`
links into `flow-sfe.html` (→ `/source/oauth/login/<slug>/`, plain redirects) —
required for password-less accounts (e.g. Google-only users). A Traefik
basic-auth fallback was rejected: it would have put a single spoofable-UA
password in front of `vbarzin→wizard` (passwordless root on the devvm). See
`stacks/authentik/patch-compat-sfe.py`.
- **SFE + forced-WebAuthn MFA gotcha** (2026-06-28): the `default-authentication-flow`
MFA stage (`not_configured_action=configure`, `conf_stages=[webauthn]`) force-enrols
a WebAuthn passkey for any **password**-path user with no MFA device — but the SFE
**cannot render WebAuthn** (enrol *or* validate), so that user gets
`unsupported state: ak-stage-authenticator-webauthn`. Two escape hatches, **no MFA
downgrade**: (1) **social login** — sources run `default-source-authentication`
(UserLoginStage only, **no MFA stage**), so the SFE's "Continue with <provider>"
button always completes; (2) **enrol TOTP** — the SFE *can* validate TOTP codes, and
≥1 confirmed device flips the stage from force-enrol to validate. User MFA devices are
runtime data (not Terraform): enrol via `ak shell`
(`TOTPDevice.objects.create(user=…, confirmed=True)`) and store the secret in the
user's own Vaultwarden item. (Done for emo — the Google-only iPadOS-15 case: TOTP in
his `authentik.viktorbarzin.me` Bitwarden item; e2e-verified the BW code is accepted.)
- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
TCP setup on the forward-auth subrequest path.
@ -108,31 +154,6 @@ All new users must use an invitation link to register. The invitation-enrollment
Group membership is auto-assigned from the invitation's `fixed_data` field. This prevents open registration while maintaining SSO convenience.
### TripIt External self-signup (open enrollment, fenced)
Unlike every other app, **TripIt allows open public self-signup** for people
outside the homelab (ADR-0020 in the tripit repo; runbook
`docs/runbooks/tripit-external-signup.md`). A dedicated public `tripit-enrollment`
flow (email + passkey, no password) creates the account and stamps it into the
parentless **`TripIt External`** group. Containment is two-layered:
- **Forward-auth apps**: a branch prepended to the `admin-services-restriction`
catch-all policy admits `TripIt External` to `tripit.viktorbarzin.me` only and
denies every other `auth="required"` host.
- **OIDC apps**: that branch does NOT cover OIDC (OIDC bypasses forward-auth).
External users are contained because every sensitive OIDC app already requires a
trusted group they do not hold — audited 2026-06-15:
Immich/Grafana/Linkwarden/Cloudflare Access → `Home Server Admins`, Forgejo →
`Task Submitters`/`Forgejo Users`, Headscale → `Headscale Users`, wrongmove →
`Wrongmove Users`. **Vault** was OPEN (any OIDC identity got a powerless
`default`-policy token) and is bound to **`Allow Login Users`** as part of this
change. The Kubernetes OIDC clients are OPEN but idle (apiserver rejects OIDC).
**Invariants**: keep `TripIt External` parentless (never under `Allow Login
Users`); keep the catch-all branch first; never co-assign `TripIt External` to a
trusted/internal user; the `tripit-enrollment` user_write "Create users group"
setting is the keystone that tags every signup.
### OIDC Applications
Authentik provides OIDC for 10 applications:

View file

@ -128,7 +128,7 @@ The agent handles all three version patterns in Terraform:
- **Slack**: All upgrade events reported (start, success, failure, rollback)
- **Git**: Detailed commit messages with changelog summaries, risk level, backup status
- **DIUN Slack**: Independent Slack channel for raw version detection (separate from upgrade agent)
- **DIUN Slack**: REMOVED 2026-07-02 (per-tag @channel pings in #image-updates; human cadence is the weekly upgrade report). The n8n webhook feed to the upgrade agent is unchanged.
## Bulk Upgrades
@ -319,7 +319,7 @@ each Job's pod and its drain target are always different nodes.
- `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
- `EtcdPreUpgradeSnapshotMissing``k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently.
- `K8sUpgradeStalled``k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
- `K8sUpgradeChainJobFailed``kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured).
- `K8sUpgradeChainJobFailed``(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). The `unless k8s_upgrade_blocked == 1` clause (2026-06-21) excludes a deliberate compat-gate refusal (owned by `K8sUpgradeBlocked`) so a block doesn't double-fire as a wedge.
- **Pushgateway metrics**:
- `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
- `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)

View file

@ -112,17 +112,32 @@ External caller (dev box):
@playwright/mcp --isolated --storage-state ~/.cache/...storage-state.json
```
## Browser binary — real Google Chrome (for proprietary codecs)
The chrome-service container runs **real Google Chrome**, not the bundled
Chromium, via the infra-owned image `ghcr.io/viktorbarzin/chrome-service-browser`
(`files/chrome/Dockerfile` = `mcr.microsoft.com/playwright:v1.48.0-noble` +
`google-chrome-stable`, built by `.github/workflows/build-chrome-service-browser.yml`).
The launch resolves `CHROMIUM=/opt/google/chrome/chrome`.
**Why:** the Playwright-bundled Chromium has proprietary codecs **compiled out**,
so H.264/AAC video (Instagram Reels, X, most `.mp4`) fails in the noVNC view with
`MEDIA_ERR_SRC_NOT_SUPPORTED` (the bytes download `200 video/mp4` but there's no
decoder — NOT a GPU issue). Royalty-free codecs (VP9/VP8/AV1 → YouTube) always
worked. Swapping `libffmpeg.so` does NOT help (codecs are compiled out, not just
the lib stripped) and Chrome-for-Testing is also codec-less — only
`google-chrome-stable` carries them.
## Image pin
Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
`stacks/chrome-service/main.tf`) and the Python client
(`playwright==1.48.0` in callers' `requirements.txt`) **must match
minor-versions**. Bump in lockstep — Playwright protocol changes between
minors and the client cannot connect to a mismatched server.
The harvester + snapshot-server sidecar use
`mcr.microsoft.com/playwright/python:v1.48.0-noble` — same playwright
minor, with Python-side bindings pre-installed.
The Playwright base + the Python client (`playwright==1.48.0` in callers'
`requirements.txt`) and the snapshot sidecars
(`mcr.microsoft.com/playwright/python:v1.48.0-noble`) historically had to match
minor-versions. The chrome-service browser is now real Google Chrome (a newer
milestone than the 1.48 Chromium), but the `connect_over_cdp` callers (tripit
fare scrape, `homelab browser`, snapshot-harvester) attach over raw CDP, which is
version-tolerant — verified working against this Chrome. If a future Chrome
milestone breaks a caller, pin Chrome in the Dockerfile or bump the clients.
## Storage
@ -167,7 +182,66 @@ minor, with Python-side bindings pre-installed.
`x11vnc` (connected to Xvfb on `localhost:6099`) bridged to
`websockify` on port 6080. Service `chrome` maps :80 → :6080 and is
exposed via `ingress_factory` at `chrome.viktorbarzin.me`,
Authentik-gated.
Authentik-gated. The bare host serves `vnc.html` (image symlinks
`index.html → vnc.html`); add `?autoconnect=true&resize=scale&path=websockify`
to skip the Connect button. The view is **black when no browser window is
open** (idle) — that is normal, not a failed connection. Chrome is launched
with `--window-size=1280,720 --window-position=0,0` to fill the Xvfb screen
(no window manager runs, so without it Chrome opens at its profile-persisted
size and the rest of the framebuffer shows as a black cut-off).
### noVNC fd-sweep gotcha (stuck "Connecting")
If the noVNC client hangs on **"Connecting" forever then times out**, the cause
is almost always x11vnc's fd-table sweep: containerd grants pods
`RLIMIT_NOFILE = 2^31`, and x11vnc `fcntl`-sweeps the **entire** fd table on
every client connection, so the RFB handshake never completes (websockify
accepts the WS and logs `connecting to: localhost:5900`, but x11vnc never sends
the `RFB 003.008` banner). Diagnose: `grep "open files" /proc/$(pgrep -n
x11vnc)/limits` (huge = bad) and time the handshake from a sibling container
(`python3 -c "import socket;s=socket.socket();s.connect(('127.0.0.1',5900));print(s.recv(12))"`
healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts**
— done both in `files/novnc/entrypoint.sh` (root) and via the container `command`
wrapper in `main.tf` (so it applies deterministically even though the image is
`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
as the android-emulator stack.
### noVNC black after a browser-container restart (x11vnc supervision)
A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects*
but the view is **black**, and the novnc container logs spew
`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection
refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run
in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser)
container's Xvfb over `localhost:6099` (shared pod network). When the browser
container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its
Xvfb vanishes and x11vnc loses its X connection and exits.
`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as
background children and `wait -n`s on them, exiting non-zero if **either** dies, so
the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and
relaunches x11vnc — the bridge **self-heals** across browser-container restarts.
(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed
websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a
`<defunct>` zombie — and the view black until a manual pod restart. Same
supervision pattern as the android-emulator stack's entrypoint.)
**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a `<defunct>`/Z
entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c
"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"`
— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate
recovery** (no image change): restart just the novnc container with `kubectl exec
-n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint
and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs.
> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment
> (`keel.sh/policy=never`, because the browser container's playwright image is
> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a
> rebuilt `:latest` will **not** redeploy on its own. After the
> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:<sha>`,
> **SHA-pin** the novnc `image` in `main.tf` to the new `:<sha>` to force the pull
> and rollout (the novnc image is TF-managed — not in the deployment's
> `lifecycle.ignore_changes`).
- **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
@ -180,6 +254,87 @@ minor, with Python-side bindings pre-installed.
See `stacks/chrome-service/README.md` for the recipe (label namespace,
inject `CHROME_CDP_URL`, vendor `stealth.js`).
## Driving from OUTSIDE the cluster (`homelab browser`)
Agents on the devvm reach this browser through the **`homelab browser`** CLI
(`cli/`, ADR-0013) — the packaged, discoverable form of the ad-hoc
`connect_over_cdp` recipe. It is the **escalation path, not the default**:
agents default to the Playwright MCP / headless browser for all routine
automation, and reach for `homelab browser` ONLY when headless is blocked — a
site loads but a gated action (submit/login) silently fails or hangs, the
signature of headless / anti-bot detection. (Same tiered rule lives in
`~/code/CLAUDE.md` and `homelab browser --help`.)
```text
devvm: homelab browser run flow.js
│ kubectl port-forward svc/chrome-service :9222 (random local port)
http://127.0.0.1:<port> ──► chrome-service pod :9222 (CDP)
│ assert /json/version Browser is "Chrome/…", not "HeadlessChrome"
│ node + playwright-core@1.48.2 → connectOverCDP
│ context.addInitScript(stealth.js) ← same vendored file as in-cluster
│ run the user's Playwright script with page/context/browser in scope
└─ port-forward always torn down (success or error)
```
Key facts:
- **port-forward bypasses the `:9222` NetworkPolicy.** It tunnels
API-server→pod, so the devvm needs no `chrome-service.viktorbarzin.me/client`
label — unlike in-cluster callers.
- **Client pinned to the image minor.** The node client is
`playwright-core@1.48.2` (matches `v1.48.0-noble` / Chromium 130), installed
lazily into `~/.cache/homelab/browser-client/`. Bump it in lockstep when the
server image bumps (same rule as the in-cluster Python clients — see "Image
pin" above).
- **Default context is a fresh incognito one** (closed on exit), safe for the
shared browser; `--shared-context` reuses the warmed persistent profile.
- **`stealth.js` is vendored** into the CLI (`cli/browser_stealth.js`) as a
byte-identical copy of `files/stealth.js`, guarded by a drift test — so the
CLI's stealth never diverges from the in-cluster callers'.
## Multi-user access (sharing the browser)
There is ONE chrome-service browser with ONE persistent profile, warmed with
**Viktor's** logged-in sessions. CDP has no per-context auth, so anyone who can
drive the browser — over the noVNC view OR the CDP/`homelab browser` path — can
reach the persistent profile (`browser.contexts[0]`) and therefore Viktor's
sessions. Access is gated accordingly, per user.
**Decision (2026-06-28):** emo (`emil.barzin` / `emil.barzin@gmail.com`) SHARES
Viktor's browser for form-filling + captcha solving, rather than getting an
isolated instance. The session-exposure trade-off above was explicitly accepted.
Two independent grants make up "browser access" for a user:
1. **noVNC (interactive view, `chrome.viktorbarzin.me`)** — gated by the Authentik
`admin-services-restriction` policy: the `CHROME_ALLOWED` set
(`stacks/authentik/admin-services-restriction.tf`) matches the user's Authentik
username OR email. Add the user there. No kubeconfig/RBAC needed.
2. **CLI (`homelab browser`, CDP over port-forward)** — needs `pods/portforward`
in `chrome-service` PLUS a non-interactive credential (a normal devvm user's
kubeconfig is interactive-OIDC-only and can't authenticate a headless agent
session). Provided by a per-user **ServiceAccount** with a long-lived token
(`stacks/chrome-service/rbac.tf`, e.g. `emo-browser`): `pods/portforward` in
this namespace + cluster read-only (`oidc-power-user-readonly`, so it can also
resolve the Service and doesn't regress the user's normal read). The devvm
provisioner (`scripts/t3-provision-users.sh``install_browser_kubeconfig`)
reads that token and installs it as the user's DEFAULT kubeconfig context
(`<user>-browser@homelab`), keeping their personal OIDC login as the
`oidc@homelab` named context. The SA's existence is the source of truth for who
gets the CLI — the provisioner no-ops for users without a `<user>-browser` SA.
**To grant another user:** add them to `CHROME_ALLOWED` (noVNC) and/or add a
`<user>-browser` SA + bindings mirroring `emo-browser` in `rbac.tf` (CLI), then run
the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate
a token by deleting its `<user>-browser-token` Secret).
Because the SA is the user's DEFAULT kubectl credential, other per-namespace
port-forward grants hang off the same identity: `stacks/excalidraw/rbac.tf`
grants `emo-browser` `pods/portforward` in `excalidraw` (2026-07-02) so emo's
agent can upload drawings via the port-forward + `X-Authentik-Username` recipe
in his `~/.claude/CLAUDE.md`. Revoking the SA revokes those too.
## Limits + risks
- **Anti-bot vs stealth arms race** — when an upstream beats us (DRM

View file

@ -94,7 +94,7 @@ can't reach Forgejo's public hairpin.
| Visibility | Packages | Pull mechanism |
|------------|----------|----------------|
| **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous |
| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson |
| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci, k8s-portal, excalidraw-library | `ghcr-credentials` dockerconfigjson |
Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the
kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit
@ -115,8 +115,66 @@ claude-agent-service, claude-memory-mcp, kms-website, Freedify,
instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`),
fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original
pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website,
k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify,
audiobook-search, council-complaints) now also land on ghcr.
k8s-portal, apple-health-data, audiblez-web, insta2spotify,
audiobook-search) now also land on ghcr.
**plotting-book** is a special case (a GitHub-first repo owned by Anca,
ADR-0003): the build runs in *her* GitHub repo
(`PassionProjectsAnca/Plotting-Your-Dream-Book`) and pushes to **private
`ghcr.io/passionprojectsanca/book-plotter`** — under her org's ghcr namespace,
not `viktorbarzin`, using the workflow's built-in `GITHUB_TOKEN` (no shared
PAT). The cluster pulls it via the Kyverno-synced `ghcr-credentials` secret (the
`plotting-book` namespace is on the allowlist; the shared `ghcr_pull_token` has
read access). Migrated off public DockerHub (`viktorbarzin/book-plotter`) on
2026-06-27. The Woodpecker deploy hook (repo 43, registered to Anca's repo) is
unchanged. Flow:
```text
DEVELOP ───────────────────────────────────────────────────────────────────────
Anca (Codex / t3 web agent)
│ git push → main
┌──────────────────────────────────────────────────────────────┐
│ GitHub: PassionProjectsAnca/Plotting-Your-Dream-Book (private)│ ← canonical
│ .github/workflows/build-and-deploy.yml on: push → main │
└───────────────────────────┬──────────────────────────────────┘
│ GitHub Actions runner (off-infra build · ADR-0002)
┌────────────────────┴─────────────────────────────────┐
▼ ▼
┌─────────────────────────────────────────────┐ ╔═══════════════════════════════════════╗
│ build job │ push ║ GHCR · PRIVATE package ║
│ • svu next --always → tag vX.Y.Z (→ repo) │═════▶║ ghcr.io/passionprojectsanca/ ║
│ • buildx linux/amd64, provenance:false │ tags ║ book-plotter :vX.Y.Z :latest ║
│ • login ghcr (GITHUB_TOKEN, packages:write)│ ╚═══════════════════╤═══════════════════╝
│ • delete-package-versions (keep newest 10) │ │
└───────────────────────┬─────────────────────┘ │ pull (private,
▼ deploy job [gate: repo var DEPLOY_ENABLED ≠ "false"] via secret)
POST ci.viktorbarzin.me/api/repos/43/pipelines {IMAGE_TAG, IMAGE_NAME} │
▼ │
┌─────────────────────────────────────────────────────────────┐ │
│ Woodpecker repo 43 · .woodpecker/deploy.yml (event: manual) │ │
│ kubectl set image deployment/plotting-book = <ghcr>:vX.Y.Z │ │
│ kubectl rollout status │ │
└───────────────────────────┬─────────────────────────────────┘ │
▼ │
═══════════════ Kubernetes · ns: plotting-book ════════════════════════════ │
┌─────────────────────────────────────────────────────────────┐ │
│ Deployment plotting-book (Recreate · image = ignore_changes)│ │
│ imagePullSecrets: ghcr-credentials ────────pull───────────┼─────────────────┘
│ Pod → Express :3001 + SQLite on PVC (proxmox-lvm) │
└─────────────────────────────────────────────────────────────┘
guards / supporting:
• Kyverno require-trusted-registries [Enforce] → ghcr.io/* ALLOWED (admission)
• Keel policy=patch @1h → watches GHCR via ghcr-credentials (backstop)
• ghcr-credentials ⇐ Kyverno generate-clone ⇐ Vault secret/viktor/ghcr_pull_token
═══════════════ Serving path (unchanged) ══════════════════════════════════
Browser ─▶ plotting-book.viktorbarzin.me (non-proxied DNS → Traefik .203)
─▶ Authentik forward-auth (gate) ─▶ Service :80 ─▶ Pod :3001
```
Governance: the Deployment + Kyverno allowlist are Terraform (`stacks/plotting-book`,
`stacks/kyverno`); the live image *tag* is CI-owned (`ignore_changes`).
### Infra-owned images (issues #29 / #30)
@ -130,6 +188,8 @@ reconciled — the workflows were added to the GitHub lineage via PR):
| android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` |
| infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` |
| infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` |
| k8s-portal | `build-k8s-portal.yml` | private `ghcr.io/viktorbarzin/k8s-portal` (Keel rolls `:latest` digests) |
| excalidraw-library | `build-excalidraw.yml` | private `ghcr.io/viktorbarzin/excalidraw-library` (Keel rolls `:latest` digests; DockerHub `:v4` frozen as rollback) |
**`infra-ci`** is the image the `.woodpecker/default.yml` apply step and
`drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is
@ -163,9 +223,9 @@ Woodpecker is **deploy + cluster-touching steps only**:
| Pipeline | File | Purpose |
|----------|------|---------|
| per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) |
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`). **Skips Tier-0 `vault`** — it's human-applied via OIDC; the CI `ci` role lacks Vault-admin perms (`sys/mounts`, `sys/policies/acl`) so a CI apply 403s |
| certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`). **Skips Tier-0 `vault`** (its `plan` 403s under the `ci` role and would fail the whole run) |
| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*``10.0.20.10` on change |
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports``/etc/exports` on PVE |
@ -176,6 +236,38 @@ Woodpecker is **deploy + cluster-touching steps only**:
**No build/test pipeline exists on any repo.** Do not (re)introduce one.
### `default.yml` apply: dual-registration de-dup + reliability (2026-06-28)
infra is registered in Woodpecker on **both** the canonical Forgejo repo (id 82)
and the legacy GitHub mirror (id 1), and **both fire `default.yml` on every
push**. Left unguarded, two `terragrunt apply` runs race each other for the
per-stack PG state lock — historically the #1 source of `Error acquiring the
state lock` failures and push-supersede "killed" runs.
- **Forge guard** (first command in the `apply` step): the push-apply runs **only
on the canonical Forgejo forge**; on the GitHub mirror it logs `[forge-guard]`
and `exit 0`s. Detection: `CI_REPO_URL`/`CI_FORGE_URL` contains `github.com`
skip. Fail-open (unknown forge still applies). The mirror keeps running the
**crons** (drift-detection, renew-tls, …), which live on repo 1 — only its
duplicate push-apply no-ops. (Crons were NOT moved; deactivating repo 1 would
have killed them.)
- **Lock-skip matches both tiers**: a stack whose apply hits a lock is SKIPPED,
not failed. The grep now matches the Tier-0 Vault message (`is locked by`) **and**
the Tier-1 PG-backend message (`Error acquiring the state lock` / `already
locked`) — the PG case was previously miscounted as a hard failure.
- **Transient retry** (bounded, 3 attempts): only provider-registry download
timeouts (`Failed to install provider` / `Client.Timeout`) and Vault 5xx are
retried. Config errors (missing arg, invalid index) and helm `atomic` timeouts
are NOT retried — they fail fast.
A pre-apply off-infra validate gate was evaluated and rejected: `terraform
validate` runs without state but catches ~0 of the observed failures (they are
provider-config-from-Vault-data, server-side-apply conflicts, helm installs, and
lock contention — all invisible to static validate), and `plan` cannot run
off-infra (no Vault/PG access). `terragrunt apply` already fails at its plan
phase without mutating on config errors, so a separate in-pipeline plan-gate was
also dropped as redundant.
### Woodpecker API
Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths
@ -203,7 +295,9 @@ The infra repo runs on Woodpecker via **two** forge registrations: the Forgejo
forge (repo id 82, registered 2026-06-08) and the legacy GitHub forge (repo id
1). Pushes to **Forgejo** `master` fire `.woodpecker/default.yml`
(changed-stacks terragrunt apply, in `infra-ci`) plus the `notify-nonadmin-push`
Slack audit step. Operational facts (2026-06-10):
Slack audit step. **Slack policy (2026-07-02): every infra pipeline posts only
on FAILURE** (plus the non-admin audit post and drift/error findings) — routine
successful runs are silent. Operational facts (2026-06-10):
- **Webhook URL is the IN-CLUSTER service**:
`http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...` (PATCHed
@ -285,7 +379,8 @@ steps:
notify:
image: plugins/slack
when:
status: [success, failure]
# Failure-only (2026-07-02 policy): CI notifies about failed runs only.
status: [failure]
```
### CI/CD secrets sync

View file

@ -277,7 +277,7 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons
Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).
**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too.
**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too. **Off-infra Valia sites** (Cloudflare Pages, ADR-0018) are the other class of public-only names with no Traefik ingress — without internal records they NXDOMAIN for every internal client while working fine externally. Since 2026-07-03 they are reconciled **declaratively**: `stacks/valia-sites` writes the ConfigMap `valia-sites-dns` (technitium ns, `<name> → <project>.pages.dev`), and the sync script ensures/updates a CNAME per entry and **deletes** stale internal CNAMEs targeting `*.pages.dev` that left the map (retire/rename cleans itself up; deletion is suffix-scoped so nothing else can be touched).
## NodeLocal DNSCache
@ -368,6 +368,7 @@ The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`)
| TXT (MTA-STS) | 1 | `v=STSv1; id=20260412` | TLS enforcement |
| TXT (TLSRPT) | 1 | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS reporting |
| A (keyserver) | 1 | `130.162.165.220` (Oracle VPS) | PGP keyserver |
| CNAME (CF Pages) | 2 | `<project>.pages.dev` (Cloudflare Pages) | bridge, stem95su — Valia sites (ADR-0018), managed by `stacks/valia-sites` |
### Proxied vs Non-Proxied
@ -513,6 +514,7 @@ For external `.viktorbarzin.me` records:
1. Add `dns_type = "proxied"` (or `"non-proxied"`) to the `ingress_factory` module call in the service stack
2. Run `scripts/tg apply` on the service stack — DNS record is auto-created
3. For non-standard records (MX, TXT), add a `cloudflare_record` resource in `stacks/cloudflared/modules/cloudflared/cloudflare.tf`
4. For a Valia site (off-infra Cloudflare Pages), add the entry to `local.sites` in `stacks/valia-sites/main.tf` — public CNAME + internal record both follow (`docs/runbooks/valia-sites.md`)
## Incident History

View file

@ -161,6 +161,17 @@ https://mail.viktorbarzin.me → Traefik → Roundcubemail
DB: MySQL (mysql.dbaas.svc.cluster.local)
```
### Paperless ingest mailbox (docs@)
`docs@viktorbarzin.me` is a dedicated real mailbox (explicit self-alias in
`extra/aliases.txt` so the `@domain → spam@` catch-all doesn't shadow it) that
paperless-ngx polls over IMAP; family members forward document emails to it
and the sender maps 1:1 to a paperless account. A per-user Dovecot sieve
(`docs-at-viktorbarzin.me.dovecot.sieve` in the `mailserver.config` ConfigMap,
mounted as `/tmp/docker-mailserver/docs@viktorbarzin.me.dovecot.sieve`)
discards mail from non-allowlisted senders at delivery. Full flow, sender map,
and add-a-sender procedure: [`runbooks/paperless-mail-ingest.md`](../runbooks/paperless-mail-ingest.md).
## DNS Records
All managed in Terraform at `stacks/cloudflared/modules/cloudflared/cloudflare.tf`.
@ -300,6 +311,21 @@ Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from External
## Troubleshooting
### All mail tempfailing with `451 4.3.0 queue file write error` (postsrsd spin)
Seen 2026-07-03 right after a pod restart. Signature in `/var/log/mail/mail.log`:
`postfix/cleanup: warning: tcp:localhost:10001 lookup error` +
`sender_canonical_maps map lookup problem ... message not accepted, try again later`.
Cause: **postsrsd** (SRS daemon, `sender_canonical_maps = tcp:localhost:10001`)
came up spinning at 100% CPU without binding 10001/10002 — supervisor shows it
`RUNNING` but `ss -ltn | grep 1000` is empty and its log is empty. Postfix then
tempfails every message (inbound AND submission); senders retry so nothing is
lost, and the roundtrip probe alerts within the hour.
Fix: `supervisorctl restart postsrsd` inside the container; if the fresh
process spins again (it did once), `kubectl -n mailserver delete pod` for a
full re-init — that healed it. Root cause not pinned down (one-off bad init;
postsrsd 1.10).
### Inbound mail not arriving
1. **DNS/MX**: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me`
2. **WAN reachability**: `nc -zw5 mail.viktorbarzin.me 25` from outside

View file

@ -146,7 +146,7 @@ Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia
**Dashboard** — `dashboards/rpi-sofia.json` ("RPi Sofia", Hardware folder): status, undervoltage/throttle, SoC temp, load, memory, root-fs free + read-only, network.
**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`rpi_under_voltage_occurred==1`), `RpiSofiaHighTemp`.
**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`increase(rpi_under_voltage_occurred[1h])>0` — edge-triggered on the sticky bit; the live `rpi_under_voltage_now` bit is too transient to catch at 1-min sampling, so it fires on a *new* brown-out and auto-resolves ~1h later instead of latching until reboot), `RpiSofiaHighTemp`.
**Recovery** — a systemd hardware watchdog (`RuntimeWatchdogSec=14s`, bcm2835 max ~15s) auto-reboots the Pi on a hard hang instead of leaving it dead for hours.
@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por
#### Security Alerts (Wave 1 — planned, beads `code-8ywc`)
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
| # | Source | Event | Severity |
|---|---|---|---|
@ -318,9 +318,20 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli
Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.
- **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security`**`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security`posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
- **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)
#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
| Alert | Expr (abridged) | For | Severity |
|---|---|---|---|
| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
#### Backup Alerts
- **PostgreSQLBackupStale**: >36h since last backup
- **MySQLBackupStale**: >36h since last backup

Some files were not shown because too many files have changed in this diff Show more