infra

Author	SHA1	Message	Date
Viktor Barzin	936e6592e0	home-lans-only: add London guest net 192.168.9.0/24 — the Portal Plus lives there Post-rollout discovery during wrap-up: the London Portal Plus leases on the GUEST network (Portal-75AE8F9C2A8A = 192.168.9.198), not the main LAN, so the allowlist shipped in `8bac9914` would have 403'd it once it woke. Verified the forwarded path end-to-end on the Flint 2 (read-only): VPN_PREROUTING_HOOK hooks BOTH br-lan and br-guest into ROUTE_POLICY -> TUNNEL10_ROUTE_POLICY, which marks all dst_net10 (10/8) traffic onto the WG tunnel — so the Portal reaches 10.0.20.203 with source 192.168.9.198 once on-screen. (Side finding, router-originated only: the firewall.user LOCAL_POLICY dst_net10 injection from vpn.md has rotted — admin curls from the router itself don't tunnel; clients unaffected. Not fixed here — live-device change, needs Viktor's OK.) Middleware already applied live via targeted tg apply (20:11 UTC). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-04 20:15:31 +00:00
Viktor Barzin	8bac9914ec	immich-frame: LAN-only access via home-lans-only allowlist + dns_type=internal Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor asked to tighten who can see the immich-frame deployments: make them not public while keeping the two Meta Portals working as frames. The Portal app bakes the URL into the APK, so the same hostnames must keep loading from the home networks with zero device or router changes. - New shared Traefik middleware home-lans-only (Sofia/London/Valchedrym LANs + 10/8 + internal v6) — separate from local-only so the remote LANs don't inherit access to admin surfaces. - New ingress_factory dns_type="internal": publicly-resolvable A record carrying the internal Traefik LB IP (10.0.20.203). Outsiders resolve but can't route; WG spokes policy-route 10/8 down the tunnel. Never combine the allowlist with proxied DNS (cloudflared pod IPs are in 10/8 and would bypass it). - Both frame ingresses: dns_type internal + allowlist attached + external_monitor=false (drop the doomed [External] monitors). - rybbit worker: highlights-immich route/site removed (off Cloudflare). - Docs: CLAUDE.md/AGENTS.md ingress tiers, networking.md DNS categories, design doc docs/plans/2026-07-04-immich-frame-lan-only-design.md. Pre-verified: London router DNS returns RFC1918 answers unfiltered; Technitium already CNAMEs both hosts to the LB; no public wildcard. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-04 14:21:01 +00:00
Viktor Barzin	114a7743ac	backup-mx: pivot to self-hosted Oracle relay; challenge-hardened design v3 All checks were successful ci/woodpecker/push/default Pipeline was successful Details Rollernet's free tier failed the validation gates before any DNS change (200 msgs / 10 MB per rolling week, then 48h of SMTP 5xx bounces — worse than no backup MX; free accounts being discontinued). Viktor chose to stay free, so the backup MX becomes a Postfix store-and-forward relay on an Oracle Always-Free VM (mx2.viktorbarzin.me, MX pref 20), draining via port 2526 through the existing pfSense HAProxy frontend since Oracle blocks egress 25. Two independent adversarial reviews then fixed the design: primary-side drain enablement moved to the layers that actually reject (unknown- client-hostname, spoof protection, anvil limits, rspamd reject tier -> external_relay + action cap, never backscatter), monitoring moved off the nonexistent cluster->tailnet path to allowlisted public-IP scrapes, bounce lifetime cut to 1d (the VM can never deliver DSNs), OCI OS-level iptables + reserved-IP + mandatory PAYG requirements added, and 4xx-only postscreen hygiene replaces the blanket no-filtering stance. ADR-0019 and the design doc renamed accordingly (rollernet -> oracle). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-04 13:38:39 +00:00
Viktor Barzin	c1ffed17a9	backup-mx design: credentials to Vaultwarden, not Vault KV All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked for the Rollernet account credentials to live in Vaultwarden (the personal password manager) rather than HashiCorp Vault. Item 'Rollernet (backup MX)' created; doc updated to match. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-04 12:55:43 +00:00
Viktor Barzin	c91fa881e6	docs: design + ADR-0019 — free backup MX via Roller Network secondary MX All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wants inbound email to survive homelab outages without loss; delayed delivery is acceptable and the budget is zero, which rules out the previously doc-flagged Dynu option. Design adopts Roller Network's free Secondary MX (3-week store-and-forward queue, no forced filtering, catch-all-compatible) with our-side postscreen/rspamd whitelisting, five validation gates before any DNS change, and a live failover test. Also records the dangling-MTA-STS finding (TXT published, policy host absent) as a follow-up. Implementation starts only after Viktor reviews these docs; account will use rollernet@viktorbarzin.me. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-04 09:59:16 +00:00
Viktor Barzin	50778d47d3	drone-logbook: new stack — self-hosted Open DroneLog at dronelog.viktorbarzin.me Viktor asked to self-host the DJI flight-log analyzer for his DJI Mini 4 Pro (his fork ViktorBarzin/drone-logbook -> upstream arpanghosh8453/open-dronelog). Upstream ghcr image with Keel auto-upgrade, DuckDB data on an encrypted proxmox-lvm PVC (GPS traces = sensitive), NFS /sync-logs drop folder imported every 8h, daily backup CronJob to /srv/nfs/drone-logbook-backup (vaultwarden pattern), Authentik-gated ingress, PROFILE_CREATION_PASS from Vault via ESO. Design + plan in docs/plans/; service-catalog updated.	2026-07-04 08:42:53 +00:00
Viktor Barzin	d9717a53bf	vault-token-renew runbook: document the self-heal behavior All checks were successful ci/woodpecker/push/default Pipeline was successful Details Drift guard section rewritten: admin-capable clobbers now self-heal at the nightly run (HEALED log line); weak clobbers keep the loud DRIFT failure; manual re-mint is only the weak-clobber recovery now. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 20:20:44 +00:00
Viktor Barzin	a07a603b80	docs/plans: vault-token self-heal implementation plan Task-by-task TDD plan for the approved self-heal design: pure-function tests first, then the heal branch, runbook update, deploy + live clobber simulation, landing and memory updates. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 20:09:36 +00:00
Viktor Barzin	e2bfb20c84	docs/plans: vault-token self-heal design (devvm renewer) Viktor asked to make 'vault login -method=oidc' work seamlessly on devvm: today any OIDC login clobbers the permanent periodic token in ~/.vault-token, the drift guard only logs the drift, and his access effectively expires weekly. Approved design: the nightly renewer re-mints the periodic token from any admin-capable clobber (weak clobbers keep failing loudly) and revokes stale periodic tokens after each heal. Implementation follows on this branch. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 20:02:53 +00:00
Viktor Barzin	9dcd3b0d5d	Merge remote-tracking branch 'forgejo/master' into wizard/stem95su-cutover All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-07-03 15:27:04 +00:00
Viktor Barzin	5367d4a055	paperless-mail-ingest: rules process inline attachments (Apple Mail lesson) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor's first real forward carried the invoice PDF with Content-Disposition: inline (Apple Mail does this for real documents), and the attachments-only rules consumed nothing — recorded PROCESSED_WO_CONSUMPTION, which also blocks reprocessing. Flipped all 5 rules to attachment_type=2 (process inline) via the API and documented the trade-off + the ProcessedMail unblock step. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 15:25:44 +00:00
Viktor Barzin	21c6e7112e	stem95su: retire the in-cluster serving stack — now a Valia site on Pages Completes the ADR-0018 cutover. The stack is emptied to a tombstone so CI destroys nginx, the NFS content volume, the ingress, the per-site gdrive-sync CronJob and the namespace; serving + sync are owned by stacks/valia-sites since the cutover commits. Catalog + runbook updated to the migrated state (incl. the one-time 42.9→21.4MB video compression Viktor approved). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 15:22:32 +00:00
Viktor Barzin	c1ee6863b3	mailserver docs: troubleshooting entry for the postsrsd 100%-CPU spin All checks were successful ci/woodpecker/push/default Pipeline was successful Details Hit during the docs@ rollout: after a pod restart postsrsd came up spinning without binding its TCP ports, so postfix cleanup tempfailed every message with 451 queue file write error. Document the signature and the supervisorctl-restart / pod-recreate fix. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 14:39:13 +00:00
Viktor Barzin	68b9858eff	paperless-mail-ingest runbook: manual mail_fetcher must drop to the paperless user All checks were successful ci/woodpecker/push/default Pipeline was successful Details A root-run kubectl exec mail_fetcher downloads attachments root-owned into the scratch dir and the celery consumer (uid 1000) fails with PermissionError — found during the build E2E. Document s6-setuidgid usage and the recovery step. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 14:26:12 +00:00
Viktor Barzin	77fcb08e8e	mailserver: add docs@ paperless ingest mailbox (sieve sender allowlist) Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor asked to forward arbitrary emails with PDF attachments into paperless-ngx, with the forwarding sender mapping 1:1 to the paperless account that owns the document. paperless-ngx's built-in IMAP consumer already does the sender->owner mapping, so the infra half is a dedicated real mailbox docs@viktorbarzin.me: an explicit self-alias (the @domain catch-all would otherwise divert it into the TripIt-swept spam@ mailbox, whose sweeper LLM-parses and auto-replies to mail from linked senders) plus a per-user Dovecot sieve that discards non-family senders at delivery (chosen behaviour for unmatched senders: ignore and delete; also keeps spam out of the guessable address). The mailbox credential was added to Vault secret/platform.mailserver_accounts. Paperless-side mail account + 5 per-sender rules are DB state, configured via the API per the new runbook docs/runbooks/paperless-mail-ingest.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 14:06:19 +00:00
Viktor Barzin	f5187806f9	ADR-0017: replace ASCII trunk diagram with excalidraw VLAN-tagging diagram All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wants the traffic-flow view as a colored excalidraw instead of the ASCII block (which was the only thing rendering after the earlier VLAN-tagging SVG commit failed to push — a locally-masked non-fast- forward this session, not a merge clobber). Ships both the editable .excalidraw scene and a hand-drawn-style SVG export embedded in the Traffic-on-the-trunk section: two lanes showing where the 802.1Q tag is added, carried (only P5<->vmbr0) and stripped, L2 membership drops vs L3 firewall verdicts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 13:21:59 +00:00
Viktor Barzin	316cdb7441	docs: valia-sites runbook + dns.md CM mechanism + service-catalog entries All checks were successful ci/woodpecker/push/default Pipeline was successful Details Runbook covers add/update/retire (one map entry; internal DNS now cleans up after itself), content rules for Valia's folders, and the failure modes incl. both token re-mint paths. dns.md superset-rule paragraph now describes the declarative ConfigMap reconcile instead of hand-added static CNAMEs. Catalog: new valia-sites row; stem95su row notes its Pages cutover is parked on the 42.9MB stem_video.mp4 exceeding the 25MB Pages per-file cap. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:46:24 +00:00
Viktor Barzin	348f64d34d	ADR-0017: add physical-cabling diagram (wires only) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked for one diagram showing just the physical connections between nodes, separate from the logical/VLAN topology: ISP->AX6000, the in-wall apartment->garage run into P1, 4G router (cellular OOB), UPS mgmt, the PoE cat6 to the camera, the LAN1 cable to eno1, dark eno2 fallback + free eno3/4, iDRAC on shared-LOM, and the note that everything else on the R730 is virtual. Referenced from the ADR next to the logical SVG. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:40:29 +00:00
Viktor Barzin	126cf4c88e	Merge origin/master into wizard/cctv-adr-trunk All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-07-03 12:32:00 +00:00
Viktor Barzin	5d16a18cf4	ADR-0017: document trunk traffic semantics + ASCII topology While reviewing the single-switch design Viktor asked whether both the home LAN and the camera VLAN 'go via pfSense which forwards upstream' - a natural misreading a future reader would repeat. Added a section spelling out the vmbr0 fork: untagged home LAN is L2-bridged past pfSense (gateway stays the AX6000, rack outage does not affect it, OOB via 4G survives), while tagged-30 can only land on the dCCTV interface, making a pfSense bypass impossible by construction. Includes a compact ASCII topology for terminal readers alongside the SVG. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:31:48 +00:00
Viktor Barzin	5c42155b81	docs: Valia-sites domain language + ADR-0018 (off-infra Pages, in-cluster sync) Grill session with Viktor: his mother Valia will keep asking for 1-page site hosting, so the pattern is being made repeatable. Decisions: all Valia sites serve off-infra on Cloudflare Pages (survive homelab outages); one shared in-cluster CronJob mirrors her Drive folders every 10 min and redeploys on change; English subdomain names picked by Viktor; failed-Job-only visibility; stem95su migrates onto the pattern. CONTEXT.md gains Valia site / Content folder / Entry file; full rationale and rejected options in ADR-0018. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:17:45 +00:00
Viktor Barzin	e1bd111562	rename CF Pages site most.viktorbarzin.me -> bridge.viktorbarzin.me All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to rename the 'мост' school static site to 'bridge'. New Cloudflare Pages project 'bridge' (bridge-cv2.pages.dev) already deployed and the custom domain attached; this renames the public CNAME (TF resource most_pages -> bridge_pages, destroy+create swaps the record) and the internal split-horizon static CNAME in the ingress-dns-sync CronJob. The old 'most' Pages project and the stale internal 'most' record are removed out-of-band after this applies. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 10:52:30 +00:00
Viktor Barzin	7dd80b6c7c	technitium: mirror most.viktorbarzin.me into the internal zone (CF Pages site) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The internal split-horizon zone is authoritative for viktorbarzin.me, so the new Cloudflare Pages site (most.viktorbarzin.me, added for Viktor's 'мост' school static site) NXDOMAINed for every internal client — LAN, VLANs and pods — while resolving fine externally. Per the superset rule, add it as a static CNAME (-> most-6if.pages.dev) in the ingress-dns-sync CronJob next to the mail-auth records, and document the off-infra-site case in dns.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 10:10:46 +00:00
Viktor Barzin	217a54be9d	cloudflared: add most.viktorbarzin.me CNAME for Cloudflare Pages site All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to host a static HTML site (the 'мост' school project, ОбУ „Отец Паисий", pulled from his Google Drive) on Cloudflare Pages with a custom domain, as a try-out of Pages hosting. The site content is deployed off-infra via wrangler to the Pages project 'most' (most-6if.pages.dev); this CNAME points most.viktorbarzin.me at it. The custom domain is already attached to the Pages project and is waiting on this DNS record to validate. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 10:06:33 +00:00
Viktor Barzin	be80ef23bb	ADR-0017 rev 3: single switch — PE replaces the SG105E, CCTV rides a VLAN-30 trunk on the LAN1 cable Viktor prefers not running two switches, so the TL-SG105PE takes over all rack duties (apartment uplink, 4G, UPS, camera PoE) and the CCTV segment moves onto a managed tagged trunk over the existing LAN1 cable: pfSense net3 re-pointed from vmbr2 to vmbr0 tag=30 (applied live; same MAC so vtnet3/dCCTV survived untouched). This is safe where the original 802.1Q rejection was not, because the managed switch is the only device on eno1 and polices VLAN-30 membership. eno2/vmbr2 kept dormant as the documented fallback. Old SG105E retires to cold spare; PE inherits 192.168.1.6. Glossary Segment term updated (all three segments are now bridge-tags feeding untagged pfSense vNICs). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 09:15:52 +00:00
Viktor Barzin	e11bd6e893	ADR-0017 rev 2: two switches — the PE is a dedicated CCTV island, no VLAN table anywhere Viktor asked to verify free ports on the garage switch (192.168.1.6) before finalizing. Logging into it showed it is NOT the TL-SG105PE from the plan but a pre-existing non-PoE TL-SG105E with 4 of 5 ports in use (apartment uplink, R730 LAN1, 4G router, UPS) - the single-shared-switch port-VLAN design written earlier today was based on conflating the two devices. Corrected: the new TL-SG105PE carries ONLY camera + eno2 uplink (mgmt 10.0.30.6 inside the segment), the old switch is untouched, and no VLAN config exists anywhere. ADR, topology SVG and networking.md updated to match. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 08:37:15 +00:00
Viktor Barzin	b761701994	ADR-0017: add network topology diagram (SVG) next to the decision All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked for a reviewable network visualization committed alongside the CCTV-segment ADR. Hand-drawn SVG (renders on Forgejo, validated palette): physical path camera -> TL-SG105PE port-VLANs -> eno2/vmbr2 -> pfSense dCCTV, the firewall flows (Frigate RTSP, ha-sofia ISAPI/RTSP, NTP-only egress, default deny), and the dashed camera-day steps (patch cable, cat6 run, AX6000 static route). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 20:25:28 +00:00
Viktor Barzin	248e186dce	CCTV segment (dCCTV 10.0.30.0/24) on a dedicated pfSense leg for the garage camera All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor and emo are adding the first owned camera at the Sofia site (HiLook IPC-T241H-C watching the garage / server rack). Viktor asked to finalize emo's plan; the grilling session resolved emo's five open decisions and replaced the doc's 802.1Q-trunk idea with the site idiom: a dedicated physical leg (R730 eno2 -> vmbr2 -> pfSense net3 = dCCTV 10.0.30.1/24), port-based VLAN split on the shared TL-SG105PE, camera default-deny with NTP-only egress, Frigate + ha-sofia as the only consumers. The PVE bridge, pfSense interface, Kea subnet and firewall rules were applied live this session (hand-managed hosts, backed up). This commit records the decision (ADR-0017), the glossary terms (Segment / CCTV segment), the as-built architecture doc, and bumps Frigate's ADR-0016 VRAM budget 2000 -> 2300 MiB for the upcoming NVDEC stream. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 20:01:45 +00:00
Viktor Barzin	4c532dbf97	devvm containment: drop the MemoryHigh throttle band, straight to MemoryMax OOM All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details t3.viktorbarzin.me went down 2026-07-02 15:42-16:35 UTC: an agent-spawned 12.3G ugrep plateaued inside t3-serve@wizard's MemoryHigh(12G)..MemoryMax(16G) band. With MemorySwapMax=0 its anon pages were unreclaimable, so the kernel throttled every task in the cgroup indefinitely (memory.pressure full ~80%, oom_kill never fired) - the t3 event loop starved, the accept queue rotted, and the terminal was dead until the hog was SIGKILLed by hand. The 2026-06-22 design assumed 'throttle to a crawl, then OOM locally'; a hog that stabilises between high and max never OOMs, so the throttle band is a livelock zone, not a safety layer. Viktor asked to close that gap: MemoryHigh is now explicitly infinity on all three work cgroup definitions (t3-serve@ unit, user-<uid>.slice drop-in, docker.slice) so a runaway is cgroup-OOM- killed at MemoryMax immediately - OOMPolicy=continue already keeps the t3 server alive when a child dies. MemoryMax/MemorySwapMax=0/earlyoom unchanged. Applied live to the devvm the same day (daemon-reload + runtime set-property on running cgroups, no session restarts). Post-mortem addendum + runbook updated in the same commit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 16:59:38 +00:00
Viktor Barzin	21afae85c9	dawarich: dedicated 100/1000 Traefik rate limit (default 10/50 429'd page loads) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor saw dawarich throwing 429s through Traefik and asked to loosen the burst for it. The access log confirms the burst pattern: one page load fires the whole fingerprinted-asset tail (SVG store badges, favicons, webmanifest) from a single client IP and trips the default 10 req/s / burst 50 limiter (repro: 80 parallel GETs -> 28x 429). Same remedy as ha-sofia, ActualBudget, noVNC, tripit, health and authentik: dedicated dawarich-rate-limit middleware (average 100 / burst 1000) + skip_default_rate_limit on the dawarich ingress. Also updates the networking.md middleware enumerations (adding the previously undocumented tripit/health limiters alongside dawarich). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 15:03:08 +00:00
Viktor Barzin	8fc657f431	excalidraw: migrate image build to GHA -> private ghcr (ADR-0002) The image was still built by hand and pushed to DockerHub (v1..v4), predating the all-builds-off-infra doctrine; Viktor chose to move it onto the standard pipeline while shipping the export/rename feature rather than keep the manual flow. Mirrors the k8s-portal pattern: .github/workflows/build-excalidraw.yml (go test + buildx linux/amd64, pushes ghcr latest+sha), excalidraw ns added to the Kyverno ghcr-credentials allowlist (package is PRIVATE), deployment now pins ghcr :latest with pullPolicy Always + pull secret, Keel force/match-tag/5m annotations seed the metadata (live values win via ignore_changes). DockerHub viktorbarzin/excalidraw-library:v4 stays frozen as the rollback image. Docs: ci-cd.md + .claude/CLAUDE.md image lists updated (also backfilled the missing k8s-portal rows in ci-cd.md). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 14:29:23 +00:00
Viktor Barzin	6f03ccd1aa	excalidraw: grant emo-browser SA port-forward for drawing uploads All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to fix emo's permission so his Claude can upload to the Excalidraw service. emo's recent sessions show the documented upload recipe (kubectl port-forward svc/draw + X-Authentik-Username header, from his ~/.claude/CLAUDE.md) failing with: pods/portforward forbidden for system:serviceaccount:chrome-service:emo-browser in namespace excalidraw because his default kubeconfig is the read-only emo-browser SA (its port-forward grant covers only chrome-service) and his old admin kubeconfig at /home/emo/code/config expired and was removed. Add a namespace-scoped Role (pods/portforward create) + RoleBinding for that SA in the excalidraw namespace, mirroring the 2026-06-28 chrome-service grant. Trade-off (any-user drawings via the trusted username header) documented in the file and accepted. Also record the grant in docs/architecture/chrome-service.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 11:08:28 +00:00
Viktor Barzin	88c86e2109	ci: Slack-notify failed pipeline runs only All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor doesn't want a Slack message for every CI run — only failures. The infra apply pipeline posted a status line to #general on every push, and the renew-tls / postmortem-todos / registry-config-sync / pve-nfs-exports-sync crons posted on every scheduled run (~30+ routine messages a week). Now: the apply pipeline's success post is gone (notify-failure already covers failures), all cron notifies are status:[failure] with explicit FAILED texts, and drift-detection is silent when all stacks are clean (still posts drift findings and errors, and gains a hard-failure catch step it previously lacked). Kept: notify-nonadmin-push (org audit feed) and the actionable provision-user post. Per-app deploy template in ci-cd.md updated to match. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 07:27:43 +00:00
Viktor Barzin	a64d2ba2b9	upgrades: fix hourly gotenberg error + cap update notifications at weekly All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor was getting upgrade-error Slack messages every hour and wants update notifications at most weekly. Root cause of the errors: Keel kept trying to roll gotenberg 8.25->8.25.1 in paperless-ngx but kyverno's require-trusted-registries denied it — gotenberg/* (and apache/, which tika will hit next) were never allowlisted, and Keel's Slack notifier at info level re-posted the identical failure to #general on every hourly poll since Jun 28. Changes: allowlist gotenberg/ + apache/* so the patch applies cleanly; disable Keel's direct Slack notifier and replace failure visibility with a KeelUpdateFailing Loki-ruler alert (alert-on-change: one notification plus the daily digest, never an hourly drip); remove diun's Slack notifier whose default message @channel-pinged #image-updates for every new upstream tag every 6h (the n8n upgrade-agent webhook feed is untouched). The k8s upgrade report is already weekly (Mon 06:07 UTC). Paperless-ngx itself stays paused (keel policy=never, user-managed) while the ingest runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 07:16:50 +00:00
Viktor Barzin	74819d4061	feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation The single time-sliced Tesla T4 has no per-tenant memory isolation, so its ~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking recruiter-responder for ~5h. Viktor asked for memory protection so we don't overallocate GPU memory, and chose to do it at the scheduling level (no device-plugin swap) after weighing HAMi and MPS. Make the scheduler VRAM-aware and add runtime teeth, all repo-native, time-slicing untouched: - Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob. - Each always-on GPU tenant declares a gpumem budget (immich-ml 3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300 <= advertised) so the scheduler refuses to co-schedule past the card (overflow -> Pending). - gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to false after a few cycles look right. - Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown -- the 2026-06-02 post-mortem's never-built free-VRAM follow-up. - Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment. HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:57:40 +00:00
Viktor Barzin	308a174ad6	docs(networking): record MetalLB .204 (frigate-rtsp go2rtc) allocation All checks were successful ci/woodpecker/push/default Pipeline was successful Details PR #17 moved frigate-rtsp to a dedicated MetalLB LoadBalancer IP (10.0.20.204) exposing RTSP 8554 + WebRTC 8555, but the networking doc still listed only four IPs in use / three dedicated. Add the .204 row to the allocation table, bump the counts (five in use, four dedicated, 5-IP layout), and add a LB-IP renumber-checklist entry for the out-of-band consumers (the go2rtc WebRTC candidate on the frigate config PVC and the HA-sofia rtsp_url_template). Note go2rtc cannot use a DNS name in ICE candidates, so the Service annotation is the single source of truth. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:42:27 +00:00
Viktor Barzin	3398873a16	k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security) uptake now lags up to 7 days instead of <=1. - var.schedule: 0 23 * * * -> 0 23 * * 0 (detector: weekly Sunday 23:00 UTC) - var.report_schedule: 7 6 * * * -> 7 6 * * 1 (report: Monday 06:07 UTC, ~7h after the Sunday check, so nightly-report.py's ~25h staleness threshold stays valid AND still flags a missed weekly run; no STALE_SECONDS change needed) The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename = churn). Cadence wording updated across main.tf comments, nightly-report.py docstring, and the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 06:22:20 +00:00
Viktor Barzin	7fe2d9780e	monitoring: add pfSense WAN/egress alerting + probes Some checks failed ci/woodpecker/push/default Pipeline was canceled Details On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for ~20 min while internal routing + Unbound stayed up; recovery needed a manual reboot and NOTHING alerted — there was no egress probe and the cloudflared replica metric stayed green. Add first-class egress monitoring so the next occurrence pages in ~2 min instead of being noticed by a human. - blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW so ICMP can use raw sockets). - Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 + 1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers). - Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable, InternetEgressDown (both providers dead), ExternalDNSResolutionDown, EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's exact "external down while internal up" signature), PfSenseVMDown. - Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the cloudflared replica metric is blind to tunnel-connection loss. Threshold calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident). - Alertmanager inhibit: WAN/egress-down suppresses the downstream egress symptom alerts so one root alert pages, not a storm. - Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md. All metric names + the cloudflared threshold verified against live Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening (dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred and documented in the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:46:30 +00:00
Viktor Barzin	279b88d2bc	docs: add MetalLB L2Status-immutable PG-VIP-flap post-mortem (code-aoxk) All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Post-mortem for the 2026-04/05 SEV3 where a stuck MetalLB ServiceL2Status CR (immutable status.node) flapped the PG load-balancer VIP and silently broke Tier-1 Woodpecker terragrunt applies for ~5 days (the wrapper error "Cannot read PG creds" masked the real cause for ~25 days). Written when the incident closed (beads code-aoxk, 2026-05-26) but never committed; landing it so the RCA + stuck-CR cleanup procedure live in the repo. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:25:10 +00:00
Viktor Barzin	2e50c1235c	chrome-service: grant emo shared browser access (noVNC + homelab browser CLI) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to give emo access to the cluster's headed Chrome so he can fill in forms and get past anti-bot / captcha pages. emo was deliberately locked out of chrome-service (noVNC Authentik allowlist was Viktor-only + his power-user RBAC has no pods/portforward). Viktor's explicit decision: SHARE his existing browser rather than stand up an isolated per-user instance, accepting that emo can therefore reach Viktor's warmed logged-in sessions (CDP has no per-context auth, so the single shared persistent profile is reachable by anyone who can drive the browser). emo's CLI use is hands-off (his agent can run it unattended). - authentik: add emo (emil.barzin / emil.barzin@gmail.com) to CHROME_ALLOWED so the admin-services-restriction policy admits him to chrome.viktorbarzin.me (noVNC). Reverses the prior Viktor-only lock; comment updated to record why. - chrome-service/rbac.tf (new): emo-browser ServiceAccount + long-lived token (dashboard-sa.tf pattern), a chrome-service-portforward Role granting pods/portforward, and a cluster read-only binding (oidc-power-user-readonly) so the SA can resolve the Service and emo's normal read access doesn't regress. - t3-provision-users.sh: install_browser_kubeconfig installs a dual-context kubeconfig for any user with a <user>-browser SA — SA token as the default context (non-interactive, works headless), personal OIDC retained as the oidc@homelab named context. emo's OIDC-only kubeconfig can't authenticate the headless agent session that homelab browser needs. - docs/architecture/chrome-service.md: document the shared-browser multi-user access model, the session-exposure trade-off, and how to grant/revoke a user. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 15:20:07 +00:00
Viktor Barzin	250d0fc334	docs(authentik): document SFE forced-WebAuthn escape hatches (TOTP + social) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Old-browser users on the SFE who have a password but no MFA device hit the default-authentication-flow's forced WebAuthn passkey enrolment, which the SFE cannot render (the 'unsupported state: ak-stage-authenticator-webauthn' error). emo (Google-only, iPadOS 15) hit this on the password path. Document the two no-MFA-downgrade fixes: (1) social login, whose source flow (default-source-authentication) has no MFA stage, so the SFE's social button always completes; (2) enrolling TOTP, which the SFE can validate (unlike WebAuthn) and which flips the MFA stage from force-enrol to validate. TOTP was enrolled for emo and stored in his Vaultwarden authentik item; verified end-to-end (a Bitwarden-generated code is accepted by authentik). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 12:24:40 +00:00
Viktor Barzin	e518ada3d4	authentik: repoint to overlay patch3 (all-iOS SFE + SFE social links) + docs All checks were successful ci/woodpecker/push/default Pipeline was successful Details global.image -> 2026.2.4-patch3. Old iPad Chrome (and any iOS browser) now gets the SFE too, and the SFE login shows social-login buttons (emo is Google-only with no password, so the password form alone was a dead end). Docs: .claude/CLAUDE.md + authentication.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:53:26 +00:00
Viktor Barzin	6ba60cbb2d	authentik: repoint to overlay patch2 (SFE for old Safari) + docs All checks were successful ci/woodpecker/push/default Pipeline was successful Details global.image -> 2026.2.4-patch2 (adds the compat_needs_sfe SFE patch on top of the SLOW-1a query patch). Old Safari/WebKit (<=16.3) now gets authentik's no-JS SFE login instead of a blank page — fixes emo's iPadOS-15.8 iPad with no auth downgrade. Docs: .claude/CLAUDE.md Authentik row + docs/architecture/authentication.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:39:29 +00:00
Viktor Barzin	ec681ba6e1	ci(infra): stop double-apply + stop counting PG lock-waits as failures All checks were successful ci/woodpecker/push/default Pipeline was successful Details The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going red ~20% of the time. Root causes (verified from the failure logs, not guessed): 1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82) AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every push. The two applies race each other for the per-stack PG state lock → "Error acquiring the state lock" failures + push-supersede "killed" runs. 2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string ("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state lock") fell through and was counted as a hard FAILURE. 3. Transient provider-registry download timeouts (and Vault 5xx) failed the whole pipeline with no retry. Fixes (all in default.yml): - Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons (they live on repo 1), so we de-dup the apply without deactivating the registration. Fail-open on unknown forge. - Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED. - Bounded retry (3x) ONLY on transient signatures (provider download timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast. Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock failures; reproduced `terraform validate` passing the exact stacks that fail at apply) and lock-reaping/force-unlock (PG advisory locks are session-scoped + auto-release; force-unlock can't free them and would corrupt a live concurrent apply). Shell logic + the classification regexes were unit-tested locally against the real decoded error strings (#359 PG lock, #353 provider timeout, #360 missing-arg, helm atomic timeout); `bash -n` clean; YAML parses. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:37:18 +00:00
Viktor Barzin	e03e4719ad	vault: distinguish Vaultwarden vs HashiCorp Vault, add `vault kv` `homelab vault` only spoke to Vaultwarden (the password manager), but the name reads as HashiCorp Vault (the infra secrets store — actually OpenBao here). Make the two unmistakable and support both. Distinction (no breakage — the existing Vaultwarden verbs are unchanged): - bare `homelab vault` help now LEADS with the two-stores split; - every verb summary is tagged `[vaultwarden]` or `[hashicorp-vault]`; - HashiCorp Vault/OpenBao lives under a clearly-named `vault kv` group. New `vault kv` (HashiCorp Vault / OpenBao, the secret/… KV store): - `kv get <path> [--field K]` — read; --field → one value (TTY-aware clipboard/stdout), no field → full secret JSON (refuses a bare TTY). - `kv list <path>` — list sub-paths (no values). - `kv put <path> <key>` — write one key; value via stdin (piped or no-echo prompt, never argv); creates the path or merges (never clobbers siblings; uses kv patch -method=rw so no `patch` cap needed). Critical: `kv` uses the caller's OWN Vault token (OIDC ~/.vault-token / $VAULT_TOKEN), NOT the per-user scoped Vaultwarden token (bound only to claude-users/<user>, which would 403 elsewhere) — handlers set VAULT_ADDR but never inject the scoped token. Access is whatever the policy grants. Logic in cmd_vault_kv.go (pure cores extractKVData/parseKVList/arg builders/kvGet/List/Put; file header documents the credential split). CLI v0.11.0. Tests: no value in put argv, create-then-merge, KV-v2 envelope strip, help names both systems. Verified e2e against live Vault (read key-names-only + a scratch put/merge/cleanup). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:09:33 +00:00
Viktor Barzin	3d948c7033	Merge remote-tracking branch 'origin/master' into wizard/upgrade-gate-held All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-28 10:09:42 +00:00
Viktor Barzin	2880fe1c29	docs: update k8s-version-upgrade runbook for actionable-vs-held gate Reflect the classification change in the operational runbook: the gate's three refusal classes (actionable/waiting/pinned), held wins on a mix, refusals now Complete cleanly (no Failed Job), k8s_upgrade_held gauge + the deliberate no-alert-for-held, the dropped K8sUpgradeChainJobFailed suppression clause, the nightly report ⏸️ HELD outcome, and the detector's silent nightly re-evaluation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 10:09:34 +00:00
Viktor Barzin	eebb6c8594	k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case The nightly upgrade chain detected 1.36, the preflight compat-gate refused it, and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY night — even though the block is unactionable (no kyverno/ESO release supports 1.36 yet, and gpu-operator is pinned to its current version because bumping it needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor asked to teach the checker to tell 'we can fix this' apart from 'nothing to do but wait', and stop the nightly Failed-Job + alert noise for the latter. compat-gate.py now classifies each blocker: - ACTIONABLE: a newer addon version in addon-compat.json supports the target -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the nightly report). - WAITING-on-upstream: no released version supports the target yet -> held. - PINNED: addon marked pinned in the matrix (gpu-operator) -> held. Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert. Tidy the block path (Viktor's scope choice): deliberate gate decisions now make the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete 'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge is pushed definitively once per run (no 1->0->1 flap that re-notifies). The detector re-spawns a refused-but-Complete preflight nightly (silently) so a standing hold still re-evaluates, and only announces genuine new/Failed spawns. nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class. gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason). Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned; Calico the lone actionable piece) — no nightly Failed Job, no alert, just the morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 10:08:20 +00:00
Viktor Barzin	ccee443790	vault: add `get --all` to browse every field of an item Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details `homelab vault get` could only fetch one of five allow-listed fields and had no way to see what fields an item even has — in particular it could not reach arbitrary user-defined custom fields. Add a `--all` flag that dumps the whole item as a normalized JSON object (`{name, username?, password?, uris?, totp?, notes?, fields?}`), so a Claude session can discover and read every field, custom ones included, in a single call. Security model preserved: - Like `get --json`, the dump is all secret values, so it refuses a bare TTY (pipe it, e.g. `\| jq`); the machine/agent path is stdout. - The TOTP seed is reduced to a presence flag (`"totp": true`) and never emitted — the seed is more powerful than a one-time code, so the only seed-derived path stays the specially-audited `vault code`. Tests assert the seed and password-history never appear in the dump. - Op-log uses a distinct `get-all` verb (item name still never logged) so a bulk dump is distinguishable from a single-field read. `normalizeItem` is a pure, unit-tested core; `getItem` is the session+fetch seam. CLI bumped to v0.10.0. Docs: README changelog, onboarding runbook, design spec §16. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 10:01:49 +00:00
Viktor Barzin	afcd463f39	k8s-upgrade: design doc for actionable-vs-held compat-gate classification The nightly upgrade chain fails a preflight Job and raises K8sUpgradeBlocked every night for the 1.36 target, even though the block is unactionable: no kyverno/ESO release supports 1.36 yet and gpu-operator is deliberately pinned (NVIDIA driver/Ubuntu coupling). Viktor asked to teach the checker to tell 'we can fix this' apart from 'nothing to do but wait', and stop the nightly Failed-Job + alert noise for the latter. This documents the design: classify each blocker as actionable / waiting- upstream / pinned, keep the alert only for actionable, quiet the held case to the nightly report, and make deliberate gate decisions Complete cleanly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 10:01:36 +00:00

1 2 3 4 5 ...

465 commits