Compare commits

...
Sign in to create a new pull request.

63 commits

Author SHA1 Message Date
Viktor Barzin
d7a4453f32 feat(f1-stream): wire optional REDDIT_* env for replays activation
Some checks failed
ci/woodpecker/push/default Pipeline failed
Adds REDDIT_CLIENT_ID / REDDIT_CLIENT_SECRET to the f1-stream deployment,
sourced from the f1-stream-secrets Secret with optional=true so the pod still
starts before the credentials exist. This activates the replays feature (app
repo ADR-0002) once reddit_client_id / reddit_client_secret are added to the
Vault "f1-stream" key (auto-synced via the ExternalSecret's dataFrom.extract)
and the pod is restarted. Dormant/no-op until then.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-04 20:57:43 +00:00
Viktor Barzin
37bdb3cb1e Merge branch 'master' of https://forgejo.viktorbarzin.me/viktor/infra
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-04 20:15:41 +00:00
Viktor Barzin
936e6592e0 home-lans-only: add London guest net 192.168.9.0/24 — the Portal Plus lives there
Post-rollout discovery during wrap-up: the London Portal Plus leases on the
GUEST network (Portal-75AE8F9C2A8A = 192.168.9.198), not the main LAN, so the
allowlist shipped in 8bac9914 would have 403'd it once it woke. Verified the
forwarded path end-to-end on the Flint 2 (read-only): VPN_PREROUTING_HOOK
hooks BOTH br-lan and br-guest into ROUTE_POLICY -> TUNNEL10_ROUTE_POLICY,
which marks all dst_net10 (10/8) traffic onto the WG tunnel — so the Portal
reaches 10.0.20.203 with source 192.168.9.198 once on-screen. (Side finding,
router-originated only: the firewall.user LOCAL_POLICY dst_net10 injection
from vpn.md has rotted — admin curls from the router itself don't tunnel;
clients unaffected. Not fixed here — live-device change, needs Viktor's OK.)

Middleware already applied live via targeted tg apply (20:11 UTC).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 20:15:31 +00:00
Viktor Barzin
13fb2a2d27 cli: memory recall/list print full content — drop 240-rune truncation
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Viktor asked to remove the truncation from memory outputs after multiple
agent sessions were misled by it: the 240-rune pretty preview cut memories
mid-sentence, made sessions wrongly conclude no full-content read-back
existed, and made a blind 'update --content' from the preview destroy the
stored tail. The server never truncated — this was purely client-side
display.

recall/list now print each memory's full content (still one line per
memory, newlines flattened). The per-turn recall hook pipes CLI output
through verbatim, so injected memories are complete too. printMemories is
split into a pure renderMemories for testability; truncatePreview and its
UTF-8-boundary test are deleted (nothing is sliced anymore, so the
invalid-UTF-8 class is gone by construction). v0.11.0 -> v0.12.0.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 14:37:38 +00:00
Viktor Barzin
8bac9914ec immich-frame: LAN-only access via home-lans-only allowlist + dns_type=internal
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor asked to tighten who can see the immich-frame deployments: make
them not public while keeping the two Meta Portals working as frames.
The Portal app bakes the URL into the APK, so the same hostnames must
keep loading from the home networks with zero device or router changes.

- New shared Traefik middleware home-lans-only (Sofia/London/Valchedrym
  LANs + 10/8 + internal v6) — separate from local-only so the remote
  LANs don't inherit access to admin surfaces.
- New ingress_factory dns_type="internal": publicly-resolvable A record
  carrying the internal Traefik LB IP (10.0.20.203). Outsiders resolve
  but can't route; WG spokes policy-route 10/8 down the tunnel. Never
  combine the allowlist with proxied DNS (cloudflared pod IPs are in
  10/8 and would bypass it).
- Both frame ingresses: dns_type internal + allowlist attached +
  external_monitor=false (drop the doomed [External] monitors).
- rybbit worker: highlights-immich route/site removed (off Cloudflare).
- Docs: CLAUDE.md/AGENTS.md ingress tiers, networking.md DNS categories,
  design doc docs/plans/2026-07-04-immich-frame-lan-only-design.md.

Pre-verified: London router DNS returns RFC1918 answers unfiltered;
Technitium already CNAMEs both hosts to the LB; no public wildcard.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 14:21:01 +00:00
Viktor Barzin
114a7743ac backup-mx: pivot to self-hosted Oracle relay; challenge-hardened design v3
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Rollernet's free tier failed the validation gates before any DNS change
(200 msgs / 10 MB per rolling week, then 48h of SMTP 5xx bounces —
worse than no backup MX; free accounts being discontinued). Viktor
chose to stay free, so the backup MX becomes a Postfix store-and-forward
relay on an Oracle Always-Free VM (mx2.viktorbarzin.me, MX pref 20),
draining via port 2526 through the existing pfSense HAProxy frontend
since Oracle blocks egress 25.

Two independent adversarial reviews then fixed the design: primary-side
drain enablement moved to the layers that actually reject (unknown-
client-hostname, spoof protection, anvil limits, rspamd reject tier ->
external_relay + action cap, never backscatter), monitoring moved off
the nonexistent cluster->tailnet path to allowlisted public-IP scrapes,
bounce lifetime cut to 1d (the VM can never deliver DSNs), OCI OS-level
iptables + reserved-IP + mandatory PAYG requirements added, and 4xx-only
postscreen hygiene replaces the blanket no-filtering stance.

ADR-0019 and the design doc renamed accordingly (rollernet -> oracle).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 13:38:39 +00:00
Viktor Barzin
c1ffed17a9 backup-mx design: credentials to Vaultwarden, not Vault KV
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked for the Rollernet account credentials to live in
Vaultwarden (the personal password manager) rather than HashiCorp
Vault. Item 'Rollernet (backup MX)' created; doc updated to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 12:55:43 +00:00
Viktor Barzin
c311a6a3c9 tasks: public ingress carve-out for PWA icons; adopt orphaned stack state
All checks were successful
ci/woodpecker/push/default Pipeline was successful
macOS Safari's Add to Dock (and iOS/Android home-screen installs) fetch
the app icon and web manifest without any session cookies, so the
Authentik forward-auth 302 on tasks.viktorbarzin.me made Safari fall
back to a letter monogram instead of the real icon. Viktor asked for an
ingress carve-out so exactly these five static PWA assets are publicly
fetchable: /apple-touch-icon.png, /favicon.png, /pwa-192x192.png,
/pwa-512x512.png, /manifest.webmanifest.

A second ingress_factory instance (auth=none, dns_type=none, same host)
routes only those paths straight to the tasks service; the SPA shell and
/api stay behind Authentik exactly as before. The new carve-out is also
registered in the Authentik walling-off probe so a future regression
(anything 302-ing these paths to Authentik again) alarms, and the
service catalog entry records the exception.

stacks/tasks/imports.tf adopts the live tasks resources into Terraform
state first: the stack's first-ever apply (pipeline 477, 2026-07-03)
died mid-apply after creating the resources but before the pg state
write, leaving tasks.states empty — without the import blocks this (and
every future) tasks apply would create-fail with 'already exists'. Same
pattern as the monitoring alert-digest adoption.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 10:14:44 +00:00
Viktor Barzin
c91fa881e6 docs: design + ADR-0019 — free backup MX via Roller Network secondary MX
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wants inbound email to survive homelab outages without loss;
delayed delivery is acceptable and the budget is zero, which rules out
the previously doc-flagged Dynu option. Design adopts Roller Network's
free Secondary MX (3-week store-and-forward queue, no forced filtering,
catch-all-compatible) with our-side postscreen/rspamd whitelisting,
five validation gates before any DNS change, and a live failover test.
Also records the dangling-MTA-STS finding (TXT published, policy host
absent) as a follow-up. Implementation starts only after Viktor reviews
these docs; account will use rollernet@viktorbarzin.me.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 09:59:16 +00:00
Viktor Barzin
1a63fee4e4 cloudflare: drop 6 dead legacy DNS names (zone at Free-plan 200-record cap)
Some checks failed
ci/woodpecker/push/default Pipeline failed
authelia, immich-powertools, loki, mcaptcha, nfty, whiteboard removed from
cloudflare_proxied_names — all verified dead (no HTTP response, no cluster
route; authelia superseded by Authentik, nfty was a typo of ntfy, whiteboard
was excalidraw's old name). The cap blocked the new drone-logbook stack's
dronelog record (Cloudflare error 81045). Records already destroyed via
targeted local apply; Viktor approved the removal. Zone now at 195/200.
2026-07-04 09:31:32 +00:00
Viktor Barzin
7e49bf394d Merge remote-tracking branch 'forgejo/master' into wizard/drone-logbook
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-07-04 08:44:04 +00:00
Viktor Barzin
c52cdd1f68 Merge branch 'master' of https://forgejo.viktorbarzin.me/viktor/infra
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
2026-07-04 08:43:55 +00:00
Viktor Barzin
c868ef3332 nfs_directories: add drone-logbook sync-logs + backup dirs
Drop folder for the new drone-logbook stack's auto-import (SYNC_LOGS_PATH)
and its daily backup target. Both created on 192.168.1.127 (root:www-data,
2777 — root-squash-writable like vaultwarden-backup).
2026-07-04 08:43:38 +00:00
Viktor Barzin
50778d47d3 drone-logbook: new stack — self-hosted Open DroneLog at dronelog.viktorbarzin.me
Viktor asked to self-host the DJI flight-log analyzer for his DJI Mini 4 Pro
(his fork ViktorBarzin/drone-logbook -> upstream arpanghosh8453/open-dronelog).
Upstream ghcr image with Keel auto-upgrade, DuckDB data on an encrypted
proxmox-lvm PVC (GPS traces = sensitive), NFS /sync-logs drop folder imported
every 8h, daily backup CronJob to /srv/nfs/drone-logbook-backup (vaultwarden
pattern), Authentik-gated ingress, PROFILE_CREATION_PASS from Vault via ESO.
Design + plan in docs/plans/; service-catalog updated.
2026-07-04 08:42:53 +00:00
Viktor Barzin
d9717a53bf vault-token-renew runbook: document the self-heal behavior
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Drift guard section rewritten: admin-capable clobbers now self-heal at the
nightly run (HEALED log line); weak clobbers keep the loud DRIFT failure;
manual re-mint is only the weak-clobber recovery now.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 20:20:44 +00:00
Viktor Barzin
4a7b6db806 vault-token-renew: self-heal the periodic token on admin-capable clobber
Viktor asked for 'vault login -method=oidc' to work seamlessly: the OIDC
login the docs prescribe kept clobbering ~/.vault-token with a 7-day token,
and detect-only DRIFT failures went unnoticed for weeks (weekly-expiry
loop, twice in June). On drift the renewer now re-mints the periodic token
with the clobbering token's own authority (Vault's 403 is the judge — no
policy guessing), sanity-checks it, replaces the file atomically, and
revokes stale token-devvm-wizard leftovers. Weak/read-only clobbers still
fail loudly on purpose. Design: docs/plans/2026-07-03-vault-token-self-heal-design.md

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 20:20:00 +00:00
Viktor Barzin
8631709ca2 vault-token-renew: pure helpers for the self-heal revoke filter
vtr_accessor parses the accessor from lookup JSON; vtr_is_stale_periodic
decides which old token-devvm-wizard tokens a heal may revoke (never the
just-minted one, never foreign tokens, nothing when the keeper is unknown).
TDD red-green for the heal branch that lands next.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 20:19:09 +00:00
Viktor Barzin
029b65ff93 state(vault): update encrypted state 2026-07-03 20:14:54 +00:00
Viktor Barzin
c48ce73c80 state(vault): update encrypted state 2026-07-03 20:14:35 +00:00
Viktor Barzin
b03a295397 state(vault): update encrypted state 2026-07-03 20:14:18 +00:00
Viktor Barzin
a07a603b80 docs/plans: vault-token self-heal implementation plan
Task-by-task TDD plan for the approved self-heal design: pure-function
tests first, then the heal branch, runbook update, deploy + live clobber
simulation, landing and memory updates.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 20:09:36 +00:00
Viktor Barzin
e2bfb20c84 docs/plans: vault-token self-heal design (devvm renewer)
Viktor asked to make 'vault login -method=oidc' work seamlessly on devvm:
today any OIDC login clobbers the permanent periodic token in
~/.vault-token, the drift guard only logs the drift, and his access
effectively expires weekly. Approved design: the nightly renewer re-mints
the periodic token from any admin-capable clobber (weak clobbers keep
failing loudly) and revokes stale periodic tokens after each heal.
Implementation follows on this branch.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 20:02:53 +00:00
Viktor Barzin
6698018ab6 service-catalog: add tasks row + tasks to the proxied-domains list
Some checks failed
ci/woodpecker/push/default Pipeline failed
Docs-with-change convention: the new tasks stack (Reminders-style PWA over
Nextcloud CalDAV) gets its catalog entry — what it is, its CNPG db + Vault
static role, the auth=required/X-authentik-username trust model with the
SEC-1 NetworkPolicy, and the ADR-0002 CI/CD path — and tasks joins the
Cloudflare proxied hostname list.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 19:53:42 +00:00
Viktor Barzin
02640df620 stacks/tasks: new stack for the tasks PWA (Authentik-gated, CNPG-backed)
Deploys the Reminders-style tasks app at tasks.viktorbarzin.me: namespace,
ExternalSecrets (fernet_key from secret/tasks; TASKS_DB_DSN composed from
the pg-tasks static-creds password the tripit way), single-replica
Deployment of ghcr.io/viktorbarzin/tasks:latest (image ignore_changes per
the fleet set-image pattern; Reloader restarts it on the 7-day DB password
rotation; /healthz probes on 8000; Europe/Sofia local tz; DEV_USER
deliberately absent — security invariant), Service on 8000, and an
ingress_factory host with auth=required + dns_type=proxied since Authentik
forward-auth is the app's only gate. NetworkPolicy tasks-ingress (SEC-1)
limits pod ingress to the traefik namespace plus monitoring on 8000 for
/metrics, so the trusted X-authentik-username header cannot be spoofed by
other pods.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 19:53:27 +00:00
Viktor Barzin
e0db1054e7 dbaas+vault: provision tasks CNPG database, role and rotating password
The new tasks PWA (Reminders-style front-end over Nextcloud CalDAV, per
tasks/docs/2026-07-03-tasks-pwa-design.md) needs its own Postgres database
for Connected Accounts and sync state. Follows the tripit/job_hunter
pattern exactly: idempotent null_resource creates role+db on the CNPG
primary with a placeholder password, and the Vault database engine static
role pg-tasks (added to the postgresql connection allowed_roles) rotates
the real password every 7 days, consumed by the tasks stack via a
vault-database ExternalSecret.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 19:53:13 +00:00
Viktor Barzin
9dcd3b0d5d Merge remote-tracking branch 'forgejo/master' into wizard/stem95su-cutover
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-03 15:27:04 +00:00
Viktor Barzin
5367d4a055 paperless-mail-ingest: rules process inline attachments (Apple Mail lesson)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor's first real forward carried the invoice PDF with
Content-Disposition: inline (Apple Mail does this for real documents),
and the attachments-only rules consumed nothing — recorded
PROCESSED_WO_CONSUMPTION, which also blocks reprocessing. Flipped all 5
rules to attachment_type=2 (process inline) via the API and documented
the trade-off + the ProcessedMail unblock step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 15:25:44 +00:00
Viktor Barzin
21c6e7112e stem95su: retire the in-cluster serving stack — now a Valia site on Pages
Completes the ADR-0018 cutover. The stack is emptied to a tombstone so
CI destroys nginx, the NFS content volume, the ingress, the per-site
gdrive-sync CronJob and the namespace; serving + sync are owned by
stacks/valia-sites since the cutover commits. Catalog + runbook updated
to the migrated state (incl. the one-time 42.9→21.4MB video compression
Viktor approved).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 15:22:32 +00:00
Viktor Barzin
974c9976e3 valia-sites: take over stem95su DNS (manage_dns=true) — cutover half 2
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Creates the public proxied CNAME stem95su -> stem95su.pages.dev and
adds the internal split-horizon entry via the valia-sites-dns
ConfigMap (the sync's update pass repoints the existing internal
record). Completes the ADR-0018 cutover; the old in-cluster serving
stack is retired in a follow-up.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 15:21:18 +00:00
Viktor Barzin
5c8e9daabd stem95su: release the public CNAME (dns_type=none) for the Pages cutover
All checks were successful
ci/woodpecker/push/default Pipeline was successful
First half of the ADR-0018 stem95su cutover: the tunnel-target CNAME is
destroyed so stacks/valia-sites can create the Pages-target record for
the same name (Cloudflare allows one CNAME per name; the follow-up
commit flips manage_dns=true there). stem_video.mp4 was compressed to
21.4MB with Viktor's explicit OK, clearing the 25MB Pages cap; content
is already deployed on the stem95su Pages project. Brief public
NXDOMAIN window between the two applies is accepted.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 15:21:18 +00:00
Viktor Barzin
c1ee6863b3 mailserver docs: troubleshooting entry for the postsrsd 100%-CPU spin
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Hit during the docs@ rollout: after a pod restart postsrsd came up
spinning without binding its TCP ports, so postfix cleanup tempfailed
every message with 451 queue file write error. Document the signature
and the supervisorctl-restart / pod-recreate fix.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 14:39:13 +00:00
Viktor Barzin
4ee4d1927d mailserver: guard alias filter against short lines with a lazy ternary
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
CI pipeline 469 failed with 'Invalid index' on the postfix_virtual alias
filter: terraform only short-circuits &&/|| from v1.6, and the older
terraform in the infra-ci image still evaluated split(" ", line)[1] for
the blank and comment lines that have been in extra/aliases.txt since the
plans@ block. The devvm's newer terraform short-circuits, which is why the
local apply of the same commit passed. A conditional expression is lazy on
every terraform version, so move the length guard into a ternary.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 14:38:30 +00:00
Viktor Barzin
68b9858eff paperless-mail-ingest runbook: manual mail_fetcher must drop to the paperless user
All checks were successful
ci/woodpecker/push/default Pipeline was successful
A root-run kubectl exec mail_fetcher downloads attachments root-owned into
the scratch dir and the celery consumer (uid 1000) fails with
PermissionError — found during the build E2E. Document s6-setuidgid usage
and the recovery step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 14:26:12 +00:00
Viktor Barzin
77fcb08e8e mailserver: add docs@ paperless ingest mailbox (sieve sender allowlist)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor asked to forward arbitrary emails with PDF attachments into
paperless-ngx, with the forwarding sender mapping 1:1 to the paperless
account that owns the document. paperless-ngx's built-in IMAP consumer
already does the sender->owner mapping, so the infra half is a dedicated
real mailbox docs@viktorbarzin.me: an explicit self-alias (the @domain
catch-all would otherwise divert it into the TripIt-swept spam@ mailbox,
whose sweeper LLM-parses and auto-replies to mail from linked senders)
plus a per-user Dovecot sieve that discards non-family senders at
delivery (chosen behaviour for unmatched senders: ignore and delete;
also keeps spam out of the guessable address). The mailbox credential
was added to Vault secret/platform.mailserver_accounts. Paperless-side
mail account + 5 per-sender rules are DB state, configured via the API
per the new runbook docs/runbooks/paperless-mail-ingest.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 14:06:19 +00:00
Viktor Barzin
f5187806f9 ADR-0017: replace ASCII trunk diagram with excalidraw VLAN-tagging diagram
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wants the traffic-flow view as a colored excalidraw instead of
the ASCII block (which was the only thing rendering after the earlier
VLAN-tagging SVG commit failed to push — a locally-masked non-fast-
forward this session, not a merge clobber). Ships both the editable
.excalidraw scene and a hand-drawn-style SVG export embedded in the
Traffic-on-the-trunk section: two lanes showing where the 802.1Q tag
is added, carried (only P5<->vmbr0) and stripped, L2 membership drops
vs L3 firewall verdicts.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 13:21:59 +00:00
Viktor Barzin
316cdb7441 docs: valia-sites runbook + dns.md CM mechanism + service-catalog entries
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Runbook covers add/update/retire (one map entry; internal DNS now
cleans up after itself), content rules for Valia's folders, and the
failure modes incl. both token re-mint paths. dns.md superset-rule
paragraph now describes the declarative ConfigMap reconcile instead of
hand-added static CNAMEs. Catalog: new valia-sites row; stem95su row
notes its Pages cutover is parked on the 42.9MB stem_video.mp4
exceeding the 25MB Pages per-file cap.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:46:24 +00:00
Viktor Barzin
4a3c8287c3 Merge remote-tracking branch 'forgejo/master' into wizard/valia-sites
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-03 12:43:28 +00:00
Viktor Barzin
e0991853e4 valia-sites: 25MB Pages-limit guard; cloudflared: drop removed{} (CI TF <1.7)
Two fixes from the first live runs. (1) The sync job now skips a whole
site when any file exceeds Cloudflare Pages' 25MB per-file cap, leaving
current serving untouched — stem95su's stem_board.html references a
42.9MB stem_video.mp4, which made every run fail; the guard turns that
into a loud skip so bridge keeps syncing. (2) The CI terraform is older
than 1.7 and rejects removed{} blocks anywhere (pipelines 461/464), so
the bridge record handoff was completed with a one-time manual
'tg state rm module.cloudflared.cloudflare_record.bridge_pages' from
the main checkout; the block is deleted and the module comment records
the manual step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:43:13 +00:00
Viktor Barzin
348f64d34d ADR-0017: add physical-cabling diagram (wires only)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked for one diagram showing just the physical connections
between nodes, separate from the logical/VLAN topology: ISP->AX6000,
the in-wall apartment->garage run into P1, 4G router (cellular OOB),
UPS mgmt, the PoE cat6 to the camera, the LAN1 cable to eno1, dark
eno2 fallback + free eno3/4, iDRAC on shared-LOM, and the note that
everything else on the R730 is virtual. Referenced from the ADR next
to the logical SVG.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:40:29 +00:00
Viktor Barzin
126cf4c88e Merge origin/master into wizard/cctv-adr-trunk
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-03 12:32:00 +00:00
Viktor Barzin
695e020111 cloudflared: move bridge removed{} to stack root — removed blocks are root-module-only
Some checks failed
ci/woodpecker/push/default Pipeline failed
Pipeline 461 failed terraform init: the removed{} handoff block sat in
the stack-local module, but Terraform only allows removed blocks in the
root module. Same intent, correct position (from =
module.cloudflared.cloudflare_record.bridge_pages, destroy=false).
Without this the stale state entry would make the next cloudflared
apply destroy the record valia-sites now owns.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:31:53 +00:00
Viktor Barzin
5d16a18cf4 ADR-0017: document trunk traffic semantics + ASCII topology
While reviewing the single-switch design Viktor asked whether both the
home LAN and the camera VLAN 'go via pfSense which forwards upstream' -
a natural misreading a future reader would repeat. Added a section
spelling out the vmbr0 fork: untagged home LAN is L2-bridged past
pfSense (gateway stays the AX6000, rack outage does not affect it, OOB
via 4G survives), while tagged-30 can only land on the dCCTV interface,
making a pfSense bypass impossible by construction. Includes a compact
ASCII topology for terminal readers alongside the SVG.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:31:48 +00:00
Viktor Barzin
8b80b4cc41 valia-sites: registry stack for Valia's Pages sites + declarative internal DNS (ADR-0018)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Build valia-sites-sync / build (push) Has been cancelled
Valia keeps asking Viktor to host 1-page sites from her Drive folders;
this makes it one map entry. New stacks/valia-sites: per site a CF Pages
project + custom domain + proxied CNAME (bridge adopted via import{}),
a ConfigMap feed (valia-sites-dns) the technitium ingress-dns-sync
script now reconciles internal CNAMEs from (add/update/REMOVE — fixes
the add-only stale-record gotcha), and one shared 10-min CronJob that
mirrors each Content folder (rclone, drive.readonly, stem95su's guards)
and wrangler-deploys ONLY on manifest change (free-tier deploy cap).
Scoped CF Pages token + shared rclone conf in secret/valia-sites; the
Global API Key never enters a pod. cloudflared forgets bridge's record
via removed{} (no destroy). stem95su is in the map dns-parked
(manage_dns=false) until its cutover commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:28:06 +00:00
Viktor Barzin
5c42155b81 docs: Valia-sites domain language + ADR-0018 (off-infra Pages, in-cluster sync)
Grill session with Viktor: his mother Valia will keep asking for 1-page
site hosting, so the pattern is being made repeatable. Decisions: all
Valia sites serve off-infra on Cloudflare Pages (survive homelab
outages); one shared in-cluster CronJob mirrors her Drive folders every
10 min and redeploys on change; English subdomain names picked by
Viktor; failed-Job-only visibility; stem95su migrates onto the pattern.
CONTEXT.md gains Valia site / Content folder / Entry file; full
rationale and rejected options in ADR-0018.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:17:45 +00:00
Viktor Barzin
e1bd111562 rename CF Pages site most.viktorbarzin.me -> bridge.viktorbarzin.me
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to rename the 'мост' school static site to 'bridge'.
New Cloudflare Pages project 'bridge' (bridge-cv2.pages.dev) already
deployed and the custom domain attached; this renames the public CNAME
(TF resource most_pages -> bridge_pages, destroy+create swaps the
record) and the internal split-horizon static CNAME in the
ingress-dns-sync CronJob. The old 'most' Pages project and the stale
internal 'most' record are removed out-of-band after this applies.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 10:52:30 +00:00
Viktor Barzin
7dd80b6c7c technitium: mirror most.viktorbarzin.me into the internal zone (CF Pages site)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The internal split-horizon zone is authoritative for viktorbarzin.me,
so the new Cloudflare Pages site (most.viktorbarzin.me, added for
Viktor's 'мост' school static site) NXDOMAINed for every internal
client — LAN, VLANs and pods — while resolving fine externally.
Per the superset rule, add it as a static CNAME (-> most-6if.pages.dev)
in the ingress-dns-sync CronJob next to the mail-auth records, and
document the off-infra-site case in dns.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 10:10:46 +00:00
Viktor Barzin
217a54be9d cloudflared: add most.viktorbarzin.me CNAME for Cloudflare Pages site
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to host a static HTML site (the 'мост' school project,
ОбУ „Отец Паисий", pulled from his Google Drive) on Cloudflare Pages
with a custom domain, as a try-out of Pages hosting. The site content
is deployed off-infra via wrangler to the Pages project 'most'
(most-6if.pages.dev); this CNAME points most.viktorbarzin.me at it.
The custom domain is already attached to the Pages project and is
waiting on this DNS record to validate.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 10:06:33 +00:00
Viktor Barzin
be80ef23bb ADR-0017 rev 3: single switch — PE replaces the SG105E, CCTV rides a VLAN-30 trunk on the LAN1 cable
Viktor prefers not running two switches, so the TL-SG105PE takes over
all rack duties (apartment uplink, 4G, UPS, camera PoE) and the CCTV
segment moves onto a managed tagged trunk over the existing LAN1 cable:
pfSense net3 re-pointed from vmbr2 to vmbr0 tag=30 (applied live; same
MAC so vtnet3/dCCTV survived untouched). This is safe where the original
802.1Q rejection was not, because the managed switch is the only device
on eno1 and polices VLAN-30 membership. eno2/vmbr2 kept dormant as the
documented fallback. Old SG105E retires to cold spare; PE inherits
192.168.1.6. Glossary Segment term updated (all three segments are now
bridge-tags feeding untagged pfSense vNICs).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 09:15:52 +00:00
Viktor Barzin
4082934bc1 Merge origin/master into wizard/cctv-two-switch
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-03 08:37:34 +00:00
Viktor Barzin
e11bd6e893 ADR-0017 rev 2: two switches — the PE is a dedicated CCTV island, no VLAN table anywhere
Viktor asked to verify free ports on the garage switch (192.168.1.6)
before finalizing. Logging into it showed it is NOT the TL-SG105PE from
the plan but a pre-existing non-PoE TL-SG105E with 4 of 5 ports in use
(apartment uplink, R730 LAN1, 4G router, UPS) - the single-shared-switch
port-VLAN design written earlier today was based on conflating the two
devices. Corrected: the new TL-SG105PE carries ONLY camera + eno2
uplink (mgmt 10.0.30.6 inside the segment), the old switch is untouched,
and no VLAN config exists anywhere. ADR, topology SVG and networking.md
updated to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 08:37:15 +00:00
Viktor Barzin
08fb65827c tripit: set PLACE_PHOTO_PROVIDER=wikipedia — real place preview photos
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked for place photos on the tripit Trip board. The app-side
work (add-time photo fetch, board place cards) shipped in tripit
v0.106.0, but prod never set PLACE_PHOTO_PROVIDER, so the fake provider
would store placeholder PNGs for every hand-added place. Same class of
fake-default gap as PLACE_RESOLVER_MODE (set explicitly for the same
reason); the ADR-0035 rollout had left both the env flip and its
backfill cron undone.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 21:57:21 +00:00
Viktor Barzin
b761701994 ADR-0017: add network topology diagram (SVG) next to the decision
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked for a reviewable network visualization committed alongside
the CCTV-segment ADR. Hand-drawn SVG (renders on Forgejo, validated
palette): physical path camera -> TL-SG105PE port-VLANs -> eno2/vmbr2 ->
pfSense dCCTV, the firewall flows (Frigate RTSP, ha-sofia ISAPI/RTSP,
NTP-only egress, default deny), and the dashed camera-day steps (patch
cable, cat6 run, AX6000 static route).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 20:25:28 +00:00
Viktor Barzin
248e186dce CCTV segment (dCCTV 10.0.30.0/24) on a dedicated pfSense leg for the garage camera
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor and emo are adding the first owned camera at the Sofia site (HiLook
IPC-T241H-C watching the garage / server rack). Viktor asked to finalize
emo's plan; the grilling session resolved emo's five open decisions and
replaced the doc's 802.1Q-trunk idea with the site idiom: a dedicated
physical leg (R730 eno2 -> vmbr2 -> pfSense net3 = dCCTV 10.0.30.1/24),
port-based VLAN split on the shared TL-SG105PE, camera default-deny with
NTP-only egress, Frigate + ha-sofia as the only consumers.

The PVE bridge, pfSense interface, Kea subnet and firewall rules were
applied live this session (hand-managed hosts, backed up). This commit
records the decision (ADR-0017), the glossary terms (Segment / CCTV
segment), the as-built architecture doc, and bumps Frigate's ADR-0016
VRAM budget 2000 -> 2300 MiB for the upcoming NVDEC stream.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 20:01:45 +00:00
3a5194c9d4 Merge pull request 'immich(frame-emo): show photos from the last 365 days (was 730)' (#18) from emo/frame-emo-1year into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Reviewed-on: #18
2026-07-02 19:05:31 +00:00
9e253d409a immich(frame-emo): show photos from the last 365 days (was 730)
Emil asked his Sofia Portal Mini photo-frame to show only the past
year of photos rolling from today, instead of the last two years.
Changes ImagesFromDays 730 -> 365 in the frame-emo Settings.yml.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 19:05:31 +00:00
Viktor Barzin
4c532dbf97 devvm containment: drop the MemoryHigh throttle band, straight to MemoryMax OOM
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
t3.viktorbarzin.me went down 2026-07-02 15:42-16:35 UTC: an agent-spawned
12.3G ugrep plateaued inside t3-serve@wizard's MemoryHigh(12G)..MemoryMax(16G)
band. With MemorySwapMax=0 its anon pages were unreclaimable, so the kernel
throttled every task in the cgroup indefinitely (memory.pressure full ~80%,
oom_kill never fired) - the t3 event loop starved, the accept queue rotted,
and the terminal was dead until the hog was SIGKILLed by hand.

The 2026-06-22 design assumed 'throttle to a crawl, then OOM locally'; a hog
that stabilises between high and max never OOMs, so the throttle band is a
livelock zone, not a safety layer. Viktor asked to close that gap: MemoryHigh
is now explicitly infinity on all three work cgroup definitions (t3-serve@
unit, user-<uid>.slice drop-in, docker.slice) so a runaway is cgroup-OOM-
killed at MemoryMax immediately - OOMPolicy=continue already keeps the t3
server alive when a child dies. MemoryMax/MemorySwapMax=0/earlyoom unchanged.
Applied live to the devvm the same day (daemon-reload + runtime set-property
on running cgroups, no session restarts). Post-mortem addendum + runbook
updated in the same commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 16:59:38 +00:00
Viktor Barzin
684ca4527c docs(CLAUDE.md): T4 now has a VRAM budget + watchdog (ADR-0016, dry-run); note llama-swap budget miscalibration
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Session wrap-up doc sync: the Immich note still claimed the shared T4 had no
VRAM isolation. Record the gpumem budget/watchdog shipped earlier today, that
the watchdog is observe-only, and that budgets need a retune (llama-swap's
real 16k-ctx resident is ~7GB, not 4.35) before arming.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 15:20:06 +00:00
Viktor Barzin
21afae85c9 dawarich: dedicated 100/1000 Traefik rate limit (default 10/50 429'd page loads)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor saw dawarich throwing 429s through Traefik and asked to loosen
the burst for it. The access log confirms the burst pattern: one page
load fires the whole fingerprinted-asset tail (SVG store badges,
favicons, webmanifest) from a single client IP and trips the default
10 req/s / burst 50 limiter (repro: 80 parallel GETs -> 28x 429).
Same remedy as ha-sofia, ActualBudget, noVNC, tripit, health and
authentik: dedicated dawarich-rate-limit middleware (average 100 /
burst 1000) + skip_default_rate_limit on the dawarich ingress. Also
updates the networking.md middleware enumerations (adding the
previously undocumented tripit/health limiters alongside dawarich).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 15:03:08 +00:00
Viktor Barzin
91d0213d1a Merge remote-tracking branch 'forgejo/master' into wizard/excalidraw-export-rename
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build excalidraw-library / build (push) Has been cancelled
2026-07-02 14:29:34 +00:00
Viktor Barzin
8fc657f431 excalidraw: migrate image build to GHA -> private ghcr (ADR-0002)
The image was still built by hand and pushed to DockerHub (v1..v4),
predating the all-builds-off-infra doctrine; Viktor chose to move it
onto the standard pipeline while shipping the export/rename feature
rather than keep the manual flow.

Mirrors the k8s-portal pattern: .github/workflows/build-excalidraw.yml
(go test + buildx linux/amd64, pushes ghcr latest+sha), excalidraw ns
added to the Kyverno ghcr-credentials allowlist (package is PRIVATE),
deployment now pins ghcr :latest with pullPolicy Always + pull secret,
Keel force/match-tag/5m annotations seed the metadata (live values win
via ignore_changes). DockerHub viktorbarzin/excalidraw-library:v4 stays
frozen as the rollback image. Docs: ci-cd.md + .claude/CLAUDE.md image
lists updated (also backfilled the missing k8s-portal rows in ci-cd.md).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:29:23 +00:00
Viktor Barzin
1cbc1e962b excalidraw: native export menu + drawing rename
Users couldn't see Excalidraw's built-in Save as / Export image options:
the app's custom toolbar was drawn exactly on top of the native hamburger
menu button, hiding it. Removed the overlay and integrated Back to
Library / Save now / Rename into the native menu, so the native export
formats (.excalidraw file, PNG, SVG, clipboard) are now reachable.
Viktor asked for exports to work via the native Excalidraw feature and
for drawings to be renameable by clicking their name.

Rename: new PATCH /api/drawings/{id} endpoint (server-side name
sanitization, 409 on conflict) + click-to-rename title pill in the
editor (updates URL in place) + Rename button/modal in the dashboard.
Existing GET/PUT/DELETE semantics unchanged for API compatibility
(emo's upload pipeline). Added main_test.go (httptest) covering rename
+ existing handler behavior; dashboard rows now DOM-built (XSS-safe).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:29:10 +00:00
Viktor Barzin
d94f267c93 immich: upgrade v2.7.5 → v3.0.0 (postgres → vectorchord 0.4.3, frames → immich_v3 tag)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to upgrade Immich to the just-released v3.0.0 (release notes,
migration guide and release discussion #29439 reviewed — no config-breaking
changes for this stack: we already use the split MACHINE_LEARNING_PRELOAD
vars, don't set DB_VECTOR_EXTENSION, OAuth goes through Authentik over
HTTPS, and the GPU node's CPU meets the new x86-64-v2 requirement).

The Immich Postgres image moves to VectorChord 0.4.3 to match the upstream
v3 reference stack (0.3.0 is still within v3's supported range '>=0.3 <2';
Immich upgrades the extension itself at startup). Both photo frames switch
to ImmichFrame's immich_v3 compatibility tag because every versioned
ImmichFrame release (≤ v1.0.33.0) crashes deserializing Immich v3 API
responses; repin to a versioned tag once upstream ships stable v3 support.

Deployment images are Keel-managed (KEEL_IGNORE_IMAGE, policy=patch), so
this commit is the source-of-truth record; the live rollout happens via
kubectl set image in the same session. Pre-upgrade pg_dumpall taken
(job postgresql-backup-pre-v3).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:18:22 +00:00
Viktor Barzin
6f03ccd1aa excalidraw: grant emo-browser SA port-forward for drawing uploads
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to fix emo's permission so his Claude can upload to the
Excalidraw service. emo's recent sessions show the documented upload
recipe (kubectl port-forward svc/draw + X-Authentik-Username header,
from his ~/.claude/CLAUDE.md) failing with:

  pods/portforward forbidden for system:serviceaccount:chrome-service:emo-browser
  in namespace excalidraw

because his default kubeconfig is the read-only emo-browser SA (its
port-forward grant covers only chrome-service) and his old admin
kubeconfig at /home/emo/code/config expired and was removed.

Add a namespace-scoped Role (pods/portforward create) + RoleBinding for
that SA in the excalidraw namespace, mirroring the 2026-06-28
chrome-service grant. Trade-off (any-user drawings via the trusted
username header) documented in the file and accepted.

Also record the grant in docs/architecture/chrome-service.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 11:08:28 +00:00
78 changed files with 9165 additions and 2816 deletions

File diff suppressed because one or more lines are too long

View file

@ -81,7 +81,7 @@
| ytdlp | YouTube downloader | ytdlp |
| wealthfolio | Finance tracking | wealthfolio |
| audiobookshelf | Audiobook server (may be merged into ebooks stack) | audiobookshelf |
| paperless-ngx | Document management | paperless-ngx |
| paperless-ngx | Document management. Mail ingest: forward document emails to `docs@viktorbarzin.me` — sender maps 1:1 to a paperless account (runbook `paperless-mail-ingest.md`) | paperless-ngx |
| jsoncrack | JSON visualizer | jsoncrack |
| servarr | Media automation (Sonarr/Radarr/etc) | servarr |
| aiostreams | Stremio stream aggregator (Real-Debrid + Torrentio/Comet/StremThru Torz/Knaben; **MediaFusion removed 2026-06-07** — broken upstream `500`). `auth=app` (own UUID+password); stream-probe tests **both series+movie paths** with per-source breakdown (`aiostreams_streams_{comet,torrentio,stremthru_torz,knaben}`) + `aiostreams_error_streams` + `aiostreams_movie_stream_count`, success gated on Comet (workhorse) being alive; weekly NFS config + Stremio-account-collection backups to `/srv/nfs/aiostreams-backup/`. PG-backed user config (Comet timeout bumped 5s→10s 2026-06-07). | servarr/aiostreams |
@ -99,6 +99,7 @@
| tor-proxy | Tor proxy | tor-proxy |
| forgejo | Git forge. Open native self-signup (Turnstile captcha + email confirm) + Authentik & GitHub OAuth sign-in; see `docs/runbooks/forgejo-open-signups.md` | forgejo |
| freshrss | RSS reader | freshrss |
| drone-logbook | DJI flight-log analyzer (Open DroneLog, upstream image) — dronelog.viktorbarzin.me | drone-logbook |
| navidrome | Music streaming | navidrome |
| networking-toolbox | Network tools | networking-toolbox |
| stirling-pdf | PDF tools | stirling-pdf |
@ -120,7 +121,9 @@
| status-page | Status page | status-page |
| plotting-book | Book plotting/world-building app | plotting-book |
| tripit | Self-hosted TripIt-clone travel-itinerary PWA (FastAPI + SvelteKit SPA, same-origin). CNPG (`tripit` db, Vault static role `pg-tripit`) + RWX NFS trip-doc vault (`/srv/nfs/tripit-documents`) + RWO `proxmox-lvm-encrypted` personal-document vault `tripit-personal-documents` (passports/IDs — AES-256-GCM app-layer envelope, master key `DOCUMENT_ENCRYPTION_KEY` in `secret/tripit`). `auth=required` (Authentik forward-auth, reads `X-authentik-email`); second `auth=none` ingress on `/api/calendar` for HMAC-token-gated `.ics` feed. Email-ingest CronJob `tripit-ingest-plans` (`*/15`) is the SOLE inbound path — forward a booking to plans@viktorbarzin.me (catch-all → spam@), polled read-only and routed ONLY to a registered user / verified linked address (no default-owner fallback; strangers ignored), parsed by local LLM (`qwen3vl-4b`), and the sender is emailed the outcome (Added to trip / Couldn't import). Plus `tripit-poll-flights`, `tripit-run-reminders`, `tripit-transport-nudge`, `tripit-weather-brief`. (The old Gmail-scrape `tripit-ingest-mail` CronJob was removed 2026-06-05.) App secrets in Vault `secret/tripit`. | tripit |
| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me. Public **open** static site (`auth=none` — CrowdSec + ai-bot-block, no login). Stock `nginx:1.28-alpine` serving content **straight off PVE host NFS** `/srv/nfs/stem-site` (RWX `nfs_volume`, mounted read-only) — **NOT** image-baked, so the externally-authored (Gemini-exported) HTML/media updates with no rebuild; auto-backed-up offsite by `nfs-mirror`. **Content source = Google Drive folder "claude"** (id `1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_`, shared Valentina→vbarzin@gmail.com). **Deploy = scheduled mirror** (since 2026-06-09, reversed the earlier on-demand-only call once content went active): CronJob `stem95su-gdrive-sync` (`*/10`, `stacks/stem95su/gdrive-sync.tf`) mounts the content PVC RW and `rclone sync`s the Drive folder onto it (`docker.io/rclone/rclone:1.74.3`, `scope=drive.readonly` — Drive is READ-ONLY; empty-source guard + `--max-delete 25` so a partial listing can't wipe the site). rclone creds (OAuth refresh-token) in Vault `secret/stem95su` (`rclone_conf`) → ESO secret `stem95su-rclone`. **Requires the GCP OAuth app (project home-lab-1700868541205) published to "Production"** or the refresh token expires ~weekly (re-mint + `vault kv put secret/stem95su rclone_conf=…` after publishing); a dead token surfaces as a failed Job. Manual on-demand sync still possible (throwaway rclone container from devvm; recipe in claude-memory). Nextcloud "PVE NFS Pool"/rsync is a manual fallback. Dashboard `stem_board.html` served at `/` via a small nginx ConfigMap (`index`). No DB, no in-cluster secrets. Reference impl for the NFS-backed static-site pattern (see patterns.md). | stem95su |
| tasks | Reminders-style tasks PWA over Nextcloud CalDAV (FastAPI + SvelteKit SPA same-origin, single container; code `~/code/tasks`, design `tasks/docs/2026-07-03-tasks-pwa-design.md`). Nextcloud stays the source of truth (VTODOs); the app is the front-end Apple Reminders stopped being. CNPG (`tasks` db, Vault static role `pg-tasks`) stores Connected Accounts — per-user Nextcloud app passwords Fernet-encrypted with `fernet_key` from `secret/tasks`. `auth=required` (Authentik forward-auth; identity = `X-authentik-username`, NO app-level login — `DEV_USER` must never be set in prod) at tasks.viktorbarzin.me (proxied). Exception: the five PWA icon/manifest files (`/apple-touch-icon.png`, `/favicon.png`, `/pwa-192x192.png`, `/pwa-512x512.png`, `/manifest.webmanifest`) are a path-scoped `auth=none` carve-out (`module.ingress_icons`) so cookie-less OS icon fetchers (macOS Safari Add-to-Dock, mobile home-screen installs) get the real icon instead of the Authentik 302; guarded by the `tasks-icons` walloff-probe target. NetworkPolicy `tasks-ingress` (SEC-1) restricts pod ingress to traefik + monitoring namespaces so the trusted header can't be spoofed pod-to-pod. GHA → public ghcr `tasks` → Woodpecker deploy (ADR-0002). | tasks |
| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me — **a Valia site on Cloudflare Pages since 2026-07-03** (ADR-0018): registry entry in `stacks/valia-sites`, synced from Drive folder "claude" every 10 min, deploy-on-change. The old in-cluster stack (nginx off PVE NFS + per-site rclone CronJob) is RETIRED — stacks/stem95su is a tombstone; `secret/stem95su` superseded by `secret/valia-sites`; `stem_video.mp4` was compressed 42.9→21.4MB (25MB Pages cap) with Viktor's OK. See docs/runbooks/valia-sites.md. | — |
| valia-sites | **Valia-site registry + sync** (ADR-0018): all sites authored by Valia serve OFF-INFRA on Cloudflare Pages (`bridge` + `stem95su` live). One map entry in `stacks/valia-sites/main.tf` per site fans out Pages project + custom domain + public CNAME + internal split-horizon CNAME (ConfigMap `valia-sites-dns` → technitium sync, declarative incl. removal). CronJob `valia-sites-sync` (`*/10`, image ghcr `valia-sites-sync`) mirrors each Drive Content folder (rclone `drive.readonly`, stem95su-style guards + 25MB Pages-cap guard) and wrangler-deploys ONLY on manifest change (free-tier deploy cap). Secrets `secret/valia-sites` (shared rclone conf + SCOPED CF Pages token — Global API Key never in pods). Failed-Job-only visibility by choice. Runbook: docs/runbooks/valia-sites.md. | valia-sites |
| trek | **TRIAL (2026-06-05)** — self-hosted group-trip planner (upstream [TREK](https://github.com/mauriceboe/TREK), `mauriceboe/trek:3.0.22`, AGPL-3.0). Solo evaluation behind Authentik forward-auth (`auth=required`) before deciding build-vs-adopt; covers collaborative trip planning + accommodation records + activities + per-person budget splitting on free OpenStreetMap (no paid maps key). SQLite + uploads on `proxmox-lvm-encrypted` (`trek-data-encrypted` 2Gi, `trek-uploads-encrypted` 5Gi). For the trial only: `ENCRYPTION_KEY` is TREK-auto-generated onto the data PVC and the bootstrap admin (`admin@trek.local`) is printed to pod logs — NO Vault/ESO wiring (graduation TODO: move key to `secret/trek` + ESO, add an app-level SQLite backup CronJob since host file-backup can't read the LUKS PVC, wire TREK↔Authentik OIDC). Pinned image, TF-managed (no CI/Keel). Availability-poll companion (Rallly) deferred. Teardown: `tg destroy` in `stacks/trek`. | trek |
## Cloudflare Domains
@ -130,7 +133,7 @@
blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send,
audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden,
changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser,
travel, netbox, phpipam, tripit, t3, stem95su
travel, netbox, phpipam, tripit, t3, stem95su, tasks
```
### Non-Proxied (Direct DNS)

42
.github/workflows/build-excalidraw.yml vendored Normal file
View file

@ -0,0 +1,42 @@
name: Build excalidraw-library
# ADR-0002 / no-local-builds: excalidraw-library (infra-owned Go app behind
# draw.viktorbarzin.me) builds off-infra on GHA → private ghcr; Keel polls
# ghcr:latest and rolls the deployment. Replaces the manual DockerHub pushes
# (viktorbarzin/excalidraw-library:v4 stays frozen as the rollback image).
on:
push:
branches: [master]
paths:
- 'stacks/excalidraw/project/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.21'
- run: go test ./...
working-directory: stacks/excalidraw/project
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/excalidraw/project
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/excalidraw-library:latest
ghcr.io/viktorbarzin/excalidraw-library:${{ github.sha }}

View file

@ -0,0 +1,39 @@
name: Build valia-sites-sync
# ADR-0002 + ADR-0018: infra-owned image built off-infra on GHA → ghcr (public).
# Rclone + wrangler runner for the Valia-sites Content-folder mirror CronJob.
# Rebuilds are rare (tool pins only change deliberately) → dispatch + path.
# Security note: no untrusted event inputs are interpolated anywhere (only
# github.actor / github.sha / GITHUB_TOKEN — same shape as the other
# build-*.yml workflows in this repo).
on:
push:
branches: [master]
paths:
- 'stacks/valia-sites/sync-image/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/valia-sites/sync-image
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/valia-sites-sync:latest
ghcr.io/viktorbarzin/valia-sites-sync:${{ github.sha }}

View file

@ -95,7 +95,7 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
## Key Paths
- `stacks/<service>/main.tf` — service definition
- `stacks/platform/modules/<service>/` — core infra modules
- `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI, and auto Cloudflare DNS (`dns_type = "proxied"` or `"non-proxied"`)
- `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI, and auto Cloudflare DNS (`dns_type = "proxied"`, `"non-proxied"`, or `"internal"` — a public A record carrying the internal Traefik LB IP for household-only services; pair with the `home-lans-only` ipAllowList middleware, never with `"proxied"`)
- `modules/kubernetes/nfs_volume/` — NFS volume module (CSI-backed, soft mount)
- `config.tfvars` — non-secret configuration (plaintext)
- `secrets.sops.json` — all secrets (SOPS-encrypted JSON)

View file

@ -118,6 +118,14 @@ _Avoid_: "external", "outside".
`viktorbarzin.lan`, served by Technitium DNS. Resolves only inside the homelab network.
_Avoid_: bare "lan", "private", "intranet".
**Segment**:
One isolated L2/L3 network with pfSense as its gateway — realised as a Proxmox-bridge-level tag feeding one dedicated untagged pfSense interface (dManagementsVms 10.0.10.0/24 = vmbr1 tag 10, dKubernetes 10.0.20.0/24 = vmbr1 tag 20, dCCTV 10.0.30.0/24 = vmbr0 tag 30). pfSense itself never terminates 802.1Q.
_Avoid_: "VLAN" as the primary name (the tags 10/20/30 are transport detail; the Segment is the concept).
**CCTV segment**:
The untrusted camera **Segment** (`dCCTV`) — devices in it may be pulled from (RTSP/ISAPI) but may initiate nothing except NTP to their gateway. Deliberately outside every trusted source-IP allowlist (ADR-0017).
_Avoid_: "camera VLAN", "CCTV LAN".
**Ingress auth**:
The `auth = "..."` parameter on `ingress_factory` — a discrete *mode*, not a ranked tier — one of `required` (Authentik forward-auth gates every request), `app` (the backend owns its login), `public` (anonymous Authentik binding for audit only), or `none` (Anubis-fronted content, or native-client API). Default `required` (fail-closed).
_Avoid_: "auth tier" / "auth mode" — refer to it by the canonical key, `auth` (e.g. `auth = "required"`). "tier" is reserved for State tier and Namespace tier.
@ -229,6 +237,20 @@ _Avoid_: expecting Diun to deploy; conflating with **Keel**.
**Anubis**:
A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content-bearing sites without app-level auth (blog, wiki, landing pages). Never in front of Git, WebDAV, CalDAV, or API endpoints (clients can't solve PoW).
### Externally-authored sites
**Valia site**:
A small public static site authored by Valia (Viktor's mother, external to the infra) and hosted for her under `<name>.viktorbarzin.me`. Its source of truth is a **Content folder** she owns; the live site is a mirror of that folder, fresh within ~10 minutes. Hosted **off-infra** (Cloudflare Pages) by decision: a homelab outage freezes content but never takes her sites down. Viktor picks the English subdomain name per site at registration (her folder names stay Bulgarian). Current instances: `stem95su`, `bridge`.
_Avoid_: "school site" (the family may grow beyond school projects); treating the deployed copy as editable — edits land only in the **Content folder**.
**Content folder**:
The Google Drive folder (or subfolder) Valia shares with `vbarzin@gmail.com` holding one **Valia site**'s files. Strictly read-only from the infra side — nothing ever writes back to her Drive. Empty or half-uploaded folder states must never wipe a live site.
_Avoid_: syncing a folder root when the servable content lives in a subfolder (stem95su serves `stem claude/files/`, not the folder root).
**Entry file**:
The HTML file a **Valia site** serves at `/`. Defaults to `index.html`; per-site override when she names it differently (stem95su: `stem_board.html`). The override is a registration-time setting, not a constraint on her authoring.
_Avoid_: asking Valia to rename her files to fit hosting conventions.
## Relationships
- A **Service** is defined by exactly one **Stack****flat** or wrapping a **Stack-local module** — which sources zero or more shared **Factory modules** and resolves to one or more K8s workloads.
@ -240,6 +262,7 @@ A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content
- A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither.
- An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**.
- Tier-1 **State tier** state and ~12 app databases share one **CNPG** `pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store.
- A **Valia site** mirrors exactly one **Content folder** and serves exactly one **Entry file** at `/`; the folder is hers, the subdomain name is Viktor's, the hosting is off-infra.
## Example dialogue

View file

@ -1 +1 @@
v0.11.0
v0.12.0

View file

@ -30,11 +30,21 @@ func memoryCommands() []Command {
}
}
// printMemories renders a {memories:[…]} response as compact lines, or raw JSON.
// printMemories renders a {memories:[…]} response as one line per memory, or raw JSON.
func printMemories(raw []byte, jsonOut bool) error {
fmt.Print(renderMemories(raw, jsonOut))
return nil
}
// renderMemories formats each memory as a single line with its FULL content
// (newlines flattened to spaces). Content is deliberately never truncated: the
// old 240-rune preview cut memories mid-sentence, misled agents into believing
// no full-content read-back existed, and made blind `update --content` from
// the preview silently destroy the stored tail. Full passthrough also can't
// produce invalid UTF-8 (the old mid-rune cut crashed the recall hook).
func renderMemories(raw []byte, jsonOut bool) string {
if jsonOut {
fmt.Println(string(raw))
return nil
return string(raw) + "\n"
}
var r struct {
Memories []struct {
@ -46,36 +56,20 @@ func printMemories(raw []byte, jsonOut bool) error {
} `json:"memories"`
}
if err := json.Unmarshal(raw, &r); err != nil {
fmt.Println(string(raw))
return nil
return string(raw) + "\n"
}
if len(r.Memories) == 0 {
fmt.Println("(no memories)")
return nil
return "(no memories)\n"
}
var b strings.Builder
for _, m := range r.Memories {
c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240)
fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
c := strings.ReplaceAll(m.Content, "\n", " ")
fmt.Fprintf(&b, "#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
if m.Tags != "" {
fmt.Printf(" tags: %s\n", m.Tags)
fmt.Fprintf(&b, " tags: %s\n", m.Tags)
}
}
return nil
}
// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it
// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240]
// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte
// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict
// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit
// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit
// hook error" for Cyrillic-language users.
func truncatePreview(s string, maxRunes int) string {
r := []rune(s)
if len(r) <= maxRunes {
return s
}
return string(r[:maxRunes]) + "…"
return b.String()
}
func memoryRecall(args []string) error {

View file

@ -8,25 +8,53 @@ import (
"unicode/utf8"
)
func TestTruncatePreviewKeepsValidUTF8(t *testing.T) {
// Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits
// invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must
// cut on a rune boundary and always stay valid UTF-8.
long := strings.Repeat("я", 300) // 300 runes / 600 bytes
got := truncatePreview(long, 240)
func TestRenderMemoriesFullContent(t *testing.T) {
// The pretty view must NOT truncate content: the old 240-rune preview cut
// memories mid-sentence, misled agents into thinking no full-content
// read-back existed, and made blind `update --content` from the preview
// destroy the stored tail. Full passthrough also removes the mid-rune-cut
// invalid-UTF-8 class by construction — nothing is ever sliced.
long := strings.Repeat("я", 300) + strings.Repeat("a", 300)
raw, _ := json.Marshal(map[string]interface{}{"memories": []map[string]interface{}{
{"id": 7, "content": long, "category": "facts", "tags": "t1,t2", "importance": 0.7},
}})
got := renderMemories(raw, false)
if !strings.Contains(got, long) {
t.Fatalf("content was truncated: %q", got)
}
if strings.Contains(got, "…") {
t.Fatalf("ellipsis in output — truncation still active: %q", got)
}
if !utf8.ValidString(got) {
t.Fatalf("truncatePreview produced invalid UTF-8: %q", got)
t.Fatalf("invalid UTF-8 in output: %q", got)
}
if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' {
t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r))
if !strings.Contains(got, "#7 [facts] (0.70) ") || !strings.Contains(got, "tags: t1,t2") {
t.Fatalf("line format broken: %q", got)
}
// Short multibyte strings pass through untouched (no ellipsis).
if got := truncatePreview("кратко", 240); got != "кратко" {
t.Fatalf("short string altered: %q", got)
}
func TestRenderMemoriesFlattensNewlinesToOneLine(t *testing.T) {
// Consumers (the recall hook, terminal skims) rely on one memory per line;
// multi-line content is flattened, never split across lines.
raw, _ := json.Marshal(map[string]interface{}{"memories": []map[string]interface{}{
{"id": 1, "content": "line one\nline two\nline three", "category": "facts", "importance": 0.5},
}})
got := renderMemories(raw, false)
if !strings.Contains(got, "line one line two line three") {
t.Fatalf("newlines not flattened: %q", got)
}
// ASCII boundary still works.
if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" {
t.Fatalf("ascii truncation wrong: %q", got)
}
func TestRenderMemoriesEdgeCases(t *testing.T) {
if got := renderMemories([]byte(`{"memories":[]}`), false); got != "(no memories)\n" {
t.Fatalf("empty list: %q", got)
}
// --json and unparseable responses pass through raw.
if got := renderMemories([]byte(`{"x":1}`), true); got != "{\"x\":1}\n" {
t.Fatalf("json passthrough: %q", got)
}
if got := renderMemories([]byte(`not json`), false); got != "not json\n" {
t.Fatalf("unparseable passthrough: %q", got)
}
}

Binary file not shown.

View file

@ -0,0 +1,126 @@
<svg xmlns="http://www.w3.org/2000/svg" width="1600" height="820" viewBox="0 0 1600 820" font-family="system-ui, -apple-system, 'Segoe UI', Roboto, sans-serif">
<!-- ADR-0017: PHYSICAL cabling only — no VLANs, no flows. Solid = cable in
place today · dashed = camera-day work · ~~~ = radio. Palette: neutral
grays + blue for copper runs (reference dataviz palette text tokens). -->
<defs>
<marker id="dot" viewBox="0 0 8 8" refX="4" refY="4" markerWidth="5" markerHeight="5">
<circle cx="4" cy="4" r="3" fill="#52514e"/>
</marker>
</defs>
<rect width="1600" height="820" fill="#fcfcfb"/>
<text x="40" y="42" font-size="26" font-weight="700" fill="#0b0b0b">ADR-0017 — physical cabling (single-switch, rev 3)</text>
<text x="40" y="66" font-size="15" fill="#52514e">wires only — no VLANs, no traffic · solid = in place · dashed = camera-day · ~ = radio</text>
<!-- ═════════ APARTMENT ═════════ -->
<rect x="40" y="100" width="330" height="330" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="56" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">APARTMENT</text>
<text x="70" y="158" font-size="13" fill="#52514e">☁ ISP (internet)</text>
<path d="M120,166 L120,196" fill="none" stroke="#52514e" stroke-width="2"/>
<rect x="64" y="198" width="220" height="64" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="80" y="222" font-size="14.5" font-weight="700" fill="#0b0b0b">AX6000 router</text>
<text x="80" y="242" font-size="12" fill="#52514e">192.168.1.1 · WAN←ISP · 8×LAN</text>
<rect x="64" y="290" width="220" height="52" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="80" y="312" font-size="14" font-weight="700" fill="#0b0b0b">Synology NAS · .13</text>
<text x="80" y="330" font-size="12" fill="#52514e">on an AX6000 LAN port</text>
<path d="M174,262 L174,290" fill="none" stroke="#2a78d6" stroke-width="2"/>
<text x="70" y="376" font-size="12.5" fill="#52514e">📶 wifi clients (phones, laptops)</text>
<path d="M110,262 C104,272 106,278 100,286 C106,294 104,300 100,308 C106,316 104,322 100,330 C106,338 104,344 100,352 C104,358 102,362 98,366" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
<!-- in-wall run apartment -> garage -->
<path d="M284,230 C450,230 540,228 616,228" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
<text x="330" y="218" font-size="12.5" font-weight="700" fill="#2a78d6">in-wall run → garage</text>
<!-- ═════════ GARAGE — RACK ═════════ -->
<rect x="560" y="100" width="640" height="680" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="576" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE — RACK</text>
<!-- switch -->
<rect x="600" y="150" width="560" height="150" rx="8" fill="#ffffff" stroke="#0b0b0b" stroke-opacity="0.5" stroke-width="1.6"/>
<text x="616" y="176" font-size="14.5" font-weight="700" fill="#0b0b0b">TL-SG105PE · 5-port gigabit PoE switch</text>
<text x="616" y="194" font-size="12" fill="#52514e">mgmt 192.168.1.6 · replaces the old TL-SG105E (→ shelf, cold spare)</text>
<g font-size="11.5" text-anchor="middle">
<rect x="616" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="664" y="227" font-weight="700" fill="#0b0b0b">P1</text>
<text x="664" y="242" fill="#52514e">← apartment</text>
<rect x="722" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="770" y="227" font-weight="700" fill="#0b0b0b">P2</text>
<text x="770" y="242" fill="#52514e">← 4G router</text>
<rect x="828" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="876" y="227" font-weight="700" fill="#0b0b0b">P3</text>
<text x="876" y="242" fill="#52514e">← UPS mgmt</text>
<rect x="934" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984" stroke-dasharray="4,3"/>
<text x="982" y="227" font-weight="700" fill="#0b0b0b">P4 ⚡PoE</text>
<text x="982" y="242" fill="#52514e">← camera</text>
<rect x="1040" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="1088" y="227" font-weight="700" fill="#0b0b0b">P5</text>
<text x="1088" y="242" fill="#52514e">← R730 eno1</text>
</g>
<text x="616" y="284" font-size="12" fill="#52514e">every cable below re-plugs old-switch → PE on camera day (≈3 min)</text>
<!-- 4G router -->
<rect x="600" y="360" width="250" height="64" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="616" y="384" font-size="14" font-weight="700" fill="#0b0b0b">4G router · 192.168.1.7</text>
<text x="616" y="403" font-size="12" fill="#52514e">~cellular uplink (out-of-band)</text>
<path d="M770,300 L770,360" fill="none" stroke="#2a78d6" stroke-width="2"/>
<path d="M856,392 C866,386 864,380 874,376 C866,370 868,364 876,360" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
<text x="884" y="380" font-size="12" fill="#52514e">📡 cellular</text>
<!-- UPS -->
<rect x="600" y="452" width="250" height="56" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="616" y="476" font-size="14" font-weight="700" fill="#0b0b0b">UPS (Huawei)</text>
<text x="616" y="494" font-size="12" fill="#52514e">network mgmt card</text>
<path d="M876,300 C876,340 800,410 720,452" fill="none" stroke="#2a78d6" stroke-width="2"/>
<!-- R730 -->
<rect x="600" y="540" width="560" height="220" rx="8" fill="#ffffff" stroke="#0b0b0b" stroke-opacity="0.5" stroke-width="1.6"/>
<text x="616" y="566" font-size="14.5" font-weight="700" fill="#0b0b0b">Dell R730 · PVE host · 192.168.1.127</text>
<g font-size="11.5">
<rect x="616" y="582" width="128" height="38" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="628" y="598" font-weight="700" fill="#0b0b0b">eno1 · LAN1</text>
<text x="628" y="613" fill="#52514e">← switch P5 · 1GbE</text>
<rect x="756" y="582" width="128" height="38" rx="5" fill="#ffffff" stroke="#8a8984" stroke-dasharray="4,3"/>
<text x="768" y="598" font-weight="700" fill="#52514e">eno2 · LAN2</text>
<text x="768" y="613" fill="#8a8984">dark · fallback leg</text>
<rect x="896" y="582" width="128" height="38" rx="5" fill="#ffffff" stroke="#d8d7d2"/>
<text x="908" y="598" fill="#8a8984">eno3 / eno4</text>
<text x="908" y="613" fill="#8a8984">free, uncabled</text>
<rect x="1036" y="582" width="108" height="38" rx="5" fill="#ffffff" stroke="#d8d7d2"/>
<text x="1048" y="598" fill="#8a8984">iDRAC · .4</text>
<text x="1048" y="613" fill="#8a8984">shared-LOM/eno1</text>
</g>
<text x="616" y="648" font-size="12" fill="#52514e">no other network cables — everything else on this host is VIRTUAL:</text>
<text x="616" y="668" font-size="12" fill="#52514e">pfSense · ha-sofia (HA) · devvm · k8s-master + node1-6 · registry VM …</text>
<text x="616" y="696" font-size="12" fill="#8a8984">(power: host + switch fed from the UPS — power wiring not drawn)</text>
<path d="M1088,300 C1088,420 720,500 680,582" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
<text x="1100" y="330" font-size="12.5" font-weight="700" fill="#2a78d6">LAN1 cable</text>
<!-- ═════════ GARAGE ENTRANCE ═════════ -->
<rect x="1280" y="100" width="280" height="200" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="1296" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE ENTRANCE</text>
<rect x="1304" y="150" width="232" height="110" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="1320" y="176" font-size="14" font-weight="700" fill="#0b0b0b">vermont-garage camera</text>
<text x="1320" y="196" font-size="12" fill="#52514e">HiLook IPC-T241H-C · 10.0.30.70</text>
<text x="1320" y="214" font-size="12" fill="#52514e">powered over the data cable (PoE)</text>
<text x="1320" y="232" font-size="12" fill="#52514e">outdoor · armored conduit</text>
<path d="M982,210 C982,150 1140,140 1304,180" fill="none" stroke="#52514e" stroke-width="2.5" stroke-dasharray="7,5"/>
<text x="1080" y="136" font-size="12.5" font-weight="700" fill="#52514e">single cat6 in conduit · data + PoE power (camera day)</text>
<!-- legend -->
<g transform="translate(40,780)" font-size="12.5">
<line x1="0" y1="-4" x2="44" y2="-4" stroke="#2a78d6" stroke-width="2.5"/>
<text x="52" y="0" fill="#0b0b0b">copper, in place</text>
<line x1="190" y1="-4" x2="234" y2="-4" stroke="#52514e" stroke-width="2.5" stroke-dasharray="7,5"/>
<text x="242" y="0" fill="#0b0b0b">camera-day cable / dark port</text>
<path d="M450,-4 C456,-10 454,-14 460,-18" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
<text x="470" y="0" fill="#0b0b0b">radio (wifi / cellular)</text>
<text x="650" y="0" fill="#52514e">total wired links at the rack: 5 (all on the one switch) · ADR-0017 rev 3</text>
</g>
</svg>

After

Width:  |  Height:  |  Size: 9 KiB

View file

@ -0,0 +1,99 @@
# CCTV segment: dedicated pfSense interface, VLAN-30 trunk on the LAN1 cable
Status: accepted (2026-07-02, rev 3 — single-switch)
![Network topology — dCCTV segment, flows, and camera-day steps](./0017-cctv-segment-topology.svg)
![Physical cabling — wires only, no VLANs](./0017-cctv-physical-cabling.svg)
The first owned camera at the Sofia/Vermont site (`vermont-garage`, HiLook
IPC-T241H-C at the garage entrance) needs to be network-isolated: its cable is
physically exposed outside the apartment, so anything plugged into that cable
must land in a segment that can reach nothing. The original design doc
(NAS: `Emo shared/Claude shared/garage-camera/`) called for an "802.1Q trunk
to pfSense" — but nothing in this network terminates dot1q on pfSense; the
site idiom is one vlan-aware Proxmox bridge → one tagged VM NIC → one clean
untagged pfSense interface per segment.
**Decision (rev 3):** ONE switch — the new TL-SG105PE **replaces** the old
garage TL-SG105E (Viktor prefers not running two switches; retired unit
becomes a cold spare, its 192.168.1.6 mgmt IP passes to the PE). Five ports,
all used: apartment uplink, 4G router 192.168.1.7, UPS mgmt (all untagged
VLAN 1), the camera (untagged VLAN 30, PoE), and the **trunk to R730 `eno1`
carrying home LAN untagged + CCTV tagged 30** over the existing LAN1 cable.
pfSense `net3` (vtnet3) sits on `vmbr0` with `tag=30` — exactly the site
idiom used for dManagementsVms/dKubernetes (bridge-level tag → clean untagged
vNIC; pfSense still terminates no dot1q itself). The earlier dedicated
`eno2`/`vmbr2` leg is kept **dormant as a fallback** (rev 2 wired it; moving
net3 back to vmbr2 restores pure physical isolation in one `qm set`).
This narrows the earlier 802.1Q objection rather than contradicting it: the
rejection assumed *unmanaged* switches, where any LAN device could inject
tagged frames; with the managed PE as the only device on eno1, VLAN-30
membership is {camera port, trunk port} only, so tag-30 ingress from every
other port — and from the exposed camera cable — is dropped or contained.
Cameras are untrusted: default-deny on dCCTV with a single
NTP-to-gateway exception; Frigate (k8s) pulls RTSP in; ha-sofia (192.168.1.8)
may reach ISAPI/RTSP directly; home-LAN clients route in via an AX6000 static
route (10.0.30.0/24 via 192.168.1.2). 10.0.30.0/24 is deliberately NOT in the
10.0.20.0/22 trusted source-IP allowlist.
## Traffic on the trunk — how one cable carries two networks
The LAN1 cable is shared, but the two networks on it diverge at `vmbr0`
(the vlan-aware bridge on the PVE host), and only ONE of them ever touches
pfSense:
- **Untagged (VLAN 1, home LAN)** is plain L2 bridging: vmbr0 switches it
between the trunk, the host's own IP (192.168.1.127) and pfSense `net0`
where pfSense sits as an ordinary LAN *client* (WAN 192.168.1.2). The home
LAN's gateway is and remains the AX6000; home-LAN traffic never transits
pfSense. Consequently a pfSense (or R730 VM-level) outage does not affect
the home LAN, and the apartment ↔ 4G-router ↔ UPS paths don't even leave
the switch (P1/P2/P3 bridge internally), so out-of-band recovery via the
4G router survives the whole rack being down.
- **Tagged 30 (CCTV)** has exactly one possible landing: vmbr0 delivers
VID 30 only to pfSense `net3` (dCCTV, 10.0.30.1), which is the camera
segment's gateway, firewall and sole exit. "Camera → AX6000 → internet"
is impossible by construction, not merely by firewall rule.
- pfSense forwards *upstream* only its own segments (10.0.10/20/30), NATed
out of its WAN toward the AX6000. Load-wise the trunk gained only the
camera's ~8 Mbps — it already carried all rack-bound home-LAN traffic.
![VLAN tagging — where traffic can flow](./0017-cctv-vlan-tagging.svg)
*(editable source: [`0017-cctv-vlan-tagging.excalidraw`](./0017-cctv-vlan-tagging.excalidraw) — open it in excalidraw to tweak)*
## Considered options
- **802.1Q over the LAN path behind an UNMANAGED switch** (the original plan
read this way) — rejected: any LAN device could inject tagged frames into
vmbr0 (`bridge-vids 2-4094`) and tag-passing through a dumb switch is
undefined. Rev 3 adopts the tagged path ONLY because the managed PE now
polices VLAN-30 membership at the single entry point to eno1; no bridge
reconfiguration was needed (vmbr0 was already vlan-aware).
- **Dedicated physical leg (eno2 → vmbr2 → net3), one switch per role**
(rev 1/2 as-built) — superseded by rev 3: it forced either a second switch
(6 connections vs 5 ports once the PE also replaced the old switch) or new
hardware. Strongest isolation of all options; kept dormant as the fallback.
- **AX6000 as the camera gateway** — rejected earlier in the design (consumer
router, no inter-VLAN firewall).
## Consequences
- The switch is now single-point and load-bearing for everything in the rack
(apartment uplink, pfSense backup-WAN via 4G, UPS mgmt, CCTV) AND its VLAN
table + mgmt password are part of the isolation boundary — the Easy Smart
mgmt UI answers on every port, so the password is the gate between a
compromised camera and the switch config. All 5 ports are consumed: the
next camera forces an 8-port PoE upgrade (the wiring plan already fits it).
- `eno2`/`vmbr2` stay cabled-ready but dormant (fallback to rev 2's physical
leg); eno3/eno4 remain free.
- The old TL-SG105E is retired to cold spare; the PE inherits 192.168.1.6
(Kea reservation by MAC).
- Revision history (all 2026-07-02): rev 1 assumed one shared PE with a
port-VLAN split (conflated the two devices); rev 2 split into two switches
after inspecting 192.168.1.6 (old non-PoE SG105E, 4/5 ports used); rev 3
consolidated back to one switch — the PE replacing the SG105E — per
Viktor's preference, moving CCTV onto a managed tagged trunk.
- Frigate's ADR-0016 VRAM budget was bumped 2000 → 2300 MiB for the extra
NVDEC stream.

View file

@ -0,0 +1,178 @@
<svg xmlns="http://www.w3.org/2000/svg" width="1600" height="880" viewBox="0 0 1600 880" font-family="system-ui, -apple-system, 'Segoe UI', Roboto, sans-serif">
<!-- ADR-0017 rev 3 dCCTV topology (single switch, VLAN-30 trunk on LAN1).
Colors: reference dataviz palette (light mode). blue #2a78d6 = home LAN ·
violet #4a3aa7 = dCCTV · aqua #1baf7a = dKubernetes ·
yellow #eda100 = dManagementsVms · green #008300 allow · red #e34948 deny -->
<defs>
<marker id="arrGreen" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
<path d="M0,0 L10,5 L0,10 z" fill="#008300"/>
</marker>
<marker id="arrRed" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
<path d="M0,0 L10,5 L0,10 z" fill="#e34948"/>
</marker>
<marker id="arrGray" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse">
<path d="M0,0 L10,5 L0,10 z" fill="#52514e"/>
</marker>
</defs>
<rect width="1600" height="880" fill="#fcfcfb"/>
<text x="40" y="42" font-size="26" font-weight="700" fill="#0b0b0b">ADR-0017 — CCTV segment behind pfSense, VLAN-30 trunk on the LAN1 cable</text>
<text x="40" y="66" font-size="15" fill="#52514e">Sofia/Vermont · rev 3 (single switch) 2026-07-02 · dashed = camera-day · the ONLY 802.1Q is the trunk between the switch and eno1</text>
<!-- camera -> everything else (denied) -->
<path d="M240,168 C520,104 900,104 1148,140" fill="none" stroke="#e34948" stroke-width="3" marker-end="url(#arrRed)"/>
<g transform="translate(560,111)">
<circle r="11" fill="#fcfcfb" stroke="#e34948" stroke-width="2.5"/>
<path d="M-5,-5 L5,5 M5,-5 L-5,5" stroke="#e34948" stroke-width="2.5"/>
</g>
<text x="588" y="100" font-size="13.5" font-weight="700" fill="#e34948">DENY · camera → LAN / other segments / internet (default deny on dCCTV)</text>
<!-- GARAGE ENTRANCE -->
<rect x="40" y="128" width="240" height="180" rx="10" fill="#4a3aa7" fill-opacity="0.06" stroke="#4a3aa7" stroke-opacity="0.35"/>
<text x="56" y="154" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE ENTRANCE</text>
<rect x="64" y="170" width="192" height="112" rx="8" fill="#ffffff" stroke="#4a3aa7" stroke-width="2"/>
<text x="80" y="196" font-size="15" font-weight="700" fill="#0b0b0b">vermont-garage</text>
<text x="80" y="216" font-size="12.5" fill="#52514e">HiLook IPC-T241H-C · pure IR</text>
<text x="80" y="234" font-size="12.5" fill="#52514e">10.0.30.70 (Kea reservation)</text>
<text x="80" y="252" font-size="12.5" fill="#52514e">DNS: garage-cam.viktorbarzin.lan</text>
<text x="80" y="270" font-size="12.5" fill="#52514e">PoE from switch · cloud/P2P off</text>
<path d="M256,284 C330,330 412,368 417,430" fill="none" stroke="#52514e" stroke-width="2" stroke-dasharray="6,5" marker-end="url(#arrGray)"/>
<text x="330" y="322" font-size="12" fill="#52514e">cat6 in conduit · PoE → P4</text>
<!-- RACK zone: single switch -->
<rect x="40" y="360" width="560" height="265" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="56" y="384" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">RACK — GARAGE · ONE SWITCH</text>
<rect x="64" y="396" width="512" height="176" rx="8" fill="#4a3aa7" fill-opacity="0.04" stroke="#4a3aa7" stroke-width="2"/>
<text x="80" y="420" font-size="15" font-weight="700" fill="#0b0b0b">TL-SG105PE <tspan font-size="12.5" font-weight="400" fill="#52514e">replaces the SG105E · mgmt 192.168.1.6 (Kea) · all 5 ports used</tspan></text>
<g font-size="11.5" text-anchor="middle">
<rect x="80" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="124" y="454" font-weight="700" fill="#0b0b0b">P1 · V1</text>
<text x="124" y="470" fill="#52514e">apartment</text>
<text x="124" y="484" fill="#52514e">uplink</text>
<rect x="178" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="222" y="454" font-weight="700" fill="#0b0b0b">P2 · V1</text>
<text x="222" y="470" fill="#52514e">4G router</text>
<text x="222" y="484" fill="#52514e">192.168.1.7</text>
<rect x="276" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="320" y="454" font-weight="700" fill="#0b0b0b">P3 · V1</text>
<text x="320" y="470" fill="#52514e">UPS mgmt</text>
<rect x="374" y="436" width="88" height="56" rx="6" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="418" y="454" font-weight="700" fill="#0b0b0b">P4 · V30</text>
<text x="418" y="470" fill="#52514e">camera</text>
<text x="418" y="484" fill="#52514e">PoE ON</text>
<rect x="472" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.10" stroke="#4a3aa7" stroke-width="2" stroke-dasharray="0"/>
<text x="516" y="454" font-weight="700" fill="#0b0b0b">P5 · trunk</text>
<text x="516" y="470" fill="#52514e">V1 untagged</text>
<text x="516" y="484" fill="#4a3aa7">+ V30 tagged</text>
</g>
<text x="80" y="516" font-size="12" fill="#52514e">802.1Q: VLAN 1 untagged {P1,P2,P3,P5} · VLAN 30 {P4 untagged/PVID 30, P5 tagged}</text>
<text x="80" y="534" font-size="12" fill="#52514e">tag-30 ingress on P1/P2/P3 is dropped (not members) — the trunk is the only tagged path</text>
<text x="80" y="558" font-size="12" fill="#8a8984">old TL-SG105E → retired, cold spare · backup-WAN (4G) + UPS keep their ports</text>
<!-- trunk: two parallel lines to eno1 -->
<path d="M560,458 C630,458 640,428 692,420" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
<path d="M560,466 C632,466 644,436 692,428" fill="none" stroke="#4a3aa7" stroke-width="2.5"/>
<text x="588" y="404" font-size="12" font-weight="700" fill="#0b0b0b">LAN1 cable</text>
<!-- R730 / PVE zone -->
<rect x="680" y="330" width="880" height="440" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="696" y="356" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">DELL R730 — PVE HOST 192.168.1.127 (IN THE RACK)</text>
<g font-size="12">
<rect x="700" y="400" width="150" height="46" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="712" y="419" font-weight="700" fill="#0b0b0b">eno1 → vmbr0</text>
<text x="712" y="436" fill="#52514e">untag V1 + tag 30</text>
<rect x="700" y="471" width="150" height="46" rx="6" fill="#ffffff" stroke="#8a8984" stroke-dasharray="4,3"/>
<text x="712" y="490" font-weight="700" fill="#52514e">eno2 → vmbr2</text>
<text x="712" y="507" fill="#8a8984">dormant fallback leg</text>
<rect x="700" y="542" width="150" height="46" rx="6" fill="#0b0b0b" fill-opacity="0.04" stroke="#8a8984"/>
<text x="712" y="561" font-weight="700" fill="#0b0b0b">vmbr1</text>
<text x="712" y="578" fill="#52514e">internal · tags 10/20</text>
</g>
<!-- pfSense VM -->
<rect x="890" y="388" width="300" height="230" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="906" y="414" font-size="15" font-weight="700" fill="#0b0b0b">pfSense (VM 101)</text>
<text x="906" y="432" font-size="12" fill="#52514e">gateway + firewall for every segment</text>
<g font-size="12">
<rect x="906" y="444" width="268" height="34" rx="5" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="916" y="465" fill="#0b0b0b">net0 · WAN <tspan fill="#52514e">192.168.1.2 · vmbr0 untagged</tspan></text>
<rect x="906" y="484" width="268" height="34" rx="5" fill="#eda100" fill-opacity="0.14" stroke="#eda100"/>
<text x="916" y="505" fill="#0b0b0b">net1 · dManagementsVms <tspan fill="#52514e">10.0.10.1</tspan></text>
<rect x="906" y="524" width="268" height="34" rx="5" fill="#1baf7a" fill-opacity="0.12" stroke="#1baf7a"/>
<text x="916" y="545" fill="#0b0b0b">net2 · dKubernetes <tspan fill="#52514e">10.0.20.1</tspan></text>
<rect x="906" y="564" width="268" height="34" rx="5" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="916" y="585" fill="#0b0b0b">net3 · dCCTV <tspan fill="#52514e">10.0.30.1/24 · vmbr0 tag 30</tspan></text>
</g>
<path d="M850,415 L890,458" fill="none" stroke="#2a78d6" stroke-width="1.6" opacity="0.6"/>
<path d="M850,430 L890,581" fill="none" stroke="#4a3aa7" stroke-width="2"/>
<path d="M850,565 L890,501" fill="none" stroke="#8a8984" stroke-width="1.6" opacity="0.6"/>
<path d="M850,565 L890,541" fill="none" stroke="#8a8984" stroke-width="1.6" opacity="0.6"/>
<!-- k8s VMs -->
<rect x="1240" y="388" width="290" height="230" rx="8" fill="#1baf7a" fill-opacity="0.07" stroke="#1baf7a"/>
<text x="1256" y="414" font-size="15" font-weight="700" fill="#0b0b0b">k8s VMs · 10.0.20.0/24</text>
<text x="1256" y="434" font-size="12.5" fill="#52514e">vmbr1 tag 20 · pod egress SNATs</text>
<text x="1256" y="450" font-size="12.5" fill="#52514e">to node IPs</text>
<rect x="1256" y="464" width="258" height="66" rx="6" fill="#ffffff" stroke="#1baf7a"/>
<text x="1268" y="486" font-size="13.5" font-weight="700" fill="#0b0b0b">Frigate · k8s-node1 (T4)</text>
<text x="1268" y="504" font-size="12" fill="#52514e">detect sub / record main</text>
<text x="1268" y="520" font-size="12" fill="#52514e">gpumem budget 2300 MiB</text>
<rect x="1256" y="540" width="258" height="52" rx="6" fill="#ffffff" stroke="#1baf7a"/>
<text x="1268" y="562" font-size="13.5" font-weight="700" fill="#0b0b0b">go2rtc LB 10.0.20.204</text>
<text x="1268" y="580" font-size="12" fill="#52514e">restream → HA live view (MSE/HLS)</text>
<!-- HOME LAN zone -->
<rect x="1148" y="128" width="412" height="180" rx="10" fill="#2a78d6" fill-opacity="0.06" stroke="#2a78d6" stroke-opacity="0.4"/>
<text x="1164" y="154" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">HOME LAN 192.168.1.0/24</text>
<rect x="1164" y="168" width="180" height="56" rx="6" fill="#ffffff" stroke="#2a78d6"/>
<text x="1176" y="190" font-size="13.5" font-weight="700" fill="#0b0b0b">AX6000 · .1</text>
<text x="1176" y="208" font-size="11.5" fill="#52514e">+ route 10.0.30.0/24 → .2</text>
<rect x="1164" y="236" width="180" height="52" rx="6" fill="#ffffff" stroke="#2a78d6"/>
<text x="1176" y="258" font-size="13.5" font-weight="700" fill="#0b0b0b">ha-sofia · .8</text>
<text x="1176" y="275" font-size="11.5" fill="#52514e">Frigate card + hikvision_next</text>
<rect x="1360" y="168" width="184" height="56" rx="6" fill="#ffffff" stroke="#2a78d6"/>
<text x="1372" y="190" font-size="13.5" font-weight="700" fill="#0b0b0b">apartment clients</text>
<text x="1372" y="208" font-size="11.5" fill="#52514e">laptops, phones</text>
<rect x="1360" y="236" width="184" height="52" rx="6" fill="#ffffff" stroke="#52514e" stroke-dasharray="5,4"/>
<text x="1372" y="256" font-size="11.5" font-weight="700" fill="#52514e">CAMERA DAY: static route</text>
<text x="1372" y="272" font-size="11.5" fill="#52514e">10.0.30.0/24 via 192.168.1.2</text>
<path d="M1254,308 C1150,352 950,372 790,400" fill="none" stroke="#2a78d6" stroke-width="2" opacity="0.6"/>
<text x="1010" y="374" font-size="12" fill="#2a78d6">apartment uplink · switch P1 · trunk · eno1</text>
<!-- FLOWS -->
<path d="M1256,497 C1010,690 330,730 120,650 C40,618 40,380 96,286" fill="none" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
<text x="620" y="700" font-size="13.5" font-weight="700" fill="#008300">ALLOW · Frigate → camera RTSP :554 (routed k8s → dCCTV; opt1 allow-all)</text>
<path d="M1164,262 C820,282 470,268 302,176 C286,167 278,166 270,172" fill="none" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
<text x="484" y="216" font-size="13.5" font-weight="700" fill="#008300">ALLOW · ha-sofia → camera :80 ISAPI + :554</text>
<text x="484" y="234" font-size="12" fill="#52514e">enters pfSense WAN · reply-to off · needs the AX6000 route</text>
<path d="M280,232 C660,200 860,320 936,386" fill="none" stroke="#008300" stroke-width="2" opacity="0.85" marker-end="url(#arrGreen)"/>
<text x="740" y="322" font-size="12.5" font-weight="700" fill="#008300">ALLOW · camera → 10.0.30.1:123 (NTP)</text>
<!-- LEGEND -->
<g transform="translate(40,800)" font-size="12.5">
<rect x="0" y="0" width="18" height="18" rx="4" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="26" y="14" fill="#0b0b0b">home LAN / VLAN 1</text>
<rect x="200" y="0" width="18" height="18" rx="4" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="226" y="14" fill="#0b0b0b">CCTV / VLAN 30 / dCCTV 10.0.30.0/24</text>
<rect x="500" y="0" width="18" height="18" rx="4" fill="#1baf7a" fill-opacity="0.12" stroke="#1baf7a"/>
<text x="526" y="14" fill="#0b0b0b">dKubernetes</text>
<rect x="640" y="0" width="18" height="18" rx="4" fill="#eda100" fill-opacity="0.14" stroke="#eda100"/>
<text x="666" y="14" fill="#0b0b0b">dManagementsVms</text>
<line x1="820" y1="9" x2="860" y2="9" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
<text x="870" y="14" fill="#0b0b0b">allowed flow</text>
<line x1="980" y1="9" x2="1020" y2="9" stroke="#e34948" stroke-width="3" marker-end="url(#arrRed)"/>
<text x="1030" y="14" fill="#0b0b0b">denied</text>
<line x1="1100" y1="9" x2="1140" y2="9" stroke="#52514e" stroke-width="2" stroke-dasharray="6,5"/>
<text x="1150" y="14" fill="#0b0b0b">camera-day step</text>
<text x="1320" y="14" fill="#52514e">ADR-0017 · rev 3</text>
</g>
</svg>

After

Width:  |  Height:  |  Size: 13 KiB

File diff suppressed because it is too large Load diff

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 23 KiB

View file

@ -0,0 +1,47 @@
# Valia sites are served off-infra (Cloudflare Pages), synced in-cluster
Valia (Viktor's mother) authors small one-page static sites in Google Drive folders she
shares, and keeps asking for them to be hosted — two exist already (`stem95su`, `bridge`)
and more are expected. We decided all **Valia sites** are served **off-infra on Cloudflare
Pages** under `<english-name>.viktorbarzin.me`, kept fresh by **one shared in-cluster
CronJob** (`stacks/valia-sites/`) that mirrors each **Content folder** every 10 minutes
(rclone, drive.readonly) and re-deploys only on change (wrangler direct upload). The
existing in-cluster `stem95su` serving stack (nginx + NFS + ingress + per-site sync)
migrates onto this and is retired.
Why off-infra serving: these are her sites, shown to teachers/parents — they must survive
homelab outages (cf. the 2026-06-27 egress incident that took every proxied in-cluster
site down). With Pages, a homelab outage degrades to "content frozen until we're back",
never "site down". Serving costs no cluster resources and no per-site nginx/PVC/ingress/
Anubis. Why the syncer stays in-cluster anyway: secrets stay in Vault (no per-site GHA
secret sprawl), and the stem95su guard patterns (hard-fail on Drive auth errors, never
wipe a live site on an empty/partial folder, capped deletes) carry over wholesale. The
deliberate asymmetry — off-infra serving, on-infra syncing — is the point, not an
accident.
## Considered options
- **In-cluster everywhere** (generalise stem95su into a factory module): one roof, no
Cloudflare Pages dependency — but her sites share the homelab's fate and each site
spends cluster resources to serve static files a free CDN serves better.
- **Pages for new sites only**: less work now, two patterns and two runbooks forever.
- **GHA-scheduled sync** (fully off-infra pipeline): no cluster dependency at all, but
Drive + Cloudflare credentials would live as GitHub secrets per repo, outside Vault.
## Consequences
- Registration is one entry in the `sites` map (name, Content folder, optional Entry
file); CI applies Pages project, custom domain, public CNAME, and internal-DNS config
together. Names are English, picked by Viktor (most → bridge set the precedent).
- The internal split-horizon zone learns Valia sites from a ConfigMap the
`technitium-ingress-dns-sync` script consumes — declaratively, including **removal**
(the previous static-CNAME approach was add-only; a retired site left a stale record).
- Deploy-on-change is mandatory, not an optimisation: Pages caps monthly deployments on
the free tier, and a 10-minute cadence would burn ~4,300/month if unchanged runs
deployed.
- Failure visibility is **failed-Job-only** by explicit choice (no stale-sync alert, no
per-site uptime monitors, no notifications to Valia) — Viktor fields "it didn't
update" reports, consistent with the alert-noise-reduction posture. Revisit if a
silent stall actually bites.
- If the homelab is down, content updates pause; the sites keep serving last-deployed
content. Accepted degradation.

View file

@ -0,0 +1,97 @@
# Inbound mail gets a self-hosted store-and-forward backup MX on Oracle Always-Free
`viktorbarzin.me` has run a single direct MX to the home IP since the 2026-04-12
inbound overhaul, with sender-MTA retry (15 days, sender-dependent) as the only
outage protection — a documented "No Backup MX" decision made after ForwardEmail's
forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email
Routing proved pass-through-only. Viktor now wants inbound mail to survive
homelab outages **without loss** (2026-07-04): delayed delivery is fine,
mid-outage reading is not required, and the budget is **$0** — a hard
constraint that eliminated every managed option (see below).
We run a minimal **Postfix store-and-forward relay on an Oracle Cloud
Always-Free `VM.Standard.E2.1.Micro`** (`mx2.viktorbarzin.me`, **reserved**
public IP, MX preference 20; primary untouched at 1). It accepts everything
for the domain (catch-all — every RCPT is valid; reputation may only ever
4xx-defer, via postscreen pregreet + conservative DNSBL-defer on the VM —
never 5xx: a backup MX that hard-rejects manufactures the loss it exists to
prevent), queues up to **30 days** (bounce lifetime 1 day — the VM can never
deliver a DSN, its only egress is the drain), and drains to the primary over
**port 2526** — one scripted pfSense WAN NAT rule onto the existing HAProxy
frontend — because Oracle blocks egress TCP 25 tenancy-wide. Management is
tailnet-only (headscale preauth key, `tag:backup-mx`; OCI console as
mid-outage break-glass since headscale itself lives in the cluster); TLS via
certbot HTTP-01 (port 80 permanently open — LE validation is
multi-perspective and unscopeable); the VM is a cattle-rebuild from a new
`stacks/backup-mx/` Terraform stack (OCI provider + cloud-init, which must
also punch 25/80 through the OCI Ubuntu image's OS-level iptables REJECT).
On the primary, the drain stream (one /32) is enabled at the layers that
actually bite — `check_client_access` permits past
`reject_unknown_client_hostname` and spoof-protection, an anvil rate-limit
exception, and rspamd `external_relay` (score against the *original* sender
IP) with the reject action capped to tag/fold so drained spam can never force
the VM to emit backscatter. Go-live is gated on empirical checks: inbound-25
reachability (recurring probe — Oracle publishes no commitment), drain
end-to-end, and a live failover test that includes a high-spam-score and a
>10 MB message. Two independent adversarial reviews (2026-07-04) shaped this
final form. Design:
[`plans/2026-07-04-backup-mx-design.md`](../plans/2026-07-04-backup-mx-design.md).
## Considered options
- **Roller Network free Secondary MX** — v1 of this decision, killed at the
validation gates the same day: free tier caps at 200 relayed messages or
10 MB per rolling 7 days, and overage suspends the domain for 48 h
answering **SMTP 5xx** (permanent bounces) — since spammers target backup
MXes even while the primary is up, background spam alone can hold it
suspended, making it *worse than no backup MX*. Free accounts are also
being discontinued. (Their TLS checked out; their paid Basic at $30/yr is
the documented fallback if the OCI route sours.)
- **Dynu Email Backup ($9.99/yr)** — queue lifetime undocumented (FAQ hints
1224 h, barely beating sender retry); filtering black-box; not free.
- **Cloudflare Email Routing / mailflare** — no store-and-forward / terminal
inbox on Cloudflare; rejected earlier (2026-04-12; 2026-07-04 memory #7148).
- **Other free tiers** (challenged and re-verified 2026-07-04): GCP e2-micro
blocks egress 25 too and its free regions are US-only; AWS's 2025+ "free"
plan is a 6-month credit; Azure has no always-free VM and blocks 25;
Hetzner has no free tier; Fly.io ended free allowances; Vultr/Linode are
trial credits; DNSExit/KisoLabs/DuoCircle backup-MX are paid or dead. OCI
is the only standing free option.
- **Harden-only** (5xx-misconfig guards + paging) — does not address
multi-day outages or short-retry senders; deferred as a complementary
track.
## Consequences
- **A pet outside the cluster** — deliberately cattle: rebuilt entirely from
Terraform + cloud-init, patched by unattended-upgrades, scraped by the
cluster's Prometheus (exporters on the reserved public IP, allowlisted to
the homelab WAN /32 — there is **no cluster→tailnet route**, so tailnet
scraping was rejected as fictional; blackbox TCP:25 + MX-set drift alerts
besides). Never a backup target itself.
- **Oracle free-tier caprice is the top risk**: Oracle silently halved the A1
free allowance in June 2026 and terminated over-limit instances, and
publishes no commitment that inbound 25 stays open. Mitigations:
**Pay-As-You-Go conversion is a required prerequisite** (exempts idle
reclamation, stays $0), a recurring inbound-25 probe, `BackupMxDown`, and
the queue being empty outside outages (a surprise reclamation loses
coverage, never mail). Home region is fixed at signup — Frankfurt, chosen
once.
- The drain stream bypasses `reject_unknown_client_hostname`, anvil limits,
and rspamd's reject tier for one /32; DKIM verification, SPF/DMARC (against
the original IP via `external_relay`), and content scoring stay on — spam
arriving via the backup is tagged and folded to Junk, never bounced. The VM
is deliberately NOT in the primary's `mynetworks` (a compromised VM must
not relay through us).
- **Outages > 30 days lose queued mail silently** — no DSN can ever leave the
VM. Stated and accepted (6× better than the status quo).
- Outage mail sits in plaintext on Oracle disk ≤ 30 days — single-tenant but
off-premises; accepted (same class as Brevo holding outbound today).
- Cloudflare zone lands at 197/200 records; the MTA-STS follow-up (policy
host found dangling during design — inert today; must list `mx2` when
fixed) needs 12 more → schedule the next record purge proactively.
- `architecture/mailserver.md` §"No Backup MX" superseded at implementation;
new runbook `docs/runbooks/backup-mx.md` (incl. OCI console break-glass);
`vpn.md`'s stale headscale claims fixed in passing; the roundtrip probe's
failure semantics change (a "failing" probe may now mean "delayed via mx2,
drains shortly" — noted in alert description).

View file

@ -329,6 +329,12 @@ Two independent grants make up "browser access" for a user:
the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate
a token by deleting its `<user>-browser-token` Secret).
Because the SA is the user's DEFAULT kubectl credential, other per-namespace
port-forward grants hang off the same identity: `stacks/excalidraw/rbac.tf`
grants `emo-browser` `pods/portforward` in `excalidraw` (2026-07-02) so emo's
agent can upload drawings via the port-forward + `X-Authentik-Username` recipe
in his `~/.claude/CLAUDE.md`. Revoking the SA revokes those too.
## Limits + risks
- **Anti-bot vs stealth arms race** — when an upstream beats us (DRM

View file

@ -94,7 +94,7 @@ can't reach Forgejo's public hairpin.
| Visibility | Packages | Pull mechanism |
|------------|----------|----------------|
| **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous |
| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson |
| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci, k8s-portal, excalidraw-library | `ghcr-credentials` dockerconfigjson |
Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the
kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit
@ -188,6 +188,8 @@ reconciled — the workflows were added to the GitHub lineage via PR):
| android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` |
| infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` |
| infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` |
| k8s-portal | `build-k8s-portal.yml` | private `ghcr.io/viktorbarzin/k8s-portal` (Keel rolls `:latest` digests) |
| excalidraw-library | `build-excalidraw.yml` | private `ghcr.io/viktorbarzin/excalidraw-library` (Keel rolls `:latest` digests; DockerHub `:v4` frozen as rollback) |
**`infra-ci`** is the image the `.woodpecker/default.yml` apply step and
`drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is

View file

@ -277,7 +277,7 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons
Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).
**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too.
**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too. **Off-infra Valia sites** (Cloudflare Pages, ADR-0018) are the other class of public-only names with no Traefik ingress — without internal records they NXDOMAIN for every internal client while working fine externally. Since 2026-07-03 they are reconciled **declaratively**: `stacks/valia-sites` writes the ConfigMap `valia-sites-dns` (technitium ns, `<name> → <project>.pages.dev`), and the sync script ensures/updates a CNAME per entry and **deletes** stale internal CNAMEs targeting `*.pages.dev` that left the map (retire/rename cleans itself up; deletion is suffix-scoped so nothing else can be touched).
## NodeLocal DNSCache
@ -368,6 +368,7 @@ The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`)
| TXT (MTA-STS) | 1 | `v=STSv1; id=20260412` | TLS enforcement |
| TXT (TLSRPT) | 1 | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS reporting |
| A (keyserver) | 1 | `130.162.165.220` (Oracle VPS) | PGP keyserver |
| CNAME (CF Pages) | 2 | `<project>.pages.dev` (Cloudflare Pages) | bridge, stem95su — Valia sites (ADR-0018), managed by `stacks/valia-sites` |
### Proxied vs Non-Proxied
@ -513,6 +514,7 @@ For external `.viktorbarzin.me` records:
1. Add `dns_type = "proxied"` (or `"non-proxied"`) to the `ingress_factory` module call in the service stack
2. Run `scripts/tg apply` on the service stack — DNS record is auto-created
3. For non-standard records (MX, TXT), add a `cloudflare_record` resource in `stacks/cloudflared/modules/cloudflared/cloudflare.tf`
4. For a Valia site (off-infra Cloudflare Pages), add the entry to `local.sites` in `stacks/valia-sites/main.tf` — public CNAME + internal record both follow (`docs/runbooks/valia-sites.md`)
## Incident History

View file

@ -161,6 +161,17 @@ https://mail.viktorbarzin.me → Traefik → Roundcubemail
DB: MySQL (mysql.dbaas.svc.cluster.local)
```
### Paperless ingest mailbox (docs@)
`docs@viktorbarzin.me` is a dedicated real mailbox (explicit self-alias in
`extra/aliases.txt` so the `@domain → spam@` catch-all doesn't shadow it) that
paperless-ngx polls over IMAP; family members forward document emails to it
and the sender maps 1:1 to a paperless account. A per-user Dovecot sieve
(`docs-at-viktorbarzin.me.dovecot.sieve` in the `mailserver.config` ConfigMap,
mounted as `/tmp/docker-mailserver/docs@viktorbarzin.me.dovecot.sieve`)
discards mail from non-allowlisted senders at delivery. Full flow, sender map,
and add-a-sender procedure: [`runbooks/paperless-mail-ingest.md`](../runbooks/paperless-mail-ingest.md).
## DNS Records
All managed in Terraform at `stacks/cloudflared/modules/cloudflared/cloudflare.tf`.
@ -300,6 +311,21 @@ Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from External
## Troubleshooting
### All mail tempfailing with `451 4.3.0 queue file write error` (postsrsd spin)
Seen 2026-07-03 right after a pod restart. Signature in `/var/log/mail/mail.log`:
`postfix/cleanup: warning: tcp:localhost:10001 lookup error` +
`sender_canonical_maps map lookup problem ... message not accepted, try again later`.
Cause: **postsrsd** (SRS daemon, `sender_canonical_maps = tcp:localhost:10001`)
came up spinning at 100% CPU without binding 10001/10002 — supervisor shows it
`RUNNING` but `ss -ltn | grep 1000` is empty and its log is empty. Postfix then
tempfails every message (inbound AND submission); senders retry so nothing is
lost, and the roundtrip probe alerts within the hour.
Fix: `supervisorctl restart postsrsd` inside the container; if the fresh
process spins again (it did once), `kubectl -n mailserver delete pod` for a
full re-init — that healed it. Root cause not pinned down (one-off bad init;
postsrsd 1.10).
### Inbound mail not arriving
1. **DNS/MX**: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me`
2. **WAN reachability**: `nc -zw5 mail.viktorbarzin.me 25` from outside

View file

@ -1,10 +1,10 @@
# Networking Architecture
Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS TSIG-signed)
Last updated: 2026-07-02 (dCCTV segment added — dedicated pfSense leg for the garage camera, ADR-0017)
## Overview
The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.
The homelab network is built on three isolated segments behind pfSense (management VLAN 10, Kubernetes VLAN 20, and the physically-legged dCCTV camera segment — see ADR-0017) with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.
## Architecture Diagram
@ -24,9 +24,14 @@ graph TB
CSdrop[CrowdSec drop<br/>nftables / CF edge<br/>out-of-band, pre-Traefik]
subgraph "Proxmox Host (eno1)"
subgraph "Proxmox Host (eno1, eno2)"
vmbr0[vmbr0 Bridge<br/>192.168.1.127/24]
vmbr1[vmbr1 Internal<br/>VLAN-aware]
vmbr2[vmbr2 Bridge<br/>eno2 → TL-SG105PE]
subgraph "dCCTV - 10.0.30.0/24<br/>ADR-0017"
Camera[vermont-garage<br/>10.0.30.70]
end
subgraph "VLAN 10 - Management<br/>10.0.10.0/24"
Proxmox[Proxmox Host<br/>10.0.10.1]
@ -71,6 +76,9 @@ graph TB
vmbr1 -.VLAN 20.- Tech
vmbr1 -.VLAN 20.- Master
vmbr1 -.VLAN 20.- Node1
vmbr2 -.physical link.- eno2
vmbr2 -.untagged.- Camera
vmbr2 -.pfSense net3 = dCCTV 10.0.30.1.- pfSense
```
## Components
@ -81,6 +89,7 @@ graph TB
| phpIPAM | v1.7.0 | phpipam.viktorbarzin.me | IP address management, device inventory, DNS sync |
| vmbr0 | Linux bridge | 192.168.1.127/24 | Physical bridge on eno1, uplink to LAN |
| vmbr1 | Linux bridge (VLAN-aware) | Internal | VLAN trunk for VM isolation |
| vmbr2 | Linux bridge | Physical (eno2) | DORMANT fallback leg for dCCTV (ADR-0017 rev 3) — live dCCTV rides vmbr0 tag 30 over the LAN1 trunk |
| Technitium DNS | Container | 10.0.20.201 (LB) / 10.96.0.53 (ClusterIP) | Internal DNS (viktorbarzin.lan) + full recursive resolver |
| Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me |
| Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding |
@ -90,6 +99,22 @@ graph TB
| MetalLB | v0.15.3 Helm chart | K8s | LoadBalancer IPs (10.0.20.200-10.0.20.220), all services on 10.0.20.200 |
| Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 |
## CCTV Segment (dCCTV) — as-built 2026-07-02
Isolated camera segment for owned cameras at the Sofia site (first: `vermont-garage`, HiLook IPC-T241H-C at the garage entrance). Decision + rejected alternatives: `docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md`.
**Physical path (rev 3, single switch)**: camera → TL-SG105PE PoE port (untagged VLAN 30) → trunk port (home LAN untagged + CCTV **tagged 30**) → the existing LAN1 cable → R730 `eno1``vmbr0` (vlan-aware) → pfSense `net3`/vtnet3 = `vmbr0 tag=30` = interface **dCCTV `10.0.30.1/24`**. The TL-SG105PE **replaces** the old garage TL-SG105E (retired to cold spare) and carries everything: apartment uplink, 4G router `192.168.1.7`, UPS mgmt (VLAN 1), camera (VLAN 30), trunk — all 5 ports used. VLAN-30 membership is {camera port, trunk port} only, so tagged injection from other ports is dropped. `eno2`/`vmbr2` remain dormant as the fallback physical leg (rev 2).
**Addressing**: Kea DHCP pool `10.0.30.100-199`; devices get MAC reservations (camera `10.0.30.70`; the PE switch mgmt inherits the retired switch's `192.168.1.6` on the home LAN). Kea DDNS auto-registers names in Technitium; `phpipam-pfsense-import` picks up leases hourly.
**Firewall** (all on pfSense):
- dCCTV in: pass `udp OPT4-net → 10.0.30.1:123` (NTP) — everything else hits the interface's default deny. Cameras cannot reach LAN, other segments, or the internet.
- WAN in (home LAN side): pass `192.168.1.8` (ha-sofia) → `10.0.30.70:80` (ISAPI/hikvision_next) and `:554` (RTSP), reply-to disabled on both.
- dKubernetes is allow-all, so cluster Frigate/go2rtc pulls RTSP with no extra rule (pod egress SNATs to node IPs).
- Home-LAN clients need the **AX6000 static route** `10.0.30.0/24 via 192.168.1.2` (camera-day step) to reach the camera UI.
**Consumers**: cluster Frigate (`/srv/nfs/frigate/config/config.yml` — NOT Terraform) pulls `rtsp://10.0.30.70:554` main+sub as `vermont-garage`; HA integrates via Frigate plus direct hikvision_next for tamper events.
## IPAM & DNS Auto-Registration
Devices are automatically discovered, named, and registered in DNS without manual intervention.
@ -207,6 +232,8 @@ VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the up
- blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send, audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden, changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser, travel, netbox
- **Non-proxied domains** (grey cloud, direct IP resolution):
- mail, wg, headscale, immich, calibre, vaultwarden, and other services requiring direct connections
- **Internal-IP domains** (grey cloud, A → `10.0.20.203` Traefik LB, `ingress_factory` `dns_type = "internal"`):
- highlights-immich, highlights-immich-emo — publicly *resolvable* but only *routable* from home LANs / WG sites / VPN (spokes policy-route `10.0.0.0/8` down the tunnel, so kiosk devices with baked-in URLs need no per-site DNS overrides). The record is reachability, not a gate — enforcement is the `home-lans-only` Traefik ipAllowList (Sofia/London/Valchedrym LANs + 10/8) on the ingress. See `docs/plans/2026-07-04-immich-frame-lan-only-design.md`.
- CNAME records for proxied domains point to Cloudflared tunnel FQDNs
### Ingress Flow
@ -261,7 +288,7 @@ Traefik chain:
1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`).
2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend.
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), and authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients).
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients), tripit (`tripit-rate-limit`, 100/1000, photo-tab thumbnail bursts), health (`health-rate-limit`, 100/1000, SPA shell + API burst per page), and dawarich (`dawarich-rate-limit`, 100/1000 — the Rails app self-serves all fingerprinted assets and the map adds an API burst per load; the default burst 429'd the asset tail and risked dropping OwnTracks/mobile location POSTs on the same host).
4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors).
Additional middleware:
@ -552,7 +579,7 @@ chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Che
**Diagnosis**: Check Traefik middleware config for the affected IngressRoute.
**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, authentik 100/1000 (login SPA `/static` chunk burst → blank screen).
**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, and tripit/health/authentik/dawarich each 100/1000 (SPA or asset-heavy page loads bursting past the default from one client IP).
### Large Downloads or Uploads Truncate / Fail Partway

View file

@ -0,0 +1,103 @@
# Vault Token Renewer Self-Heal Design
**Date**: 2026-07-03
**Status**: Approved (brainstorm complete; implementation pending)
**Owner**: wizard@devvm
**Supersedes**: the "version-only, no self-heal" scope choice recorded in
`docs/runbooks/vault-token-renew-devvm.md` (2026-06-07)
## Problem
`wizard@devvm` holds a maintenance-free periodic Vault token
(`token-devvm-wizard`, `period=768h`, renewed daily by the
`vault-token-renew` user timer) precisely so no weekly re-login is needed.
But `~/.vault-token` is the Vault CLI's default token sink, so any
`vault login -method=oidc` — which the infra docs themselves instruct before
applies — overwrites it with a 7-day OIDC token. The renewer's drift guard
(deliberately detect-only) then refuses to renew the foreign token and fails
the unit daily, into a log nobody watches.
Observed consequence: a self-perpetuating weekly-expiry loop. The OIDC token
expires after 7 days → Vault 403s → the natural response is another
`vault login -method=oidc` → clobbers again. Drift persisted unnoticed
2026-06-18 → 06-26 and 2026-06-29 → 07-03 (memory #7121); Viktor experienced
it as "the token expires maybe once a week".
**Goal**: `vault login -method=oidc` becomes harmless on devvm. The renewer
converts any admin-capable clobber back into the permanent periodic token,
unattended. (Chosen over "never log in" doc-fixes and over instant path-unit
healing — see Alternatives.)
## Decisions
| # | Decision | Notes |
|---|----------|-------|
| 1 | Heal in the existing renewer's drift branch, at its nightly run | ~20-line diff to an already-tested script; no new units. A few-hours window holding the 7-day OIDC token is harmless (heal window 24h ≪ 7d TTL) |
| 2 | Heal = *attempt* re-mint using the foreign token itself; let Vault's 403 decide | No policy-list guessing — identity-vs-token-policies burned us before (memory #4211). OIDC tokens carry `vault-admin` via `identity_policies`, so the create succeeds |
| 3 | Weak foreign token (create denied) → keep today's loud DRIFT failure | A read-only clobber (e.g. the 2026-06-05 `kubernetes-woodpecker-default` incident) signals a misbehaving agent flow; auto-papering over it would hide the offender. Log gains a "heal denied — investigate what wrote it" suffix |
| 4 | Do NOT revoke the clobbering OIDC token | It may still back the user's live login session; it ages out in 7 days on its own |
| 5 | After a successful heal, revoke stale `token-devvm-wizard` accessors | Anti-sprawl: each heal would otherwise strand the previous periodic **admin** token server-side for up to 32 days. Walk `auth/token/accessors`, revoke every `display_name=token-devvm-wizard` except the just-minted one. Runs only on heal (rare), never on the happy path |
| 6 | Minted-token sanity check before writing the file | Look up the new token; require `display_name=token-devvm-wizard`. Write via temp file + `mv` + `chmod 600` so a failed mint can never truncate `~/.vault-token` |
| 7 | Keep timer cadence (daily) and all happy-path behavior unchanged | |
| 8 | No notification plumbing in this change | devvm alerting is tracked separately (beads `code-aslh`). Heal events are logged; heal-denied/FAIL still fail the unit |
## Behavior matrix
| Token found in `~/.vault-token` | Before | After |
|---|---|---|
| Our periodic token | renew-self, log `OK` | unchanged |
| Foreign, admin-capable (OIDC login) | log `DRIFT`, exit 1 | re-mint periodic token with it, sanity-check, atomic write, revoke stale periodic accessors, log `HEALED: re-minted from foreign dn=<dn> (revoked N stale)`, exit 0 |
| Foreign, weak (read-only k8s clobber) | log `DRIFT`, exit 1 | log `DRIFT … heal denied — foreign token lacks create authority; investigate what wrote it`, exit 1 |
| Vault unreachable / lookup fails | log `FAIL`, exit 1 | unchanged |
Re-mint command (identical to the manual recovery the DRIFT log already
prescribes):
```
vault token create -orphan -period=768h \
-policy=vault-admin -policy=sops-admin -display-name=devvm-wizard
```
## Testing
- **Unit** (`scripts/test-vault-token-renew.sh`, existing source-the-functions
harness): new pure functions for (a) the stale-accessor revoke filter
(match on `display_name`, exclude the current accessor) and (b) the
minted-token sanity predicate; regression cases for the existing drift
predicate stay green.
- **Live, post-deploy** (on devvm):
1. Mint a fake 1h admin token (`-display-name=fake-oidc`,
`-policy=vault-admin -policy=sops-admin`), write to `~/.vault-token`,
start the service → expect `HEALED`, file holds `token-devvm-wizard`.
2. Mint a fake 10m no-privilege token (`-policy=default`), write it, start
the service → expect `DRIFT … heal denied`, unit `failed`; restore real
token.
3. Revoke both fakes; one-off sweep of stale periodic accessors left by the
June 26 / July 3 manual re-mints.
## Docs & rollout
- Same commit rewrites the runbook's "Drift guard & recovery" section:
self-heal is the recovery for admin-capable clobbers; manual re-mint remains
only for weak clobbers (or a dead token with no admin-capable replacement in
the file).
- `vault login -method=oidc` instructions across the docs stay as-is — the
login is now harmless by design.
- Deploy per the runbook's manual model: `install -m 0755` to
`~/.local/bin/vault-token-renew`. Units unchanged — no daemon-reload.
- After landing: update memories #4204/#4211 (gotcha now self-healing).
## Alternatives considered
- **Instant heal** (systemd path unit + protected source-copy of the token):
strictly more capable (seconds-latency, heals weak clobbers too, zero
re-minting), but 2 new units + a second secret file + inotify re-trigger
edge cases — machinery disproportionate to the residual risk. Revisit only
if the few-hour heal window ever bites.
- **Vault CLI `token_helper` interception**: right interception point in
theory, but a helper bug breaks every `vault` CLI call, Terraform reads
`~/.vault-token` natively anyway, and it adds latency inside login. Rejected.
- **Docs-only ("never log in")**: rejected by user — the login should keep
working, not become forbidden knowledge.
- **Raise the OIDC role's 7-day `token_max_ttl`**: shared role, affects every
OIDC user; rejected previously for the same reason (memory #4205).

View file

@ -0,0 +1,443 @@
# Vault Token Renewer Self-Heal Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Make `vault login -method=oidc` harmless on devvm — the nightly renewer re-mints the permanent periodic token from any admin-capable clobber of `~/.vault-token`, unattended.
**Architecture:** Extend the drift branch of `scripts/vault-token-renew.sh` (deployed to `~/.local/bin/vault-token-renew`, driven by an existing systemd user timer). On drift, *attempt* the re-mint with the clobbering token itself and let Vault's 403 be the authority; sanity-check the minted token, replace the file atomically, then revoke stale `token-devvm-wizard` leftovers. Weak clobbers keep today's loud failure. Design: `docs/plans/2026-07-03-vault-token-self-heal-design.md`.
**Tech Stack:** bash + jq + vault CLI; existing test harness `scripts/test-vault-token-renew.sh` (sources the script, `vtr_main` is guarded).
**Working copy:** everything below runs in the worktree
`~/code/infra/.worktrees/vault-token-self-heal` on branch `wizard/vault-token-self-heal`.
Per repo policy, EVERY git command in this git-crypt repo worktree carries:
`-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false`
(abbreviated as `$GCFLAGS` below; define once per shell:
`GCFLAGS="-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false"`
and use it unquoted: `git $GCFLAGS <verb> …`).
---
### Task 1: Unit tests for the two new pure functions (RED)
**Files:**
- Modify: `scripts/test-vault-token-renew.sh` (append before the final `printf`/exit lines)
- [ ] **Step 1: Append the failing tests**
Insert this block immediately after the existing "parse + decide end-to-end" section (after the line `no "oidc: parse+decide refused" …`, before the final `printf '\n%d passed…'`):
```bash
# --- vtr_accessor: parse accessor out of lookup JSON ---
LOOKUP_NEW='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-new","policies":["default","sops-admin","vault-admin"],"identity_policies":null}}'
eq "accessor parsed" "acc-new" "$(vtr_accessor "$LOOKUP_NEW")"
eq "accessor absent -> empty" "" "$(vtr_accessor '{"data":{"display_name":"x"}}')"
# --- vtr_is_stale_periodic: the heal's revoke filter — ONLY old token-devvm-wizard
# --- tokens are swept; the just-minted token, foreign tokens, and anything with an
# --- unknown accessor are kept. An empty keep-accessor sweeps NOTHING (fail-safe).
STALE_OURS='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-old","policies":["default","sops-admin","vault-admin"]}}'
ok "older periodic token is stale" vtr_is_stale_periodic "$STALE_OURS" "acc-new"
no "the just-minted token is kept" vtr_is_stale_periodic "$LOOKUP_NEW" "acc-new"
no "foreign oidc token never swept" vtr_is_stale_periodic "$LOOKUP_OIDC" "acc-new"
no "woodpecker token never swept" vtr_is_stale_periodic "$LOOKUP_WP" "acc-new"
no "missing accessor never swept" vtr_is_stale_periodic '{"data":{"display_name":"token-devvm-wizard"}}' "acc-new"
no "empty keep-accessor sweeps nothing" vtr_is_stale_periodic "$STALE_OURS" ""
```
(`LOOKUP_OIDC` / `LOOKUP_WP` and the `ok`/`no`/`eq` helpers already exist in the file.)
- [ ] **Step 2: Run tests, verify they fail**
Run: `bash scripts/test-vault-token-renew.sh`
Expected: FAILs / `command not found` for `vtr_accessor` and `vtr_is_stale_periodic`; the 17 pre-existing tests stay green.
### Task 2: Implement the pure functions (GREEN)
**Files:**
- Modify: `scripts/vault-token-renew.sh` (insert after `vtr_drift_ok()`, before `vtr_main()`)
- [ ] **Step 1: Add the two functions**
```bash
# vtr_accessor <lookup-json> -> the token accessor (empty if absent).
vtr_accessor() {
printf '%s' "$1" | jq -r '.data.accessor // ""'
}
# vtr_is_stale_periodic <lookup-json> <keep-accessor> -> 0 if this lookup
# describes one of OUR periodic tokens (display name matches) that is NOT the
# one to keep — i.e. a stale leftover a heal should revoke. 1 otherwise.
# Name-only on purpose (no policy check): anything named token-devvm-wizard
# that isn't the current token is garbage from a previous mint. An empty
# keep-accessor sweeps NOTHING (fail-safe: never revoke when we don't know
# which token is current).
vtr_is_stale_periodic() {
local dn acc
[ -n "${2:-}" ] || return 1
dn=$(vtr_display_name "$1")
acc=$(vtr_accessor "$1")
[ "$dn" = "$EXPECTED_DN" ] || return 1
[ -n "$acc" ] || return 1
[ "$acc" != "$2" ]
}
```
- [ ] **Step 2: Run tests, verify all pass**
Run: `bash scripts/test-vault-token-renew.sh`
Expected: `25 passed, 0 failed`, exit 0.
- [ ] **Step 3: Commit**
```bash
cd ~/code/infra/.worktrees/vault-token-self-heal
git $GCFLAGS add scripts/vault-token-renew.sh scripts/test-vault-token-renew.sh
git $GCFLAGS commit -m "vault-token-renew: pure helpers for the self-heal revoke filter
vtr_accessor parses the accessor from lookup JSON; vtr_is_stale_periodic
decides which old token-devvm-wizard tokens a heal may revoke (never the
just-minted one, never foreign tokens, nothing when the keeper is unknown).
TDD red-green for the heal branch that lands next."
```
### Task 3: The heal branch (`vtr_heal` + `vtr_main` wiring)
**Files:**
- Modify: `scripts/vault-token-renew.sh`
- [ ] **Step 1: Add `vtr_heal` after `vtr_is_stale_periodic()`, before `vtr_main()`**
```bash
# vtr_heal <foreign-dn> <log-file> -> 0 if ~/.vault-token was re-minted back to
# our periodic admin token using the foreign token's own authority, 1 if the
# heal was denied or failed (caller exits non-zero; the unit goes failed).
#
# Self-heal added 2026-07-03 (docs/plans/2026-07-03-vault-token-self-heal-design.md):
# an OIDC login — which the infra docs prescribe before applies — clobbers
# ~/.vault-token with a 7-day token, and detect-only drift left that unnoticed
# for weeks (the weekly-expiry loop). We ATTEMPT the re-mint with the
# clobbering token itself and let Vault's authz decide — a read-only clobber
# (the 2026-06-05 woodpecker incident) is denied the mint and stays a loud
# failure, because it signals a misbehaving flow that someone should look at.
vtr_heal() {
local foreign_dn="$1" log="$2"
local errf new_token new_info new_dn new_pols new_acc tmp
errf=$(mktemp)
if ! new_token=$(vault token create -orphan -period=768h \
-policy=vault-admin -policy=sops-admin -display-name=devvm-wizard \
-field=token 2>"$errf") || [ -z "$new_token" ]; then
printf '%s DRIFT: ~/.vault-token is dn=%q — heal denied, foreign token lacks create authority (%s); investigate what wrote it. Manual re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \
"$(date -Is)" "$foreign_dn" "$(tr '\n' ' ' <"$errf")" >>"$log"
rm -f "$errf"
return 1
fi
rm -f "$errf"
# Sanity: the minted token must itself pass the drift guard before it may
# replace ~/.vault-token.
if ! new_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json 2>&1); then
printf '%s FAIL: heal minted a token but its lookup failed: %s\n' \
"$(date -Is)" "$new_info" >>"$log"
return 1
fi
new_dn=$(vtr_display_name "$new_info")
new_pols=$(vtr_policies_csv "$new_info")
if ! vtr_drift_ok "$new_dn" "$new_pols"; then
printf '%s FAIL: heal minted an unexpected token (dn=%q policies=%q) — not writing it\n' \
"$(date -Is)" "$new_dn" "$new_pols" >>"$log"
return 1
fi
# Atomic replace: mktemp files are 0600 from birth; same-filesystem mv.
tmp=$(mktemp "$HOME/.vault-token.XXXXXX")
printf '%s' "$new_token" >"$tmp"
mv "$tmp" "$HOME/.vault-token"
# Anti-sprawl: revoke previous token-devvm-wizard tokens — each heal would
# otherwise strand the prior periodic ADMIN token server-side for up to 32d.
# The clobbering foreign token is deliberately NOT revoked: it may still back
# the user's live login session, and it ages out on its own (7d for OIDC).
local sweep="accessor sweep skipped (list denied)" accessors a a_info revoked=0
new_acc=$(vtr_accessor "$new_info")
if [ -n "$new_acc" ] && accessors=$(VAULT_TOKEN="$new_token" vault list -format=json auth/token/accessors 2>/dev/null); then
while IFS= read -r a; do
[ -n "$a" ] || continue
a_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json -accessor "$a" 2>/dev/null) || continue
if vtr_is_stale_periodic "$a_info" "$new_acc"; then
VAULT_TOKEN="$new_token" vault token revoke -accessor "$a" >/dev/null 2>&1 && revoked=$((revoked + 1))
fi
done < <(printf '%s' "$accessors" | jq -r '.[]')
sweep="revoked $revoked stale periodic token(s)"
fi
printf '%s HEALED: re-minted periodic token from foreign dn=%q (%s)\n' \
"$(date -Is)" "$foreign_dn" "$sweep" >>"$log"
}
```
- [ ] **Step 2: Rewire the drift branch in `vtr_main`**
Replace this exact block (comment + if):
```bash
# Drift guard (added 2026-06-07): the renewer must NOT keep a FOREIGN token alive.
# On 2026-06-05 a stray `vault login -method=kubernetes` overwrote ~/.vault-token
# with a read-only woodpecker token, and this script then silently renewed THAT
# for two days — masking the loss of write access. So before renewing, confirm
# the token is our periodic admin token; if it has drifted, fail loudly (systemd
# marks the unit failed) instead of keeping someone else's token alive.
if ! vtr_drift_ok "$dn" "$pols"; then
printf '%s DRIFT: ~/.vault-token is dn=%q policies=%q (expected dn=%q with %q). Refusing to renew a foreign token. Re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \
"$(date -Is)" "$dn" "$pols" "$EXPECTED_DN" "$REQUIRED_POLICY" >>"$log"
exit 1
fi
```
with:
```bash
# Drift guard (2026-06-07) + self-heal (2026-07-03): the renewer must not
# keep a FOREIGN token alive (on 2026-06-05 a stray kubernetes login was
# silently renewed for two days, masking lost write access). But detect-only
# drift proved worse in practice: an OIDC login — which the infra docs
# prescribe before applies — clobbers this file too, and the resulting DRIFT
# failures went unnoticed for weeks while access degraded to a 7-day token
# (the weekly-expiry loop). On drift we now ATTEMPT to heal (see vtr_heal):
# re-mint the periodic token with the clobbering token's own authority.
# Vault's authz keeps the old guarantee — a token that couldn't legitimately
# hold vault-admin is denied the mint, and we still fail loud.
if ! vtr_drift_ok "$dn" "$pols"; then
vtr_heal "$dn" "$log" || exit 1
exit 0
fi
```
- [ ] **Step 3: Syntax + lint + regression check**
Run: `bash -n scripts/vault-token-renew.sh && bash scripts/test-vault-token-renew.sh; command -v shellcheck >/dev/null && shellcheck scripts/vault-token-renew.sh`
Expected: syntax OK, `25 passed, 0 failed`; shellcheck (if installed) reports nothing new.
- [ ] **Step 4: Commit**
```bash
git $GCFLAGS add scripts/vault-token-renew.sh
git $GCFLAGS commit -m "vault-token-renew: self-heal the periodic token on admin-capable clobber
Viktor asked for 'vault login -method=oidc' to work seamlessly: the OIDC
login the docs prescribe kept clobbering ~/.vault-token with a 7-day token,
and detect-only DRIFT failures went unnoticed for weeks (weekly-expiry
loop, twice in June). On drift the renewer now re-mints the periodic token
with the clobbering token's own authority (Vault's 403 is the judge — no
policy guessing), sanity-checks it, replaces the file atomically, and
revokes stale token-devvm-wizard leftovers. Weak/read-only clobbers still
fail loudly on purpose. Design: docs/plans/2026-07-03-vault-token-self-heal-design.md"
```
### Task 4: Docs — runbook + test-file header
**Files:**
- Modify: `docs/runbooks/vault-token-renew-devvm.md` (the `## Drift guard & recovery` section + the healthy-log-line note + `## Tests`)
- Modify: `scripts/test-vault-token-renew.sh` (header comment only)
- [ ] **Step 1: Replace the runbook's `## Drift guard & recovery` section with:**
```markdown
## Drift guard & self-heal
`~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login`
overwrites it. Two confirmed clobber vectors:
1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer
can't push past the OIDC role's 7-day `token_max_ttl`). The infra docs
prescribe this login before applies, so it recurs — it went unnoticed for
weeks twice (2026-06-18→26, 2026-06-29→07-03) and read as "Vault expires
weekly".
2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) →
writes a read-only `kubernetes-woodpecker-default` token (can read Vault but
**cannot** write `secret/*`). Happened 2026-06-05, unnoticed for two days.
Since 2026-07-03 the renewer **self-heals**
(`docs/plans/2026-07-03-vault-token-self-heal-design.md`). On a foreign token
it attempts the re-mint **with the clobbering token's own authority** and lets
Vault's authz decide:
- **Admin-capable clobber (OIDC login)** → re-mints the periodic token,
sanity-checks it against the drift guard, atomically replaces
`~/.vault-token`, revokes stale `token-devvm-wizard` leftovers
(anti-sprawl), logs
`HEALED: re-minted periodic token from foreign dn=… (revoked N stale periodic token(s))`
and exits 0. The clobbering token is NOT revoked — it may still back a live
login session; it ages out on its own.
- **Weak clobber (read-only k8s token)** → the mint is denied; logs
`DRIFT: … heal denied, foreign token lacks create authority …; investigate what wrote it`
and exits non-zero (unit `failed`). Deliberately loud: this signals a
misbehaving agent flow — exactly the 2026-06-05 case.
**Manual recovery** is only needed for the weak-clobber case (the DRIFT log
line still contains the exact command) — run the
[mint/re-mint](#mint--re-mint-the-token) block.
```
- [ ] **Step 2: In the runbook's `## Health check` section**, after the "A healthy log line looks like…" sentence, add:
```markdown
After an OIDC login you'll instead see, at the next nightly run:
`<ts> HEALED: re-minted periodic token from foreign dn="oidc-…" (revoked N stale periodic token(s))` — that's the self-heal working as designed.
```
- [ ] **Step 3: In the runbook's `## Tests` section**, replace the first sentence with:
```markdown
`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision,
the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber
case), and the self-heal's revoke filter (which stale periodic tokens a heal
may sweep).
```
- [ ] **Step 4: Update the test file's header comment** (lines 27) to:
```bash
# Unit tests for the pure functions in vault-token-renew.sh.
# Sources the script (vtr_main is guarded) and exercises (a) the drift-guard
# decision — is ~/.vault-token OUR periodic admin token (renew) or a foreign
# clobber (heal / fail loud)? — whose ABSENCE let the 2026-06-05 woodpecker
# clobber be silently renewed for two days, and (b) the self-heal's revoke
# filter — which stale token-devvm-wizard tokens a heal may sweep.
# Run: bash infra/scripts/test-vault-token-renew.sh
```
- [ ] **Step 5: Run tests once more, then commit**
Run: `bash scripts/test-vault-token-renew.sh`
Expected: `25 passed, 0 failed`.
```bash
git $GCFLAGS add docs/runbooks/vault-token-renew-devvm.md scripts/test-vault-token-renew.sh
git $GCFLAGS commit -m "vault-token-renew runbook: document the self-heal behavior
Drift guard section rewritten: admin-capable clobbers now self-heal at the
nightly run (HEALED log line); weak clobbers keep the loud DRIFT failure;
manual re-mint is only the weak-clobber recovery now."
```
### Task 5: Deploy + live verification (on devvm, as wizard)
**Files:** none (host deploy + live checks)
- [ ] **Step 1: Install from the worktree**
```bash
install -m 0755 ~/code/infra/.worktrees/vault-token-self-heal/scripts/vault-token-renew.sh ~/.local/bin/vault-token-renew
```
(Units unchanged — no `daemon-reload` needed.)
- [ ] **Step 2: Live case 1 — admin-capable clobber heals**
```bash
export VAULT_ADDR=https://vault.viktorbarzin.me
export XDG_RUNTIME_DIR=/run/user/$(id -u)
FAKE_ADMIN=$(vault token create -ttl=1h -policy=vault-admin -policy=sops-admin -display-name=fake-oidc -field=token)
printf '%s' "$FAKE_ADMIN" > ~/.vault-token
systemctl --user start vault-token-renew.service; echo "exit=$?"
tail -1 ~/.local/state/vault-token-renew.log
vault token lookup | grep -E 'display_name|period'
```
Expected: `exit=0`; log line `HEALED: re-minted periodic token from foreign dn="token-fake-oidc" (revoked N stale periodic token(s))` with N ≥ 1 (the pre-clobber periodic token is itself swept as stale — by design — along with any strays from the June 26 / July 3 manual re-mints); lookup shows `display_name token-devvm-wizard`, `period 768h`. Note: `FAKE_ADMIN` is a child of the swept old token, so the cascade revokes it too — no cleanup needed.
- [ ] **Step 3: Verify exactly ONE periodic token remains server-side**
```bash
for a in $(vault list -format=json auth/token/accessors | jq -r '.[]'); do
vault token lookup -format=json -accessor "$a" 2>/dev/null \
| jq -r 'select(.data.display_name=="token-devvm-wizard") | .data.accessor'
done
```
Expected: exactly one line, matching `vault token lookup -format=json | jq -r .data.accessor`.
- [ ] **Step 4: Live case 2 — weak clobber stays a loud failure**
```bash
GOOD=$(cat ~/.vault-token)
FAKE_WEAK=$(vault token create -ttl=10m -policy=default -display-name=fake-weak -field=token)
printf '%s' "$FAKE_WEAK" > ~/.vault-token
systemctl --user start vault-token-renew.service; echo "exit=$?"
systemctl --user is-failed vault-token-renew.service
tail -1 ~/.local/state/vault-token-renew.log
printf '%s' "$GOOD" > ~/.vault-token && chmod 600 ~/.vault-token
vault token revoke "$FAKE_WEAK" >/dev/null
```
Expected: `exit=1` (start reports the oneshot failure), `is-failed` prints `failed`, log line `DRIFT: ~/.vault-token is dn="token-fake-weak" — heal denied, foreign token lacks create authority (… permission denied …); investigate what wrote it. Manual re-mint: …`.
- [ ] **Step 5: Happy path still green**
```bash
systemctl --user start vault-token-renew.service; echo "exit=$?"
tail -1 ~/.local/state/vault-token-renew.log
```
Expected: `exit=0`, log `OK renewed (dn=token-devvm-wizard ttl=2764800s)`.
### Task 6: Land on master + cleanup
- [ ] **Step 1: Merge latest master into the branch, re-verify, push**
```bash
cd ~/code/infra/.worktrees/vault-token-self-heal
git $GCFLAGS fetch forgejo
git $GCFLAGS merge forgejo/master
bash scripts/test-vault-token-renew.sh
git $GCFLAGS push forgejo HEAD:master
```
Expected: clean merge (or already up to date), `25 passed, 0 failed`, push accepted. Non-fast-forward → fetch, merge, push again.
- [ ] **Step 2: Watch CI to completion**
The push fires the infra Woodpecker `default.yml` (terragrunt apply for changed stacks). This change touches only `scripts/` + `docs/` → expect a fast success / no-op apply. Check (Forgejo-forge infra repo = Woodpecker repo id 82):
```bash
export VAULT_ADDR=https://vault.viktorbarzin.me
vault kv get -format=json secret/ci/global | jq -r '.data.data | keys[]' # find the woodpecker admin token key
WP_TOKEN=$(vault kv get -field=<that-key> secret/ci/global)
curl -s -H "Authorization: Bearer $WP_TOKEN" 'https://ci.viktorbarzin.me/api/repos/82/pipelines?perPage=1' | jq '.[0] | {number, status, commit: .commit[0:8]}'
```
Expected: the pipeline for the pushed commit reaches `status: "success"` (poll until terminal). If it fails, fix before proceeding.
- [ ] **Step 3: Remove worktree + branch, reconcile main checkout**
```bash
git -C ~/code/infra $GCFLAGS worktree remove .worktrees/vault-token-self-heal
git -C ~/code/infra $GCFLAGS branch -d wizard/vault-token-self-heal
git -C ~/code/infra status --porcelain # expect clean before pulling
git -C ~/code/infra $GCFLAGS pull --ff-only forgejo master
```
Expected: worktree gone, branch deleted (already merged), main checkout fast-forwards to the landed commit.
### Task 7: Memory + wrap-up
- [ ] **Step 1: Update the stale memories** (they say the drift guard is detect-only / recovery is manual):
```bash
homelab memory recall "vault periodic token renewer drift" # confirm ids 4204, 4211, 7121 still say detect-only
homelab memory update 4211 "<original gotcha content, amended: since 2026-07-03 the renewer SELF-HEALS admin-capable clobbers at its nightly run (re-mints the periodic token with the clobbering token's authority + revokes stale token-devvm-wizard leftovers; weak clobbers still fail loudly). An OIDC login on devvm is now harmless. Design: infra docs/plans/2026-07-03-vault-token-self-heal-design.md>"
homelab memory update 7121 "<original content, amended: PLAYBOOK OBSOLETE for admin clobbers self-heal shipped 2026-07-03; manual re-mint only needed for weak/read-only clobbers>"
```
(Fetch each memory's current text first and preserve it — amend, don't replace wholesale.)
- [ ] **Step 2: End-of-task extraction** — dispatch the standard M.3 memory-mining subagent per `~/.claude/rules/execution.md`, then give the final summary.
---
## Plan self-review (done at write time)
- **Spec coverage**: heal-on-admin-clobber (T3), loud-fail-on-weak (T3 + live T5.4), no-revoke-foreign (T3 comment + design decision 4), anti-sprawl sweep + fail-safe filter (T2/T3, live T5.3), minted-token sanity + atomic write (T3), unit tests (T1/T2), runbook (T4), deploy + live sim (T5), memory updates (T7). ✓
- **Placeholders**: `<that-key>` in T6.2 is a deliberate discovery step (key name verified live from Vault, not invented). No other TBDs. ✓
- **Name consistency**: `vtr_accessor`, `vtr_is_stale_periodic`, `vtr_heal`, `EXPECTED_DN` match across tasks; test count 17→25 consistent (8 new cases). ✓

View file

@ -0,0 +1,335 @@
# Backup MX — self-hosted store-and-forward relay on Oracle Always-Free — design
Date: 2026-07-04 (v3 — post-challenge; v2 Oracle pivot same day) · Status: design,
pre-implementation · ADR: [0019](../adr/0019-backup-mx-self-hosted-oracle-relay.md)
v3 incorporates two independent adversarial-challenge reviews (same day). Their
material corrections are marked **[CH]** throughout — the largest: the v2 drain
path would never have drained (primary-side smtpd rejects), monitoring-over-
tailnet was fiction (no cluster→tailnet route exists), and the VM's bounce
model was wrong (it can never deliver a DSN).
## Goal
Inbound mail for `viktorbarzin.me` must survive homelab outages without loss.
Requirement level (Viktor, 2026-07-04): **never lose mail; delayed delivery is
acceptable; budget is $0** (hard constraint — reaffirmed after the Rollernet
gates failed). A store-and-forward backup MX queues mail while the homelab is
down and re-delivers when it returns.
Out of scope, explicitly:
- Reading new mail *during* an outage.
- Outbound mail during outages.
- The "primary up but hard-bouncing 5xx" misconfig class — a backup MX is
never consulted when the primary answers. Separate hardening/alerting track.
Known residual limit (state it plainly): an outage **longer than 30 days**
loses the queued mail *silently* — the VM cannot emit a bounce to anyone
(egress 25 blocked), so no sender ever learns. Accepted; 30 days is already
6× the sender-retry status quo.
## v1 → v2: why Rollernet was dropped (gate evidence, 2026-07-04)
v1 selected Roller Network's free Secondary MX. The validation gates killed it
before any DNS change:
- **G2 FAILED**: the [free-accounts policy](https://rollernet.us/policy/free-accounts.html)
caps free mail service at **200 relayed messages or 10 MB per rolling 7
days**; overage → domain suspended **48 h answering SMTP 5xx** (permanent
bounces), repeatable. Spammers deliberately target backup MXes even while
the primary is up, so background spam alone can hold the domain suspended —
worse than no backup MX.
- **G1 SHAKY**: same policy page says free accounts are being discontinued.
- **G3 PASSED** (for posterity): `mail{,2}.rollernet.us` present valid LE
certs over STARTTLS.
- Signup is Cloudflare-Turnstile-gated — moot given G1/G2.
Viktor's decision: stay free → self-host on Oracle Always-Free. **[CH]** The
external challenger re-searched the free landscape (DNSExit, KisoLabs,
DuoCircle, AWS/Azure/GCP/Hetzner/Fly/Vultr/Linode free tiers) and confirmed:
no credible free managed backup-MX or free VM with a usable port-25 story
exists in 2026 other than OCI. GCP's free e2-micro also blocks egress 25 and
is US-regions-only (wrong continent).
## Decision
A minimal **Postfix store-and-forward relay** (`mx2.viktorbarzin.me`) on an
Oracle Cloud **Always-Free** compute instance, published as a lower-preference
MX. It accepts mail for `viktorbarzin.me` when the primary is unreachable,
queues up to 30 days, and drains to the primary when it returns. No mailboxes,
no third-party terms — the queue-lifetime and reject-behavior knobs are ours.
## Architecture
```
┌── pri 1 mail.viktorbarzin.me ──► pfSense HAProxy ──► mailserver pod
sender MTA ──► MX lookup ┤ ▲
└── pri 20 mx2.viktorbarzin.me │ drain: smtp to
(Oracle VM, Postfix relay, │ mail.viktorbarzin.me:2526
queue ≤ 30 days) ───────────────────┘ (pfSense WAN NAT rdr
2526 → 10.0.20.1:25,
existing HAProxy frontend)
```
- **Normal operation**: senders use pri 1; the VM idles (spammers targeting
the backup + transient-blip retries get relayed onward immediately).
- **Outage**: senders fall back to pri 20 → VM accepts + queues → Postfix
retries the primary on its native schedule → queue drains after recovery
through the standard external ingress path (PROXY v2 → :2525 → rspamd →
Dovecot).
- **Custom drain port**: Oracle blocks **egress TCP 25** tenancy-wide
(post-2021; exemptions unreliable) — the VM cannot reach
`mail.viktorbarzin.me:25`. One pfSense WAN NAT rule `TCP 2526 →
10.0.20.1:25` reuses the existing HAProxy frontend unchanged. **[CH]
Verified against the runbook**: the frontend binds `*:25` on pfSense (not
strictly 10.0.20.1), rdr dst-port rewrite is the existing production
pattern (WAN:25 already rewrites to 10.0.20.1:25), and port 2526 collides
with nothing (the HAProxy test frontend uses :2525). Inbound TCP 25 **to**
the VM is unaffected by Oracle's egress-only block per practitioner
evidence (iRedMail/mailcow on OCI: receive works, send doesn't) — **to be
proven at gate O2 before any DNS change** (Oracle publishes no positive
commitment).
## Oracle account & instance
- **Account**: Viktor creates it (human signup; card for identity, $0
charged). **Home region is fixed at signup and Always-Free compute exists
only there — choose `eu-frankfurt-1` deliberately; there is no
try-another-region fallback without a new account. [CH]**
- **[CH] PAYG conversion is a REQUIRED prerequisite, not a recommendation**:
Oracle stops idle Always-Free instances (95th-pct CPU < 20% over 7 days an
idle Postfix box qualifies) and demonstrably changes free-tier terms without
notice, enforcing by termination (June 2026: A1 allowance silently halved,
over-limit instances shut down). PAYG keeps Always-Free resources free and
exempts them from idle reclamation.
- **Shape**: `VM.Standard.E2.1.Micro` (x86, 1/8 OCPU burst, 1 GB RAM; 2
always-free instances allowed; ample for queue-only Postfix — and untouched
by the 2026 A1 cuts). ARM A1 fallback is **unreliable** (halved quota,
chronic Frankfurt capacity) — treat E2.1.Micro availability as the gate.
- **[CH] Reserved public IP is mandatory** (`oci_core_public_ip`, reserved):
an ephemeral IP rotates on stop/start and would silently break all four
IP-keyed controls at once (pfSense NAT source-restriction, the primary's
smtpd/rspamd exemptions, the Oracle security list, Prometheus scrape
allowlist) — discovered only at the next outage's drain.
- **OS**: Ubuntu 24.04. **[CH] OCI Ubuntu images ship an OS-level iptables
ruleset (`/etc/iptables/rules.v4`) that ACCEPTs 22 and REJECTs everything
else, independent of security lists** — cloud-init must insert ACCEPT rules
for 25/80 (+ scrape ports) ahead of the REJECT and persist them, or gate O2
fails on day 1 with a correct security list.
- **Credentials**: OCI API key for Terraform → Vault `secret/viktor`
(`oci_*`); web login → Vaultwarden item `Oracle Cloud (backup MX)`.
## Networking & security posture
- **Ingress on the VM**: TCP 25 world-open (the service). **[CH] TCP 80
world-open permanently** — Let's Encrypt validation is multi-perspective
with no published source IPs, so it cannot be source-scoped, and a
"open-only-during-renewal" toggle is unspecified automation whose realistic
failure mode is an expired cert at day ~90. Nothing listens on 80 outside
certbot's seconds-long renewal windows; connection-refused surface is
negligible. TCP 9100/9154 (exporters) restricted to the homelab WAN /32
(176.12.22.76) in both the Oracle security list and the VM firewall.
- **No public SSH**: management rides the headscale tailnet — cloud-init
enrolls via a **preauth key for a dedicated non-OIDC headscale user** with
node tag `tag:backup-mx` (headscale 0.28.0 file-mode ACL, content in Vault
`secret/headscale``headscale_acl`); SSH bound to the tailnet interface.
ACL grant: `group:admin → tag:backup-mx:22` (cluster pods are NOT tailnet
members — see monitoring). **[CH] Outage caveat**: headscale's control
plane + DERP live in the cluster, so mid-outage tailnet reachability is
cached-netmap best-effort — the runbook documents the **OCI instance
console connection as break-glass** management. (Also fix `vpn.md`'s stale
"0.23.x / OIDC-only" claims while in there.)
- **VM compromise blast radius**: plaintext of outage-queued mail + a relay
surface contained by `relay_domains = viktorbarzin.me` only, no submission
ports, no SASL, no local delivery. The VM is deliberately NOT added to the
primary's `mynetworks` (that would let a compromised VM relay arbitrary
mail *through* the primary) — per-stage exemptions instead, below.
## Postfix configuration (relay-only, accept-and-queue with 4xx-only hygiene)
- `relay_domains = viktorbarzin.me`; `mydestination =` (empty).
- **[CH]** `smtpd_relay_restrictions = permit_mynetworks,
reject_unauth_destination` — explicit 5xx for foreign-domain RCPTs (the
default tail is `defer_unauth_destination`, whose 4xx invites every relay
probe to retry forever).
- **[CH]** `relay_recipient_maps` explicitly set to the wildcard form
(`@viktorbarzin.me OK`) — documents accept-all-recipients as a decision
(the domain is catch-all; every RCPT is valid by definition).
- `transport_maps`: `viktorbarzin.me smtp:[mail.viktorbarzin.me]:2526`.
- `maximal_queue_lifetime = 30d`. **[CH]** `bounce_queue_lifetime = 1d` and
`delay_warning_time = 0` — this host can never deliver a DSN to anyone
(egress 25 blocked; its only egress is 2526 to the primary), so undeliverable
bounces must be discarded quickly or they rot in the queue for a month and
permanently poison the queue-depth alert.
- **[CH]** `message_size_limit = 209715200` — exactly the primary's 200 MB
(`POSTFIX_MESSAGE_SIZE_LIMIT`, mailserver main.tf:88). The stock 10 MB
default would 552-reject large legitimate mail during outages — the exact
loss mode this project exists to prevent. Equal, never higher (higher
recreates drain-time rejects).
- **[CH] postscreen on the VM in 4xx-only posture**: pregreet test ON
(fire-and-forget bots don't retry; real MTAs do — the whole design already
rests on sender retry, so 4xx filtering is loss-free by construction),
optionally `postscreen_dnsbl_action = defer` with a conservative threshold.
v2's blanket "no DNSBL" conflated 5xx reputation rejects (rightly banned)
with 4xx tempfail (harmless); without any hygiene the backup is a 24/7
spam backdoor since spammers deliberately deliver to the highest-numbered
MX. Zero 5xx from reputation, ever.
- `inet_protocols = ipv4` **[CH]** — the primary publishes an AAAA (HE
tunnel) but the IPv6 HAProxy bridge has no :2526 listener; skip the wasted
v6 attempt per delivery.
- `smtpd_tls_cert_file` = LE cert for `mx2.viktorbarzin.me` (opportunistic
STARTTLS inbound; `smtp_tls_security_level = may` on the drain leg).
- Queue disk: the ~45 GB free boot volume dwarfs any realistic 30-day
accumulation for a personal domain.
## TLS
certbot standalone HTTP-01 for `mx2.viktorbarzin.me` (no Cloudflare API token
on an internet-facing VM). Port 80 permanently open (see above); certbot renew
timer. The MTA-STS follow-up (separate task; policy host currently dangling —
below) must list `mx2.viktorbarzin.me` when implemented.
## Primary-side drain enablement **[CH — this section replaces v2's "SPF/DMARC exemption + postscreen permit", which exempted the wrong layers]**
The v2 exemptions targeted postscreen DNSBL (which is **off** on the primary —
`ENABLE_DNSBL` unset) and rspamd SPF/DMARC scoring — but missed the three
mechanisms that would actually break the drain. All are keyed on the VM's
reserved /32 (the PROXY-v2-recovered client IP):
1. **`reject_unknown_client_hostname` bypass** — the primary sets
`POSTFIX_REJECT_UNKNOWN_CLIENT_HOSTNAME=1` (main.tf:89); an Oracle IP
without full FCrDNS (PTR needs an Oracle SR; limited on free accounts)
would be **450-deferred on every drain attempt → the queue never drains →
mass-bounces at day 30**. Fix: `check_client_access` permit for the VM /32
early in `smtpd_client_restrictions`, and a matching permit at the sender
stage (SPOOF_PROTECTION=1 rejects unauthenticated own-domain envelope
senders — drained self-addressed/bounced mail would 5xx). Attempt the
Oracle PTR anyway (belt and braces).
2. **Anvil rate-limit exception**`smtpd_client_message_rate_limit = 30`/min
keys on the VM's IP at drain; a >3,600-message backlog would throttle for
hours and false-fire the queue alert. Add the VM /32 to
`smtpd_client_event_limit_exceptions`.
3. **rspamd: evaluate the original sender, never 5xx the drain stream** — via
the existing override.d ConfigMap pattern (same mount as
`dkim_signing.conf`): (a) configure rspamd's **`external_relay`** module
(ip_map = VM /32) so SPF/DMARC/IP reputation evaluate against the
*original* client IP parsed from the VM's Received header — this keeps
DMARC protection for the entire drain stream instead of v2's blanket
disable; (b) cap rspamd's **action at the VM /32 to tag/fold — never
milter-reject**: the primary's default reject tier (DMS default, active
since only dkim_signing is overridden today) would 5xx high-score spam at
DATA, forcing the VM to generate DSNs to forged senders = classic
backup-MX backscatter → mx2's IP blacklisted. Drained spam lands tagged in
the catch-all's Junk instead. Validate the external_relay ↔ settings-rule
interplay at gate O5 with a high-spam-score message.
4. postscreen permit for the /32 (harmless; pregreet never trips a real
Postfix client and DNSBL is off — kept for future-proofing only).
## Our-side changes (Terraform unless noted)
1. **New stack `stacks/backup-mx/`** (Tier 1): OCI provider (creds from
Vault), VCN + subnet + security list + **reserved public IP** +
`VM.Standard.E2.1.Micro` + cloud-init (`templatefile`): **OS iptables
ACCEPTs for 25/80/9100/9154 ahead of the OCI image's REJECT rule
(persisted)**, postfix + config above, certbot, tailscale→headscale
enrollment (preauth key from Vault), node_exporter, postfix_exporter,
unattended-upgrades.
2. **DNS**`stacks/cloudflared/modules/cloudflared/cloudflare.tf`: A
`mx2.viktorbarzin.me` → reserved IP (non-proxied), MX pref 20 → `mx2`.
**[CH] Live zone count verified: 195/200 → 197/200 after this change; only
3 slots remain and the MTA-STS follow-up needs 12 → plan the next
record-purge now, not at collision time.**
3. **pfSense (live network device — approved as part of this plan)**: WAN NAT
rdr `TCP 2526 → 10.0.20.1:25` + firewall rule, source-restricted to the
reserved IP. **[CH] Scripted** (extend the existing
`scripts/pfsense-*-haproxy*.php` bootstrap-script family), not
hand-clicked — keeps the git-rebuildable parity the rest of the pfSense
mail config has. Config.xml rides the nightly backup.
4. **Mailserver stack**: the four-layer drain enablement above (client+sender
`check_client_access` permits, anvil exception, rspamd external_relay +
action cap, postscreen permit) — all keyed to one /32, via the existing
`postfix_cf` / `user-patches.sh` / rspamd-override hook points (verified
present: main.tf:129-144, 222-281, 467-474).
5. **Monitoring [CH — replaces v2's tailnet scraping, which had no transport:
no cluster→tailnet route exists and no existing target is scraped that
way]**: Prometheus scrapes `node_exporter`/`postfix_exporter` on the VM's
**public reserved IP**, allowed only from the homelab WAN /32 (Oracle SL +
VM firewall); blackbox TCP:25 from the cluster (`BackupMxDown`, warning);
MX-set drift assertion (both MX records present). Alerts:
`BackupMxQueueStuck` = **non-bounce** queue depth > 0 for 2 h while the
primary is healthy (gate on the existing `MailServerDown`/roundtrip
series, machine-readable — not prose); bounce residue is excluded by the
1-day bounce lifetime. Note: during a full homelab outage Prometheus
itself is down — queue growth is unobservable live under ANY transport;
what we actually watch is the post-recovery drain. A WAN-IP change stales
the Oracle allowlist → visible as ScrapeTargetDown (self-signaling).
**Probe semantics note**: once mx2 exists, the Brevo roundtrip probe's
mail fails over to mx2 on transient primary blips and arrives minutes late
via the drain — `EmailRoundtripFailing` may then mean "delayed via mx2",
not "lost"; note in the alert description and runbook.
6. **Docs (same commit as implementation)**: rewrite `mailserver.md` §"No
Backup MX", new runbook `docs/runbooks/backup-mx.md` (`postqueue -p`,
forced drain `postqueue -f`, cert renewal, **OCI console break-glass**, VM
rebuild from stack, Oracle account facts incl. PAYG + home-region lock),
`vpn.md` headscale-version/OIDC staleness fix, monitoring rows.
### MTA-STS finding (unchanged; no action in this change)
`_mta-sts` TXT is published but `mta-sts.viktorbarzin.me` has no record and
nothing serves the policy — MTA-STS is inert today. When fixed, the policy
MUST include `mx: mx2.viktorbarzin.me` (and budget its DNS records against the
3 remaining zone slots).
## Validation gates (in order; any failure → stop and report)
| # | Gate | Method | Failure handling |
|---|------|--------|------------------|
| O1 | Oracle account (home region `eu-frankfurt-1`, **fixed forever at signup**), **PAYG conversion done**, E2.1.Micro capacity | Viktor signs up + converts; TF apply | A1-in-home-region is a best-effort fallback only (halved quota, contended); else decision returns to Viktor |
| O2 | Inbound TCP 25 reachable from the internet (after the OS-iptables fix) | `nc -zv <reserved-ip> 25` from outside + recurring Uptime-Kuma TCP monitor (keeps proving it — Oracle publishes no commitment) | Stop; decision returns to Viktor |
| O3 | Drain works: VM → `mail.viktorbarzin.me:2526` delivers end-to-end | Test message injected on the VM | Debug pfSense NAT / HAProxy path |
| O4 | LE cert issued | certbot standalone | STARTTLS is opportunistic — non-blocking for go-live; fix before MTA-STS |
| O5 | Live failover test — **hardened [CH]** | presence-claim → scale mailserver to 0 (~30 min) → send from Gmail + Brevo **plus a high-spam-score message and a >10 MB message** → confirm queued (`postqueue -p`) → scale up → verify full drain within the anvil-exception expectations, spam folded to Junk (not bounced), headers show original-IP SPF/DMARC evaluation, no DSN generated on the VM, roundtrip probe recovers | Debug or roll back (remove MX record) |
## Failure modes
Covered: cluster/pod outages, pfSense/power/ISP outages ≤ 30 days, WAN IP
changes, short-retry senders. If pfSense is down the drain waits — Postfix
retries until it heals.
Not covered: primary-up-but-5xx misconfigs; outbound; mid-outage mailbox
access; **outages > 30 days lose queued mail silently (no DSN possible)**.
Simultaneous Oracle+homelab outage = status quo ante (sender retries).
Newly introduced, accepted:
- **A pet outside the cluster** — deliberately cattle: rebuilt from TF +
cloud-init, patched by unattended-upgrades, scraped by Prometheus. Never a
backup target.
- **Oracle free-tier caprice [CH — upgraded from v2's framing]**: Oracle has
silently cut Always-Free allowances and terminated over-limit instances
(June 2026, A1). Mitigations: PAYG (required), recurring inbound-25 probe,
`BackupMxDown`, and the fact that outside an active outage the queue is
empty — a surprise reclamation loses nothing, only coverage until rebuilt.
Rollernet Basic ($30/yr) stays the documented fallback if OCI sours.
- **Spam hygiene**: 4xx-only postscreen on the VM (pregreet + conservative
DNSBL-defer) instead of v2's nothing; drained spam is tagged/folded by
rspamd, never bounced.
- Outage mail sits plaintext on Oracle disk ≤ 30 days (single-tenant;
accepted).
## Rollback
Remove the MX + A records; wait for `postqueue -p` empty; `terraform destroy`
on `backup-mx`; delete the pfSense NAT rule (scripted); drop the mailserver
/32 exemptions. Order matters: MX record first.
## Viktor's manual steps (everything else is mine)
1. Create the Oracle Cloud account — **home region `eu-frankfurt-1`** (fixed
forever), card for identity, $0 charged.
2. **Convert the tenancy to Pay-As-You-Go** (required — idle-reclamation
exemption; Always-Free stays $0).
3. Hand me the tenancy OCID + a console user → I mint the API key, store
creds (Vault + Vaultwarden), and build the stack.
4. Approve the (scripted) pfSense NAT rule when I reach that step.

View file

@ -0,0 +1,89 @@
# Drone Logbook (Open DroneLog) — Design
**Date:** 2026-07-04
**Status:** Approved (Viktor, 2026-07-04)
**Owner request:** "I have a DJI Mini 4 Pro. I'm interested in github.com/ViktorBarzin/drone-logbook" → self-host it in the cluster.
## Goal
Self-host [Open DroneLog](https://github.com/arpanghosh8453/open-dronelog) (upstream of the
`ViktorBarzin/drone-logbook` fork) at **https://dronelog.viktorbarzin.me** so Viktor can import
DJI Fly flight logs from his DJI Mini 4 Pro and analyze them privately: telemetry charts, 3D map
replay, per-flight and lifetime stats. All data stays in the cluster (single DuckDB database).
## Decisions (interview, 2026-07-04)
| Question | Decision |
|---|---|
| Deployment form | Self-hosted Docker web app in k8s (not desktop app, not hosted webapp) |
| Exposure | Public `dronelog.viktorbarzin.me`, **Authentik forward-auth** (`auth = "required"`) |
| Log ingestion | **Both** manual web upload *and* a server-side auto-import drop folder from day one |
| Image source | **Upstream** `ghcr.io/arpanghosh8453/open-dronelog:latest` — NOT the fork |
| Fork disposition | Fork is 0 ahead / 372 behind, adds nothing; delete or park it. Only revive (sync + ADR-0002 GHA build) if Viktor starts modifying the code |
## Architecture
New Tier-1 stack `stacks/drone-logbook/`, modeled line-by-line on `stacks/freshrss/`
(the closest existing shape: single upstream-image app, own data volume, Keel-updated):
- **Namespace** `drone-logbook`, tier `4-aux`, label `keel.sh/enrolled=true` → Kyverno injects
Keel poll annotations → auto-upgrades as upstream releases (project is actively maintained).
- **Deployment** (1 replica, `Recreate` — DuckDB is single-writer/embedded):
- image `ghcr.io/arpanghosh8453/open-dronelog:latest` (nginx frontend + Axum REST backend, port 80)
- memory request=limit **512Mi** (DuckDB import/analytics spikes), cpu request 25m, no cpu limit
- standard `KYVERNO_LIFECYCLE_V1` / `KEEL_IGNORE_IMAGE` / `KEEL_LIFECYCLE_V1` lifecycle ignores
- **App data** `/data/drone-logbook` (DuckDB db, cached DJI decryption keys, uploaded originals):
**`proxmox-lvm-encrypted` block PVC** `drone-logbook-data-encrypted`, 2Gi, topolvm autoresize →
10Gi ceiling. Encrypted class because flight logs are GPS traces of home/travel — sensitive data
defaults to `proxmox-lvm-encrypted` per the storage decision rule (`.claude/CLAUDE.md`).
Embedded DBs stay off NFS (same rationale documented in the freshrss stack: NFS only for static files).
- **Backup CronJob** `drone-logbook-backup` (mandatory for every proxmox-lvm app): daily 01:30
file copy of the data volume → NFS `/srv/nfs/drone-logbook-backup` (dated dirs, 30-day retention,
Pushgateway metrics), pod-affinity co-scheduled with the app pod (RWO volume). 01:30 sits outside
the 00:00/08:00/16:00 sync-import windows so the DuckDB file is quiescent; retained upload
originals make even a torn copy recoverable by re-import. `nfs-mirror` (02:00) ships it to sda →
Synology offsite. Vaultwarden pattern.
- **Sync drop folder**: static NFS volume (`modules/kubernetes/nfs_volume`)
`192.168.1.127:/srv/nfs/drone-logbook/sync-logs`, mounted **read-only** at `/sync-logs`;
`SYNC_LOGS_PATH=/sync-logs`, `SYNC_INTERVAL="0 0 */8 * * *"` (every 8 h).
Any producer (Nextcloud sync, scp, a future phone pipeline) drops `.txt` logs there; the app
imports them automatically. `KEEP_UPLOADED_FILES=true` keeps re-importable originals in the PVC.
- **Ingress** via `ingress_factory`: `name = "dronelog"`, `auth = "required"` (Authentik
forward-auth), `dns_type = "proxied"`. External Uptime Kuma HTTPS monitor comes automatically
with the ingress annotation. Homepage tile (group "Media & Entertainment", icon `mdi-quadcopter`).
- **Secrets**: Vault KV `secret/drone-logbook` (`profile_creation_pass`) → ExternalSecret
(`vault-kv` ClusterSecretStore) → k8s secret `drone-logbook-secrets` → env
`PROFILE_CREATION_PASS`. Gates profile create/delete even for other Authentik-logged-in users.
No plan-time secret reads needed (no `data "kubernetes_secret"`).
No `DJI_API_KEY` — bundled default is fine at personal import volume; add later if rate-limited.
## Operational notes
- **DJI egress dependency**: importing a *new* log file requires the pod to reach DJI's servers
once (flight-log decryption key fetch; keys are then cached in the data dir). Remember this when
egress enforcement lands (Security wave 1, beads `code-8ywc`).
- The web UI is desktop-first; mobile is functional but basic.
- NFS host prerequisite: `/srv/nfs/drone-logbook/sync-logs` (root:www-data, 2775 — same shape as
sibling dirs) and `/srv/nfs/drone-logbook-backup` created on 192.168.1.127 and recorded in
`secrets/nfs_directories.txt`. `/srv/nfs` is exported whole-tree, so no `/etc/exports`
(`scripts/pve-nfs-exports`) change.
- Backup story = the daily app-level backup CronJob (above) + the host `daily-backup` LVM-snapshot
leg + original log files retained both in the drop folder and in the data volume
(`KEEP_UPLOADED_FILES=true`).
## Alternatives considered
- **Build from the fork** (`ghcr.io/viktorbarzin/...` via GHA, ADR-0002): rejected for now — fork
has zero custom commits; a build chain adds maintenance for no benefit. Revisit if code changes
are wanted.
- **`auth = "app"` + app profile passwords** (would enable the `opendronelog-sync` native uploader
from anywhere): rejected — a single app password guarding GPS traces of home/travel on the open
internet is weaker than Authentik; the sync drop folder covers automated ingestion instead.
- **Internal-only (.lan + VPN)**: rejected — Authentik-gated public matches the rest of the
homelab and works without VPN while traveling.
- **NFS for the DuckDB data**: rejected — embedded-DB-on-NFS locking risk; freshrss precedent
keeps app DB data on proxmox-lvm.
## Implementation
See `2026-07-04-drone-logbook-plan.md`.

View file

@ -0,0 +1,542 @@
# Drone Logbook (Open DroneLog) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Deploy Open DroneLog (DJI flight-log analyzer) at https://dronelog.viktorbarzin.me — new Tier-1 stack `stacks/drone-logbook/`, upstream image, Authentik-gated, with a DuckDB data PVC and an NFS auto-import drop folder.
**Architecture:** Single Deployment running `ghcr.io/arpanghosh8453/open-dronelog:latest` (nginx + Axum + DuckDB, port 80) in namespace `drone-logbook`; data on a `proxmox-lvm-encrypted` PVC (GPS logs = sensitive data), `/sync-logs` drop folder on static NFS, daily backup CronJob to `/srv/nfs/drone-logbook-backup` (vaultwarden pattern), `ingress_factory` with `auth = "required"`, Keel auto-upgrades via namespace enrollment. Modeled line-by-line on `stacks/freshrss/`. Design: `2026-07-04-drone-logbook-design.md`.
**Tech Stack:** Terraform/Terragrunt (Tier-1 PG state), Vault KV + ESO, ingress_factory, nfs_volume module, Keel/Kyverno.
Terraform is exempt from TDD (execution.md); each task ends with a concrete verification instead.
---
### Task 1: Vault secret
**Files:** none (Vault KV only)
- [ ] **Step 1.1: Create `secret/drone-logbook` with a generated profile-creation password**
```bash
vault kv put secret/drone-logbook profile_creation_pass="$(openssl rand -base64 24)"
```
- [ ] **Step 1.2: Verify**
```bash
vault kv get -field=profile_creation_pass secret/drone-logbook | wc -c
```
Expected: `33` (32 chars + newline). Never echo the value itself.
### Task 2: NFS drop folder on 192.168.1.127
**Files:**
- Modify: `secrets/nfs_directories.txt` (git-crypt'd — **edit from the MAIN checkout only**, never the worktree; sorted list, add `drone-logbook/sync-logs`)
- [ ] **Step 2.1: Create the directories** — world-writable + setgid like `vaultwarden-backup` (the `/srv/nfs` export root-squashes, so pod-root writes land as `nobody`):
```bash
ssh root@192.168.1.127 'mkdir -p /srv/nfs/drone-logbook/sync-logs /srv/nfs/drone-logbook-backup && chown -R root:www-data /srv/nfs/drone-logbook /srv/nfs/drone-logbook-backup && chmod 2777 /srv/nfs/drone-logbook/sync-logs /srv/nfs/drone-logbook-backup && ls -ld /srv/nfs/drone-logbook/sync-logs /srv/nfs/drone-logbook-backup'
```
Expected: `drwxrwsrwx ... root www-data ...` for both.
No `/etc/exports` (`scripts/pve-nfs-exports`) change — `/srv/nfs` is exported whole-tree.
- [ ] **Step 2.2: Record them in the declarative list (MAIN checkout, plaintext there)** — insert `drone-logbook-backup` and `drone-logbook/sync-logs` (after `diun`, before `etcd-backup`) in `~/code/infra/secrets/nfs_directories.txt`, then commit that single file to master:
```bash
git -C ~/code/infra add secrets/nfs_directories.txt
git -C ~/code/infra commit -m "nfs_directories: add drone-logbook/sync-logs
Drop folder for the new drone-logbook stack's auto-import (SYNC_LOGS_PATH).
Directory created on 192.168.1.127 root:www-data 2775."
git -C ~/code/infra push forgejo master
```
(Trivial single-file exception per execution.md; encrypted files cannot be edited from the worktree.)
### Task 3: Stack files (in the `wizard/drone-logbook` worktree)
**Files:**
- Create: `stacks/drone-logbook/main.tf` (content below)
- Create: `stacks/drone-logbook/terragrunt.hcl` (content below)
- Create: `stacks/drone-logbook/secrets` → symlink to `../../secrets`
- (`backend.tf`, `tiers.tf`, `cloudflare_provider.tf`, `providers.tf`, `.terraform.lock.hcl` are terragrunt-generated and **gitignored** — do NOT create or commit them; the tracked copies in old stacks like freshrss predate the ignore rule)
- [ ] **Step 3.1: `terragrunt.hcl`**
```hcl
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}
```
- [ ] **Step 3.2: `main.tf`** — exact content:
```hcl
variable "tls_secret_name" {
type = string
sensitive = true
}
variable "nfs_server" { type = string }
# Open DroneLog (https://github.com/arpanghosh8453/open-dronelog) — self-hosted
# DJI flight-log analyzer for the DJI Mini 4 Pro. Runs the UPSTREAM image (the
# ViktorBarzin/drone-logbook fork has no custom commits); Keel tracks :latest.
# Design: docs/plans/2026-07-04-drone-logbook-design.md
resource "kubernetes_namespace" "drone_logbook" {
metadata {
name = "drone-logbook"
labels = {
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "drone-logbook-secrets"
namespace = "drone-logbook"
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = {
name = "drone-logbook-secrets"
}
dataFrom = [{
extract = {
key = "drone-logbook"
}
}]
}
}
depends_on = [kubernetes_namespace.drone_logbook]
}
module "tls_secret" {
source = "../../modules/kubernetes/setup_tls_secret"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
tls_secret_name = var.tls_secret_name
}
# DuckDB database + cached DJI decryption keys + uploaded originals.
# Embedded DB -> block storage, not NFS (same rationale as freshrss data).
# Encrypted class: flight logs are GPS traces of home/travel (sensitive data
# -> proxmox-lvm-encrypted per the storage decision rule in .claude/CLAUDE.md).
resource "kubernetes_persistent_volume_claim" "data" {
wait_until_bound = false
metadata {
name = "drone-logbook-data-encrypted"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "10Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm-encrypted"
resources {
requests = {
storage = "2Gi"
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and PVCs
# can't shrink; without this every apply tries to revert the size.
ignore_changes = [spec[0].resources[0].requests]
}
}
# Drop folder: any producer (Nextcloud sync, scp, future phone pipeline) lands
# DJI .txt logs here over NFS; the app auto-imports on SYNC_INTERVAL.
module "nfs_sync_logs" {
source = "../../modules/kubernetes/nfs_volume"
name = "drone-logbook-sync-logs"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/drone-logbook/sync-logs"
storage = "5Gi"
}
resource "kubernetes_deployment" "drone_logbook" {
metadata {
name = "drone-logbook"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
labels = {
app = "drone-logbook"
"kubernetes.io/cluster-service" = "true"
tier = local.tiers.aux
}
}
spec {
replicas = 1
strategy {
# DuckDB is single-writer; never overlap two pods on the same volume
type = "Recreate"
}
selector {
match_labels = {
app = "drone-logbook"
}
}
template {
metadata {
labels = {
app = "drone-logbook"
"kubernetes.io/cluster-service" = "true"
}
}
spec {
container {
name = "drone-logbook"
image = "ghcr.io/arpanghosh8453/open-dronelog:latest"
env {
name = "RUST_LOG"
value = "info"
}
env {
# keep re-importable originals under /data/drone-logbook/uploaded
name = "KEEP_UPLOADED_FILES"
value = "true"
}
env {
name = "SYNC_LOGS_PATH"
value = "/sync-logs"
}
env {
# 6-field cron (sec min hour dom mon dow): scan drop folder every 8h
name = "SYNC_INTERVAL"
value = "0 0 */8 * * *"
}
env {
name = "PROFILE_CREATION_PASS"
value_from {
secret_key_ref {
name = "drone-logbook-secrets"
key = "profile_creation_pass"
}
}
}
volume_mount {
name = "data"
mount_path = "/data/drone-logbook"
}
volume_mount {
name = "sync-logs"
mount_path = "/sync-logs"
read_only = true
}
port {
name = "http"
container_port = 80
protocol = "TCP"
}
resources {
requests = {
cpu = "25m"
memory = "512Mi"
}
limits = {
memory = "512Mi"
}
}
}
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
}
}
volume {
name = "sync-logs"
persistent_volume_claim {
claim_name = module.nfs_sync_logs.claim_name
}
}
}
}
}
depends_on = [kubernetes_manifest.external_secret]
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
metadata[0].annotations["kubernetes.io/change-cause"],
metadata[0].annotations["deployment.kubernetes.io/revision"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
}
}
resource "kubernetes_service" "drone_logbook" {
metadata {
name = "drone-logbook"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
labels = {
"app" = "drone-logbook"
}
}
spec {
selector = {
app = "drone-logbook"
}
port {
port = "80"
target_port = "80"
}
}
}
# -----------------------------------------------------------------------------
# Backup — required for every proxmox-lvm(-encrypted) app: daily copy of the
# data volume to NFS /srv/nfs/drone-logbook-backup (picked up by nfs-mirror ->
# sda -> Synology offsite). 01:30 = outside the 00:00/08:00/16:00 sync-import
# windows, so the DuckDB file is quiescent; uploaded originals make even a
# mid-write copy recoverable by re-import. Pod-affinity co-schedules with the
# app pod (RWO volume mounts twice only on the same node). Vaultwarden pattern.
# -----------------------------------------------------------------------------
module "nfs_backup" {
source = "../../modules/kubernetes/nfs_volume"
name = "drone-logbook-backup-host"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/drone-logbook-backup"
}
resource "kubernetes_cron_job_v1" "backup" {
metadata {
name = "drone-logbook-backup"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
}
spec {
concurrency_policy = "Replace"
failed_jobs_history_limit = 5
schedule = "30 1 * * *"
starting_deadline_seconds = 300
successful_jobs_history_limit = 3
job_template {
metadata {}
spec {
backoff_limit = 3
ttl_seconds_after_finished = 10
template {
metadata {}
spec {
affinity {
pod_affinity {
required_during_scheduling_ignored_during_execution {
label_selector {
match_labels = {
app = "drone-logbook"
}
}
topology_key = "kubernetes.io/hostname"
}
}
}
container {
name = "drone-logbook-backup"
image = "docker.io/library/alpine"
command = ["/bin/sh", "-c", <<-EOT
set -euxo pipefail
_t0=$(date +%s)
now=$(date +"%Y_%m_%d_%H_%M")
mkdir -p /backup/$now
cp -a /data/. /backup/$now/
# Rotate — 30 day retention
find /backup -maxdepth 1 -mindepth 1 -type d -mtime +30 -exec rm -rf {} +
_dur=$(($(date +%s) - _t0))
_out_bytes=$(du -sb /backup/$now | awk '{print $1}')
wget -qO- --post-data "backup_duration_seconds $${_dur}
backup_output_bytes $${_out_bytes}
backup_last_success_timestamp $(date +%s)
" "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/drone-logbook-backup" || true
EOT
]
volume_mount {
name = "data"
mount_path = "/data"
read_only = true
}
volume_mount {
name = "backup"
mount_path = "/backup"
}
}
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
}
}
volume {
name = "backup"
persistent_volume_claim {
claim_name = module.nfs_backup.claim_name
}
}
dns_config {
option {
name = "ndots"
value = "2"
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# https://dronelog.viktorbarzin.me
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
auth = "required" # Authentik forward-auth — flight logs are GPS traces of home/travel
dns_type = "proxied"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
name = "dronelog"
service_name = "drone-logbook"
tls_secret_name = var.tls_secret_name
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Drone Logbook"
"gethomepage.dev/description" = "DJI flight log analyzer"
"gethomepage.dev/icon" = "mdi-quadcopter"
"gethomepage.dev/group" = "Media & Entertainment"
"gethomepage.dev/pod-selector" = ""
}
}
```
- [ ] **Step 3.3: Boilerplate**
```bash
ln -s ../../secrets ~/code/infra/.worktrees/drone-logbook/stacks/drone-logbook/secrets
```
- [ ] **Step 3.4: Format check**
```bash
terraform fmt -check -diff $WT/stacks/drone-logbook/ || terraform fmt $WT/stacks/drone-logbook/
```
Expected: no diff (or auto-fixed).
- [ ] **Step 3.5: Commit on the branch (files by name, git-crypt filter flags per execution.md)**
```bash
git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false \
add docs/plans/2026-07-04-drone-logbook-design.md docs/plans/2026-07-04-drone-logbook-plan.md \
stacks/drone-logbook/main.tf stacks/drone-logbook/terragrunt.hcl stacks/drone-logbook/secrets \
.claude/reference/service-catalog.md
git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false \
commit -m "drone-logbook: new stack — self-hosted Open DroneLog at dronelog.viktorbarzin.me
Viktor asked to self-host the DJI flight-log analyzer for his DJI Mini 4 Pro
(fork ViktorBarzin/drone-logbook -> upstream arpanghosh8453/open-dronelog).
Upstream ghcr image with Keel auto-upgrade, DuckDB data on proxmox-lvm PVC,
NFS /sync-logs drop folder auto-imported every 8h, Authentik-gated ingress,
PROFILE_CREATION_PASS from Vault via ESO. Design + plan in docs/plans/."
```
### Task 4: Land and apply
- [ ] **Step 4.1: Presence claim** (CI apply mutates shared infra)
```bash
~/code/scripts/presence claim infra:drone-logbook --purpose "deploy new drone-logbook stack (Open DroneLog) via CI apply"
```
- [ ] **Step 4.2: Merge latest master into the branch, push to master**
```bash
git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false fetch forgejo
git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false merge forgejo/master
git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false push forgejo HEAD:master
```
Non-fast-forward → another agent landed first: fetch, merge, push again. Branch-protection rejection → fall back to PR via Forgejo API (token = password in `~/.git-credentials`).
- [ ] **Step 4.3: Watch the CI apply to completion** — Woodpecker pipeline on the infra repo (`ci.viktorbarzin.me`), then confirm live:
```bash
kubectl get ns drone-logbook && kubectl -n drone-logbook get deploy,pvc,pods,externalsecret,cronjob
kubectl -n drone-logbook rollout status deploy/drone-logbook --timeout=300s
```
Expected: namespace present, ExternalSecret `SecretSynced`, data PVC `Bound` (the NFS PVCs bind on first pod/job use), CronJob `drone-logbook-backup` scheduled `30 1 * * *`, pod `Running 1/1`.
- [ ] **Step 4.4: Cleanup worktree + branch; release presence**
```bash
git -C ~/code/infra worktree remove .worktrees/drone-logbook
git -C ~/code/infra branch -d wizard/drone-logbook
git -C ~/code/infra pull --ff-only # only if main checkout clean/quiescent
~/code/scripts/presence release infra:drone-logbook
```
### Task 5: End-to-end verification
- [ ] **Step 5.1: Ingress + Authentik gate**
```bash
curl -sI https://dronelog.viktorbarzin.me | head -5
```
Expected: `302` redirect into Authentik (NOT `200`, NOT `404`).
- [ ] **Step 5.2: App alive behind the gate** (bypass ingress via port-forward, read-only debug)
```bash
kubectl -n drone-logbook port-forward svc/drone-logbook 18080:80 &
sleep 2 && curl -s -o /dev/null -w '%{http_code}\n' http://127.0.0.1:18080/ && kill %1
```
Expected: `200`.
- [ ] **Step 5.3: Sync folder visible in-pod**
```bash
kubectl -n drone-logbook exec deploy/drone-logbook -- ls -ld /sync-logs /data/drone-logbook
```
Expected: both directories listed; `/sync-logs` read-only mount.
- [ ] **Step 5.4: Monitor + homepage** — Uptime Kuma external monitor for `dronelog.viktorbarzin.me` auto-created (ingress annotation); homepage tile under "Media & Entertainment".
- [ ] **Step 5.5: Functional import** — Viktor uploads a real Mini 4 Pro `.txt` log via the web UI (or drops it in `/srv/nfs/drone-logbook/sync-logs`); confirms flight appears with charts/map. Requires pod egress to DJI once per new log (decryption key). If an upstream sample log is available, the agent may pre-verify import via the REST API through the port-forward.

View file

@ -0,0 +1,125 @@
# immich-frame: LAN-only access, Portals untouched (2026-07-04)
## Goal
Strangers must no longer be able to view `highlights-immich.viktorbarzin.me`
(Viktor's London Portal Plus frame) or `highlights-immich-emo.viktorbarzin.me`
(Emo's Sofia Portal Mini frame) — pages or ImmichFrame API. Both were
`auth = "none"`, Cloudflare-proxied, fully public.
Who keeps access (per Viktor, this session): the two Portals plus **any
household device on the Sofia, London, or Valchedrym home networks**. No
public access, no tailnet requirement. Hard constraint: the Portal app is a
WebView with the URL **baked in at APK build time** (`portal-immich-frame`,
`-PframeUrl`), so the exact URLs must keep loading from where the Portals sit
— zero app rebuilds, zero device touches, zero router changes.
## Design
Two cooperating pieces — the gate and the reachability pointer:
1. **The gate — `home-lans-only` Traefik middleware** (traefik stack, next to
`local-only`): `ipAllowList` of `192.168.1.0/24` (Sofia LAN), `10.0.0.0/8`
(VLANs, K8s pods `10.10.0.0/16`, services `10.96.0.0/12`, WG tunnel
`10.3.2.0/24`), `192.168.8.0/24` (London LAN), `192.168.9.0/24` (London
GUEST net — post-rollout discovery: the Portal Plus actually leases here,
`Portal-75AE8F9C2A8A` = `192.168.9.198`, added same day), `192.168.0.0/24`
(Valchedrym LAN), `fc00::/7`, `fe80::/10`. Attached to both frame
ingresses via `extra_middlewares`. Everyone else gets a Traefik 403 —
including direct-to-WAN-IP requests carrying the right SNI, which DNS
changes alone cannot stop. A **separate** middleware rather than a widened
`local-only`, because widening would silently grant the remote LANs access
to the 9 admin surfaces using it (Prometheus, iDRAC, Loki, …).
2. **The pointer — `dns_type = "internal"`** (new `ingress_factory` tier,
Viktor's idea): a **non-proxied public A record → `10.0.20.203`** (module
var `internal_lb_ip`). Outsiders resolve it but get an unroutable RFC1918
address; every household resolver path delivers a working answer with no
config anywhere: Sofia LAN already gets the internal CNAME from Technitium,
London/Valchedrym resolve the public record via any upstream and
policy-route `10.0.0.0/8` down the WireGuard tunnel. IPv4-only (spokes
route no internal v6 range).
Interlock (the reason both flip together): with a *proxied* record, public
traffic arrives from cloudflared **pod IPs inside 10/8** and would sail
through the allowlist. `internal` removes the Cloudflare path entirely (CF
edge stops serving the hostname), so every request reaches Traefik with its
real source IP (ETP=Local). Verified: no wildcard `*.viktorbarzin.me` record
exists to resurrect public resolution.
`auth` stays `"none"` — there is still no *user* auth by design (kiosk
WebView; forward-auth would 302 the device to a login it can't complete, and
emo's Google-only account can't log in inside a WebView at all); the
convention comment now names the ipAllowList as the gate.
### Resulting flows
| Client | Path | Result |
|---|---|---|
| Emo's Portal Mini (Sofia LAN) | Technitium CNAME → `.203` direct (unchanged) | allowed (`192.168.1.x`) |
| Viktor's Portal Plus (London GUEST net) | public A → `10.0.20.203` → WG tunnel | allowed (`192.168.9.x`) |
| Household browsers (any of the 3 LANs) | same as above | allowed |
| In-cluster checks (`homelab browser`, blackbox) | CoreDNS → Technitium → `.203` | allowed (pod IP in 10/8) |
| Stranger, resolves hostname | gets `10.0.20.203` | unroutable |
| Stranger, hits WAN IP with SNI | pfSense NAT → Traefik (real source IP) | **403** |
| Stranger, via Cloudflare | no proxied record | CF edge won't serve the host |
### Rejected alternatives
- **ImmichFrame `AuthenticationSecret`** (supported upstream: web input field
or `?authsecret=` param + bearer API): real auth from anywhere, but family
browsers would face a secret prompt (fails "household devices just work"),
the secret leaks into URLs/analytics/APK, and robust rollout needs APK
rebuild + USB-adb sideload on both Portals (the Sofia one is high-friction).
- **Authentik forward-auth / `auth = "public"`**: WebView can't complete SSO
(Google blocks WebView logins; session expiry silently bricks an appliance);
the anonymous outpost is an audit trail, not a gate.
- **Remove DNS + London router AdGuardHome rewrites**: works, but adds an
out-of-band, un-IaC'd router dependency the internal-IP record makes
unnecessary. Kept as documented fallback if resolver-side private-IP
filtering ever appears in the London path.
## Pre-verified facts (2026-07-04)
- London Flint 2 DNS chain returns RFC1918 answers unfiltered
(`nslookup 10.0.20.203.nip.io 127.0.0.1` on the router → `10.0.20.203`;
dnsmasq `rebind_protection '0'`, no AdGuardHome rebind filtering).
- Technitium already CNAMEs both hostnames → apex → `10.0.20.203`
(`technitium-ingress-dns-sync` is ingress-driven, not DNS-record-driven, so
the internal answer survives the Cloudflare record swap).
- Pod CIDR `10.10.0.0/16`, service CIDR `10.96.0.0/12` — inside `10.0.0.0/8`.
- No public wildcard record in the zone.
## Blast radius & cleanups
- `external_monitor = false` set explicitly on both ingresses: the
external-monitor-sync default opt-in would otherwise keep the now-doomed
`[External] highlights-immich*` uptime-kuma monitors alive and red. Verify
the sync drops them post-apply.
- rybbit CF worker: `highlights-immich` removed from `SITE_IDS` (`index.js`)
and `wrangler.toml` routes — off Cloudflare the route can never fire.
Requires a `wrangler deploy` to take effect (route removal is hygiene, not
functional).
- Homepage dashboard link keeps working from LANs (hostname unchanged).
- Docs updated in the same change: `.claude/CLAUDE.md` (DNS tier +
external-monitor mechanism), `AGENTS.md`, `docs/architecture/networking.md`
(Internal-IP domains category). The `portal-immich-frame` repo's glossary
("public, login-less URL") updated separately in that repo.
## Failure-mode delta
London frame now depends on the WG tunnel instead of Cloudflare+cloudflared
(the app self-heals with 5s retries; tunnel-flap modes documented in
`docs/architecture/vpn.md`). A Traefik LB renumber must update
`internal_lb_ip` in the module alongside the split-horizon apex record.
Cutover window: cached proxied answers keep working ≤ ~5 min TTL, then the
WebView's own retry picks up the new path.
## Verification & rollback
Verify: public dig → `10.0.20.203` (both hosts); Technitium dig → `.203`;
curl from devvm (10/8) → 200; external vantage (WebFetch/cloud) → unreachable
or 403; middleware attached on both ingresses; Emo's frame renders via
`homelab browser`; London Portal image fetches visible in Traefik access logs
from `192.168.8.x`. Rollback: `git revert` + apply traefik/immich — records
and middleware chain restore (`allow_overwrite = true` re-adopts the records).

View file

@ -129,3 +129,40 @@ heavy user between 1216G even with RAM free; bump to 16/20 if that bites.
storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the
correct pairing. A famous tool that "does OOM" still has to be proven to fire
under *your* configuration.
## Addendum (2026-07-02): the MemoryHigh throttle band livelocks — removed
The soft-cap layer of this design was falsified in production on 2026-07-02
(~15:4216:35 UTC): an agent-spawned `ugrep` (12.35G RSS; `-o` with wide
alternation captures over a multi-GB `.jsonl` transcript) **plateaued inside
t3-serve@wizard's `MemoryHigh=12G..MemoryMax=16G` band**. With
`MemorySwapMax=0` its anonymous pages were unreclaimable, so the kernel parked
every allocating task of the cgroup in `mem_cgroup_handle_over_high`
(`memory.pressure full avg60 ≈ 80%`, `memory.events high=882948`, `oom_kill=0`)
— including the `t3 serve` event loop (~0.5G RSS, pure collateral). The accept
queue backed up (21 pending connections), t3-probe logged `t3serve: [Errno 104]
Connection reset by peer`, t3-dispatch logged `proxy error: context canceled`,
and t3.viktorbarzin.me was dead for its user until the hog was SIGKILLed by
hand (the D-state high-throttle sleep IS killable; the cgroup dropped 14G→1.4G
and the service recovered in seconds with no restart).
The Verification bullet above — a soft-capped balloon "throttled to a crawl,
making no progress and **harming nothing**" — holds only when the hog is alone
in its cgroup. Sharing the cgroup with a latency-sensitive server, the crawl
IS the harm: a hog that stabilises below `MemoryMax` never triggers the local
OOM the design counted on, so the band converts "runaway dies" into "everyone
in the cgroup stalls forever".
**Fix (same day, admin-approved): `MemoryHigh=infinity` on all three work
cgroup definitions** — `scripts/t3-serve@.service`, the `user-.slice.d`
drop-in, and `docker.slice` (`setup-devvm.sh` §10a/§10c). A runaway now runs
unthrottled into `MemoryMax` and is cgroup-OOM-killed immediately
(`OOMPolicy=continue` keeps t3-serve itself alive; in slices the kernel kills
the biggest task). `MemoryMax`, `MemorySwapMax=0`, and earlyoom — the layers
the stress tests actually validated — are unchanged. Applied live via
`daemon-reload` + runtime `set-property` on the running cgroups; no session
restarts.
Lesson: **with `swap=0`, `memory.high` is not a gentler `memory.max` — it is
an unbounded stall injector for everything sharing the cgroup.** Cap-and-kill
beats throttle-and-pray for multi-tenant interactive services.

View file

@ -0,0 +1,135 @@
# Paperless-ngx Mail Ingest (docs@viktorbarzin.me)
Last updated: 2026-07-03 (initial build)
Forward any email with document attachments to **`docs@viktorbarzin.me`** and
paperless-ngx ingests the attachments, owned by the paperless account mapped
from the **sender** (From) address. Built entirely from existing parts: a
docker-mailserver mailbox + Dovecot sieve, and paperless-ngx's native mail
consumer (the same machinery as the `utility:` rules).
## Flow
```
family member forwards email ──> MX ──> docker-mailserver
│ postfix virtual: docs@ has an explicit self-alias (extra/aliases.txt),
│ so the @domain catch-all (→ spam@, swept by TripIt) does NOT apply
Dovecot LMTP delivery to docs@
│ per-user sieve (docs@viktorbarzin.me.dovecot.sieve): sender NOT in
│ allowlist → discard (decision 2026-07-03: unmatched = ignore & delete)
docs@ INBOX ── paperless-ngx mail task (every 10 min, PAPERLESS_EMAIL_TASK_CRON
│ default) applies mail rules in order: filter_from = <sender>
│ → consume attachments (ALL parts incl. inline — see design
│ notes: Apple Mail marks real PDFs inline), owner = mapped user,
│ tag = email-ingest, title = mail subject
consumed mail is MOVED to the "Processed" IMAP folder (audit trail);
INBOX stays empty in steady state
```
## Sender → paperless account map (as built)
| Sender (From) | Paperless user | Rule |
|--------------------------|----------------|-----------------|
| me@viktorbarzin.me | root (id 3) | forward: Viktor (me@) |
| vbarzin@gmail.com | root (id 3) | forward: Viktor (gmail) |
| viktorbarzin@meta.com | root (id 3) | forward: Viktor (meta) |
| ancaelena98@gmail.com | anca (id 4) | forward: Anca |
| emil.barzin@gmail.com | emo (id 7) | forward: Emo |
The map lives in **two places by design** — keep them in sync:
1. **Delivery gate (infra, Terraform):**
`stacks/mailserver/modules/mailserver/extra/docs-at-viktorbarzin.me.dovecot.sieve`
— senders not listed here are discarded at delivery (spam control + the
"ignore and delete unmatched" behaviour; paperless cannot express
"delete without ingesting", so this must happen before the mailbox).
2. **Owner map (paperless DB, via API/UI):** one mail rule per sender on the
`docs@viktorbarzin.me` mail account. DB-state like workflows — NOT
Terraform.
## Add a family member / sender
1. Add the address to the sieve allowlist file above; commit; apply the
`mailserver` stack (normal apply is enough — the sieve CM key is not under
`ignore_changes`; Reloader restarts the pod).
2. Clone an existing `forward:` mail rule in the paperless admin UI
(Mail → Rules) or via API, changing `filter_from` and the rule **owner**
(documents are owned by the rule owner — `assign_owner_from_rule=true`).
Keep: action = Move to `Processed`, attachment type = **process all files
including inline** (`attachment_type=2` — NOT attachments-only, see design
notes), consumption scope = attachments only, tag `email-ingest`, order
after the existing rules.
## Operations
- **Trigger a fetch immediately** (instead of waiting ≤10 min):
`kubectl -n paperless-ngx exec deploy/paperless-ngx -c paperless-ngx -- s6-setuidgid paperless python3 manage.py mail_fetcher`
The `s6-setuidgid paperless` is **required**: `kubectl exec` runs as root, and a
root-run fetcher downloads attachments root-owned into the scratch dir, which
the celery consumer (uid 1000) then can't read — `PermissionError` on
`/tmp/paperless/paperless-mail-*/...`, consume task FAILURE (hit during the
2026-07-03 build E2E). The mail correctly stays in INBOX for retry (the move
action is a chord callback on successful consumption). Recover: `rm -rf
/tmp/paperless/paperless-mail-*` (as root) and let the next scheduled fetch
re-process.
- **Mailbox credentials:** Vault `secret/platform``mailserver_accounts`
JSON, key `docs@viktorbarzin.me` (also used by the paperless mail account).
- **Inspect the mailbox:**
`python3 -c` IMAP to `mailserver.mailserver.svc.cluster.local:993` (in-cluster,
from a pod) or `mail.viktorbarzin.me:993` (externally / devvm).
- **Paperless-side logs:** `kubectl -n paperless-ngx logs deploy/paperless-ngx | grep -i mail`
(also Loki, ns `paperless-ngx`). Rule/account state: `GET /api/mail_rules/`,
`GET /api/mail_accounts/` with the admin token
(k8s secret `paperless-ngx-secrets`, field `api_token`).
- **Account/mailbox provisioning:** adding/rotating anything in
`mailserver_accounts` requires the ConfigMap replace workaround —
`scripts/tg apply mailserver -- -replace=module.mailserver.kubernetes_config_map.mailserver_config`
— because `postfix-accounts.cf` is under `ignore_changes`
(non-deterministic bcrypt; see the module comment).
## Design notes / caveats
- **Why not the catch-all?** Mail to unknown `@viktorbarzin.me` addresses
lands in `spam@`, which the TripIt `ingest-plans` CronJob sweeps every
15 min: it marks everything `\Seen`, LLM-parses mail from linked senders and
replies with ack/failure emails. Forwarded bank statements would get
"couldn't parse a trip" replies. `docs@` being a real mailbox bypasses that
path entirely; TripIt, the `smoke-test@` roundtrip probe, and `dmarc@` are
untouched.
- **Spoofing:** the sender match is on the From header. Rspamd verifies
SPF/DKIM/DMARC on inbound mail, but gmail.com publishes `p=none`, so a
crafted spoof could ingest documents into a family member's account. Accepted
risk (worst case: unwanted documents appear, visible + deletable in
paperless).
- **Not PDF-only:** any attachment type paperless supports is consumed
(PDF, images, Office via the existing tika+gotenberg pipeline).
- **Inline attachments ARE processed (`attachment_type=2`, flipped
2026-07-03):** the rules originally used attachments-only (1) to skip
signature logos, but the very first real forward (Apple Mail, Viktor's
client) attached the invoice PDF with `Content-Disposition: inline`
paperless matched the rule, consumed nothing, and recorded
`PROCESSED_WO_CONSUMPTION` (which, like any ProcessedMail row, blocks that
UID from ever being re-processed — delete the row via `manage.py shell` to
retry). Trade-off: signature/inline images in forwards may be ingested as
junk docs (tagged `email-ingest`, easy to spot). If that gets noisy, add
`filter_attachment_filename_exclude` patterns to the rules using the
actually-observed junk filenames — do NOT flip back to attachments-only.
- **No dedicated alerting** (deliberate, 2026-07-03): mail-task errors surface
in paperless logs; the mailserver inbound path is covered by
`email-roundtrip-monitor`. Revisit if forwards start silently failing.
- **Workflows:** the global `payslip-webhook` + `claude-mcp-readers
auto-permission` workflows fire for mail-ingested docs like any other
consumption source (verified pre-build; payslip receiver does its own
filtering).
## Rollback
1. Disable/delete the 5 `forward:` mail rules + the `docs@` mail account
(paperless admin UI or API).
2. Revert the infra commit (aliases.txt entry, sieve file, CM key + mount).
3. Remove `docs@viktorbarzin.me` from Vault `mailserver_accounts`, then apply
with the `-replace` workaround above. Mail to docs@ then falls back to the
catch-all (spam@) like any unknown address.

View file

@ -109,10 +109,17 @@ rate(node_pressure_memory_stalled_seconds_total{instance="devvm"}[5m])
node_memory_SwapFree_bytes{instance="devvm"}
```
Guardrails in place (2026-06-10, `scripts/t3-serve@.service`): per-unit
`MemoryHigh=12G`, `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue`
a runaway agent now OOMs alone inside the cgroup instead of taking the box
(and the WS server) with it.
Guardrails in place (2026-06-10, hardened 2026-07-02; `scripts/t3-serve@.service`):
per-unit `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue`, and
`MemoryHigh=infinity` — deliberately NO soft throttle band. With swap=0, a hog
plateauing between high and max never OOMs and the kernel high-throttle stalls
the whole unit: a 12.3G agent `ugrep` livelocked t3-serve@wizard for ~50min on
2026-07-02 (signature: probe `t3serve` leg `Connection reset by peer`, dispatch
`proxy error: context canceled`, server D-state in `mem_cgroup_handle_over_high`,
`ss` backlog on the serve port; fix: SIGKILL the hog — the D-state is killable).
A runaway agent now OOMs alone at 16G inside the cgroup instead of throttling
the WS server with it. Post-mortem addendum:
`docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md`.
## 4. Known root causes (2026-06-10 investigation)

View file

@ -0,0 +1,98 @@
# Valia sites — add / update / retire
Off-infra static sites authored by Valia (ADR-0018, CONTEXT.md "Valia site").
Serving: Cloudflare Pages. Freshness: the `valia-sites-sync` CronJob
(`valia-sites` ns) mirrors each Content folder every 10 minutes and deploys
only when the folder's manifest hash changed. Registry: `local.sites` in
`stacks/valia-sites/main.tf` — one entry per site drives everything (Pages
project, custom domain, public CNAME, internal split-horizon CNAME, sync).
Current sites: `bridge` (ОбУ „Отец Паисий“ — "мост"), `stem95su` (95. СУ STEM
board).
## Add a site
1. Valia shares the Drive folder with **vbarzin@gmail.com** (viewer is enough —
the pipeline is strictly read-only towards Drive).
2. Get the folder id from its URL (`drive.google.com/drive/folders/<ID>`).
3. Pick the **English** subdomain name (Viktor's call — CONTEXT.md naming rule).
4. Add one entry to `local.sites` in `stacks/valia-sites/main.tf`:
```hcl
<name> = {
folder_id = "<ID>"
src_path = "" # or "sub/folder" if servable files live deeper
entry_file = "index.html" # or whatever her main HTML file is called
manage_dns = true
}
```
5. Commit + push; CI applies. Within ~10 min the sync deploys content and the
site serves at `https://<name>.viktorbarzin.me` (custom-domain TLS takes
~510 min extra on first attach — CF returns 522 for the hostname until
then). Internal LAN/VLAN/pod resolution appears when the hourly
`technitium-ingress-dns-sync` next runs — trigger it early with:
`kubectl create job --from=cronjob/technitium-ingress-dns-sync valia-dns-now -n technitium`
## Content rules (what Valia's folder must look like)
- The **entry file** must exist — the sync stages a copy as `index.html` at
deploy time, so `/` works; the original filename keeps working too (deep
links survive). If the folder is empty or the entry file is missing, the
sync **skips the site and leaves it as-is** (never wipes a live site).
- Google-native files (Docs/Sheets) are **ignored** (`--drive-skip-gdocs`) —
only real files (`.html`, images, …) deploy. Gemini's HTML exports are fine.
- Per-file limit 25 MB (Cloudflare Pages), 20k files max — far beyond a
1-page site.
## Update a site
Nothing to do: Valia edits the folder, the site follows within ~10 minutes.
Force it early: `kubectl create job --from=cronjob/valia-sites-sync sync-now -n valia-sites`
## Rename / retire a site
Rename = retire + add (Pages projects can't be renamed). Retire:
1. Delete the entry from `local.sites`; commit + push. TF destroys the public
CNAME + custom domain + Pages project; the internal record is removed by
the next `technitium-ingress-dns-sync` run (its deletion pass drops any
internal `*.pages.dev` CNAME that left the `valia-sites-dns` ConfigMap —
scoped so it can never touch non-Pages records).
2. That's all — no manual DNS cleanup (the pre-ADR-0018 add-only gotcha is
fixed by the deletion pass).
## Failure modes / debugging
- **Visibility is failed-Job-only by choice** (ADR-0018): no alerts, no
notifications. Check: `kubectl get jobs -n valia-sites | tail`, logs of the
last `valia-sites-sync-*` pod.
- **Drive auth broken** (`FATAL … Drive list failed`): the shared
`secret/valia-sites.rclone_conf` token died. The GCP OAuth app
(`home-lab-1700868541205`) must stay published to "Production" or refresh
tokens expire weekly (same constraint as the old stem95su conf, which this
one was copied from). Re-mint and `vault kv patch secret/valia-sites
rclone_conf=@…`.
- **Wrangler auth broken**: `secret/valia-sites.cloudflare_pages_token` is a
SCOPED token (Pages Read+Write on the account, id
`355d2c9d11579bdad1e9498dafca30d5`) — re-mint via
`POST /user/tokens` with the Global API Key (`secret/platform`), patch
Vault. Do NOT put the Global API Key in the pod.
- **Site serves stale content**: check the state CM
(`kubectl get cm valia-sites-state -n valia-sites -o yaml`) — deleting a
site's key forces a redeploy on the next run.
- **`GUARD … skipping`** in logs: Valia's folder is empty or renamed the
entry file — the site deliberately kept its last content. Fix the folder or
update `entry_file`.
## History
- stem95su served in-cluster (nginx + NFS + its own rclone CronJob) until
2026-07-03, when it was cut over to this pattern and the old stack retired
(ADR-0018). The blocking 42.9 MB `stem_video.mp4` was compressed to 21.4 MB
(same 1080p, ~2.5 Mbps H.264) and replaced in Valia's folder with Viktor's
explicit one-time OK. `secret/stem95su` is superseded by
`secret/valia-sites`; `/srv/nfs/stem-site` on the PVE host is a harmless
leftover.
- bridge started as a hand-deployed wrangler experiment (2026-07-03, memory
id 7085) and was adopted into the stack the same day.

View file

@ -82,33 +82,48 @@ tail -5 ~/.local/state/vault-token-renew.log # recent results
A healthy log line looks like:
`<ts> OK renewed (dn=token-devvm-wizard ttl=2764800s)` (ttl 2764800s = 768h).
## Drift guard & recovery
After an OIDC login you'll instead see, at the next nightly run:
`<ts> HEALED: re-minted periodic token from foreign dn=oidc-… (revoked N stale periodic token(s))`
— that's the self-heal working as designed.
## Drift guard & self-heal
`~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login`
overwrites it. Two confirmed clobber vectors:
1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer
can't push past the OIDC role's 7-day `token_max_ttl`).
can't push past the OIDC role's 7-day `token_max_ttl`). The infra docs
prescribe this login before applies, so it recurs — it went unnoticed for
weeks twice (2026-06-18→26, 2026-06-29→07-03) and read as "Vault expires
weekly".
2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) →
writes a read-only `kubernetes-woodpecker-default` token (can read Vault but
**cannot** write `secret/*`). This happened 2026-06-05 and went unnoticed for
two days — reads worked, writes silently 403'd.
**cannot** write `secret/*`). Happened 2026-06-05, unnoticed for two days.
To stop the renewer from silently keeping a foreign token alive, it runs a
**drift guard** first: it refuses to renew unless the token is
`token-devvm-wizard` **and** carries `vault-admin`. On drift it logs loudly and
exits non-zero (the systemd unit goes `failed`) rather than renewing someone
else's token. Symptom in the log:
Since 2026-07-03 the renewer **self-heals**
(`docs/plans/2026-07-03-vault-token-self-heal-design.md`). On a foreign token
it attempts the re-mint **with the clobbering token's own authority** and lets
Vault's authz decide:
`<ts> DRIFT: ~/.vault-token is dn=... policies=... Refusing to renew a foreign token. Re-mint: ...`
- **Admin-capable clobber (OIDC login)** → re-mints the periodic token,
sanity-checks it against the drift guard, atomically replaces
`~/.vault-token`, revokes stale `token-devvm-wizard` leftovers
(anti-sprawl), logs
`HEALED: re-minted periodic token from foreign dn=… (revoked N stale periodic token(s))`
and exits 0. The clobbering token is NOT revoked — it may still back a live
login session; it ages out on its own.
- **Weak clobber (read-only k8s token)** → the mint is denied; logs
`DRIFT: … heal denied, foreign token lacks create authority …; investigate what wrote it`
and exits non-zero (unit `failed`). Deliberately loud: this signals a
misbehaving agent flow — exactly the 2026-06-05 case.
**Recovery: re-mint** (the DRIFT log line contains the exact command) — run the
[mint/re-mint](#mint--re-mint-the-token) block. The drift guard detects but does
**not** auto-recover (a deliberate scope choice — version-only, no self-heal);
recovery is the manual re-mint above.
**Manual recovery** is only needed for the weak-clobber case (the DRIFT log
line still contains the exact command) — run the
[mint/re-mint](#mint--re-mint-the-token) block.
## Tests
`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision
and the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber
case). Run: `bash infra/scripts/test-vault-token-renew.sh`.
`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision,
the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber
case), and the self-heal's revoke filter (which stale periodic tokens a heal
may sweep). Run: `bash infra/scripts/test-vault-token-renew.sh`.

View file

@ -127,20 +127,29 @@ variable "anti_ai_scraping" {
variable "dns_type" {
type = string
default = "none"
description = "Cloudflare DNS: 'proxied' (CNAME to tunnel), 'non-proxied' (A/AAAA to public IP), or 'none'"
description = <<-EOT
Cloudflare DNS: 'proxied' (CNAME to tunnel), 'non-proxied' (A/AAAA to
public IP), 'internal' (A to the internal Traefik LB IP resolvable from
any resolver but only ROUTABLE from home LANs / WG sites / VPN; the record
is a reachability pointer, NOT a gate: pair it with an ipAllowList via
extra_middlewares, e.g. traefik-home-lans-only@kubernetescrd, because
direct-to-WAN-IP requests with the right SNI still hit Traefik), or 'none'.
EOT
validation {
condition = contains(["proxied", "non-proxied", "none"], var.dns_type)
error_message = "dns_type must be 'proxied', 'non-proxied', or 'none'."
condition = contains(["proxied", "non-proxied", "internal", "none"], var.dns_type)
error_message = "dns_type must be 'proxied', 'non-proxied', 'internal', or 'none'."
}
}
# Uptime Kuma external monitor: when true, annotate the ingress so the
# external-monitor-sync CronJob creates a `[External] <name>` monitor pointing
# at https://<host>. Null means "follow dns_type" enabled when proxied.
# at https://<host>. Null means "follow dns_type" enabled when the ingress
# has a PUBLIC DNS record (proxied or non-proxied; 'internal' records are not
# externally reachable, so no external monitor).
variable "external_monitor" {
type = bool
default = null
description = "Enable Uptime Kuma external monitor. null = auto (enabled when dns_type == 'proxied')."
description = "Enable Uptime Kuma external monitor. null = auto (enabled when dns_type is 'proxied' or 'non-proxied')."
}
variable "external_monitor_name" {
@ -171,6 +180,15 @@ variable "public_ipv6" {
default = "2001:470:6e:43d::2"
}
# Internal Traefik LB IP used by dns_type = "internal" records. Tracks the
# dedicated MetalLB IP from stacks/traefik (ETP=Local). A future LB renumber
# must update this default alongside the split-horizon apex record see
# docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*.
variable "internal_lb_ip" {
type = string
default = "10.0.20.203"
}
variable "homepage_group" {
type = string
default = null # auto-detect from namespace
@ -201,8 +219,10 @@ locals {
)
# External monitor enabled by default when the ingress has a public DNS
# record (either CF-proxied or direct A/AAAA). Explicit bool overrides.
effective_external_monitor = var.external_monitor != null ? var.external_monitor : (var.dns_type != "none")
# record (either CF-proxied or direct A/AAAA). 'internal' records resolve
# publicly but are unroutable from outside, so they get no external monitor.
# Explicit bool overrides.
effective_external_monitor = var.external_monitor != null ? var.external_monitor : (var.dns_type == "proxied" || var.dns_type == "non-proxied")
# Emit the annotation when effective is true (positive signal), or when the
# caller explicitly set external_monitor=false (opt-out). When the caller
@ -424,3 +444,19 @@ resource "cloudflare_record" "non_proxied_aaaa" {
zone_id = var.cloudflare_zone_id
allow_overwrite = true
}
# 'internal': a publicly-resolvable A record carrying the INTERNAL Traefik LB
# IP. Outsiders resolve it but can't route to it; home-LAN/WG-site/VPN clients
# reach Traefik directly (the WG spokes policy-route 10.0.0.0/8 through the
# tunnel), so kiosk devices with baked-in URLs need no DNS overrides anywhere.
# IPv4-only on purpose: the spokes route no internal IPv6 range.
resource "cloudflare_record" "internal_a" {
count = var.dns_type == "internal" ? 1 : 0
name = local.dns_name
content = var.internal_lb_ip
proxied = false
ttl = 1
type = "A"
zone_id = var.cloudflare_zone_id
allow_overwrite = true
}

View file

@ -21,12 +21,19 @@ WorkingDirectory=/home/%i
ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3
Restart=on-failure
RestartSec=5
# Memory containment (2026-06-10): agent children live in this cgroup; a
# runaway agent (10.8G anon on a 23G host) swap-thrashed the whole devvm —
# every >20s stall fires the t3 client watchdog (visible "disconnects") —
# then global-OOMed. Cap the cgroup so a runaway OOMs early and locally,
# and forbid swap so stalls can't smear into minutes-long freezes.
MemoryHigh=12G
# Memory containment (2026-06-10, amended 2026-07-02): agent children live in
# this cgroup; a runaway agent (10.8G anon on a 23G host) swap-thrashed the
# whole devvm — every >20s stall fires the t3 client watchdog (visible
# "disconnects") — then global-OOMed. Cap the cgroup so a runaway OOMs early
# and locally, and forbid swap so stalls can't smear into minutes-long freezes.
# MemoryHigh is DELIBERATELY infinity — do not add a soft band below MemoryMax:
# with swap=0 a hog that plateaus between high and max is unreclaimable but
# never OOMs, and the kernel's high-throttle stalls EVERY task in the cgroup
# (the t3 event loop included) indefinitely. A 12.3G agent ugrep livelocked
# this unit for ~50min on 2026-07-02 exactly this way. Straight-to-OOM at
# MemoryMax is the containment; OOMPolicy=continue below keeps the server up.
# See docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md addendum.
MemoryHigh=infinity
MemoryMax=16G
MemorySwapMax=0
# Default OOMPolicy=stop kills the WHOLE unit (8.5min outage 2026-06-10

View file

@ -1,10 +1,11 @@
#!/usr/bin/env bash
# Unit tests for the pure drift-guard functions in vault-token-renew.sh.
# Sources the script (vtr_main is guarded) and exercises the decision logic that
# decides whether ~/.vault-token is OUR periodic admin token (renew) or a foreign
# token that clobbered the file (refuse, fail loud). This is exactly the logic
# whose ABSENCE let the 2026-06-05 woodpecker-token clobber be silently renewed
# for two days. Run: bash infra/scripts/test-vault-token-renew.sh
# Unit tests for the pure functions in vault-token-renew.sh.
# Sources the script (vtr_main is guarded) and exercises (a) the drift-guard
# decision — is ~/.vault-token OUR periodic admin token (renew) or a foreign
# clobber (heal / fail loud)? — whose ABSENCE let the 2026-06-05 woodpecker
# clobber be silently renewed for two days, and (b) the self-heal's revoke
# filter — which stale token-devvm-wizard tokens a heal may sweep.
# Run: bash infra/scripts/test-vault-token-renew.sh
set -uo pipefail
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=/dev/null
@ -53,5 +54,21 @@ ok "ours: parse+decide renews" vtr_drift_ok "$(vtr_display_name "$LOOKUP_
no "woodpecker: parse+decide refused" vtr_drift_ok "$(vtr_display_name "$LOOKUP_WP")" "$(vtr_policies_csv "$LOOKUP_WP")"
no "oidc: parse+decide refused" vtr_drift_ok "$(vtr_display_name "$LOOKUP_OIDC")" "$(vtr_policies_csv "$LOOKUP_OIDC")"
# --- vtr_accessor: parse accessor out of lookup JSON ---
LOOKUP_NEW='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-new","policies":["default","sops-admin","vault-admin"],"identity_policies":null}}'
eq "accessor parsed" "acc-new" "$(vtr_accessor "$LOOKUP_NEW")"
eq "accessor absent -> empty" "" "$(vtr_accessor '{"data":{"display_name":"x"}}')"
# --- vtr_is_stale_periodic: the heal's revoke filter — ONLY old token-devvm-wizard
# --- tokens are swept; the just-minted token, foreign tokens, and anything with an
# --- unknown accessor are kept. An empty keep-accessor sweeps NOTHING (fail-safe).
STALE_OURS='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-old","policies":["default","sops-admin","vault-admin"]}}'
ok "older periodic token is stale" vtr_is_stale_periodic "$STALE_OURS" "acc-new"
no "the just-minted token is kept" vtr_is_stale_periodic "$LOOKUP_NEW" "acc-new"
no "foreign oidc token never swept" vtr_is_stale_periodic "$LOOKUP_OIDC" "acc-new"
no "woodpecker token never swept" vtr_is_stale_periodic "$LOOKUP_WP" "acc-new"
no "missing accessor never swept" vtr_is_stale_periodic '{"data":{"display_name":"token-devvm-wizard"}}' "acc-new"
no "empty keep-accessor sweeps nothing" vtr_is_stale_periodic "$STALE_OURS" ""
printf '\n%d passed, %d failed\n' "$pass" "$fail"
(( fail == 0 ))

View file

@ -45,6 +45,94 @@ vtr_drift_ok() {
printf ',%s,' "$pols" | grep -q ",$REQUIRED_POLICY," || return 1
}
# vtr_accessor <lookup-json> -> the token accessor (empty if absent).
vtr_accessor() {
printf '%s' "$1" | jq -r '.data.accessor // ""'
}
# vtr_is_stale_periodic <lookup-json> <keep-accessor> -> 0 if this lookup
# describes one of OUR periodic tokens (display name matches) that is NOT the
# one to keep — i.e. a stale leftover a heal should revoke. 1 otherwise.
# Name-only on purpose (no policy check): anything named token-devvm-wizard
# that isn't the current token is garbage from a previous mint. An empty
# keep-accessor sweeps NOTHING (fail-safe: never revoke when we don't know
# which token is current).
vtr_is_stale_periodic() {
local dn acc
[ -n "${2:-}" ] || return 1
dn=$(vtr_display_name "$1")
acc=$(vtr_accessor "$1")
[ "$dn" = "$EXPECTED_DN" ] || return 1
[ -n "$acc" ] || return 1
[ "$acc" != "$2" ]
}
# vtr_heal <foreign-dn> <log-file> -> 0 if ~/.vault-token was re-minted back to
# our periodic admin token using the foreign token's own authority, 1 if the
# heal was denied or failed (caller exits non-zero; the unit goes failed).
#
# Self-heal added 2026-07-03 (docs/plans/2026-07-03-vault-token-self-heal-design.md):
# an OIDC login — which the infra docs prescribe before applies — clobbers
# ~/.vault-token with a 7-day token, and detect-only drift left that unnoticed
# for weeks (the weekly-expiry loop). We ATTEMPT the re-mint with the
# clobbering token itself and let Vault's authz decide — a read-only clobber
# (the 2026-06-05 woodpecker incident) is denied the mint and stays a loud
# failure, because it signals a misbehaving flow that someone should look at.
vtr_heal() {
local foreign_dn="$1" log="$2"
local errf new_token new_info new_dn new_pols new_acc tmp
errf=$(mktemp)
if ! new_token=$(vault token create -orphan -period=768h \
-policy=vault-admin -policy=sops-admin -display-name=devvm-wizard \
-field=token 2>"$errf") || [ -z "$new_token" ]; then
printf '%s DRIFT: ~/.vault-token is dn=%q — heal denied, foreign token lacks create authority (%s); investigate what wrote it. Manual re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \
"$(date -Is)" "$foreign_dn" "$(tr '\n' ' ' <"$errf")" >>"$log"
rm -f "$errf"
return 1
fi
rm -f "$errf"
# Sanity: the minted token must itself pass the drift guard before it may
# replace ~/.vault-token.
if ! new_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json 2>&1); then
printf '%s FAIL: heal minted a token but its lookup failed: %s\n' \
"$(date -Is)" "$new_info" >>"$log"
return 1
fi
new_dn=$(vtr_display_name "$new_info")
new_pols=$(vtr_policies_csv "$new_info")
if ! vtr_drift_ok "$new_dn" "$new_pols"; then
printf '%s FAIL: heal minted an unexpected token (dn=%q policies=%q) — not writing it\n' \
"$(date -Is)" "$new_dn" "$new_pols" >>"$log"
return 1
fi
# Atomic replace: mktemp files are 0600 from birth; same-filesystem mv.
tmp=$(mktemp "$HOME/.vault-token.XXXXXX")
printf '%s' "$new_token" >"$tmp"
mv "$tmp" "$HOME/.vault-token"
# Anti-sprawl: revoke previous token-devvm-wizard tokens — each heal would
# otherwise strand the prior periodic ADMIN token server-side for up to 32d.
# The clobbering foreign token is deliberately NOT revoked: it may still back
# the user's live login session, and it ages out on its own (7d for OIDC).
local sweep="accessor sweep skipped (list denied)" accessors a a_info revoked=0
new_acc=$(vtr_accessor "$new_info")
if [ -n "$new_acc" ] && accessors=$(VAULT_TOKEN="$new_token" vault list -format=json auth/token/accessors 2>/dev/null); then
while IFS= read -r a; do
[ -n "$a" ] || continue
a_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json -accessor "$a" 2>/dev/null) || continue
if vtr_is_stale_periodic "$a_info" "$new_acc"; then
VAULT_TOKEN="$new_token" vault token revoke -accessor "$a" >/dev/null 2>&1 && revoked=$((revoked + 1))
fi
done < <(printf '%s' "$accessors" | jq -r '.[]')
sweep="revoked $revoked stale periodic token(s)"
fi
printf '%s HEALED: re-minted periodic token from foreign dn=%q (%s)\n' \
"$(date -Is)" "$foreign_dn" "$sweep" >>"$log"
}
vtr_main() {
set -euo pipefail
export PATH="/usr/local/bin:/usr/bin:/bin:${PATH:-}"
@ -61,16 +149,19 @@ vtr_main() {
dn=$(vtr_display_name "$info")
pols=$(vtr_policies_csv "$info")
# Drift guard (added 2026-06-07): the renewer must NOT keep a FOREIGN token alive.
# On 2026-06-05 a stray `vault login -method=kubernetes` overwrote ~/.vault-token
# with a read-only woodpecker token, and this script then silently renewed THAT
# for two days — masking the loss of write access. So before renewing, confirm
# the token is our periodic admin token; if it has drifted, fail loudly (systemd
# marks the unit failed) instead of keeping someone else's token alive.
# Drift guard (2026-06-07) + self-heal (2026-07-03): the renewer must not
# keep a FOREIGN token alive (on 2026-06-05 a stray kubernetes login was
# silently renewed for two days, masking lost write access). But detect-only
# drift proved worse in practice: an OIDC login — which the infra docs
# prescribe before applies — clobbers this file too, and the resulting DRIFT
# failures went unnoticed for weeks while access degraded to a 7-day token
# (the weekly-expiry loop). On drift we now ATTEMPT to heal (see vtr_heal):
# re-mint the periodic token with the clobbering token's own authority.
# Vault's authz keeps the old guarantee — a token that couldn't legitimately
# hold vault-admin is denied the mint, and we still fail loud.
if ! vtr_drift_ok "$dn" "$pols"; then
printf '%s DRIFT: ~/.vault-token is dn=%q policies=%q (expected dn=%q with %q). Refusing to renew a foreign token. Re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \
"$(date -Is)" "$dn" "$pols" "$EXPECTED_DN" "$REQUIRED_POLICY" >>"$log"
exit 1
vtr_heal "$dn" "$log" || exit 1
exit 0
fi
# `vault token renew` with no argument renews the calling token (renew-self).

View file

@ -244,9 +244,15 @@ log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-us
# virtual disk into an IO storm + multi-minute freeze (hard-killed 2026-06-22).
# t3-serve@ was already capped (its [Service] block); the HOLE was the uncapped
# user-<uid>.slice (all ssh/tmux work). Design — per user, on BOTH trees:
# MemoryHigh=12G soft (throttles a runaway to a crawl), MemoryMax=16G hard,
# MemorySwapMax=0 (work never touches disk swap → no thrash; it OOMs locally at
# the ceiling instead), plus fair-share CPU/IO weights.
# MemoryMax=16G hard + MemorySwapMax=0 (work never touches disk swap → no
# thrash; a runaway is cgroup-OOM-killed locally at the ceiling), plus
# fair-share CPU/IO weights.
# NO MemoryHigh soft band (removed 2026-07-02; was 12G "throttle to a crawl"):
# with swap=0, a hog that PLATEAUS between high and max is unreclaimable but
# never OOMs — the kernel parks every task of the cgroup in
# mem_cgroup_handle_over_high and the whole tree stalls indefinitely. A 12.3G
# agent ugrep livelocked t3-serve@wizard (t3 down ~50min) exactly this way.
# Cap-and-kill, never throttle-and-pray — see the post-mortem addendum.
# BACKSTOP = earlyoom, NOT systemd-oomd. We first shipped systemd-oomd but it is
# INERT with swap=0: its pressure-kill only acts on cgroups doing active reclaim
# (pgscan rising), and a no-swap anon workload never reclaims — verified live, a
@ -260,12 +266,16 @@ log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-us
# 10a) per-user caps + fair-share weights on EVERY user-<uid>.slice (ssh/tmux)
install -d -m 0755 /etc/systemd/system/user-.slice.d
cat > /etc/systemd/system/user-.slice.d/50-devvm-resource.conf <<'SLICE_EOF'
# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22).
# Applies to EACH user-<uid>.slice = all of one user's ssh/tmux work. Mirrors the
# t3-serve@.service caps so a user is bounded in whichever surface they work in.
# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22;
# MemoryHigh dropped 2026-07-02). Applies to EACH user-<uid>.slice = all of one
# user's ssh/tmux work. Mirrors the t3-serve@.service caps so a user is bounded
# in whichever surface they work in. MemoryHigh stays infinity: with swap=0 a
# hog plateauing in a high..max band livelocks the entire slice (every ssh/tmux
# session of that user) instead of dying — straight-to-OOM at MemoryMax is the
# containment (see post-mortem addendum 2026-07-02).
[Slice]
MemoryAccounting=yes
MemoryHigh=12G
MemoryHigh=infinity
MemoryMax=16G
MemorySwapMax=0
CPUAccounting=yes
@ -294,12 +304,14 @@ cat > /etc/systemd/system/docker.slice <<'DOCKER_SLICE_EOF'
# All docker containers live here (cgroup-parent in /etc/docker/daemon.json) so
# they share one bounded budget and a runaway container is capped at MemoryMax
# (cgroup-OOM'd locally) instead of escaping into the uncapped system.slice.
# setup-devvm.sh §10, 2026-06-22.
# setup-devvm.sh §10, 2026-06-22; MemoryHigh dropped 2026-07-02 — a container
# plateauing in the high..max band would throttle-livelock EVERY container in
# the slice (see post-mortem addendum); MemoryMax OOM is the containment.
[Unit]
Description=Docker containers slice (capped)
[Slice]
MemoryAccounting=yes
MemoryHigh=6G
MemoryHigh=infinity
MemoryMax=8G
MemorySwapMax=0
CPUAccounting=yes

Binary file not shown.

View file

@ -235,6 +235,12 @@ resource "cloudflare_record" "keyserver" {
zone_id = var.cloudflare_zone_id
}
# bridge.viktorbarzin.me (Cloudflare Pages, "мост" school site) moved to
# stacks/valia-sites (ADR-0018) all Valia-site records live there now.
# State handoff was a manual `tg state rm` (2026-07-03): the CI terraform
# (<1.7) rejects removed{} blocks even at the stack root, so declarative
# forget wasn't available. valia-sites imported the live record by id.
# Enable HTTP/3 (QUIC) for Cloudflare-proxied domains
resource "cloudflare_zone_settings_override" "http3" {
zone_id = var.cloudflare_zone_id

View file

@ -16,7 +16,7 @@ resource "kubernetes_namespace" "dawarich" {
name = "dawarich"
labels = {
"istio-injection" : "disabled"
tier = local.tiers.edge
tier = local.tiers.edge
"keel.sh/enrolled" = "true"
}
}
@ -330,7 +330,7 @@ resource "kubernetes_deployment" "dawarich" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
@ -458,6 +458,13 @@ module "ingress" {
namespace = kubernetes_namespace.dawarich.metadata[0].name
name = "dawarich"
tls_secret_name = var.tls_secret_name
# Rails serves all its fingerprinted assets itself and the map view adds an
# API burst per page load the default 10/50 limiter 429s the asset tail
# from a single client IP (and risks dropping OwnTracks/mobile ingestion
# POSTs on the same host). Dedicated 100/1000 limiter defined in
# stacks/traefik/modules/traefik/middleware.tf.
skip_default_rate_limit = true
extra_middlewares = ["traefik-dawarich-rate-limit@kubernetescrd"]
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Dawarich"

View file

@ -1511,6 +1511,34 @@ resource "null_resource" "pg_instagram_poster_db" {
}
}
# Create tasks database for the tasks PWA (Reminders-style front-end over
# Nextcloud CalDAV; FastAPI + SvelteKit SPA see ~/code/tasks). Stores
# Connected Accounts (Fernet-encrypted Nextcloud app passwords) + sync state.
# Role password is managed by Vault Database Secrets Engine (static role
# `pg-tasks`, 7d rotation). Tables are created by alembic on app startup.
resource "null_resource" "pg_tasks_db" {
depends_on = [null_resource.pg_cluster]
triggers = {
db_name = "tasks"
username = "tasks"
}
provisioner "local-exec" {
command = <<-EOT
PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
bash -c '
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'tasks'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE ROLE tasks WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_database WHERE datname = '"'"'tasks'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE DATABASE tasks OWNER tasks"
psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE tasks TO tasks"
'
EOT
}
}
# Old PostgreSQL deployment kept commented for rollback reference
# resource "kubernetes_deployment" "postgres" {
# metadata {

View file

@ -0,0 +1,360 @@
variable "tls_secret_name" {
type = string
sensitive = true
}
variable "nfs_server" { type = string }
# Open DroneLog (https://github.com/arpanghosh8453/open-dronelog) self-hosted
# DJI flight-log analyzer for the DJI Mini 4 Pro. Runs the UPSTREAM image (the
# ViktorBarzin/drone-logbook fork has no custom commits); Keel tracks :latest.
# Design: docs/plans/2026-07-04-drone-logbook-design.md
resource "kubernetes_namespace" "drone_logbook" {
metadata {
name = "drone-logbook"
labels = {
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "drone-logbook-secrets"
namespace = "drone-logbook"
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = {
name = "drone-logbook-secrets"
}
dataFrom = [{
extract = {
key = "drone-logbook"
}
}]
}
}
depends_on = [kubernetes_namespace.drone_logbook]
}
module "tls_secret" {
source = "../../modules/kubernetes/setup_tls_secret"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
tls_secret_name = var.tls_secret_name
}
# DuckDB database + cached DJI decryption keys + uploaded originals.
# Embedded DB -> block storage, not NFS (same rationale as freshrss data).
# Encrypted class: flight logs are GPS traces of home/travel (sensitive data
# -> proxmox-lvm-encrypted per the storage decision rule in .claude/CLAUDE.md).
resource "kubernetes_persistent_volume_claim" "data" {
wait_until_bound = false
metadata {
name = "drone-logbook-data-encrypted"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "10Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm-encrypted"
resources {
requests = {
storage = "2Gi"
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and PVCs
# can't shrink; without this every apply tries to revert the size.
ignore_changes = [spec[0].resources[0].requests]
}
}
# Drop folder: any producer (Nextcloud sync, scp, future phone pipeline) lands
# DJI .txt logs here over NFS; the app auto-imports on SYNC_INTERVAL.
module "nfs_sync_logs" {
source = "../../modules/kubernetes/nfs_volume"
name = "drone-logbook-sync-logs"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/drone-logbook/sync-logs"
storage = "5Gi"
}
resource "kubernetes_deployment" "drone_logbook" {
metadata {
name = "drone-logbook"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
labels = {
app = "drone-logbook"
"kubernetes.io/cluster-service" = "true"
tier = local.tiers.aux
}
}
spec {
replicas = 1
strategy {
# DuckDB is single-writer; never overlap two pods on the same volume
type = "Recreate"
}
selector {
match_labels = {
app = "drone-logbook"
}
}
template {
metadata {
labels = {
app = "drone-logbook"
"kubernetes.io/cluster-service" = "true"
}
}
spec {
container {
name = "drone-logbook"
image = "ghcr.io/arpanghosh8453/open-dronelog:latest"
env {
name = "RUST_LOG"
value = "info"
}
env {
# keep re-importable originals under /data/drone-logbook/uploaded
name = "KEEP_UPLOADED_FILES"
value = "true"
}
env {
name = "SYNC_LOGS_PATH"
value = "/sync-logs"
}
env {
# 6-field cron (sec min hour dom mon dow): scan drop folder every 8h
name = "SYNC_INTERVAL"
value = "0 0 */8 * * *"
}
env {
name = "PROFILE_CREATION_PASS"
value_from {
secret_key_ref {
name = "drone-logbook-secrets"
key = "profile_creation_pass"
}
}
}
volume_mount {
name = "data"
mount_path = "/data/drone-logbook"
}
volume_mount {
name = "sync-logs"
mount_path = "/sync-logs"
read_only = true
}
port {
name = "http"
container_port = 80
protocol = "TCP"
}
resources {
requests = {
cpu = "25m"
memory = "512Mi"
}
limits = {
memory = "512Mi"
}
}
}
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
}
}
volume {
name = "sync-logs"
persistent_volume_claim {
claim_name = module.nfs_sync_logs.claim_name
}
}
}
}
}
depends_on = [kubernetes_manifest.external_secret]
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["kubernetes.io/change-cause"],
metadata[0].annotations["deployment.kubernetes.io/revision"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
}
}
resource "kubernetes_service" "drone_logbook" {
metadata {
name = "drone-logbook"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
labels = {
"app" = "drone-logbook"
}
}
spec {
selector = {
app = "drone-logbook"
}
port {
port = "80"
target_port = "80"
}
}
}
# -----------------------------------------------------------------------------
# Backup required for every proxmox-lvm(-encrypted) app: daily copy of the
# data volume to NFS /srv/nfs/drone-logbook-backup (picked up by nfs-mirror ->
# sda -> Synology offsite). 01:30 = outside the 00:00/08:00/16:00 sync-import
# windows, so the DuckDB file is quiescent; uploaded originals make even a
# mid-write copy recoverable by re-import. Pod-affinity co-schedules with the
# app pod (RWO volume mounts twice only on the same node). Vaultwarden pattern.
# -----------------------------------------------------------------------------
module "nfs_backup" {
source = "../../modules/kubernetes/nfs_volume"
name = "drone-logbook-backup-host"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/drone-logbook-backup"
}
resource "kubernetes_cron_job_v1" "backup" {
metadata {
name = "drone-logbook-backup"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
}
spec {
concurrency_policy = "Replace"
failed_jobs_history_limit = 5
schedule = "30 1 * * *"
starting_deadline_seconds = 300
successful_jobs_history_limit = 3
job_template {
metadata {}
spec {
backoff_limit = 3
ttl_seconds_after_finished = 10
template {
metadata {}
spec {
affinity {
pod_affinity {
required_during_scheduling_ignored_during_execution {
label_selector {
match_labels = {
app = "drone-logbook"
}
}
topology_key = "kubernetes.io/hostname"
}
}
}
container {
name = "drone-logbook-backup"
image = "docker.io/library/alpine"
command = ["/bin/sh", "-c", <<-EOT
set -euxo pipefail
_t0=$(date +%s)
now=$(date +"%Y_%m_%d_%H_%M")
mkdir -p /backup/$now
cp -a /data/. /backup/$now/
# Rotate 30 day retention
find /backup -maxdepth 1 -mindepth 1 -type d -mtime +30 -exec rm -rf {} +
_dur=$(($(date +%s) - _t0))
_out_bytes=$(du -sb /backup/$now | awk '{print $1}')
wget -qO- --post-data "backup_duration_seconds $${_dur}
backup_output_bytes $${_out_bytes}
backup_last_success_timestamp $(date +%s)
" "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/drone-logbook-backup" || true
EOT
]
volume_mount {
name = "data"
mount_path = "/data"
read_only = true
}
volume_mount {
name = "backup"
mount_path = "/backup"
}
}
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
}
}
volume {
name = "backup"
persistent_volume_claim {
claim_name = module.nfs_backup.claim_name
}
}
dns_config {
option {
name = "ndots"
value = "2"
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# https://dronelog.viktorbarzin.me
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
auth = "required" # Authentik forward-auth flight logs are GPS traces of home/travel
dns_type = "proxied"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
name = "dronelog"
service_name = "drone-logbook"
tls_secret_name = var.tls_secret_name
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Drone Logbook"
"gethomepage.dev/description" = "DJI flight log analyzer"
"gethomepage.dev/icon" = "mdi-quadcopter"
"gethomepage.dev/group" = "Media & Entertainment"
"gethomepage.dev/pod-selector" = ""
}
}

View file

@ -0,0 +1 @@
../../secrets

View file

@ -0,0 +1,8 @@
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}

View file

@ -10,7 +10,7 @@ resource "kubernetes_namespace" "excalidraw" {
name = "excalidraw"
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
@ -45,6 +45,15 @@ resource "kubernetes_deployment" "excalidraw" {
app = "excalidraw"
tier = local.tiers.aux
}
# Keel rolls new ghcr:latest digests (k8s-portal pattern). Values here are
# recreate-correct seeds only the keys are in ignore_changes below, so
# the live annotations win on an existing deployment.
annotations = {
"keel.sh/policy" = "force"
"keel.sh/trigger" = "poll"
"keel.sh/match-tag" = "true"
"keel.sh/pollSchedule" = "@every 5m"
}
}
spec {
replicas = 1
@ -67,9 +76,19 @@ resource "kubernetes_deployment" "excalidraw" {
}
}
spec {
# GHCR pull secret: the ghcr-credentials Secret in this namespace is
# cloned in by the kyverno stack's sync-ghcr-credentials ClusterPolicy
# (allowlisted private-ghcr namespaces only ADR-0002). Source of
# truth: stacks/kyverno/modules/kyverno/ghcr-credentials.tf.
image_pull_secrets {
name = "ghcr-credentials"
}
container {
image = "viktorbarzin/excalidraw-library:v4"
image_pull_policy = "IfNotPresent"
# ADR-0002: GHA-built (.github/workflows/build-excalidraw.yml),
# PRIVATE ghcr; Keel rolls new :latest digests. DockerHub
# viktorbarzin/excalidraw-library:v4 is the frozen rollback image.
image = "ghcr.io/viktorbarzin/excalidraw-library:latest"
image_pull_policy = "Always"
name = "excalidraw"
port {
container_port = 8080
@ -107,7 +126,7 @@ resource "kubernetes_deployment" "excalidraw" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],

View file

@ -4,18 +4,28 @@ A self-hosted Excalidraw library with per-user drawing storage and management.
## Features
- Dashboard to manage all your drawings
- Dashboard to manage all your drawings (create, open, rename, delete)
- Per-user storage (via Authentik SSO headers)
- Create, edit, and delete drawings
- Rename drawings from the dashboard or by clicking the drawing name in the editor
- Native Excalidraw export via the editor's hamburger menu: "Save to..."
(.excalidraw file) and "Export image..." (PNG / SVG / clipboard)
- Autosave (2s debounce) + manual save (Ctrl+S or menu "Save now")
- Persistent storage via NFS
## Docker Image
```
viktorbarzin/excalidraw-library:v4
ghcr.io/viktorbarzin/excalidraw-library:latest
```
Available on Docker Hub: https://hub.docker.com/r/viktorbarzin/excalidraw-library
Built by GitHub Actions (`.github/workflows/build-excalidraw.yml` in the infra
repo, ADR-0002) on every master push touching `stacks/excalidraw/project/**`;
tags `:latest` + `:<git-sha>`. The package is PRIVATE — cluster pulls use the
Kyverno-synced `ghcr-credentials` secret. Keel polls `:latest` and rolls the
deployment on digest change.
The legacy manually-built DockerHub image `viktorbarzin/excalidraw-library:v4`
is frozen as the rollback target; nothing pushes to it anymore.
## Configuration
@ -39,54 +49,13 @@ Mount a persistent volume to the `DATA_DIR` path. Drawings are stored as `.excal
└── my-diagram.excalidraw
```
The filename (without extension) is both the drawing ID and its display name;
renaming a drawing renames the file (`os.Rename`, mtime preserved).
## Deployment
### Docker
```bash
docker run -d \
--name excalidraw-rooms \
-p 8080:8080 \
-v /path/to/storage:/data \
viktorbarzin/excalidraw-library:v4
```
### Kubernetes
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: excalidraw
spec:
replicas: 1
selector:
matchLabels:
app: excalidraw
template:
metadata:
labels:
app: excalidraw
spec:
containers:
- name: excalidraw
image: viktorbarzin/excalidraw-library:v4
ports:
- containerPort: 8080
env:
- name: DATA_DIR
value: /data
- name: PORT
value: "8080"
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
nfs:
server: 192.168.1.127
path: /srv/nfs/excalidraw
```
Deployed by the `stacks/excalidraw` Terraform stack (namespace `excalidraw`,
service `draw`, ingress `draw.viktorbarzin.me` with `auth = "required"`).
### With Authentik SSO
@ -96,23 +65,7 @@ The application reads user identity from Authentik headers:
- `X-Authentik-Email` - Displayed in UI
- `X-Authentik-Name` - Displayed in UI
Configure your ingress to pass these headers:
```yaml
annotations:
nginx.ingress.kubernetes.io/auth-response-headers: "X-authentik-username,X-authentik-email,X-authentik-name"
```
## Building
```bash
# Build the Docker image
docker build -t excalidraw-library .
# Or build locally
go build -o excalidraw-library .
./excalidraw-library
```
Requests without `X-Authentik-Username` fall back to the `anonymous` user.
## API Endpoints
@ -122,10 +75,25 @@ go build -o excalidraw-library .
| GET | `/api/drawings` | List all drawings for current user |
| GET | `/api/drawings/:id` | Get drawing data |
| PUT | `/api/drawings/:id` | Save drawing |
| PATCH | `/api/drawings/:id` | Rename drawing — body `{"name": "<new-name>"}`; returns `{"status":"renamed","id":"<new-id>"}`; 409 if the target name exists |
| DELETE | `/api/drawings/:id` | Delete drawing |
| GET | `/api/user` | Get current user info |
| GET | `/draw/:id` | Open drawing in editor |
Rename names are sanitized server-side to `[a-zA-Z0-9-_]` (other characters
become `-`; a trailing `.excalidraw` is stripped). Existing IDs are accepted
as-is for backward compatibility with API clients.
## Development
```bash
# Run tests
go test ./...
# Run locally
DATA_DIR=/tmp/excalidraw-data go run .
```
## License
MIT

View file

@ -9,6 +9,7 @@ import (
"net/http"
"os"
"path/filepath"
"regexp"
"sort"
"strings"
"time"
@ -63,6 +64,21 @@ func getUsername(r *http.Request) string {
return username
}
var invalidNameChars = regexp.MustCompile(`[^a-zA-Z0-9-_]`)
// sanitizeName normalizes a user-supplied drawing name into a safe file ID
// (same charset the dashboard applies on create). Returns "" if nothing
// meaningful remains.
func sanitizeName(name string) string {
name = strings.TrimSpace(name)
name = strings.TrimSuffix(name, ".excalidraw")
name = invalidNameChars.ReplaceAllString(name, "-")
if strings.Trim(name, "-") == "" {
return ""
}
return name
}
// getUserDataDir returns the data directory for a specific user and ensures it exists
func getUserDataDir(username string) string {
userDir := filepath.Join(dataDir, username)
@ -168,6 +184,41 @@ func handleDrawing(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(map[string]string{"status": "saved", "id": id})
case http.MethodPatch:
var req struct {
Name string `json:"name"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
http.Error(w, "Invalid JSON body", http.StatusBadRequest)
return
}
newID := sanitizeName(req.Name)
if newID == "" {
http.Error(w, "Invalid name", http.StatusBadRequest)
return
}
if _, err := os.Stat(filePath); err != nil {
if os.IsNotExist(err) {
http.Error(w, "Drawing not found", http.StatusNotFound)
} else {
http.Error(w, err.Error(), http.StatusInternalServerError)
}
return
}
if newID != id {
newPath := filepath.Join(userDataDir, newID+".excalidraw")
if _, err := os.Stat(newPath); err == nil {
http.Error(w, "A drawing with that name already exists", http.StatusConflict)
return
}
if err := os.Rename(filePath, newPath); err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(map[string]string{"status": "renamed", "id": newID})
case http.MethodDelete:
if err := os.Remove(filePath); err != nil {
if os.IsNotExist(err) {
@ -264,6 +315,8 @@ const dashboardHTML = `<!DOCTYPE html>
.btn:hover { background: #5b4cdb; }
.btn-danger { background: #e74c3c; }
.btn-danger:hover { background: #c0392b; }
.btn-secondary { background: #3d3d5c; }
.btn-secondary:hover { background: #4a4a70; }
.btn-small { padding: 0.4rem 0.8rem; font-size: 0.85rem; }
.drawings { display: grid; gap: 1rem; }
.drawing {
@ -342,11 +395,11 @@ const dashboardHTML = `<!DOCTYPE html>
<div id="modal" class="modal">
<div class="modal-content">
<h2>New Drawing</h2>
<h2 id="modal-title">New Drawing</h2>
<input type="text" id="drawingName" placeholder="Drawing name..." autofocus>
<div class="modal-actions">
<button class="btn" style="background:#444" onclick="hideModal()">Cancel</button>
<button class="btn" onclick="createDrawing()">Create</button>
<button class="btn" id="modal-confirm" onclick="confirmModal()">Create</button>
</div>
</div>
</div>
@ -369,31 +422,63 @@ const dashboardHTML = `<!DOCTYPE html>
}
}
function drawingRow(d) {
var row = document.createElement('div');
row.className = 'drawing';
var info = document.createElement('div');
info.className = 'drawing-info';
var nameLink = document.createElement('a');
nameLink.className = 'drawing-name';
nameLink.href = '/draw/' + encodeURIComponent(d.id);
nameLink.textContent = d.name;
var meta = document.createElement('div');
meta.className = 'drawing-meta';
meta.textContent = 'Modified: ' + new Date(d.modified).toLocaleDateString() + ' ' +
new Date(d.modified).toLocaleTimeString() + ' - ' + formatSize(d.size);
info.appendChild(nameLink);
info.appendChild(meta);
var actions = document.createElement('div');
actions.className = 'drawing-actions';
var open = document.createElement('a');
open.className = 'btn btn-small';
open.href = '/draw/' + encodeURIComponent(d.id);
open.textContent = 'Open';
var rename = document.createElement('button');
rename.className = 'btn btn-small btn-secondary';
rename.textContent = 'Rename';
rename.onclick = function() { showRenameModal(d.id); };
var del = document.createElement('button');
del.className = 'btn btn-small btn-danger';
del.textContent = 'Delete';
del.onclick = function() { deleteDrawing(d.id); };
actions.appendChild(open);
actions.appendChild(rename);
actions.appendChild(del);
row.appendChild(info);
row.appendChild(actions);
return row;
}
async function loadDrawings() {
const resp = await fetch('/api/drawings');
const drawings = await resp.json();
const container = document.getElementById('drawings');
container.replaceChildren();
if (!drawings || drawings.length === 0) {
container.innerHTML = '<div class="empty">No drawings yet. Create your first one!</div>';
var empty = document.createElement('div');
empty.className = 'empty';
empty.textContent = 'No drawings yet. Create your first one!';
container.appendChild(empty);
return;
}
container.innerHTML = drawings.map(function(d) {
return '<div class="drawing">' +
'<div class="drawing-info">' +
'<a href="/draw/' + d.id + '" class="drawing-name">' + d.name + '</a>' +
'<div class="drawing-meta">' +
'Modified: ' + new Date(d.modified).toLocaleDateString() + ' ' + new Date(d.modified).toLocaleTimeString() +
' - ' + formatSize(d.size) +
'</div>' +
'</div>' +
'<div class="drawing-actions">' +
'<a href="/draw/' + d.id + '" class="btn btn-small">Open</a>' +
'<button class="btn btn-small btn-danger" onclick="deleteDrawing(\'' + d.id + '\')">Delete</button>' +
'</div>' +
'</div>';
}).join('');
drawings.forEach(function(d) {
container.appendChild(drawingRow(d));
});
}
function formatSize(bytes) {
@ -402,18 +487,64 @@ const dashboardHTML = `<!DOCTYPE html>
return (bytes / (1024 * 1024)).toFixed(1) + ' MB';
}
function showNewModal() {
var modalAction = null; // invoked with the input value on confirm
function showModal(title, confirmLabel, initialValue, action) {
document.getElementById('modal-title').textContent = title;
document.getElementById('modal-confirm').textContent = confirmLabel;
var input = document.getElementById('drawingName');
input.value = initialValue || '';
modalAction = action;
document.getElementById('modal').classList.add('active');
document.getElementById('drawingName').focus();
input.focus();
input.select();
}
function showNewModal() {
showModal('New Drawing', 'Create', '', createDrawing);
}
function showRenameModal(id) {
showModal('Rename Drawing', 'Rename', id, function(value) {
renameDrawing(id, value);
});
}
function hideModal() {
document.getElementById('modal').classList.remove('active');
document.getElementById('drawingName').value = '';
modalAction = null;
}
async function createDrawing() {
var name = document.getElementById('drawingName').value.trim();
function confirmModal() {
if (modalAction) modalAction(document.getElementById('drawingName').value);
}
async function renameDrawing(id, newName) {
newName = (newName || '').trim();
if (!newName || newName === id) {
hideModal();
return;
}
var resp = await fetch('/api/drawings/' + encodeURIComponent(id), {
method: 'PATCH',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ name: newName })
});
if (resp.status === 409) {
alert('A drawing with that name already exists.');
return; // keep the modal open so the user can pick another name
}
if (!resp.ok) {
alert('Rename failed: ' + await resp.text());
return;
}
hideModal();
loadDrawings();
}
async function createDrawing(name) {
name = (name || '').trim();
if (!name) {
name = 'drawing-' + Date.now();
}
@ -446,7 +577,7 @@ const dashboardHTML = `<!DOCTYPE html>
}
document.getElementById('drawingName').addEventListener('keypress', function(e) {
if (e.key === 'Enter') createDrawing();
if (e.key === 'Enter') confirmModal();
});
document.getElementById('modal').addEventListener('click', function(e) {

View file

@ -0,0 +1,249 @@
package main
import (
"encoding/json"
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"strings"
"testing"
)
const testDrawing = `{"type":"excalidraw","version":2,"source":"excalidraw-library","elements":[{"id":"e1"}],"appState":{"viewBackgroundColor":"#ffffff"}}`
func setupDataDir(t *testing.T) {
t.Helper()
dataDir = t.TempDir()
}
// doDrawing sends a request to handleDrawing for the given user and returns the recorder.
func doDrawing(t *testing.T, method, id, body, user string) *httptest.ResponseRecorder {
t.Helper()
var reader *strings.Reader
if body == "" {
reader = strings.NewReader("")
} else {
reader = strings.NewReader(body)
}
req := httptest.NewRequest(method, "/api/drawings/"+id, reader)
if user != "" {
req.Header.Set("X-Authentik-Username", user)
}
w := httptest.NewRecorder()
handleDrawing(w, req)
return w
}
func listDrawings(t *testing.T, user string) []Drawing {
t.Helper()
req := httptest.NewRequest(http.MethodGet, "/api/drawings", nil)
if user != "" {
req.Header.Set("X-Authentik-Username", user)
}
w := httptest.NewRecorder()
handleListDrawings(w, req)
if w.Code != http.StatusOK {
t.Fatalf("list: expected 200, got %d", w.Code)
}
var drawings []Drawing
if err := json.Unmarshal(w.Body.Bytes(), &drawings); err != nil {
t.Fatalf("list: bad JSON: %v", err)
}
return drawings
}
func TestPutGetRoundtrip(t *testing.T) {
setupDataDir(t)
if w := doDrawing(t, http.MethodPut, "foo", testDrawing, "alice"); w.Code != http.StatusOK {
t.Fatalf("PUT: expected 200, got %d: %s", w.Code, w.Body.String())
}
w := doDrawing(t, http.MethodGet, "foo", "", "alice")
if w.Code != http.StatusOK {
t.Fatalf("GET: expected 200, got %d", w.Code)
}
if w.Body.String() != testDrawing {
t.Errorf("GET: content mismatch: %s", w.Body.String())
}
}
func TestGetMissing(t *testing.T) {
setupDataDir(t)
if w := doDrawing(t, http.MethodGet, "nope", "", "alice"); w.Code != http.StatusNotFound {
t.Fatalf("expected 404, got %d", w.Code)
}
}
func TestListDrawings(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "one", testDrawing, "alice")
doDrawing(t, http.MethodPut, "two", testDrawing, "alice")
drawings := listDrawings(t, "alice")
if len(drawings) != 2 {
t.Fatalf("expected 2 drawings, got %d", len(drawings))
}
ids := map[string]bool{drawings[0].ID: true, drawings[1].ID: true}
if !ids["one"] || !ids["two"] {
t.Errorf("unexpected ids: %v", ids)
}
for _, d := range drawings {
if d.Name != d.ID {
t.Errorf("name should equal id: %+v", d)
}
}
}
func TestDelete(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "foo", testDrawing, "alice")
if w := doDrawing(t, http.MethodDelete, "foo", "", "alice"); w.Code != http.StatusOK {
t.Fatalf("DELETE: expected 200, got %d", w.Code)
}
if w := doDrawing(t, http.MethodGet, "foo", "", "alice"); w.Code != http.StatusNotFound {
t.Fatalf("GET after delete: expected 404, got %d", w.Code)
}
if w := doDrawing(t, http.MethodDelete, "foo", "", "alice"); w.Code != http.StatusNotFound {
t.Fatalf("second DELETE: expected 404, got %d", w.Code)
}
}
func TestPerUserIsolation(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "secret", testDrawing, "alice")
if w := doDrawing(t, http.MethodGet, "secret", "", "bob"); w.Code != http.StatusNotFound {
t.Fatalf("bob should not see alice's drawing, got %d", w.Code)
}
if drawings := listDrawings(t, "bob"); len(drawings) != 0 {
t.Fatalf("bob's list should be empty, got %d", len(drawings))
}
}
// --- rename (PATCH) ---
func renameReq(t *testing.T, id, newName, user string) *httptest.ResponseRecorder {
t.Helper()
return doDrawing(t, http.MethodPatch, id, `{"name":`+strconv(newName)+`}`, user)
}
// strconv JSON-quotes a string without importing encoding/json for a one-liner.
func strconv(s string) string {
b, _ := json.Marshal(s)
return string(b)
}
func TestRenameSuccess(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "foo", testDrawing, "alice")
w := renameReq(t, "foo", "bar", "alice")
if w.Code != http.StatusOK {
t.Fatalf("PATCH: expected 200, got %d: %s", w.Code, w.Body.String())
}
var resp map[string]string
if err := json.Unmarshal(w.Body.Bytes(), &resp); err != nil {
t.Fatalf("PATCH: bad JSON: %v", err)
}
if resp["id"] != "bar" || resp["status"] != "renamed" {
t.Errorf("unexpected response: %v", resp)
}
if w := doDrawing(t, http.MethodGet, "bar", "", "alice"); w.Code != http.StatusOK || w.Body.String() != testDrawing {
t.Errorf("GET new id: code=%d content=%q", w.Code, w.Body.String())
}
if w := doDrawing(t, http.MethodGet, "foo", "", "alice"); w.Code != http.StatusNotFound {
t.Errorf("GET old id: expected 404, got %d", w.Code)
}
}
func TestRenameConflict(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "a", testDrawing, "alice")
doDrawing(t, http.MethodPut, "b", testDrawing, "alice")
if w := renameReq(t, "a", "b", "alice"); w.Code != http.StatusConflict {
t.Fatalf("expected 409, got %d", w.Code)
}
// both drawings intact
for _, id := range []string{"a", "b"} {
if w := doDrawing(t, http.MethodGet, id, "", "alice"); w.Code != http.StatusOK {
t.Errorf("drawing %q should be intact, got %d", id, w.Code)
}
}
}
func TestRenameMissing(t *testing.T) {
setupDataDir(t)
if w := renameReq(t, "nope", "new", "alice"); w.Code != http.StatusNotFound {
t.Fatalf("expected 404, got %d", w.Code)
}
}
func TestRenameSameName(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "foo", testDrawing, "alice")
w := renameReq(t, "foo", "foo", "alice")
if w.Code != http.StatusOK {
t.Fatalf("same-name rename: expected 200, got %d: %s", w.Code, w.Body.String())
}
if w := doDrawing(t, http.MethodGet, "foo", "", "alice"); w.Code != http.StatusOK {
t.Errorf("drawing should be intact, got %d", w.Code)
}
}
func TestRenameInvalidNames(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "foo", testDrawing, "alice")
for _, name := range []string{"", " ", "../..", "---"} {
if w := renameReq(t, "foo", name, "alice"); w.Code != http.StatusBadRequest {
t.Errorf("rename to %q: expected 400, got %d", name, w.Code)
}
}
// malformed body
if w := doDrawing(t, http.MethodPatch, "foo", `{not json`, "alice"); w.Code != http.StatusBadRequest {
t.Errorf("malformed body: expected 400, got %d", w.Code)
}
}
func TestRenameSanitization(t *testing.T) {
setupDataDir(t)
cases := []struct{ in, want string }{
{"My Drawing!", "My-Drawing-"},
{"net diag.excalidraw", "net-diag"}, // .excalidraw suffix stripped, not mangled
{"a/b\\c", "a-b-c"},
}
for _, c := range cases {
doDrawing(t, http.MethodPut, "src", testDrawing, "alice")
w := renameReq(t, "src", c.in, "alice")
if w.Code != http.StatusOK {
t.Errorf("rename to %q: expected 200, got %d: %s", c.in, w.Code, w.Body.String())
continue
}
var resp map[string]string
json.Unmarshal(w.Body.Bytes(), &resp)
if resp["id"] != c.want {
t.Errorf("rename to %q: expected id %q, got %q", c.in, c.want, resp["id"])
}
// file must be inside the user dir under the sanitized name
if _, err := os.Stat(filepath.Join(dataDir, "alice", c.want+".excalidraw")); err != nil {
t.Errorf("rename to %q: expected file %q on disk: %v", c.in, c.want, err)
}
doDrawing(t, http.MethodDelete, resp["id"], "", "alice")
}
}
func TestRenameTraversalStaysInUserDir(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "foo", testDrawing, "alice")
w := renameReq(t, "foo", "../../../etc/passwd", "alice")
if w.Code == http.StatusOK {
var resp map[string]string
json.Unmarshal(w.Body.Bytes(), &resp)
if strings.Contains(resp["id"], "/") || strings.Contains(resp["id"], "..") {
t.Fatalf("traversal characters survived: %q", resp["id"])
}
if _, err := os.Stat(filepath.Join(dataDir, "alice", resp["id"]+".excalidraw")); err != nil {
t.Fatalf("renamed file escaped user dir: %v", err)
}
}
// nothing outside the data dir
if _, err := os.Stat(filepath.Join(dataDir, "..", "etc")); err == nil {
t.Fatal("file escaped the data dir")
}
}

View file

@ -8,41 +8,41 @@
* { margin: 0; padding: 0; }
html, body { width: 100%; height: 100%; overflow: hidden; }
#root { width: 100%; height: 100%; }
.toolbar {
position: fixed;
top: 10px;
left: 10px;
z-index: 1000;
.top-right-ui {
display: flex;
align-items: center;
gap: 8px;
background: rgba(255,255,255,0.95);
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
}
.top-right-ui a, .top-right-ui button {
display: inline-flex;
align-items: center;
gap: 6px;
padding: 8px 12px;
border: 1px solid transparent;
border-radius: 8px;
box-shadow: 0 2px 8px rgba(0,0,0,0.15);
}
.toolbar button, .toolbar a {
padding: 6px 14px;
border: none;
border-radius: 6px;
cursor: pointer;
font-size: 14px;
background: #6c5ce7;
color: white;
font-size: 13px;
text-decoration: none;
display: inline-block;
box-shadow: 0 1px 4px rgba(0,0,0,0.12);
max-width: 40vw;
white-space: nowrap;
overflow: hidden;
text-overflow: ellipsis;
}
.toolbar button:hover, .toolbar a:hover { background: #5b4cdb; }
.toolbar .secondary { background: #ddd; color: #333; }
.toolbar .secondary:hover { background: #ccc; }
.toolbar .title {
font-weight: 600;
padding: 6px 0;
color: #333;
.top-right-ui.theme-light a, .top-right-ui.theme-light button {
background: #ffffff;
color: #1b1b1f;
}
.top-right-ui.theme-dark a, .top-right-ui.theme-dark button {
background: #232329;
color: #e9ecef;
}
.top-right-ui button:hover, .top-right-ui a:hover { border-color: #a29bfe; }
.status {
position: fixed;
bottom: 10px;
right: 10px;
right: 60px;
padding: 6px 12px;
background: rgba(0,0,0,0.7);
color: white;
@ -51,6 +51,7 @@
z-index: 1000;
opacity: 0;
transition: opacity 0.3s;
pointer-events: none;
}
.status.show { opacity: 1; }
.loading {
@ -67,11 +68,6 @@
</style>
</head>
<body>
<div class="toolbar">
<a href="/" class="secondary">Back to Library</a>
<span class="title" id="title">Loading...</span>
<button onclick="saveDrawing()">Save</button>
</div>
<div id="root">
<div class="loading">
<div>Loading Excalidraw...</div>
@ -81,16 +77,33 @@
<div id="status" class="status">Saved</div>
<script>
// Replaces #root with an error panel (safe DOM methods, no innerHTML).
function showFatal(title, detail) {
var root = document.getElementById('root');
root.replaceChildren();
var panel = document.createElement('div');
panel.className = 'loading error';
var titleEl = document.createElement('div');
titleEl.textContent = title;
panel.appendChild(titleEl);
if (detail) {
var detailEl = document.createElement('div');
detailEl.style.fontSize = '0.9rem';
detailEl.textContent = detail;
panel.appendChild(detailEl);
}
root.appendChild(panel);
}
// Get drawing ID from URL path: /draw/{id}
var pathParts = window.location.pathname.split('/');
var drawingId = pathParts[pathParts.length - 1] || pathParts[pathParts.length - 2];
if (!drawingId) {
document.getElementById('root').innerHTML = '<div class="loading error">No drawing ID specified</div>';
showFatal('No drawing ID specified');
throw new Error('No drawing ID');
}
document.getElementById('title').textContent = drawingId;
document.title = drawingId + ' - Excalidraw';
var excalidrawAPI = null;
@ -159,6 +172,46 @@
autoSaveTimeout = setTimeout(saveDrawing, 2000);
}
// Renames the current drawing via the API. Returns the new ID, or null
// if the rename was cancelled or failed.
async function renameCurrentDrawing() {
var newName = window.prompt('Rename drawing', drawingId);
if (newName === null) return null;
newName = newName.trim();
if (!newName || newName === drawingId) return null;
// A pending autosave would resurrect the old file after the rename.
clearTimeout(autoSaveTimeout);
var resp;
try {
resp = await fetch('/api/drawings/' + drawingId, {
method: 'PATCH',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ name: newName })
});
} catch (e) {
showStatus('Rename failed!');
return null;
}
if (resp.status === 409) {
window.alert('A drawing with that name already exists.');
return null;
}
if (!resp.ok) {
window.alert('Rename failed: ' + (await resp.text()));
return null;
}
var result = await resp.json();
drawingId = result.id;
document.title = drawingId + ' - Excalidraw';
window.history.replaceState(null, '', '/draw/' + encodeURIComponent(drawingId));
showStatus('Renamed');
// Flush any unsaved changes to the new file.
saveDrawing();
return drawingId;
}
// Load scripts dynamically
function loadScript(src) {
return new Promise(function(resolve, reject) {
@ -197,33 +250,76 @@
updateLoadStatus('Rendering Excalidraw...');
// Create Excalidraw component
var e = React.createElement;
var MainMenu = ExcalidrawLib.MainMenu;
// Native default menu items, existence-guarded so a library
// update that drops one degrades gracefully.
function defaultItem(name) {
var C = MainMenu && MainMenu.DefaultItems && MainMenu.DefaultItems[name];
return C ? e(C, { key: name }) : null;
}
function App() {
return React.createElement(ExcalidrawLib.Excalidraw, {
var nameState = React.useState(drawingId);
var name = nameState[0], setName = nameState[1];
function onRename() {
renameCurrentDrawing().then(function(newId) {
if (newId) setName(newId);
});
}
// The menu is where the native export features live:
// Export = "Save to..." (.excalidraw), SaveAsImage =
// "Export image..." (PNG / SVG / clipboard).
var menu = MainMenu ? e(MainMenu, { key: 'menu' },
e(MainMenu.Item, { key: 'back', onSelect: function() { window.location.href = '/'; } }, 'Back to Library'),
e(MainMenu.Item, { key: 'save', onSelect: saveDrawing }, 'Save now'),
e(MainMenu.Item, { key: 'rename', onSelect: onRename }, 'Rename drawing…'),
MainMenu.Separator ? e(MainMenu.Separator, { key: 'sep1' }) : null,
defaultItem('LoadScene'),
defaultItem('Export'),
defaultItem('SaveAsImage'),
MainMenu.Separator ? e(MainMenu.Separator, { key: 'sep2' }) : null,
defaultItem('ClearCanvas'),
defaultItem('ToggleTheme'),
defaultItem('ChangeCanvasBackground'),
defaultItem('Help')
) : null;
return e(ExcalidrawLib.Excalidraw, {
initialData: initialData ? {
elements: initialData.elements || [],
appState: initialData.appState || {}
} : undefined,
UIOptions: { canvasActions: { toggleTheme: true } },
excalidrawAPI: function(api) {
excalidrawAPI = api;
console.log('Excalidraw API ready');
},
onChange: onChange
});
onChange: onChange,
renderTopRightUI: function(isMobile, appState) {
return e('div', { className: 'top-right-ui theme-' + (appState.theme || 'light') },
e('a', { key: 'home', href: '/', title: 'Back to Library' }, '← Library'),
e('button', {
key: 'name',
title: 'Click to rename',
onClick: onRename
}, name + ' ✎')
);
}
}, menu);
}
var root = ReactDOM.createRoot(document.getElementById('root'));
root.render(React.createElement(App));
root.render(e(App));
console.log('Excalidraw rendered successfully');
} catch (e) {
console.error('Init error:', e);
document.getElementById('root').innerHTML =
'<div class="loading error">' +
'<div>Failed to load Excalidraw</div>' +
'<div style="font-size:0.9rem">' + e.message + '</div>' +
'</div>';
} catch (err) {
console.error('Init error:', err);
showFatal('Failed to load Excalidraw', err.message);
}
}

49
stacks/excalidraw/rbac.tf Normal file
View file

@ -0,0 +1,49 @@
# emo's Claude Excalidraw upload RBAC.
#
# emo's agent uploads drawings with `kubectl -n excalidraw port-forward svc/draw`
# + `PUT /api/drawings/<name>` carrying the X-Authentik-Username header (the
# documented recipe in emo's ~/.claude/CLAUDE.md the app sits behind Authentik
# forward-auth, so direct curl gets redirected). His hands-off credential is the
# chrome-service/emo-browser ServiceAccount kubeconfig (stacks/chrome-service/rbac.tf);
# its cluster-wide grant (oidc-power-user-readonly) is read-only, so pods/portforward
# must be granted per namespace. This is the excalidraw-namespace grant
# (Viktor's call, 2026-07-02; same pattern as the chrome-service one).
#
# TRADE-OFF (accepted): port-forward into this namespace bypasses the Authentik
# ingress and the drawings API trusts the X-Authentik-Username header, so the SA
# can read/write ANY user's drawings, not only emo's. The namespace runs nothing
# but the drawings app, and the same class of trade-off was already accepted for
# the shared browser (CDP reach into Viktor's sessions).
resource "kubernetes_role" "portforward" {
metadata {
name = "excalidraw-portforward"
namespace = kubernetes_namespace.excalidraw.metadata[0].name
}
rule {
api_groups = [""]
resources = ["pods/portforward"]
verbs = ["create"]
}
}
resource "kubernetes_role_binding" "emo_browser_portforward" {
metadata {
name = "emo-browser-portforward"
namespace = kubernetes_namespace.excalidraw.metadata[0].name
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "Role"
name = kubernetes_role.portforward.metadata[0].name
}
subject {
kind = "ServiceAccount"
# Defined in stacks/chrome-service/rbac.tf referenced by name across
# stacks, same as that file references the oidc-power-user-readonly
# ClusterRole. get/list on pods+services (needed to resolve svc/draw) comes
# from the SA's cluster-read binding there.
name = "emo-browser"
namespace = "chrome-service"
}
}

View file

@ -166,6 +166,33 @@ resource "kubernetes_deployment" "f1-stream" {
name = "DISCORD_CHANNELS"
value = var.discord_f1_channel_ids
}
# Replays feature (app repo ADR-0002). optional=true so the pod still
# starts before the Reddit app credentials exist; the app treats missing
# creds as "replays off" (logs "Replays pipeline disabled"). The
# ExternalSecret above uses dataFrom.extract on the Vault "f1-stream"
# key, so adding reddit_client_id / reddit_client_secret there auto-syncs
# them into this Secret no ExternalSecret change needed, just a pod
# restart to pick them up.
env {
name = "REDDIT_CLIENT_ID"
value_from {
secret_key_ref {
name = "f1-stream-secrets"
key = "reddit_client_id"
optional = true
}
}
}
env {
name = "REDDIT_CLIENT_SECRET"
value_from {
secret_key_ref {
name = "f1-stream-secrets"
key = "reddit_client_secret"
optional = true
}
}
}
# Verifier connects to in-cluster headed Chromium pool see
# stacks/chrome-service/. Falls back to in-process headless if unset.
# 2026-06-04: migrated WS (:3000 / path-token) CDP (:9222 /

View file

@ -117,8 +117,9 @@ resource "kubernetes_deployment" "frigate" {
limits = {
memory = "10Gi"
"nvidia.com/gpu" = "1"
# GPU VRAM budget (ADR-0016): detector + ffmpeg decode (~1.9 GiB).
"viktorbarzin.me/gpumem" = "2000"
# GPU VRAM budget (ADR-0016): detector + ffmpeg decode (~1.9 GiB),
# +~250 MiB NVDEC headroom for the vermont-garage camera (ADR-0017).
"viktorbarzin.me/gpumem" = "2300"
}
}
env {

View file

@ -34,7 +34,7 @@ resource "kubernetes_config_map" "frame_config_emo" {
Accounts:
- ImmichServerUrl: http://immich.viktorbarzin.me
ApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_api_key_emo"]}
ImagesFromDays: 730
ImagesFromDays: 365
EOF
}
}
@ -73,7 +73,9 @@ resource "kubernetes_deployment" "immich-frame-emo" {
}
spec {
container {
image = "ghcr.io/immichframe/immichframe:v1.0.32.0"
# immich_v3: upstream compat tag for Immich v3 see frame.tf for the
# full story; repin to a versioned tag once upstream releases v3 support.
image = "ghcr.io/immichframe/immichframe:immich_v3"
name = "immich-frame-emo"
resources {
requests = {
@ -142,14 +144,21 @@ resource "kubernetes_service" "immich-frame-emo" {
module "ingress_emo" {
source = "../../modules/kubernetes/ingress_factory"
# Photo-frame kiosk display on Emo's Portal headless browser pulling images
# via an Immich API key (no user login). Forward-auth would 302 the device to
# Authentik with no way to complete login.
# auth = "none": photo-frame kiosk; headless browser with API key; no user login.
auth = "none"
dns_type = "proxied"
namespace = "immich"
name = "highlights-immich-emo"
tls_secret_name = var.tls_secret_name
service_name = "immich-frame-emo"
# Photo-frame kiosk display on Emo's Portal Mini (Sofia LAN) WebView
# pulling images via an Immich API key; no user login possible on the
# device. Same LAN-only gating as frame.tf: home-lans-only ipAllowList +
# dns_type "internal" (Emo's Portal already resolves this host internally
# via Technitium; the public internal-IP record covers any resolver).
# LAN-only design: docs/plans/2026-07-04-immich-frame-lan-only-design.md.
# auth = "none": kiosk WebView, no user auth by design; gated by the home-lans-only ipAllowList instead.
auth = "none"
dns_type = "internal"
extra_middlewares = ["traefik-home-lans-only@kubernetescrd"]
# Not externally reachable explicit opt-out so external-monitor-sync
# drops the old [External] monitor instead of default-opting it back in.
external_monitor = false
namespace = "immich"
name = "highlights-immich-emo"
tls_secret_name = var.tls_secret_name
service_name = "immich-frame-emo"
}

View file

@ -69,7 +69,11 @@ resource "kubernetes_deployment" "immich-frame" {
}
spec {
container {
image = "ghcr.io/immichframe/immichframe:v1.0.32.0"
# immich_v3 is the upstream compat tag for Immich v3 servers every
# versioned release ( v1.0.33.0) crashes deserializing v3 API
# responses (immichFrame/immichFrame#653). Pin back to a vX.Y.Z.W tag
# once a stable release ships v3 support (upstream PR #654).
image = "ghcr.io/immichframe/immichframe:immich_v3"
name = "immich-frame"
resources {
requests = {
@ -138,14 +142,23 @@ resource "kubernetes_service" "immich-frame" {
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
# Photo-frame kiosk display runs in headless browser mode on a TV/frame
# device and pulls images via an Immich API key (no user login). Forward-auth
# would 302 the device to Authentik with no way to complete login.
# auth = "none": Photo-frame kiosk display headless browser with API key; no user login; forward-auth breaks device automation.
auth = "none"
dns_type = "proxied"
namespace = "immich"
name = "highlights-immich"
tls_secret_name = var.tls_secret_name
service_name = "immich-frame"
# Photo-frame kiosk display (Viktor's London Portal Plus WebView) pulls
# images via an Immich API key; no user login possible on the device, so
# forward-auth would 302 it to Authentik with no way to complete login.
# The GATE is network-level: the home-lans-only ipAllowList (Sofia/London/
# Valchedrym LANs + 10/8) 403s everyone else, and dns_type "internal"
# publishes the Traefik LB IP publicly so the Portal's baked-in URL resolves
# from any resolver yet routes only via the home LANs / WG tunnel.
# LAN-only design: docs/plans/2026-07-04-immich-frame-lan-only-design.md.
# auth = "none": kiosk WebView, no user auth by design; gated by the home-lans-only ipAllowList instead.
auth = "none"
dns_type = "internal"
extra_middlewares = ["traefik-home-lans-only@kubernetescrd"]
# Not externally reachable explicit opt-out so external-monitor-sync
# drops the old [External] monitor instead of default-opting it back in.
external_monitor = false
namespace = "immich"
name = "highlights-immich"
tls_secret_name = var.tls_secret_name
service_name = "immich-frame"
}

View file

@ -15,7 +15,7 @@ locals {
variable "immich_version" {
type = string
# Change me to upgrade
default = "v2.7.5"
default = "v3.0.0"
}
variable "proxmox_host" { type = string }
variable "redis_host" { type = string }
@ -492,7 +492,7 @@ resource "kubernetes_deployment" "immich-postgres" {
}
spec {
container {
image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0"
image = "ghcr.io/immich-app/postgres:15-vectorchord0.4.3-pgvectors0.2.0"
name = "immich-postgresql"
port {
container_port = 5432
@ -882,7 +882,7 @@ resource "kubernetes_cron_job_v1" "clip-index-prewarm" {
restart_policy = "Never"
container {
name = "prewarm"
image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0"
image = "ghcr.io/immich-app/postgres:15-vectorchord0.4.3-pgvectors0.2.0"
# command overrides the postgres entrypoint runs psql directly.
command = [
"psql", "-v", "ON_ERROR_STOP=1", "-c",
@ -964,7 +964,7 @@ resource "kubernetes_cron_job_v1" "immich-search-probe" {
}
init_container {
name = "measure"
image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0"
image = "ghcr.io/immich-app/postgres:15-vectorchord0.4.3-pgvectors0.2.0"
command = ["/bin/bash", "-c", <<-EOT
set -uo pipefail
OUT=/shared/metrics.prom

View file

@ -43,6 +43,11 @@ locals {
# ghcr.io/passionprojectsanca/book-plotter (built by GHA in Anca's repo,
# under her own org's ghcr). The deployment references the cloned secret.
"plotting-book",
# excalidraw: infra-owned image migrated from manual DockerHub pushes to
# PRIVATE ghcr.io/viktorbarzin/excalidraw-library (ADR-0002, built by
# .github/workflows/build-excalidraw.yml). The deployment references the
# cloned secret.
"excalidraw",
]
}

View file

@ -19,3 +19,12 @@ plans@viktorbarzin.me spam@viktorbarzin.me
# to trips@, or every verification/recovery send is rejected (550 sender). Also
# routes any inbound trips@ to spam@.
trips@viktorbarzin.me spam@viktorbarzin.me
# docs@ -> docs@: explicit self-alias for the paperless-ngx ingest MAILBOX
# (a real account in secret/platform.mailserver_accounts). Without this the
# @domain catch-all above (Vault-side aliases) rewrites docs@ to spam@ and the
# mail lands in the TripIt-swept catch-all mailbox instead. Same pattern as
# me@ -> me@. Delivery-time sender allowlist: docs-at-viktorbarzin.me
# .dovecot.sieve (mounted as docs@viktorbarzin.me.dovecot.sieve).
# Runbook: docs/runbooks/paperless-mail-ingest.md
docs@viktorbarzin.me docs@viktorbarzin.me

View file

@ -0,0 +1,17 @@
# Sender allowlist for the paperless-ngx ingest mailbox docs@viktorbarzin.me.
# Family members forward document emails here; paperless-ngx polls the INBOX
# over IMAP and maps each sender to a paperless account (1 mail rule per
# sender). Decision (Viktor, 2026-07-03): mail from any OTHER sender is
# ignored and deleted — discarded here at LMTP delivery, before paperless
# ever sees it. This also keeps spam to the guessable address out entirely.
#
# Keep this list in sync with the paperless mail rules (the sender -> owner
# map). Add-a-sender procedure: docs/runbooks/paperless-mail-ingest.md
if not address :is "from" ["me@viktorbarzin.me",
"vbarzin@gmail.com",
"viktorbarzin@meta.com",
"ancaelena98@gmail.com",
"emil.barzin@gmail.com"] {
discard;
stop;
}

View file

@ -14,10 +14,15 @@ variable "nfs_server" { type = string }
locals {
_account_set = keys(var.mailserver_accounts)
_virtual_lines = split("\n", format("%s%s", var.postfix_account_aliases, file("${path.module}/extra/aliases.txt")))
# NOTE: the length guard must live in a ternary, not a leading `&&` operand.
# Terraform only short-circuits && / || from v1.6 on the older terraform
# pinned in the infra-ci image, `split(" ", line)[1]` was still evaluated
# for blank/comment lines and failed the whole plan with "Invalid index"
# (first hit by CI pipeline #469, 2026-07-03). A conditional expression is
# lazy on every terraform version.
postfix_virtual = join("\n", [
for line in local._virtual_lines : line
if !(
length(split(" ", line)) == 2 &&
if length(split(" ", line)) != 2 ? true : !(
contains(local._account_set, split(" ", line)[0]) &&
contains(local._account_set, split(" ", line)[1]) &&
split(" ", line)[0] != split(" ", line)[1]
@ -110,6 +115,12 @@ resource "kubernetes_config_map" "mailserver_config" {
"postfix-main.cf" = var.postfix_cf
"postfix-virtual.cf" = local.postfix_virtual
# Per-user Dovecot sieve for the paperless-ngx ingest mailbox: DMS installs
# any /tmp/docker-mailserver/<login>.dovecot.sieve at startup. ConfigMap
# keys can't contain '@', so the key is sanitized ("-at-") and the
# volume_mount below restores the real filename.
"docs-at-viktorbarzin.me.dovecot.sieve" = file("${path.module}/extra/docs-at-viktorbarzin.me.dovecot.sieve")
KeyTable = "mail._domainkey.viktorbarzin.me viktorbarzin.me:mail:/etc/opendkim/keys/viktorbarzin.me-mail.key\n"
SigningTable = "*@viktorbarzin.me mail._domainkey.viktorbarzin.me\n"
TrustedHosts = "127.0.0.1\nlocalhost\n"
@ -404,6 +415,12 @@ resource "kubernetes_deployment" "mailserver" {
sub_path = "postfix-virtual.cf"
read_only = true
}
volume_mount {
name = "config"
mount_path = "/tmp/docker-mailserver/docs@viktorbarzin.me.dovecot.sieve"
sub_path = "docs-at-viktorbarzin.me.dovecot.sieve"
read_only = true
}
volume_mount {
name = "config"
mount_path = "/tmp/docker-mailserver/fetchmail.cf"

View file

@ -60,6 +60,10 @@ locals {
# t3 dispatch probe surface (auth="none" path carve-out on /probe): WS echo
# + healthz for the t3-probe drop-attribution client (stacks/t3code).
"t3-probe-ws" = "https://t3.viktorbarzin.me/probe/healthz"
# tasks PWA icons + manifest (auth="none" path carve-out, stacks/tasks
# module.ingress_icons): macOS/iOS/Android icon fetchers carry no session
# cookies, so an Authentik 302 here breaks Add-to-Dock icons.
"tasks-icons" = "https://tasks.viktorbarzin.me/apple-touch-icon.png"
# NOTE: openclaw task-webhook (auth="none") is intentionally NOT probed it
# has no public DNS record (NXDOMAIN, external_monitor=false), so there is no
# externally GET-able URL to probe. Its carve-out is internal-only.

View file

@ -18,7 +18,6 @@ const SITE_IDS = {
"stacks.viktorbarzin.me": "b38fda4285df",
"f1.viktorbarzin.me": "7e69786f66d5",
"frigate.viktorbarzin.me": "0d4044069ff5",
"highlights-immich.viktorbarzin.me": "602167601c6b",
"immich.viktorbarzin.me": "35eedb7a3d2b",
"mail.viktorbarzin.me": "082f164faa7d",
"navidrome.viktorbarzin.me": "8a3844ff75ba",

View file

@ -28,7 +28,6 @@ routes = [
{ pattern = "stacks.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
{ pattern = "f1.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
{ pattern = "frigate.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
{ pattern = "highlights-immich.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
{ pattern = "immich.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
{ pattern = "mail.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
{ pattern = "navidrome.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },

View file

@ -1,122 +0,0 @@
# Automatic Google Drive -> site sync (added 2026-06-09; supersedes the
# earlier on-demand-only model now that content is actively maintained).
#
# A CronJob mirrors the READ-ONLY Drive folder "claude" (servable content in
# subfolder "stem claude/files/") onto the NFS content volume every 10 min via
# rclone. rclone is delta-aware: an unchanged run lists ~33 files' metadata and
# transfers nothing, so the schedule is cheap (not a 24MB re-download). nginx
# keeps serving the same volume read-only; updates appear within ~5s (actimeo).
#
# Drive is treated strictly READ-ONLY: scope=drive.readonly and rclone only ever
# reads the remote (sync gdrive: -> /data), never writes back.
#
# TOKEN LONGEVITY: the GCP OAuth app (project home-lab-1700868541205) MUST be
# published to "Production" or its refresh token expires ~weekly and this job
# fails. After publishing, re-mint the token and refresh
# `secret/stem95su.rclone_conf`. A failed run surfaces as a failed Job.
resource "kubernetes_manifest" "rclone_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "stem95su-rclone"
namespace = kubernetes_namespace.stem95su.metadata[0].name
}
spec = {
refreshInterval = "1h"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = { name = "stem95su-rclone" }
data = [{
secretKey = "rclone.conf"
remoteRef = {
key = "stem95su"
property = "rclone_conf"
}
}]
}
}
depends_on = [kubernetes_namespace.stem95su]
}
resource "kubernetes_cron_job_v1" "gdrive_sync" {
metadata {
name = "stem95su-gdrive-sync"
namespace = kubernetes_namespace.stem95su.metadata[0].name
labels = { run = "stem95su", component = "gdrive-sync" }
}
spec {
schedule = "*/10 * * * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 2
failed_jobs_history_limit = 3
job_template {
metadata {}
spec {
backoff_limit = 1
ttl_seconds_after_finished = 86400
template {
metadata { labels = { run = "stem95su", component = "gdrive-sync" } }
spec {
restart_policy = "OnFailure"
container {
name = "rclone"
image = "docker.io/rclone/rclone:1.74.3"
# Mirror Drive folder -> /data. Guard: hard-fail on auth/list error
# (so an expired token is visible); skip quietly if the source is
# empty / missing the dashboard (never wipe the live site);
# --max-delete caps catastrophic deletes from a partial listing.
command = ["/bin/sh", "-c", <<-EOT
set -eu
cp /config/rclone.conf /tmp/rc.conf
SRC="gdrive:stem claude/files"
LIST=$(rclone --config /tmp/rc.conf lsf "$SRC" --files-only) || { echo "FATAL: Drive list failed (auth/network)"; exit 1; }
N=$(printf '%s\n' "$LIST" | grep -c . || true)
if [ "$N" -lt 1 ] || ! printf '%s\n' "$LIST" | grep -qx "stem_board.html"; then
echo "GUARD: source N=$N / stem_board.html missing -- skipping, site untouched"; exit 0
fi
echo "source OK ($N files) -- mirroring to /data"
rclone --config /tmp/rc.conf sync "$SRC" /data --exclude ".DS_Store" --fast-list --transfers 4 --max-delete 25 -v
EOT
]
resources {
requests = { cpu = "10m", memory = "64Mi" }
limits = { memory = "192Mi" }
}
volume_mount {
name = "rclone-config"
mount_path = "/config"
read_only = true
}
volume_mount {
name = "content"
mount_path = "/data"
}
}
volume {
name = "rclone-config"
secret { secret_name = "stem95su-rclone" }
}
volume {
name = "content"
persistent_volume_claim {
claim_name = module.nfs_content.claim_name
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
depends_on = [kubernetes_manifest.rclone_external_secret]
}

View file

@ -1,173 +1,9 @@
# STEM educational platform for 95. СУ Проф. Иван Шишманов" (Sofia).
# Public, open static site at stem95su.viktorbarzin.me. Self-contained HTML
# pages + media authored externally (Gemini exports), served by a stock nginx
# straight off the PVE host NFS NOT baked into an image, so content can be
# updated out-of-band (Nextcloud "PVE NFS Pool" or rsync to /srv/nfs/stem-site)
# without a rebuild. Auto-backed-up offsite by the existing nfs-mirror job.
resource "kubernetes_namespace" "stem95su" {
metadata {
name = "stem95su"
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
source = "../../modules/kubernetes/setup_tls_secret"
namespace = kubernetes_namespace.stem95su.metadata[0].name
tls_secret_name = var.tls_secret_name
}
# Content lives on the PVE host NFS. NOTE: the nfs_volume module creates only
# the K8s PV+PVC the export subdir (/srv/nfs/stem-site) must already exist on
# 192.168.1.127 or the pod fails to mount (mount.nfs exit 32). It is created
# during deploy and re-created on demand if ever lost.
module "nfs_content" {
source = "../../modules/kubernetes/nfs_volume"
name = "stem95su-content"
namespace = kubernetes_namespace.stem95su.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/stem-site"
storage = "1Gi"
access_modes = ["ReadWriteMany"]
}
# Minimal nginx server block: serve the static dir, with the dashboard
# (stem_board.html) as the directory index so "/" loads the platform home.
# All other pages/assets are reached by their exact filenames (the dashboard
# links to them by name those must not be renamed).
resource "kubernetes_config_map" "nginx_conf" {
metadata {
name = "stem95su-nginx-conf"
namespace = kubernetes_namespace.stem95su.metadata[0].name
}
data = {
"default.conf" = <<-EOT
server {
listen 80;
server_name _;
root /usr/share/nginx/html;
index stem_board.html index.html;
}
EOT
}
}
resource "kubernetes_deployment" "stem95su" {
metadata {
name = "stem95su"
namespace = kubernetes_namespace.stem95su.metadata[0].name
labels = {
run = "stem95su"
tier = local.tiers.aux
}
}
spec {
replicas = 1
selector {
match_labels = {
run = "stem95su"
}
}
template {
metadata {
labels = {
run = "stem95su"
}
}
spec {
container {
image = "nginx:1.28-alpine"
name = "nginx"
resources {
limits = {
memory = "64Mi"
}
requests = {
cpu = "10m"
memory = "64Mi"
}
}
port {
container_port = 80
}
volume_mount {
name = "content"
mount_path = "/usr/share/nginx/html"
read_only = true
}
volume_mount {
name = "nginx-conf"
mount_path = "/etc/nginx/conf.d"
read_only = true
}
readiness_probe {
http_get {
path = "/"
port = 80
}
initial_delay_seconds = 3
period_seconds = 10
}
}
volume {
name = "content"
persistent_volume_claim {
claim_name = module.nfs_content.claim_name
}
}
volume {
name = "nginx-conf"
config_map {
name = kubernetes_config_map.nginx_conf.metadata[0].name
}
}
}
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
]
}
}
resource "kubernetes_service" "stem95su" {
metadata {
name = "stem95su"
namespace = kubernetes_namespace.stem95su.metadata[0].name
labels = {
run = "stem95su"
}
}
spec {
selector = {
run = "stem95su"
}
port {
name = "http"
port = "80"
target_port = "80"
}
}
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
# auth = "none": public static educational site for 95. СУ, open to the internet by design CrowdSec + ai-bot-block gate bots; no login.
auth = "none"
namespace = kubernetes_namespace.stem95su.metadata[0].name
name = "stem95su"
service_name = kubernetes_service.stem95su.metadata[0].name
port = "80"
host = "stem95su"
dns_type = "proxied"
tls_secret_name = var.tls_secret_name
}
# stem95su moved OFF-INFRA to Cloudflare Pages (ADR-0018 cutover, 2026-07-03)
# registry entry `stem95su` in stacks/valia-sites; runbook
# docs/runbooks/valia-sites.md. This stack intentionally declares NOTHING:
# the apply that landed this file destroyed the old in-cluster serving
# (nginx + NFS content PVC + ingress + per-site gdrive-sync CronJob +
# namespace). Directory kept only so the destroy could run through CI
# safe to delete the dir + its PG state schema in a later cleanup.
# Harmless leftovers (manual cleanup if ever wanted): /srv/nfs/stem-site on
# the PVE host, and Vault secret/stem95su (superseded by secret/valia-sites).

View file

@ -1,9 +0,0 @@
variable "tls_secret_name" {
type = string
sensitive = true
}
variable "nfs_server" {
type = string
default = "192.168.1.127"
}

53
stacks/tasks/imports.tf Normal file
View file

@ -0,0 +1,53 @@
# One-shot adoption of the live tasks-stack resources that exist in-cluster but
# were never persisted to Terraform state: pipeline 477 (2026-07-03, the stack's
# first apply) died mid-`[tasks] apply` after creating the resources, before
# the pg backend write so `tasks.states` stayed empty and every later apply
# would create-fail with `namespaces "tasks" already exists` (same class as the
# monitoring alert-digest adoption in stacks/monitoring/imports.tf). Importing
# reconciles them into state so `terraform apply` UPDATES instead of failing to
# create. These blocks are idempotent (a no-op once the resources are in state)
# and may be removed after the next green apply. Defs: main.tf.
# (module.ingress_icons is deliberately NOT here it does not exist live yet;
# the same apply creates it.)
import {
to = kubernetes_namespace.tasks
id = "tasks"
}
import {
to = kubernetes_manifest.external_secret
id = "apiVersion=external-secrets.io/v1,kind=ExternalSecret,namespace=tasks,name=tasks-secrets"
}
import {
to = kubernetes_manifest.db_external_secret
id = "apiVersion=external-secrets.io/v1,kind=ExternalSecret,namespace=tasks,name=tasks-db-creds"
}
import {
to = kubernetes_deployment.tasks
id = "tasks/tasks"
}
import {
to = kubernetes_service.tasks
id = "tasks/tasks"
}
import {
to = kubernetes_network_policy_v1.tasks_ingress
id = "tasks/tasks-ingress"
}
import {
to = module.ingress.kubernetes_ingress_v1.proxied-ingress
id = "tasks/tasks"
}
# Cloudflare record ID looked up via the API (zone fd2c5dd4 / record for
# tasks.viktorbarzin.me, CNAME the cfargotunnel target, proxied).
import {
to = module.ingress.cloudflare_record.proxied[0]
id = "fd2c5dd4efe8fe38958944e74d0ced6d/a8e6901a074c5255d09700d93eaaf705"
}

378
stacks/tasks/main.tf Normal file
View file

@ -0,0 +1,378 @@
variable "image_tag" {
type = string
default = "latest"
description = "tasks image tag. Running tag is set by the Woodpecker deploy (kubectl set image)."
}
variable "postgresql_host" { type = string }
variable "tls_secret_name" {
type = string
sensitive = true
}
locals {
namespace = "tasks"
# ADR-0002: built on GHA from the public GitHub mirror, pushed to ghcr
# (public package anonymous pulls). Running tag is managed by the
# Woodpecker deploy (kubectl set image); the image ref below is
# ignore_changes'd (KEEL_IGNORE_IMAGE), so this base only matters on
# (re)create.
image = "ghcr.io/viktorbarzin/tasks:${var.image_tag}"
labels = {
app = "tasks"
}
}
resource "kubernetes_namespace" "tasks" {
metadata {
name = local.namespace
labels = {
tier = local.tiers.aux
"istio-injection" = "disabled"
# Opt into Keel auto-update (inject-keel-annotations ClusterPolicy).
"keel.sh/enrolled" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label.
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# App secrets seed these in Vault before applying:
# secret/tasks
# fernet_key Fernet key encrypting the per-user Nextcloud app passwords
# stored in the Connected Accounts table (tasks ADR-0002).
#
# DB: CNPG database `tasks` (created in dbaas, null_resource.pg_tasks_db);
# role password managed via the Vault database engine see
# static-creds/pg-tasks. Alembic runs migrations on app startup (no init
# container needed).
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "tasks-secrets"
namespace = local.namespace
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = {
name = "tasks-secrets"
template = {
metadata = {
annotations = {
"reloader.stakater.com/match" = "true"
}
}
}
}
data = [
{ secretKey = "TASKS_FERNET_KEY", remoteRef = { key = "tasks", property = "fernet_key" } },
]
}
}
depends_on = [kubernetes_namespace.tasks]
}
# DB credentials from Vault database engine (7-day rotation).
# Builds the asyncpg DSN consumed by the FastAPI app as TASKS_DB_DSN.
# Pre-req in dbaas: CNPG cluster has DB `tasks`, role `tasks`, and Vault
# role `static-creds/pg-tasks`.
resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "tasks-db-creds"
namespace = local.namespace
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-database"
kind = "ClusterSecretStore"
}
target = {
name = "tasks-db-creds"
template = {
metadata = {
annotations = {
"reloader.stakater.com/match" = "true"
}
}
data = {
TASKS_DB_DSN = "postgresql+asyncpg://tasks:{{ .password }}@${var.postgresql_host}:5432/tasks"
DB_PASSWORD = "{{ .password }}"
}
}
}
data = [{
secretKey = "password"
remoteRef = {
key = "static-creds/pg-tasks"
property = "password"
}
}]
}
}
depends_on = [kubernetes_namespace.tasks]
}
resource "kubernetes_deployment" "tasks" {
metadata {
name = "tasks"
namespace = kubernetes_namespace.tasks.metadata[0].name
labels = merge(local.labels, {
tier = local.tiers.aux
})
annotations = {
# Reloader restarts the pod when tasks-secrets / tasks-db-creds change
# (both carry reloader.stakater.com/match=true) required because the
# DB password rotates every 7 days and is read only at startup.
"reloader.stakater.com/search" = "true"
}
}
spec {
# Single leader: the CalDAV sync engine wants one writer per user's
# sync-token cursor; the SPA is served by the same process.
replicas = 1
strategy {
type = "Recreate"
}
selector {
match_labels = local.labels
}
template {
metadata {
labels = local.labels
annotations = {
# Prometheus scrapes the service-endpoints (annotations live on the
# Service below); the pod annotations here let the kubernetes-pods
# SD job also discover /metrics directly.
"prometheus.io/scrape" = "true"
"prometheus.io/path" = "/metrics"
"prometheus.io/port" = "8000"
}
}
spec {
image_pull_secrets {
name = "registry-credentials"
}
container {
name = "tasks"
image = local.image
port {
container_port = 8000
}
# TASKS_FERNET_KEY via tasks-secrets; TASKS_DB_DSN via tasks-db-creds.
env_from {
secret_ref { name = "tasks-secrets" }
}
env_from {
secret_ref { name = "tasks-db-creds" }
}
# Wall-clock zone for all-day due dates (DUE;VALUE=DATE) and the
# Today/Scheduled smart views.
env {
name = "TASKS_LOCAL_TZ"
value = "Europe/Sofia"
}
# SECURITY INVARIANT DEV_USER must NEVER be set here. It is the
# dev-only identity fallback: when present the backend treats every
# request as that user, bypassing the Authentik forward-auth
# identity (X-authentik-username) entirely. Production identity
# comes ONLY from the header Traefik/Authentik injects.
readiness_probe {
http_get {
path = "/healthz"
port = 8000
}
initial_delay_seconds = 5
period_seconds = 10
}
liveness_probe {
http_get {
path = "/healthz"
port = 8000
}
initial_delay_seconds = 30
period_seconds = 30
}
resources {
requests = { cpu = "100m", memory = "384Mi" }
limits = { memory = "384Mi" }
}
}
}
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Woodpecker deploy sets the running tag
metadata[0].annotations["kubernetes.io/change-cause"],
metadata[0].annotations["deployment.kubernetes.io/revision"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
}
depends_on = [
kubernetes_manifest.external_secret,
kubernetes_manifest.db_external_secret,
]
}
resource "kubernetes_service" "tasks" {
metadata {
name = "tasks"
namespace = kubernetes_namespace.tasks.metadata[0].name
labels = local.labels
annotations = {
# Prometheus kubernetes-service-endpoints SD scrapes /metrics here.
"prometheus.io/scrape" = "true"
"prometheus.io/path" = "/metrics"
"prometheus.io/port" = "8000"
}
}
spec {
type = "ClusterIP"
selector = local.labels
port {
name = "http"
port = 8000
target_port = 8000
}
}
}
# Kyverno ClusterPolicy `sync-tls-secret` auto-clones the wildcard TLS
# secret into every namespace, so we don't need a setup_tls_secret module.
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
# auth = "required": Authentik forward-auth gates EVERY request the app
# has no login of its own and blindly trusts the X-authentik-username
# header the outpost injects, so Authentik is the only thing standing
# between strangers and everyone's tasks. Do NOT relax this tier (tasks
# design decision #3; pairs with the NetworkPolicy below, SEC-1).
auth = "required"
dns_type = "proxied"
namespace = kubernetes_namespace.tasks.metadata[0].name
name = "tasks"
port = 8000
tls_secret_name = var.tls_secret_name
}
# Carve-out for the PWA icon assets + web manifest. macOS Safari's
# "Add to Dock" (and every other OS icon fetcher: iOS Add-to-Home-Screen,
# Android install prompt) fetches these in a cookie-less context behind
# forward-auth it got the Authentik 302 and fell back to a letter monogram.
# Traefik prioritises these longer path prefixes over the main "/" router,
# so ONLY these five static files bypass Authentik; the SPA shell and /api
# stay gated by the main ingress above (and the app itself 401s /api
# without the identity header). Guarded against regression by the
# tasks-icons entry in the Authentik walling-off probe
# (stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf).
module "ingress_icons" {
source = "../../modules/kubernetes/ingress_factory"
# auth = "none": public static icons + manifest, no user data; required for
# OS icon fetchers (Safari Add-to-Dock etc.) that carry no session and
# cannot complete the Authentik redirect dance.
auth = "none"
namespace = kubernetes_namespace.tasks.metadata[0].name
name = "tasks-icons"
service_name = kubernetes_service.tasks.metadata[0].name
port = 8000
ingress_path = [
"/apple-touch-icon.png",
"/favicon.png",
"/pwa-192x192.png",
"/pwa-512x512.png",
"/manifest.webmanifest",
]
full_host = "tasks.viktorbarzin.me" # MUST match the main ingress host; otherwise the factory derives tasks-icons.viktorbarzin.me and the carve-out never matches.
dns_type = "none" # host record already owned by the main tasks ingress
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false # Five static icons + a manifest; nothing for scrapers to mine.
homepage_enabled = false # path carve-out, not its own dashboard tile
}
# --- NetworkPolicy: scoped pod ingress (security-review finding SEC-1). ---
# The app trusts X-authentik-username unconditionally, so its ENTIRE auth
# model depends on requests only ever arriving through Traefik (where the
# Authentik forward-auth middleware sets that header). Any pod that could
# reach the pod IP directly could spoof the header and read/write anyone's
# tasks hence ingress is restricted to:
# - TCP/8000 from the traefik namespace (user traffic, post-forward-auth);
# - TCP/8000 from the monitoring namespace (Prometheus /metrics scrape).
# The cluster has no default-deny, so this NP only takes effect inside the
# tasks ns pods elsewhere remain unaffected. (Same shape as
# chrome-service's chrome-service-ws-ingress.)
resource "kubernetes_network_policy_v1" "tasks_ingress" {
metadata {
name = "tasks-ingress"
namespace = kubernetes_namespace.tasks.metadata[0].name
}
spec {
pod_selector {
match_labels = local.labels
}
policy_types = ["Ingress"]
ingress {
from {
namespace_selector {
match_labels = {
"kubernetes.io/metadata.name" = "traefik"
}
}
}
ports {
port = "8000"
protocol = "TCP"
}
}
ingress {
from {
namespace_selector {
match_labels = {
"kubernetes.io/metadata.name" = "monitoring"
}
}
}
ports {
port = "8000"
protocol = "TCP"
}
}
}
}

View file

@ -0,0 +1,23 @@
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}
dependency "vault" {
config_path = "../vault"
skip_outputs = true
}
dependency "external-secrets" {
config_path = "../external-secrets"
skip_outputs = true
}
inputs = {
# Override per-deploy in CI / commit.
image_tag = "latest"
}

View file

@ -873,6 +873,14 @@ resource "kubernetes_cluster_role" "ingress_dns_sync" {
resources = ["services"]
verbs = ["get", "list"]
}
# Read the Valia-sites internal-DNS feed (written by stacks/valia-sites,
# ADR-0018) so the sync can reconcile off-infra Pages CNAMEs declaratively.
rule {
api_groups = [""]
resources = ["configmaps"]
resource_names = ["valia-sites-dns"]
verbs = ["get"]
}
}
resource "kubernetes_cluster_role_binding" "ingress_dns_sync" {
@ -1002,6 +1010,42 @@ resource "kubernetes_cron_job_v1" "technitium_ingress_dns_sync" {
echo "mail-auth: MX present"
fi
# Valia sites (ADR-0018) off-infra Cloudflare Pages sites.
# The internal zone is authoritative (superset rule above), so
# these public-only names must exist here or every internal
# client NXDOMAINs on them. Reconciled DECLARATIVELY from the
# ConfigMap valia-sites-dns (written by stacks/valia-sites):
# ensure/update every entry, and DELETE stale records that
# left the map (site retired/renamed). Deletion is scoped to
# CNAMEs targeting *.pages.dev nothing else is ever touched.
# Targets resolve upstream to CF edge IPs; no hairpin involved.
VALIA=$$(kubectl get configmap valia-sites-dns -n technitium -o go-template='{{range $$k, $$v := .data}}{{$$k}} {{$$v}}{{"\n"}}{{end}}' 2>/dev/null || true)
if [ -n "$$VALIA" ]; then
printf '%s\n' "$$VALIA" | while read -r VNAME VTARGET; do
[ -z "$$VNAME" ] && continue
CUR=$$(curl -sf "$$TECH_API/api/zones/records/get?token=$$TOKEN&zone=$$ZONE&domain=$$VNAME.$$ZONE" | grep -o '"cname":"[^"]*"' | head -1 | cut -d'"' -f4)
if [ "$$CUR" = "$$VTARGET" ]; then
echo "valia: $$VNAME.$$ZONE ok"
continue
fi
if [ -n "$$CUR" ]; then
curl -sf -G "$$TECH_API/api/zones/records/delete" --data-urlencode "token=$$TOKEN" --data-urlencode "zone=$$ZONE" --data-urlencode "domain=$$VNAME.$$ZONE" --data-urlencode "type=CNAME" --data-urlencode "cname=$$CUR" > /dev/null || true
fi
R=$$(curl -sf -G "$$TECH_API/api/zones/records/add" --data-urlencode "token=$$TOKEN" --data-urlencode "zone=$$ZONE" --data-urlencode "domain=$$VNAME.$$ZONE" --data-urlencode "type=CNAME" --data-urlencode "cname=$$VTARGET" --data-urlencode "ttl=3600") || true
echo "$$R" | grep -q '"status":"ok"' && echo "valia: set $$VNAME.$$ZONE -> $$VTARGET" || echo "valia: FAILED $$VNAME.$$ZONE -- $$R"
done
# Deletion pass: zone CNAMEs targeting *.pages.dev that are
# no longer in the map. ZONE_DUMP predates this run's adds,
# but just-set names are in $VALIA so they're never deleted.
printf '%s' "$$ZONE_DUMP" | tr ',' '\n' | awk -F'"' '/"name":/{n=$$4} /"cname":/{print n" "$$4}' | grep '\.pages\.dev *$$' | while read -r RNAME RTARGET; do
SHORT=$${RNAME%%.$$ZONE}
printf '%s\n' "$$VALIA" | grep -q "^$$SHORT " && continue
curl -sf -G "$$TECH_API/api/zones/records/delete" --data-urlencode "token=$$TOKEN" --data-urlencode "zone=$$ZONE" --data-urlencode "domain=$$RNAME" --data-urlencode "type=CNAME" --data-urlencode "cname=$$RTARGET" > /dev/null && echo "valia: removed stale $$RNAME -> $$RTARGET"
done
else
echo "valia: CM valia-sites-dns absent/unreadable -- skipping Pages CNAMEs this run"
fi
# Pin the .lan ingress anchor A record to the LIVE Traefik LB IP.
# *.viktorbarzin.lan ingress hosts CNAME to ingress.viktorbarzin.lan,
# so a Traefik LB IP move that misses the .lan zone silently breaks

View file

@ -119,6 +119,41 @@ resource "kubernetes_manifest" "middleware_local_only" {
depends_on = [helm_release.traefik]
}
# IP allowlist for household access across ALL home sites: Sofia LAN + the
# WireGuard spoke LANs (London, Valchedrym) + 10/8 (VLANs, K8s pods/services,
# WG tunnel IPs). Deliberately a SEPARATE middleware from `local-only`
# widening local-only would grant the remote LANs access to the admin surfaces
# that use it (Prometheus, iDRAC, Loki, ). Use for family-facing services
# (e.g. the immich-frame kiosks) that every household device may open but the
# public internet must not. Pair with ingress_factory `dns_type = "internal"`:
# a Cloudflare-proxied record would deliver public traffic from cloudflared
# POD IPs (inside 10/8) and silently bypass this allowlist.
resource "kubernetes_manifest" "middleware_home_lans_only" {
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "home-lans-only"
namespace = kubernetes_namespace.traefik.metadata[0].name
}
spec = {
ipAllowList = {
sourceRange = [
"192.168.1.0/24", # Sofia LAN (hub site)
"10.0.0.0/8", # VLANs, K8s pod/svc CIDRs, WG tunnel subnet
"192.168.8.0/24", # London LAN (via WG tunnel)
"192.168.9.0/24", # London GUEST net the Portal Plus actually leases here (Portal-75AE8F9C2A8A = 192.168.9.198)
"192.168.0.0/24", # Valchedrym LAN (via WG tunnel)
"fc00::/7",
"fe80::/10",
]
}
}
}
depends_on = [helm_release.traefik]
}
# HTTPS redirect middleware
resource "kubernetes_manifest" "middleware_redirect_https" {
manifest = {
@ -368,6 +403,33 @@ resource "kubernetes_manifest" "middleware_authentik_rate_limit" {
depends_on = [helm_release.traefik]
}
# Dawarich-specific rate limit. The Rails app serves all its fingerprinted
# assets itself (JS/CSS chunks, SVG store badges, favicons, webmanifest) and
# the map view adds a points/API burst on load a single page load from one
# client IP blows past the default 10/50 limiter and 429s the asset tail
# (seventh instance of the burst pattern, after ha-sofia, ActualBudget, noVNC,
# tripit, health and authentik). Background location ingestion (OwnTracks
# bridge + mobile api_key POSTs) rides the same host, so 429s here also risk
# dropped pings. Burst absorbs a couple of full page loads back-to-back.
resource "kubernetes_manifest" "middleware_dawarich_rate_limit" {
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "dawarich-rate-limit"
namespace = kubernetes_namespace.traefik.metadata[0].name
}
spec = {
rateLimit = {
average = 100
burst = 1000
}
}
}
depends_on = [helm_release.traefik]
}
# Compress responses to clients at the entrypoint level (outermost).
# Applied at websecure entrypoint so all responses get compressed.
# Uses includedContentTypes (whitelist) instead of excludedContentTypes:

View file

@ -175,6 +175,12 @@ locals {
STORY_SOURCE_MODE = "web"
SCRIPT_WRITER_MODE = "chat"
PLACE_RESOLVER_MODE = "wikipedia"
# Saved Place preview photos (tripit ADR-0035/0040): the Wikipedia lead-image
# fetcher behind manual-add-time photos and the backfill sweep. Same fake-
# default gap as the resolver above never set, so prod silently ran the
# fake and hand-added places (and any backfill) would store placeholder
# PNGs instead of real photos.
PLACE_PHOTO_PROVIDER = "wikipedia"
}
}

368
stacks/valia-sites/main.tf Normal file
View file

@ -0,0 +1,368 @@
# Valia sites (ADR-0018): small static sites authored by Valia in Google Drive,
# served OFF-INFRA on Cloudflare Pages, mirrored by the in-cluster CronJob below
# every 10 minutes. Registering a new site = one entry in local.sites (plus
# Valia sharing the folder with vbarzin@gmail.com). Full runbook:
# docs/runbooks/valia-sites.md
#
# Per site this stack fans out:
# - cloudflare_pages_project + custom domain <name>.viktorbarzin.me
# - public proxied CNAME <name> -> <project>.pages.dev (manage_dns gate)
# - internal split-horizon CNAME via ConfigMap valia-sites-dns consumed by
# the technitium-ingress-dns-sync script (declarative: add/update/REMOVE)
# - a slot in the shared sync CronJob (rclone mirror -> wrangler deploy)
locals {
cloudflare_account_id = "02e035473cfc4834fb10c5d35470d8b4" # vbarzin@gmail.com's account (not a secret)
# THE site registry. Keys are the public subdomain (English, Viktor picks
# CONTEXT.md "Valia site"). folder_id = the Drive folder Valia shared (the
# Content folder); src_path = subfolder holding servable files ("" = root);
# entry_file = what / must serve (staged as index.html at deploy time).
# manage_dns = false parks a site's public CNAME + internal record while the
# name is still owned elsewhere (used for the stem95su ingress cutover).
sites = {
bridge = {
folder_id = "1YWwAtSTsJD9HOzckGRIFXigWqCgYSGEa" # "мост" ОбУ Отец Паисий
src_path = ""
entry_file = "index.html"
manage_dns = true
}
stem95su = {
folder_id = "1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_" # "claude" 95. СУ STEM board
src_path = "stem claude/files"
entry_file = "stem_board.html"
manage_dns = true
}
}
dns_managed_sites = { for k, v in local.sites : k => v if v.manage_dns }
}
# ---------------------------------------------------------------------------
# Cloudflare Pages: project + custom domain per site
# ---------------------------------------------------------------------------
resource "cloudflare_pages_project" "site" {
for_each = local.sites
account_id = local.cloudflare_account_id
name = each.key
production_branch = "main"
}
# bridge was created by hand (wrangler) on 2026-07-03 adopt, don't recreate.
import {
to = cloudflare_pages_project.site["bridge"]
id = "02e035473cfc4834fb10c5d35470d8b4/bridge"
}
resource "cloudflare_pages_domain" "site" {
for_each = local.sites
account_id = local.cloudflare_account_id
project_name = cloudflare_pages_project.site[each.key].name
domain = "${each.key}.viktorbarzin.me"
}
import {
to = cloudflare_pages_domain.site["bridge"]
id = "02e035473cfc4834fb10c5d35470d8b4/bridge/bridge.viktorbarzin.me"
}
# Public proxied CNAME. Gated on manage_dns: a site whose name is still served
# by an in-cluster ingress keeps its ingress_factory record until cutover
# (two records can't share one name).
resource "cloudflare_record" "site" {
for_each = local.dns_managed_sites
zone_id = var.cloudflare_zone_id
name = each.key
content = cloudflare_pages_project.site[each.key].subdomain
type = "CNAME"
proxied = true
ttl = 1
}
# bridge's record predates this stack (created 2026-07-03 in stacks/cloudflared,
# handed off via removed{} there) adopt by id.
import {
to = cloudflare_record.site["bridge"]
id = "fd2c5dd4efe8fe38958944e74d0ced6d/ff4fb6f4900744d4b22de50d3fdd219b"
}
# ---------------------------------------------------------------------------
# Internal split-horizon DNS feed (docs/architecture/dns.md "superset rule"):
# the technitium-ingress-dns-sync script reads this CM and reconciles internal
# CNAMEs for every entry including deleting stale *.pages.dev records when
# an entry disappears (site retired/renamed).
# ---------------------------------------------------------------------------
resource "kubernetes_config_map" "valia_sites_dns" {
metadata {
name = "valia-sites-dns"
namespace = "technitium"
labels = { "app.kubernetes.io/managed-by" = "valia-sites" }
}
data = { for k, v in local.dns_managed_sites : k => cloudflare_pages_project.site[k].subdomain }
}
# ---------------------------------------------------------------------------
# The shared sync CronJob
# ---------------------------------------------------------------------------
resource "kubernetes_namespace" "valia_sites" {
metadata {
name = "valia-sites"
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# Secrets: shared drive.readonly rclone conf + the SCOPED CF Pages token
# (Pages Read/Write only the Global API Key never enters a pod).
resource "kubernetes_manifest" "sync_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "valia-sites-sync"
namespace = kubernetes_namespace.valia_sites.metadata[0].name
}
spec = {
refreshInterval = "1h"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = { name = "valia-sites-sync" }
data = [
{
secretKey = "rclone.conf"
remoteRef = { key = "valia-sites", property = "rclone_conf" }
},
{
secretKey = "CLOUDFLARE_API_TOKEN"
remoteRef = { key = "valia-sites", property = "cloudflare_pages_token" }
},
{
secretKey = "CLOUDFLARE_ACCOUNT_ID"
remoteRef = { key = "valia-sites", property = "account_id" }
},
]
}
}
depends_on = [kubernetes_namespace.valia_sites]
}
# Site registry rendered for the job (folder ids aren't secrets).
resource "kubernetes_config_map" "sync_config" {
metadata {
name = "valia-sites-config"
namespace = kubernetes_namespace.valia_sites.metadata[0].name
}
data = {
"sites.json" = jsonencode(local.sites)
}
}
# Last-deployed manifest hash per site written by the job (merge-patch), so
# TF must never fight it over data.
resource "kubernetes_config_map" "sync_state" {
metadata {
name = "valia-sites-state"
namespace = kubernetes_namespace.valia_sites.metadata[0].name
}
data = {}
lifecycle {
ignore_changes = [data]
}
}
resource "kubernetes_service_account" "sync" {
metadata {
name = "valia-sites-sync"
namespace = kubernetes_namespace.valia_sites.metadata[0].name
}
}
resource "kubernetes_role" "sync_state" {
metadata {
name = "valia-sites-sync-state"
namespace = kubernetes_namespace.valia_sites.metadata[0].name
}
rule {
api_groups = [""]
resources = ["configmaps"]
resource_names = ["valia-sites-state"]
verbs = ["get", "patch"]
}
}
resource "kubernetes_role_binding" "sync_state" {
metadata {
name = "valia-sites-sync-state"
namespace = kubernetes_namespace.valia_sites.metadata[0].name
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "Role"
name = kubernetes_role.sync_state.metadata[0].name
}
subject {
kind = "ServiceAccount"
name = kubernetes_service_account.sync.metadata[0].name
namespace = kubernetes_namespace.valia_sites.metadata[0].name
}
}
resource "kubernetes_cron_job_v1" "sync" {
metadata {
name = "valia-sites-sync"
namespace = kubernetes_namespace.valia_sites.metadata[0].name
labels = { app = "valia-sites", component = "sync" }
}
spec {
schedule = "*/10 * * * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 2
failed_jobs_history_limit = 3
job_template {
metadata {}
spec {
backoff_limit = 1
ttl_seconds_after_finished = 86400
template {
metadata { labels = { app = "valia-sites", component = "sync" } }
spec {
restart_policy = "OnFailure"
service_account_name = kubernetes_service_account.sync.metadata[0].name
container {
name = "sync"
image = "ghcr.io/viktorbarzin/valia-sites-sync:latest"
# Guards mirror stem95su's proven set: hard-fail on Drive
# list/auth errors (visible as a failed Job the chosen
# visibility, ADR-0018), skip quietly when a folder is empty or
# missing its entry file (never wipe a live site), capped
# deletes. Deploy ONLY on remote-manifest change: CF Pages caps
# monthly deployments on the free tier, so 144 no-op
# deploys/day is not an option.
command = ["/bin/sh", "-c", <<-EOT
set -u
cp /config/rclone.conf /tmp/rc.conf
APISERVER="https://kubernetes.default.svc"
SA=/var/run/secrets/kubernetes.io/serviceaccount
KTOKEN=$$(cat $$SA/token); NS=$$(cat $$SA/namespace)
STATE_URL="$$APISERVER/api/v1/namespaces/$$NS/configmaps/valia-sites-state"
FAILED=0
for SITE in $$(jq -r 'keys[]' /sites/sites.json); do
FOLDER=$$(jq -r --arg s "$$SITE" '.[$$s].folder_id' /sites/sites.json)
SRC_PATH=$$(jq -r --arg s "$$SITE" '.[$$s].src_path' /sites/sites.json)
ENTRY=$$(jq -r --arg s "$$SITE" '.[$$s].entry_file' /sites/sites.json)
RC="rclone --config /tmp/rc.conf --drive-root-folder-id=$$FOLDER --drive-skip-gdocs"
# 1. Remote manifest (path+size+hash) metadata only, no download.
MANIFEST=$$($$RC lsf "gdrive:$$SRC_PATH" -R --files-only --format phs 2>/tmp/lsf.err) || {
echo "FATAL [$$SITE]: Drive list failed (auth/network):"; cat /tmp/lsf.err; FAILED=1; continue; }
N=$$(printf '%s\n' "$$MANIFEST" | grep -c . || true)
if [ "$$N" -lt 1 ] || ! printf '%s\n' "$$MANIFEST" | cut -d';' -f1 | grep -qx "$$ENTRY"; then
echo "GUARD [$$SITE]: N=$$N / $$ENTRY missing -- skipping, site untouched"; continue
fi
# Cloudflare Pages hard-caps files at 25 MB deploying
# without an oversize file would silently break the pages
# that reference it, so skip the whole site instead (last
# deployed content keeps serving) and say so loudly.
OVERSIZE=$$(printf '%s\n' "$$MANIFEST" | awk -F';' '$$3 > 26214400 {print $$1" ("$$3" B)"}')
if [ -n "$$OVERSIZE" ]; then
echo "GUARD [$$SITE]: file(s) exceed the 25MB Pages limit -- skipping, site untouched:"; echo "$$OVERSIZE"; continue
fi
HASH=$$(printf '%s' "$$MANIFEST" | sha256sum | cut -d' ' -f1)
LAST=$$(curl -sf --cacert $$SA/ca.crt -H "Authorization: Bearer $$KTOKEN" "$$STATE_URL" | jq -r --arg s "$$SITE" '.data[$$s] // ""')
if [ "$$HASH" = "$$LAST" ]; then echo "OK [$$SITE]: unchanged"; continue; fi
# 2. Content changed pull and deploy.
$$RC sync "gdrive:$$SRC_PATH" "/work/$$SITE" --exclude ".DS_Store" --fast-list --transfers 4 --max-delete 25 -v || {
echo "FATAL [$$SITE]: rclone sync failed"; FAILED=1; continue; }
if [ "$$ENTRY" != "index.html" ]; then
cp "/work/$$SITE/$$ENTRY" "/work/$$SITE/index.html"
fi
wrangler pages deploy "/work/$$SITE" --project-name="$$SITE" --branch=main --commit-dirty=true || {
echo "FATAL [$$SITE]: wrangler deploy failed"; FAILED=1; continue; }
curl -sf --cacert $$SA/ca.crt -H "Authorization: Bearer $$KTOKEN" \
-X PATCH -H "Content-Type: application/merge-patch+json" \
-d "{\"data\":{\"$$SITE\":\"$$HASH\"}}" "$$STATE_URL" > /dev/null || {
echo "WARN [$$SITE]: state patch failed (will redeploy next run)"; FAILED=1; }
echo "DEPLOYED [$$SITE]: $$HASH"
done
exit $$FAILED
EOT
]
env {
name = "CLOUDFLARE_API_TOKEN"
value_from {
secret_key_ref {
name = "valia-sites-sync"
key = "CLOUDFLARE_API_TOKEN"
}
}
}
env {
name = "CLOUDFLARE_ACCOUNT_ID"
value_from {
secret_key_ref {
name = "valia-sites-sync"
key = "CLOUDFLARE_ACCOUNT_ID"
}
}
}
resources {
requests = { cpu = "25m", memory = "128Mi" }
limits = { memory = "512Mi" }
}
volume_mount {
name = "rclone-config"
mount_path = "/config"
read_only = true
}
volume_mount {
name = "sites-config"
mount_path = "/sites"
read_only = true
}
volume_mount {
name = "work"
mount_path = "/work"
}
}
volume {
name = "rclone-config"
secret {
secret_name = "valia-sites-sync"
items {
key = "rclone.conf"
path = "rclone.conf"
}
}
}
volume {
name = "sites-config"
config_map { name = kubernetes_config_map.sync_config.metadata[0].name }
}
volume {
name = "work"
empty_dir {}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
depends_on = [kubernetes_manifest.sync_external_secret]
}

View file

@ -0,0 +1,15 @@
# valia-sites-sync: everything the 10-min Content-folder mirror needs, baked in
# (no runtime installs — CronJob pods must not apk/npm on every start).
# rclone pinned to match the proven stem95su version; wrangler pinned to major 4.
FROM node:22-alpine
RUN apk add --no-cache curl unzip ca-certificates jq \
&& curl -fsSL https://downloads.rclone.org/v1.74.3/rclone-v1.74.3-linux-amd64.zip -o /tmp/rclone.zip \
&& unzip -j /tmp/rclone.zip '*/rclone' -d /usr/local/bin \
&& chmod +x /usr/local/bin/rclone \
&& rm /tmp/rclone.zip \
&& npm install -g wrangler@4 \
&& npm cache clean --force
# wrangler writes config/cache under $HOME; the CronJob runs as non-root node (uid 1000)
ENV HOME=/tmp

View file

@ -0,0 +1,8 @@
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}

View file

@ -0,0 +1,3 @@
variable "cloudflare_zone_id" {
type = string
}

View file

@ -675,6 +675,7 @@ resource "vault_database_secret_backend_connection" "postgresql" {
"pg-nextcloud-todos",
"pg-technitium",
"pg-goldmane-edges",
"pg-tasks",
]
postgresql {
@ -903,6 +904,17 @@ resource "vault_database_secret_backend_static_role" "pg_goldmane_edges" {
rotation_period = 604800
}
# tasks PWA (Reminders-style front-end over Nextcloud CalDAV) 7-day rotation
# for the `tasks` CNPG role. Consumed by stacks/tasks via a vault-database
# ExternalSecret -> TASKS_DB_DSN (remoteRef static-creds/pg-tasks).
resource "vault_database_secret_backend_static_role" "pg_tasks" {
backend = vault_mount.database.path
db_name = vault_database_secret_backend_connection.postgresql.name
name = "pg-tasks"
username = "tasks"
rotation_period = 604800
}
# =============================================================================
# Kubernetes Secrets Engine Dynamic K8s Credentials
# =============================================================================

File diff suppressed because one or more lines are too long