feat(f1-stream): wire optional REDDIT_* env for replays activation

Adds REDDIT_CLIENT_ID / REDDIT_CLIENT_SECRET to the f1-stream deployment, sourced from the f1-stream-secrets Secret with optional=true so the pod still starts before the credentials exist. This activates the replays feature (app repo ADR-0002) once reddit_client_id / reddit_client_secret are added to the Vault "f1-stream" key (auto-synced via the ExternalSecret's dataFrom.extract) and the pod is restarted. Dormant/no-op until then. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Merge branch 'master' of https://forgejo.viktorbarzin.me/viktor/infra
2026-07-04 20:57:43 +00:00 · 2026-07-04 20:15:41 +00:00 · 2026-07-04 20:15:31 +00:00 · 2026-07-04 14:37:38 +00:00 · 2026-07-04 14:21:01 +00:00 · 2026-07-04 13:38:39 +00:00
78 changed files with 9165 additions and 2816 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
--- a/.claude/reference/service-catalog.md
+++ b/.claude/reference/service-catalog.md
@ -81,7 +81,7 @@
 | ytdlp | YouTube downloader | ytdlp |
 | wealthfolio | Finance tracking | wealthfolio |
 | audiobookshelf | Audiobook server (may be merged into ebooks stack) | audiobookshelf |
-| paperless-ngx | Document management | paperless-ngx |
+| paperless-ngx | Document management. Mail ingest: forward document emails to `docs@viktorbarzin.me` — sender maps 1:1 to a paperless account (runbook `paperless-mail-ingest.md`) | paperless-ngx |
 | jsoncrack | JSON visualizer | jsoncrack |
 | servarr | Media automation (Sonarr/Radarr/etc) | servarr |
 | aiostreams | Stremio stream aggregator (Real-Debrid + Torrentio/Comet/StremThru Torz/Knaben; **MediaFusion removed 2026-06-07** — broken upstream `500`). `auth=app` (own UUID+password); stream-probe tests **both series+movie paths** with per-source breakdown (`aiostreams_streams_{comet,torrentio,stremthru_torz,knaben}`) + `aiostreams_error_streams` + `aiostreams_movie_stream_count`, success gated on Comet (workhorse) being alive; weekly NFS config + Stremio-account-collection backups to `/srv/nfs/aiostreams-backup/`. PG-backed user config (Comet timeout bumped 5s→10s 2026-06-07). | servarr/aiostreams |
@ -99,6 +99,7 @@
 | tor-proxy | Tor proxy | tor-proxy |
 | forgejo | Git forge. Open native self-signup (Turnstile captcha + email confirm) + Authentik & GitHub OAuth sign-in; see `docs/runbooks/forgejo-open-signups.md` | forgejo |
 | freshrss | RSS reader | freshrss |
+| drone-logbook | DJI flight-log analyzer (Open DroneLog, upstream image) — dronelog.viktorbarzin.me | drone-logbook |
 | navidrome | Music streaming | navidrome |
 | networking-toolbox | Network tools | networking-toolbox |
 | stirling-pdf | PDF tools | stirling-pdf |
@ -120,7 +121,9 @@
 | status-page | Status page | status-page |
 | plotting-book | Book plotting/world-building app | plotting-book |
 | tripit | Self-hosted TripIt-clone travel-itinerary PWA (FastAPI + SvelteKit SPA, same-origin). CNPG (`tripit` db, Vault static role `pg-tripit`) + RWX NFS trip-doc vault (`/srv/nfs/tripit-documents`) + RWO `proxmox-lvm-encrypted` personal-document vault `tripit-personal-documents` (passports/IDs — AES-256-GCM app-layer envelope, master key `DOCUMENT_ENCRYPTION_KEY` in `secret/tripit`). `auth=required` (Authentik forward-auth, reads `X-authentik-email`); second `auth=none` ingress on `/api/calendar` for HMAC-token-gated `.ics` feed. Email-ingest CronJob `tripit-ingest-plans` (`*/15`) is the SOLE inbound path — forward a booking to plans@viktorbarzin.me (catch-all → spam@), polled read-only and routed ONLY to a registered user / verified linked address (no default-owner fallback; strangers ignored), parsed by local LLM (`qwen3vl-4b`), and the sender is emailed the outcome (Added to trip / Couldn't import). Plus `tripit-poll-flights`, `tripit-run-reminders`, `tripit-transport-nudge`, `tripit-weather-brief`. (The old Gmail-scrape `tripit-ingest-mail` CronJob was removed 2026-06-05.) App secrets in Vault `secret/tripit`. | tripit |
-| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me. Public **open** static site (`auth=none` — CrowdSec + ai-bot-block, no login). Stock `nginx:1.28-alpine` serving content **straight off PVE host NFS** `/srv/nfs/stem-site` (RWX `nfs_volume`, mounted read-only) — **NOT** image-baked, so the externally-authored (Gemini-exported) HTML/media updates with no rebuild; auto-backed-up offsite by `nfs-mirror`. **Content source = Google Drive folder "claude"** (id `1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_`, shared Valentina→vbarzin@gmail.com). **Deploy = scheduled mirror** (since 2026-06-09, reversed the earlier on-demand-only call once content went active): CronJob `stem95su-gdrive-sync` (`*/10`, `stacks/stem95su/gdrive-sync.tf`) mounts the content PVC RW and `rclone sync`s the Drive folder onto it (`docker.io/rclone/rclone:1.74.3`, `scope=drive.readonly` — Drive is READ-ONLY; empty-source guard + `--max-delete 25` so a partial listing can't wipe the site). rclone creds (OAuth refresh-token) in Vault `secret/stem95su` (`rclone_conf`) → ESO secret `stem95su-rclone`. **Requires the GCP OAuth app (project home-lab-1700868541205) published to "Production"** or the refresh token expires ~weekly (re-mint + `vault kv put secret/stem95su rclone_conf=…` after publishing); a dead token surfaces as a failed Job. Manual on-demand sync still possible (throwaway rclone container from devvm; recipe in claude-memory). Nextcloud "PVE NFS Pool"/rsync is a manual fallback. Dashboard `stem_board.html` served at `/` via a small nginx ConfigMap (`index`). No DB, no in-cluster secrets. Reference impl for the NFS-backed static-site pattern (see patterns.md). | stem95su |
+| tasks | Reminders-style tasks PWA over Nextcloud CalDAV (FastAPI + SvelteKit SPA same-origin, single container; code `~/code/tasks`, design `tasks/docs/2026-07-03-tasks-pwa-design.md`). Nextcloud stays the source of truth (VTODOs); the app is the front-end Apple Reminders stopped being. CNPG (`tasks` db, Vault static role `pg-tasks`) stores Connected Accounts — per-user Nextcloud app passwords Fernet-encrypted with `fernet_key` from `secret/tasks`. `auth=required` (Authentik forward-auth; identity = `X-authentik-username`, NO app-level login — `DEV_USER` must never be set in prod) at tasks.viktorbarzin.me (proxied). Exception: the five PWA icon/manifest files (`/apple-touch-icon.png`, `/favicon.png`, `/pwa-192x192.png`, `/pwa-512x512.png`, `/manifest.webmanifest`) are a path-scoped `auth=none` carve-out (`module.ingress_icons`) so cookie-less OS icon fetchers (macOS Safari Add-to-Dock, mobile home-screen installs) get the real icon instead of the Authentik 302; guarded by the `tasks-icons` walloff-probe target. NetworkPolicy `tasks-ingress` (SEC-1) restricts pod ingress to traefik + monitoring namespaces so the trusted header can't be spoofed pod-to-pod. GHA → public ghcr `tasks` → Woodpecker deploy (ADR-0002). | tasks |
+| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me — **a Valia site on Cloudflare Pages since 2026-07-03** (ADR-0018): registry entry in `stacks/valia-sites`, synced from Drive folder "claude" every 10 min, deploy-on-change. The old in-cluster stack (nginx off PVE NFS + per-site rclone CronJob) is RETIRED — stacks/stem95su is a tombstone; `secret/stem95su` superseded by `secret/valia-sites`; `stem_video.mp4` was compressed 42.9→21.4MB (25MB Pages cap) with Viktor's OK. See docs/runbooks/valia-sites.md. | — |
+| valia-sites | **Valia-site registry + sync** (ADR-0018): all sites authored by Valia serve OFF-INFRA on Cloudflare Pages (`bridge` + `stem95su` live). One map entry in `stacks/valia-sites/main.tf` per site fans out Pages project + custom domain + public CNAME + internal split-horizon CNAME (ConfigMap `valia-sites-dns` → technitium sync, declarative incl. removal). CronJob `valia-sites-sync` (`*/10`, image ghcr `valia-sites-sync`) mirrors each Drive Content folder (rclone `drive.readonly`, stem95su-style guards + 25MB Pages-cap guard) and wrangler-deploys ONLY on manifest change (free-tier deploy cap). Secrets `secret/valia-sites` (shared rclone conf + SCOPED CF Pages token — Global API Key never in pods). Failed-Job-only visibility by choice. Runbook: docs/runbooks/valia-sites.md. | valia-sites |
 | trek | **TRIAL (2026-06-05)** — self-hosted group-trip planner (upstream [TREK](https://github.com/mauriceboe/TREK), `mauriceboe/trek:3.0.22`, AGPL-3.0). Solo evaluation behind Authentik forward-auth (`auth=required`) before deciding build-vs-adopt; covers collaborative trip planning + accommodation records + activities + per-person budget splitting on free OpenStreetMap (no paid maps key). SQLite + uploads on `proxmox-lvm-encrypted` (`trek-data-encrypted` 2Gi, `trek-uploads-encrypted` 5Gi). For the trial only: `ENCRYPTION_KEY` is TREK-auto-generated onto the data PVC and the bootstrap admin (`admin@trek.local`) is printed to pod logs — NO Vault/ESO wiring (graduation TODO: move key to `secret/trek` + ESO, add an app-level SQLite backup CronJob since host file-backup can't read the LUKS PVC, wire TREK↔Authentik OIDC). Pinned image, TF-managed (no CI/Keel). Availability-poll companion (Rallly) deferred. Teardown: `tg destroy` in `stacks/trek`. | trek |

 ## Cloudflare Domains
@ -130,7 +133,7 @@
 blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send,
 audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden,
 changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser,
-travel, netbox, phpipam, tripit, t3, stem95su
+travel, netbox, phpipam, tripit, t3, stem95su, tasks
 ```

 ### Non-Proxied (Direct DNS)
--- a/.github/workflows/build-excalidraw.yml
+++ b/.github/workflows/build-excalidraw.yml
@ -0,0 +1,42 @@
+name: Build excalidraw-library
+
+# ADR-0002 / no-local-builds: excalidraw-library (infra-owned Go app behind
+# draw.viktorbarzin.me) builds off-infra on GHA → private ghcr; Keel polls
+# ghcr:latest and rolls the deployment. Replaces the manual DockerHub pushes
+# (viktorbarzin/excalidraw-library:v4 stays frozen as the rollback image).
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'stacks/excalidraw/project/**'
+  workflow_dispatch: {}
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-go@v5
+        with:
+          go-version: '1.21'
+      - run: go test ./...
+        working-directory: stacks/excalidraw/project
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      - uses: docker/build-push-action@v6
+        with:
+          context: stacks/excalidraw/project
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/excalidraw-library:latest
+            ghcr.io/viktorbarzin/excalidraw-library:${{ github.sha }}
--- a/.github/workflows/build-valia-sites-sync.yml
+++ b/.github/workflows/build-valia-sites-sync.yml
@ -0,0 +1,39 @@
+name: Build valia-sites-sync
+
+# ADR-0002 + ADR-0018: infra-owned image built off-infra on GHA → ghcr (public).
+# Rclone + wrangler runner for the Valia-sites Content-folder mirror CronJob.
+# Rebuilds are rare (tool pins only change deliberately) → dispatch + path.
+# Security note: no untrusted event inputs are interpolated anywhere (only
+# github.actor / github.sha / GITHUB_TOKEN — same shape as the other
+# build-*.yml workflows in this repo).
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'stacks/valia-sites/sync-image/**'
+  workflow_dispatch: {}
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      - uses: docker/build-push-action@v6
+        with:
+          context: stacks/valia-sites/sync-image
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/valia-sites-sync:latest
+            ghcr.io/viktorbarzin/valia-sites-sync:${{ github.sha }}
--- a/AGENTS.md
+++ b/AGENTS.md
@ -95,7 +95,7 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
 ## Key Paths
 - `stacks/<service>/main.tf` — service definition
 - `stacks/platform/modules/<service>/` — core infra modules
- `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI, and auto Cloudflare DNS (`dns_type = "proxied"` or `"non-proxied"`)
+- `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI, and auto Cloudflare DNS (`dns_type = "proxied"`, `"non-proxied"`, or `"internal"` — a public A record carrying the internal Traefik LB IP for household-only services; pair with the `home-lans-only` ipAllowList middleware, never with `"proxied"`)
 - `modules/kubernetes/nfs_volume/` — NFS volume module (CSI-backed, soft mount)
 - `config.tfvars` — non-secret configuration (plaintext)
 - `secrets.sops.json` — all secrets (SOPS-encrypted JSON)
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -118,6 +118,14 @@ _Avoid_: "external", "outside".
 `viktorbarzin.lan`, served by Technitium DNS. Resolves only inside the homelab network.
 _Avoid_: bare "lan", "private", "intranet".

+**Segment**:
+One isolated L2/L3 network with pfSense as its gateway — realised as a Proxmox-bridge-level tag feeding one dedicated untagged pfSense interface (dManagementsVms 10.0.10.0/24 = vmbr1 tag 10, dKubernetes 10.0.20.0/24 = vmbr1 tag 20, dCCTV 10.0.30.0/24 = vmbr0 tag 30). pfSense itself never terminates 802.1Q.
+_Avoid_: "VLAN" as the primary name (the tags 10/20/30 are transport detail; the Segment is the concept).
+
+**CCTV segment**:
+The untrusted camera **Segment** (`dCCTV`) — devices in it may be pulled from (RTSP/ISAPI) but may initiate nothing except NTP to their gateway. Deliberately outside every trusted source-IP allowlist (ADR-0017).
+_Avoid_: "camera VLAN", "CCTV LAN".
+
 **Ingress auth**:
 The `auth = "..."` parameter on `ingress_factory` — a discrete *mode*, not a ranked tier — one of `required` (Authentik forward-auth gates every request), `app` (the backend owns its login), `public` (anonymous Authentik binding for audit only), or `none` (Anubis-fronted content, or native-client API). Default `required` (fail-closed).
 _Avoid_: "auth tier" / "auth mode" — refer to it by the canonical key, `auth` (e.g. `auth = "required"`). "tier" is reserved for State tier and Namespace tier.
@ -229,6 +237,20 @@ _Avoid_: expecting Diun to deploy; conflating with **Keel**.
 **Anubis**:
 A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content-bearing sites without app-level auth (blog, wiki, landing pages). Never in front of Git, WebDAV, CalDAV, or API endpoints (clients can't solve PoW).

+### Externally-authored sites
+
+**Valia site**:
+A small public static site authored by Valia (Viktor's mother, external to the infra) and hosted for her under `<name>.viktorbarzin.me`. Its source of truth is a **Content folder** she owns; the live site is a mirror of that folder, fresh within ~10 minutes. Hosted **off-infra** (Cloudflare Pages) by decision: a homelab outage freezes content but never takes her sites down. Viktor picks the English subdomain name per site at registration (her folder names stay Bulgarian). Current instances: `stem95su`, `bridge`.
+_Avoid_: "school site" (the family may grow beyond school projects); treating the deployed copy as editable — edits land only in the **Content folder**.
+
+**Content folder**:
+The Google Drive folder (or subfolder) Valia shares with `vbarzin@gmail.com` holding one **Valia site**'s files. Strictly read-only from the infra side — nothing ever writes back to her Drive. Empty or half-uploaded folder states must never wipe a live site.
+_Avoid_: syncing a folder root when the servable content lives in a subfolder (stem95su serves `stem claude/files/`, not the folder root).
+
+**Entry file**:
+The HTML file a **Valia site** serves at `/`. Defaults to `index.html`; per-site override when she names it differently (stem95su: `stem_board.html`). The override is a registration-time setting, not a constraint on her authoring.
+_Avoid_: asking Valia to rename her files to fit hosting conventions.
+
 ## Relationships

 - A **Service** is defined by exactly one **Stack** — **flat** or wrapping a **Stack-local module** — which sources zero or more shared **Factory modules** and resolves to one or more K8s workloads.
@ -240,6 +262,7 @@ A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content
 - A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither.
 - An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**.
 - Tier-1 **State tier** state and ~12 app databases share one **CNPG** `pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store.
+- A **Valia site** mirrors exactly one **Content folder** and serves exactly one **Entry file** at `/`; the folder is hers, the subdomain name is Viktor's, the hosting is off-infra.

 ## Example dialogue

--- a/cli/VERSION
+++ b/cli/VERSION
@ -1 +1 @@
-v0.11.0
+v0.12.0
--- a/cli/cmd_memory.go
+++ b/cli/cmd_memory.go
@ -30,11 +30,21 @@ func memoryCommands() []Command {
 	}
 }

-// printMemories renders a {memories:[…]} response as compact lines, or raw JSON.
+// printMemories renders a {memories:[…]} response as one line per memory, or raw JSON.
 func printMemories(raw []byte, jsonOut bool) error {
+	fmt.Print(renderMemories(raw, jsonOut))
+	return nil
+}
+
+// renderMemories formats each memory as a single line with its FULL content
+// (newlines flattened to spaces). Content is deliberately never truncated: the
+// old 240-rune preview cut memories mid-sentence, misled agents into believing
+// no full-content read-back existed, and made blind `update --content` from
+// the preview silently destroy the stored tail. Full passthrough also can't
+// produce invalid UTF-8 (the old mid-rune cut crashed the recall hook).
+func renderMemories(raw []byte, jsonOut bool) string {
 	if jsonOut {
-		fmt.Println(string(raw))
-		return nil
+		return string(raw) + "\n"
 	}
 	var r struct {
 		Memories []struct {
@ -46,36 +56,20 @@ func printMemories(raw []byte, jsonOut bool) error {
 		} `json:"memories"`
 	}
 	if err := json.Unmarshal(raw, &r); err != nil {
-		fmt.Println(string(raw))
-		return nil
+		return string(raw) + "\n"
 	}
 	if len(r.Memories) == 0 {
-		fmt.Println("(no memories)")
-		return nil
+		return "(no memories)\n"
 	}
+	var b strings.Builder
 	for _, m := range r.Memories {
-		c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240)
-		fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
+		c := strings.ReplaceAll(m.Content, "\n", " ")
+		fmt.Fprintf(&b, "#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
 		if m.Tags != "" {
-			fmt.Printf("       tags: %s\n", m.Tags)
+			fmt.Fprintf(&b, "       tags: %s\n", m.Tags)
 		}
 	}
-	return nil
-}
-
-// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it
-// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240]
-// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte
-// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict
-// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit
-// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit
-// hook error" for Cyrillic-language users.
-func truncatePreview(s string, maxRunes int) string {
-	r := []rune(s)
-	if len(r) <= maxRunes {
-		return s
-	}
-	return string(r[:maxRunes]) + "…"
+	return b.String()
 }

 func memoryRecall(args []string) error {
--- a/cli/memory_test.go
+++ b/cli/memory_test.go
@ -8,25 +8,53 @@ import (
 	"unicode/utf8"
 )

-func TestTruncatePreviewKeepsValidUTF8(t *testing.T) {
-	// Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits
-	// invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must
-	// cut on a rune boundary and always stay valid UTF-8.
-	long := strings.Repeat("я", 300) // 300 runes / 600 bytes
-	got := truncatePreview(long, 240)
+func TestRenderMemoriesFullContent(t *testing.T) {
+	// The pretty view must NOT truncate content: the old 240-rune preview cut
+	// memories mid-sentence, misled agents into thinking no full-content
+	// read-back existed, and made blind `update --content` from the preview
+	// destroy the stored tail. Full passthrough also removes the mid-rune-cut
+	// invalid-UTF-8 class by construction — nothing is ever sliced.
+	long := strings.Repeat("я", 300) + strings.Repeat("a", 300)
+	raw, _ := json.Marshal(map[string]interface{}{"memories": []map[string]interface{}{
+		{"id": 7, "content": long, "category": "facts", "tags": "t1,t2", "importance": 0.7},
+	}})
+	got := renderMemories(raw, false)
+	if !strings.Contains(got, long) {
+		t.Fatalf("content was truncated: %q", got)
+	}
+	if strings.Contains(got, "…") {
+		t.Fatalf("ellipsis in output — truncation still active: %q", got)
+	}
 	if !utf8.ValidString(got) {
-		t.Fatalf("truncatePreview produced invalid UTF-8: %q", got)
+		t.Fatalf("invalid UTF-8 in output: %q", got)
 	}
-	if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' {
-		t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r))
+	if !strings.Contains(got, "#7 [facts] (0.70) ") || !strings.Contains(got, "tags: t1,t2") {
+		t.Fatalf("line format broken: %q", got)
 	}
-	// Short multibyte strings pass through untouched (no ellipsis).
-	if got := truncatePreview("кратко", 240); got != "кратко" {
-		t.Fatalf("short string altered: %q", got)
+}
+
+func TestRenderMemoriesFlattensNewlinesToOneLine(t *testing.T) {
+	// Consumers (the recall hook, terminal skims) rely on one memory per line;
+	// multi-line content is flattened, never split across lines.
+	raw, _ := json.Marshal(map[string]interface{}{"memories": []map[string]interface{}{
+		{"id": 1, "content": "line one\nline two\nline three", "category": "facts", "importance": 0.5},
+	}})
+	got := renderMemories(raw, false)
+	if !strings.Contains(got, "line one line two line three") {
+		t.Fatalf("newlines not flattened: %q", got)
 	}
-	// ASCII boundary still works.
-	if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" {
-		t.Fatalf("ascii truncation wrong: %q", got)
+}
+
+func TestRenderMemoriesEdgeCases(t *testing.T) {
+	if got := renderMemories([]byte(`{"memories":[]}`), false); got != "(no memories)\n" {
+		t.Fatalf("empty list: %q", got)
+	}
+	// --json and unparseable responses pass through raw.
+	if got := renderMemories([]byte(`{"x":1}`), true); got != "{\"x\":1}\n" {
+		t.Fatalf("json passthrough: %q", got)
+	}
+	if got := renderMemories([]byte(`not json`), false); got != "not json\n" {
+		t.Fatalf("unparseable passthrough: %q", got)
 	}
 }

--- a/config.tfvars
+++ b/config.tfvars
--- a/docs/adr/0017-cctv-physical-cabling.svg
+++ b/docs/adr/0017-cctv-physical-cabling.svg
@ -0,0 +1,126 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="1600" height="820" viewBox="0 0 1600 820" font-family="system-ui, -apple-system, 'Segoe UI', Roboto, sans-serif">
+  <!-- ADR-0017: PHYSICAL cabling only — no VLANs, no flows. Solid = cable in
+       place today · dashed = camera-day work · ~~~ = radio. Palette: neutral
+       grays + blue for copper runs (reference dataviz palette text tokens). -->
+  <defs>
+    <marker id="dot" viewBox="0 0 8 8" refX="4" refY="4" markerWidth="5" markerHeight="5">
+      <circle cx="4" cy="4" r="3" fill="#52514e"/>
+    </marker>
+  </defs>
+
+  <rect width="1600" height="820" fill="#fcfcfb"/>
+
+  <text x="40" y="42" font-size="26" font-weight="700" fill="#0b0b0b">ADR-0017 — physical cabling (single-switch, rev 3)</text>
+  <text x="40" y="66" font-size="15" fill="#52514e">wires only — no VLANs, no traffic · solid = in place · dashed = camera-day · ~ = radio</text>
+
+  <!-- ═════════ APARTMENT ═════════ -->
+  <rect x="40" y="100" width="330" height="330" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
+  <text x="56" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">APARTMENT</text>
+
+  <text x="70" y="158" font-size="13" fill="#52514e">☁ ISP (internet)</text>
+  <path d="M120,166 L120,196" fill="none" stroke="#52514e" stroke-width="2"/>
+
+  <rect x="64" y="198" width="220" height="64" rx="8" fill="#ffffff" stroke="#8a8984"/>
+  <text x="80" y="222" font-size="14.5" font-weight="700" fill="#0b0b0b">AX6000 router</text>
+  <text x="80" y="242" font-size="12" fill="#52514e">192.168.1.1 · WAN←ISP · 8×LAN</text>
+
+  <rect x="64" y="290" width="220" height="52" rx="8" fill="#ffffff" stroke="#8a8984"/>
+  <text x="80" y="312" font-size="14" font-weight="700" fill="#0b0b0b">Synology NAS · .13</text>
+  <text x="80" y="330" font-size="12" fill="#52514e">on an AX6000 LAN port</text>
+  <path d="M174,262 L174,290" fill="none" stroke="#2a78d6" stroke-width="2"/>
+
+  <text x="70" y="376" font-size="12.5" fill="#52514e">📶 wifi clients (phones, laptops)</text>
+  <path d="M110,262 C104,272 106,278 100,286 C106,294 104,300 100,308 C106,316 104,322 100,330 C106,338 104,344 100,352 C104,358 102,362 98,366" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
+
+  <!-- in-wall run apartment -> garage -->
+  <path d="M284,230 C450,230 540,228 616,228" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
+  <text x="330" y="218" font-size="12.5" font-weight="700" fill="#2a78d6">in-wall run → garage</text>
+
+  <!-- ═════════ GARAGE — RACK ═════════ -->
+  <rect x="560" y="100" width="640" height="680" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
+  <text x="576" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE — RACK</text>
+
+  <!-- switch -->
+  <rect x="600" y="150" width="560" height="150" rx="8" fill="#ffffff" stroke="#0b0b0b" stroke-opacity="0.5" stroke-width="1.6"/>
+  <text x="616" y="176" font-size="14.5" font-weight="700" fill="#0b0b0b">TL-SG105PE · 5-port gigabit PoE switch</text>
+  <text x="616" y="194" font-size="12" fill="#52514e">mgmt 192.168.1.6 · replaces the old TL-SG105E (→ shelf, cold spare)</text>
+  <g font-size="11.5" text-anchor="middle">
+    <rect x="616" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
+    <text x="664" y="227" font-weight="700" fill="#0b0b0b">P1</text>
+    <text x="664" y="242" fill="#52514e">← apartment</text>
+    <rect x="722" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
+    <text x="770" y="227" font-weight="700" fill="#0b0b0b">P2</text>
+    <text x="770" y="242" fill="#52514e">← 4G router</text>
+    <rect x="828" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
+    <text x="876" y="227" font-weight="700" fill="#0b0b0b">P3</text>
+    <text x="876" y="242" fill="#52514e">← UPS mgmt</text>
+    <rect x="934" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984" stroke-dasharray="4,3"/>
+    <text x="982" y="227" font-weight="700" fill="#0b0b0b">P4 ⚡PoE</text>
+    <text x="982" y="242" fill="#52514e">← camera</text>
+    <rect x="1040" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
+    <text x="1088" y="227" font-weight="700" fill="#0b0b0b">P5</text>
+    <text x="1088" y="242" fill="#52514e">← R730 eno1</text>
+  </g>
+  <text x="616" y="284" font-size="12" fill="#52514e">every cable below re-plugs old-switch → PE on camera day (≈3 min)</text>
+
+  <!-- 4G router -->
+  <rect x="600" y="360" width="250" height="64" rx="8" fill="#ffffff" stroke="#8a8984"/>
+  <text x="616" y="384" font-size="14" font-weight="700" fill="#0b0b0b">4G router · 192.168.1.7</text>
+  <text x="616" y="403" font-size="12" fill="#52514e">~cellular uplink (out-of-band)</text>
+  <path d="M770,300 L770,360" fill="none" stroke="#2a78d6" stroke-width="2"/>
+  <path d="M856,392 C866,386 864,380 874,376 C866,370 868,364 876,360" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
+  <text x="884" y="380" font-size="12" fill="#52514e">📡 cellular</text>
+
+  <!-- UPS -->
+  <rect x="600" y="452" width="250" height="56" rx="8" fill="#ffffff" stroke="#8a8984"/>
+  <text x="616" y="476" font-size="14" font-weight="700" fill="#0b0b0b">UPS (Huawei)</text>
+  <text x="616" y="494" font-size="12" fill="#52514e">network mgmt card</text>
+  <path d="M876,300 C876,340 800,410 720,452" fill="none" stroke="#2a78d6" stroke-width="2"/>
+
+  <!-- R730 -->
+  <rect x="600" y="540" width="560" height="220" rx="8" fill="#ffffff" stroke="#0b0b0b" stroke-opacity="0.5" stroke-width="1.6"/>
+  <text x="616" y="566" font-size="14.5" font-weight="700" fill="#0b0b0b">Dell R730 · PVE host · 192.168.1.127</text>
+  <g font-size="11.5">
+    <rect x="616" y="582" width="128" height="38" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
+    <text x="628" y="598" font-weight="700" fill="#0b0b0b">eno1 · LAN1</text>
+    <text x="628" y="613" fill="#52514e">← switch P5 · 1GbE</text>
+    <rect x="756" y="582" width="128" height="38" rx="5" fill="#ffffff" stroke="#8a8984" stroke-dasharray="4,3"/>
+    <text x="768" y="598" font-weight="700" fill="#52514e">eno2 · LAN2</text>
+    <text x="768" y="613" fill="#8a8984">dark · fallback leg</text>
+    <rect x="896" y="582" width="128" height="38" rx="5" fill="#ffffff" stroke="#d8d7d2"/>
+    <text x="908" y="598" fill="#8a8984">eno3 / eno4</text>
+    <text x="908" y="613" fill="#8a8984">free, uncabled</text>
+    <rect x="1036" y="582" width="108" height="38" rx="5" fill="#ffffff" stroke="#d8d7d2"/>
+    <text x="1048" y="598" fill="#8a8984">iDRAC · .4</text>
+    <text x="1048" y="613" fill="#8a8984">shared-LOM/eno1</text>
+  </g>
+  <text x="616" y="648" font-size="12" fill="#52514e">no other network cables — everything else on this host is VIRTUAL:</text>
+  <text x="616" y="668" font-size="12" fill="#52514e">pfSense · ha-sofia (HA) · devvm · k8s-master + node1-6 · registry VM …</text>
+  <text x="616" y="696" font-size="12" fill="#8a8984">(power: host + switch fed from the UPS — power wiring not drawn)</text>
+
+  <path d="M1088,300 C1088,420 720,500 680,582" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
+  <text x="1100" y="330" font-size="12.5" font-weight="700" fill="#2a78d6">LAN1 cable</text>
+
+  <!-- ═════════ GARAGE ENTRANCE ═════════ -->
+  <rect x="1280" y="100" width="280" height="200" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
+  <text x="1296" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE ENTRANCE</text>
+  <rect x="1304" y="150" width="232" height="110" rx="8" fill="#ffffff" stroke="#8a8984"/>
+  <text x="1320" y="176" font-size="14" font-weight="700" fill="#0b0b0b">vermont-garage camera</text>
+  <text x="1320" y="196" font-size="12" fill="#52514e">HiLook IPC-T241H-C · 10.0.30.70</text>
+  <text x="1320" y="214" font-size="12" fill="#52514e">powered over the data cable (PoE)</text>
+  <text x="1320" y="232" font-size="12" fill="#52514e">outdoor · armored conduit</text>
+
+  <path d="M982,210 C982,150 1140,140 1304,180" fill="none" stroke="#52514e" stroke-width="2.5" stroke-dasharray="7,5"/>
+  <text x="1080" y="136" font-size="12.5" font-weight="700" fill="#52514e">single cat6 in conduit · data + PoE power (camera day)</text>
+
+  <!-- legend -->
+  <g transform="translate(40,780)" font-size="12.5">
+    <line x1="0" y1="-4" x2="44" y2="-4" stroke="#2a78d6" stroke-width="2.5"/>
+    <text x="52" y="0" fill="#0b0b0b">copper, in place</text>
+    <line x1="190" y1="-4" x2="234" y2="-4" stroke="#52514e" stroke-width="2.5" stroke-dasharray="7,5"/>
+    <text x="242" y="0" fill="#0b0b0b">camera-day cable / dark port</text>
+    <path d="M450,-4 C456,-10 454,-14 460,-18" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
+    <text x="470" y="0" fill="#0b0b0b">radio (wifi / cellular)</text>
+    <text x="650" y="0" fill="#52514e">total wired links at the rack: 5 (all on the one switch) · ADR-0017 rev 3</text>
+  </g>
+</svg>
--- a/docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md
+++ b/docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md
@ -0,0 +1,99 @@
+# CCTV segment: dedicated pfSense interface, VLAN-30 trunk on the LAN1 cable
+
+Status: accepted (2026-07-02, rev 3 — single-switch)
+
+![Network topology — dCCTV segment, flows, and camera-day steps](./0017-cctv-segment-topology.svg)
+
+![Physical cabling — wires only, no VLANs](./0017-cctv-physical-cabling.svg)
+
+The first owned camera at the Sofia/Vermont site (`vermont-garage`, HiLook
+IPC-T241H-C at the garage entrance) needs to be network-isolated: its cable is
+physically exposed outside the apartment, so anything plugged into that cable
+must land in a segment that can reach nothing. The original design doc
+(NAS: `Emo shared/Claude shared/garage-camera/`) called for an "802.1Q trunk
+to pfSense" — but nothing in this network terminates dot1q on pfSense; the
+site idiom is one vlan-aware Proxmox bridge → one tagged VM NIC → one clean
+untagged pfSense interface per segment.
+
+**Decision (rev 3):** ONE switch — the new TL-SG105PE **replaces** the old
+garage TL-SG105E (Viktor prefers not running two switches; retired unit
+becomes a cold spare, its 192.168.1.6 mgmt IP passes to the PE). Five ports,
+all used: apartment uplink, 4G router 192.168.1.7, UPS mgmt (all untagged
+VLAN 1), the camera (untagged VLAN 30, PoE), and the **trunk to R730 `eno1`
+carrying home LAN untagged + CCTV tagged 30** over the existing LAN1 cable.
+pfSense `net3` (vtnet3) sits on `vmbr0` with `tag=30` — exactly the site
+idiom used for dManagementsVms/dKubernetes (bridge-level tag → clean untagged
+vNIC; pfSense still terminates no dot1q itself). The earlier dedicated
+`eno2`/`vmbr2` leg is kept **dormant as a fallback** (rev 2 wired it; moving
+net3 back to vmbr2 restores pure physical isolation in one `qm set`).
+This narrows the earlier 802.1Q objection rather than contradicting it: the
+rejection assumed *unmanaged* switches, where any LAN device could inject
+tagged frames; with the managed PE as the only device on eno1, VLAN-30
+membership is {camera port, trunk port} only, so tag-30 ingress from every
+other port — and from the exposed camera cable — is dropped or contained.
+Cameras are untrusted: default-deny on dCCTV with a single
+NTP-to-gateway exception; Frigate (k8s) pulls RTSP in; ha-sofia (192.168.1.8)
+may reach ISAPI/RTSP directly; home-LAN clients route in via an AX6000 static
+route (10.0.30.0/24 via 192.168.1.2). 10.0.30.0/24 is deliberately NOT in the
+10.0.20.0/22 trusted source-IP allowlist.
+
+## Traffic on the trunk — how one cable carries two networks
+
+The LAN1 cable is shared, but the two networks on it diverge at `vmbr0`
+(the vlan-aware bridge on the PVE host), and only ONE of them ever touches
+pfSense:
+
+- **Untagged (VLAN 1, home LAN)** is plain L2 bridging: vmbr0 switches it
+  between the trunk, the host's own IP (192.168.1.127) and pfSense `net0` —
+  where pfSense sits as an ordinary LAN *client* (WAN 192.168.1.2). The home
+  LAN's gateway is and remains the AX6000; home-LAN traffic never transits
+  pfSense. Consequently a pfSense (or R730 VM-level) outage does not affect
+  the home LAN, and the apartment ↔ 4G-router ↔ UPS paths don't even leave
+  the switch (P1/P2/P3 bridge internally), so out-of-band recovery via the
+  4G router survives the whole rack being down.
+- **Tagged 30 (CCTV)** has exactly one possible landing: vmbr0 delivers
+  VID 30 only to pfSense `net3` (dCCTV, 10.0.30.1), which is the camera
+  segment's gateway, firewall and sole exit. "Camera → AX6000 → internet"
+  is impossible by construction, not merely by firewall rule.
+- pfSense forwards *upstream* only its own segments (10.0.10/20/30), NATed
+  out of its WAN toward the AX6000. Load-wise the trunk gained only the
+  camera's ~8 Mbps — it already carried all rack-bound home-LAN traffic.
+
+![VLAN tagging — where traffic can flow](./0017-cctv-vlan-tagging.svg)
+
+*(editable source: [`0017-cctv-vlan-tagging.excalidraw`](./0017-cctv-vlan-tagging.excalidraw) — open it in excalidraw to tweak)*
+
+## Considered options
+
+- **802.1Q over the LAN path behind an UNMANAGED switch** (the original plan
+  read this way) — rejected: any LAN device could inject tagged frames into
+  vmbr0 (`bridge-vids 2-4094`) and tag-passing through a dumb switch is
+  undefined. Rev 3 adopts the tagged path ONLY because the managed PE now
+  polices VLAN-30 membership at the single entry point to eno1; no bridge
+  reconfiguration was needed (vmbr0 was already vlan-aware).
+- **Dedicated physical leg (eno2 → vmbr2 → net3), one switch per role**
+  (rev 1/2 as-built) — superseded by rev 3: it forced either a second switch
+  (6 connections vs 5 ports once the PE also replaced the old switch) or new
+  hardware. Strongest isolation of all options; kept dormant as the fallback.
+- **AX6000 as the camera gateway** — rejected earlier in the design (consumer
+  router, no inter-VLAN firewall).
+
+## Consequences
+
+- The switch is now single-point and load-bearing for everything in the rack
+  (apartment uplink, pfSense backup-WAN via 4G, UPS mgmt, CCTV) AND its VLAN
+  table + mgmt password are part of the isolation boundary — the Easy Smart
+  mgmt UI answers on every port, so the password is the gate between a
+  compromised camera and the switch config. All 5 ports are consumed: the
+  next camera forces an 8-port PoE upgrade (the wiring plan already fits it).
+- `eno2`/`vmbr2` stay cabled-ready but dormant (fallback to rev 2's physical
+  leg); eno3/eno4 remain free.
+- The old TL-SG105E is retired to cold spare; the PE inherits 192.168.1.6
+  (Kea reservation by MAC).
+- Revision history (all 2026-07-02): rev 1 assumed one shared PE with a
+  port-VLAN split (conflated the two devices); rev 2 split into two switches
+  after inspecting 192.168.1.6 (old non-PoE SG105E, 4/5 ports used); rev 3
+  consolidated back to one switch — the PE replacing the SG105E — per
+  Viktor's preference, moving CCTV onto a managed tagged trunk.
+- Frigate's ADR-0016 VRAM budget was bumped 2000 → 2300 MiB for the extra
+  NVDEC stream.
--- a/docs/adr/0017-cctv-segment-topology.svg
+++ b/docs/adr/0017-cctv-segment-topology.svg
@ -0,0 +1,178 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="1600" height="880" viewBox="0 0 1600 880" font-family="system-ui, -apple-system, 'Segoe UI', Roboto, sans-serif">
+  <!-- ADR-0017 rev 3 dCCTV topology (single switch, VLAN-30 trunk on LAN1).
+       Colors: reference dataviz palette (light mode). blue #2a78d6 = home LAN ·
+       violet #4a3aa7 = dCCTV · aqua #1baf7a = dKubernetes ·
+       yellow #eda100 = dManagementsVms · green #008300 allow · red #e34948 deny -->
+  <defs>
+    <marker id="arrGreen" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
+      <path d="M0,0 L10,5 L0,10 z" fill="#008300"/>
+    </marker>
+    <marker id="arrRed" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
+      <path d="M0,0 L10,5 L0,10 z" fill="#e34948"/>
+    </marker>
+    <marker id="arrGray" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse">
+      <path d="M0,0 L10,5 L0,10 z" fill="#52514e"/>
+    </marker>
+  </defs>
+
+  <rect width="1600" height="880" fill="#fcfcfb"/>
+
+  <text x="40" y="42" font-size="26" font-weight="700" fill="#0b0b0b">ADR-0017 — CCTV segment behind pfSense, VLAN-30 trunk on the LAN1 cable</text>
+  <text x="40" y="66" font-size="15" fill="#52514e">Sofia/Vermont · rev 3 (single switch) 2026-07-02 · dashed = camera-day · the ONLY 802.1Q is the trunk between the switch and eno1</text>
+
+  <!-- camera -> everything else (denied) -->
+  <path d="M240,168 C520,104 900,104 1148,140" fill="none" stroke="#e34948" stroke-width="3" marker-end="url(#arrRed)"/>
+  <g transform="translate(560,111)">
+    <circle r="11" fill="#fcfcfb" stroke="#e34948" stroke-width="2.5"/>
+    <path d="M-5,-5 L5,5 M5,-5 L-5,5" stroke="#e34948" stroke-width="2.5"/>
+  </g>
+  <text x="588" y="100" font-size="13.5" font-weight="700" fill="#e34948">DENY · camera → LAN / other segments / internet (default deny on dCCTV)</text>
+
+  <!-- GARAGE ENTRANCE -->
+  <rect x="40" y="128" width="240" height="180" rx="10" fill="#4a3aa7" fill-opacity="0.06" stroke="#4a3aa7" stroke-opacity="0.35"/>
+  <text x="56" y="154" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE ENTRANCE</text>
+  <rect x="64" y="170" width="192" height="112" rx="8" fill="#ffffff" stroke="#4a3aa7" stroke-width="2"/>
+  <text x="80" y="196" font-size="15" font-weight="700" fill="#0b0b0b">vermont-garage</text>
+  <text x="80" y="216" font-size="12.5" fill="#52514e">HiLook IPC-T241H-C · pure IR</text>
+  <text x="80" y="234" font-size="12.5" fill="#52514e">10.0.30.70 (Kea reservation)</text>
+  <text x="80" y="252" font-size="12.5" fill="#52514e">DNS: garage-cam.viktorbarzin.lan</text>
+  <text x="80" y="270" font-size="12.5" fill="#52514e">PoE from switch · cloud/P2P off</text>
+
+  <path d="M256,284 C330,330 412,368 417,430" fill="none" stroke="#52514e" stroke-width="2" stroke-dasharray="6,5" marker-end="url(#arrGray)"/>
+  <text x="330" y="322" font-size="12" fill="#52514e">cat6 in conduit · PoE → P4</text>
+
+  <!-- RACK zone: single switch -->
+  <rect x="40" y="360" width="560" height="265" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
+  <text x="56" y="384" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">RACK — GARAGE · ONE SWITCH</text>
+
+  <rect x="64" y="396" width="512" height="176" rx="8" fill="#4a3aa7" fill-opacity="0.04" stroke="#4a3aa7" stroke-width="2"/>
+  <text x="80" y="420" font-size="15" font-weight="700" fill="#0b0b0b">TL-SG105PE <tspan font-size="12.5" font-weight="400" fill="#52514e">replaces the SG105E · mgmt 192.168.1.6 (Kea) · all 5 ports used</tspan></text>
+  <g font-size="11.5" text-anchor="middle">
+    <rect x="80"  y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
+    <text x="124" y="454" font-weight="700" fill="#0b0b0b">P1 · V1</text>
+    <text x="124" y="470" fill="#52514e">apartment</text>
+    <text x="124" y="484" fill="#52514e">uplink</text>
+    <rect x="178" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
+    <text x="222" y="454" font-weight="700" fill="#0b0b0b">P2 · V1</text>
+    <text x="222" y="470" fill="#52514e">4G router</text>
+    <text x="222" y="484" fill="#52514e">192.168.1.7</text>
+    <rect x="276" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
+    <text x="320" y="454" font-weight="700" fill="#0b0b0b">P3 · V1</text>
+    <text x="320" y="470" fill="#52514e">UPS mgmt</text>
+    <rect x="374" y="436" width="88" height="56" rx="6" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
+    <text x="418" y="454" font-weight="700" fill="#0b0b0b">P4 · V30</text>
+    <text x="418" y="470" fill="#52514e">camera</text>
+    <text x="418" y="484" fill="#52514e">PoE ON</text>
+    <rect x="472" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.10" stroke="#4a3aa7" stroke-width="2" stroke-dasharray="0"/>
+    <text x="516" y="454" font-weight="700" fill="#0b0b0b">P5 · trunk</text>
+    <text x="516" y="470" fill="#52514e">V1 untagged</text>
+    <text x="516" y="484" fill="#4a3aa7">+ V30 tagged</text>
+  </g>
+  <text x="80" y="516" font-size="12" fill="#52514e">802.1Q: VLAN 1 untagged {P1,P2,P3,P5} · VLAN 30 {P4 untagged/PVID 30, P5 tagged}</text>
+  <text x="80" y="534" font-size="12" fill="#52514e">tag-30 ingress on P1/P2/P3 is dropped (not members) — the trunk is the only tagged path</text>
+  <text x="80" y="558" font-size="12" fill="#8a8984">old TL-SG105E → retired, cold spare · backup-WAN (4G) + UPS keep their ports</text>
+
+  <!-- trunk: two parallel lines to eno1 -->
+  <path d="M560,458 C630,458 640,428 692,420" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
+  <path d="M560,466 C632,466 644,436 692,428" fill="none" stroke="#4a3aa7" stroke-width="2.5"/>
+  <text x="588" y="404" font-size="12" font-weight="700" fill="#0b0b0b">LAN1 cable</text>
+
+  <!-- R730 / PVE zone -->
+  <rect x="680" y="330" width="880" height="440" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
+  <text x="696" y="356" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">DELL R730 — PVE HOST 192.168.1.127 (IN THE RACK)</text>
+
+  <g font-size="12">
+    <rect x="700" y="400" width="150" height="46" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
+    <text x="712" y="419" font-weight="700" fill="#0b0b0b">eno1 → vmbr0</text>
+    <text x="712" y="436" fill="#52514e">untag V1 + tag 30</text>
+
+    <rect x="700" y="471" width="150" height="46" rx="6" fill="#ffffff" stroke="#8a8984" stroke-dasharray="4,3"/>
+    <text x="712" y="490" font-weight="700" fill="#52514e">eno2 → vmbr2</text>
+    <text x="712" y="507" fill="#8a8984">dormant fallback leg</text>
+
+    <rect x="700" y="542" width="150" height="46" rx="6" fill="#0b0b0b" fill-opacity="0.04" stroke="#8a8984"/>
+    <text x="712" y="561" font-weight="700" fill="#0b0b0b">vmbr1</text>
+    <text x="712" y="578" fill="#52514e">internal · tags 10/20</text>
+  </g>
+
+  <!-- pfSense VM -->
+  <rect x="890" y="388" width="300" height="230" rx="8" fill="#ffffff" stroke="#8a8984"/>
+  <text x="906" y="414" font-size="15" font-weight="700" fill="#0b0b0b">pfSense (VM 101)</text>
+  <text x="906" y="432" font-size="12" fill="#52514e">gateway + firewall for every segment</text>
+  <g font-size="12">
+    <rect x="906" y="444" width="268" height="34" rx="5" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
+    <text x="916" y="465" fill="#0b0b0b">net0 · WAN <tspan fill="#52514e">192.168.1.2 · vmbr0 untagged</tspan></text>
+    <rect x="906" y="484" width="268" height="34" rx="5" fill="#eda100" fill-opacity="0.14" stroke="#eda100"/>
+    <text x="916" y="505" fill="#0b0b0b">net1 · dManagementsVms <tspan fill="#52514e">10.0.10.1</tspan></text>
+    <rect x="906" y="524" width="268" height="34" rx="5" fill="#1baf7a" fill-opacity="0.12" stroke="#1baf7a"/>
+    <text x="916" y="545" fill="#0b0b0b">net2 · dKubernetes <tspan fill="#52514e">10.0.20.1</tspan></text>
+    <rect x="906" y="564" width="268" height="34" rx="5" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
+    <text x="916" y="585" fill="#0b0b0b">net3 · dCCTV <tspan fill="#52514e">10.0.30.1/24 · vmbr0 tag 30</tspan></text>
+  </g>
+  <path d="M850,415 L890,458" fill="none" stroke="#2a78d6" stroke-width="1.6" opacity="0.6"/>
+  <path d="M850,430 L890,581" fill="none" stroke="#4a3aa7" stroke-width="2"/>
+  <path d="M850,565 L890,501" fill="none" stroke="#8a8984" stroke-width="1.6" opacity="0.6"/>
+  <path d="M850,565 L890,541" fill="none" stroke="#8a8984" stroke-width="1.6" opacity="0.6"/>
+
+  <!-- k8s VMs -->
+  <rect x="1240" y="388" width="290" height="230" rx="8" fill="#1baf7a" fill-opacity="0.07" stroke="#1baf7a"/>
+  <text x="1256" y="414" font-size="15" font-weight="700" fill="#0b0b0b">k8s VMs · 10.0.20.0/24</text>
+  <text x="1256" y="434" font-size="12.5" fill="#52514e">vmbr1 tag 20 · pod egress SNATs</text>
+  <text x="1256" y="450" font-size="12.5" fill="#52514e">to node IPs</text>
+  <rect x="1256" y="464" width="258" height="66" rx="6" fill="#ffffff" stroke="#1baf7a"/>
+  <text x="1268" y="486" font-size="13.5" font-weight="700" fill="#0b0b0b">Frigate · k8s-node1 (T4)</text>
+  <text x="1268" y="504" font-size="12" fill="#52514e">detect sub / record main</text>
+  <text x="1268" y="520" font-size="12" fill="#52514e">gpumem budget 2300 MiB</text>
+  <rect x="1256" y="540" width="258" height="52" rx="6" fill="#ffffff" stroke="#1baf7a"/>
+  <text x="1268" y="562" font-size="13.5" font-weight="700" fill="#0b0b0b">go2rtc LB 10.0.20.204</text>
+  <text x="1268" y="580" font-size="12" fill="#52514e">restream → HA live view (MSE/HLS)</text>
+
+  <!-- HOME LAN zone -->
+  <rect x="1148" y="128" width="412" height="180" rx="10" fill="#2a78d6" fill-opacity="0.06" stroke="#2a78d6" stroke-opacity="0.4"/>
+  <text x="1164" y="154" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">HOME LAN 192.168.1.0/24</text>
+  <rect x="1164" y="168" width="180" height="56" rx="6" fill="#ffffff" stroke="#2a78d6"/>
+  <text x="1176" y="190" font-size="13.5" font-weight="700" fill="#0b0b0b">AX6000 · .1</text>
+  <text x="1176" y="208" font-size="11.5" fill="#52514e">+ route 10.0.30.0/24 → .2</text>
+  <rect x="1164" y="236" width="180" height="52" rx="6" fill="#ffffff" stroke="#2a78d6"/>
+  <text x="1176" y="258" font-size="13.5" font-weight="700" fill="#0b0b0b">ha-sofia · .8</text>
+  <text x="1176" y="275" font-size="11.5" fill="#52514e">Frigate card + hikvision_next</text>
+  <rect x="1360" y="168" width="184" height="56" rx="6" fill="#ffffff" stroke="#2a78d6"/>
+  <text x="1372" y="190" font-size="13.5" font-weight="700" fill="#0b0b0b">apartment clients</text>
+  <text x="1372" y="208" font-size="11.5" fill="#52514e">laptops, phones</text>
+  <rect x="1360" y="236" width="184" height="52" rx="6" fill="#ffffff" stroke="#52514e" stroke-dasharray="5,4"/>
+  <text x="1372" y="256" font-size="11.5" font-weight="700" fill="#52514e">CAMERA DAY: static route</text>
+  <text x="1372" y="272" font-size="11.5" fill="#52514e">10.0.30.0/24 via 192.168.1.2</text>
+
+  <path d="M1254,308 C1150,352 950,372 790,400" fill="none" stroke="#2a78d6" stroke-width="2" opacity="0.6"/>
+  <text x="1010" y="374" font-size="12" fill="#2a78d6">apartment uplink · switch P1 · trunk · eno1</text>
+
+  <!-- FLOWS -->
+  <path d="M1256,497 C1010,690 330,730 120,650 C40,618 40,380 96,286" fill="none" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
+  <text x="620" y="700" font-size="13.5" font-weight="700" fill="#008300">ALLOW · Frigate → camera RTSP :554 (routed k8s → dCCTV; opt1 allow-all)</text>
+
+  <path d="M1164,262 C820,282 470,268 302,176 C286,167 278,166 270,172" fill="none" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
+  <text x="484" y="216" font-size="13.5" font-weight="700" fill="#008300">ALLOW · ha-sofia → camera :80 ISAPI + :554</text>
+  <text x="484" y="234" font-size="12" fill="#52514e">enters pfSense WAN · reply-to off · needs the AX6000 route</text>
+
+  <path d="M280,232 C660,200 860,320 936,386" fill="none" stroke="#008300" stroke-width="2" opacity="0.85" marker-end="url(#arrGreen)"/>
+  <text x="740" y="322" font-size="12.5" font-weight="700" fill="#008300">ALLOW · camera → 10.0.30.1:123 (NTP)</text>
+
+  <!-- LEGEND -->
+  <g transform="translate(40,800)" font-size="12.5">
+    <rect x="0" y="0" width="18" height="18" rx="4" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
+    <text x="26" y="14" fill="#0b0b0b">home LAN / VLAN 1</text>
+    <rect x="200" y="0" width="18" height="18" rx="4" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
+    <text x="226" y="14" fill="#0b0b0b">CCTV / VLAN 30 / dCCTV 10.0.30.0/24</text>
+    <rect x="500" y="0" width="18" height="18" rx="4" fill="#1baf7a" fill-opacity="0.12" stroke="#1baf7a"/>
+    <text x="526" y="14" fill="#0b0b0b">dKubernetes</text>
+    <rect x="640" y="0" width="18" height="18" rx="4" fill="#eda100" fill-opacity="0.14" stroke="#eda100"/>
+    <text x="666" y="14" fill="#0b0b0b">dManagementsVms</text>
+    <line x1="820" y1="9" x2="860" y2="9" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
+    <text x="870" y="14" fill="#0b0b0b">allowed flow</text>
+    <line x1="980" y1="9" x2="1020" y2="9" stroke="#e34948" stroke-width="3" marker-end="url(#arrRed)"/>
+    <text x="1030" y="14" fill="#0b0b0b">denied</text>
+    <line x1="1100" y1="9" x2="1140" y2="9" stroke="#52514e" stroke-width="2" stroke-dasharray="6,5"/>
+    <text x="1150" y="14" fill="#0b0b0b">camera-day step</text>
+    <text x="1320" y="14" fill="#52514e">ADR-0017 · rev 3</text>
+  </g>
+</svg>
--- a/docs/adr/0017-cctv-vlan-tagging.excalidraw
+++ b/docs/adr/0017-cctv-vlan-tagging.excalidraw
--- a/docs/adr/0017-cctv-vlan-tagging.svg
+++ b/docs/adr/0017-cctv-vlan-tagging.svg
--- a/docs/adr/0018-valia-sites-off-infra-pages-in-cluster-sync.md
+++ b/docs/adr/0018-valia-sites-off-infra-pages-in-cluster-sync.md
@ -0,0 +1,47 @@
+# Valia sites are served off-infra (Cloudflare Pages), synced in-cluster
+
+Valia (Viktor's mother) authors small one-page static sites in Google Drive folders she
+shares, and keeps asking for them to be hosted — two exist already (`stem95su`, `bridge`)
+and more are expected. We decided all **Valia sites** are served **off-infra on Cloudflare
+Pages** under `<english-name>.viktorbarzin.me`, kept fresh by **one shared in-cluster
+CronJob** (`stacks/valia-sites/`) that mirrors each **Content folder** every 10 minutes
+(rclone, drive.readonly) and re-deploys only on change (wrangler direct upload). The
+existing in-cluster `stem95su` serving stack (nginx + NFS + ingress + per-site sync)
+migrates onto this and is retired.
+
+Why off-infra serving: these are her sites, shown to teachers/parents — they must survive
+homelab outages (cf. the 2026-06-27 egress incident that took every proxied in-cluster
+site down). With Pages, a homelab outage degrades to "content frozen until we're back",
+never "site down". Serving costs no cluster resources and no per-site nginx/PVC/ingress/
+Anubis. Why the syncer stays in-cluster anyway: secrets stay in Vault (no per-site GHA
+secret sprawl), and the stem95su guard patterns (hard-fail on Drive auth errors, never
+wipe a live site on an empty/partial folder, capped deletes) carry over wholesale. The
+deliberate asymmetry — off-infra serving, on-infra syncing — is the point, not an
+accident.
+
+## Considered options
+
+- **In-cluster everywhere** (generalise stem95su into a factory module): one roof, no
+  Cloudflare Pages dependency — but her sites share the homelab's fate and each site
+  spends cluster resources to serve static files a free CDN serves better.
+- **Pages for new sites only**: less work now, two patterns and two runbooks forever.
+- **GHA-scheduled sync** (fully off-infra pipeline): no cluster dependency at all, but
+  Drive + Cloudflare credentials would live as GitHub secrets per repo, outside Vault.
+
+## Consequences
+
+- Registration is one entry in the `sites` map (name, Content folder, optional Entry
+  file); CI applies Pages project, custom domain, public CNAME, and internal-DNS config
+  together. Names are English, picked by Viktor (most → bridge set the precedent).
+- The internal split-horizon zone learns Valia sites from a ConfigMap the
+  `technitium-ingress-dns-sync` script consumes — declaratively, including **removal**
+  (the previous static-CNAME approach was add-only; a retired site left a stale record).
+- Deploy-on-change is mandatory, not an optimisation: Pages caps monthly deployments on
+  the free tier, and a 10-minute cadence would burn ~4,300/month if unchanged runs
+  deployed.
+- Failure visibility is **failed-Job-only** by explicit choice (no stale-sync alert, no
+  per-site uptime monitors, no notifications to Valia) — Viktor fields "it didn't
+  update" reports, consistent with the alert-noise-reduction posture. Revisit if a
+  silent stall actually bites.
+- If the homelab is down, content updates pause; the sites keep serving last-deployed
+  content. Accepted degradation.
--- a/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md
+++ b/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md
@ -0,0 +1,97 @@
+# Inbound mail gets a self-hosted store-and-forward backup MX on Oracle Always-Free
+
+`viktorbarzin.me` has run a single direct MX to the home IP since the 2026-04-12
+inbound overhaul, with sender-MTA retry (1–5 days, sender-dependent) as the only
+outage protection — a documented "No Backup MX" decision made after ForwardEmail's
+forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email
+Routing proved pass-through-only. Viktor now wants inbound mail to survive
+homelab outages **without loss** (2026-07-04): delayed delivery is fine,
+mid-outage reading is not required, and the budget is **$0** — a hard
+constraint that eliminated every managed option (see below).
+
+We run a minimal **Postfix store-and-forward relay on an Oracle Cloud
+Always-Free `VM.Standard.E2.1.Micro`** (`mx2.viktorbarzin.me`, **reserved**
+public IP, MX preference 20; primary untouched at 1). It accepts everything
+for the domain (catch-all — every RCPT is valid; reputation may only ever
+4xx-defer, via postscreen pregreet + conservative DNSBL-defer on the VM —
+never 5xx: a backup MX that hard-rejects manufactures the loss it exists to
+prevent), queues up to **30 days** (bounce lifetime 1 day — the VM can never
+deliver a DSN, its only egress is the drain), and drains to the primary over
+**port 2526** — one scripted pfSense WAN NAT rule onto the existing HAProxy
+frontend — because Oracle blocks egress TCP 25 tenancy-wide. Management is
+tailnet-only (headscale preauth key, `tag:backup-mx`; OCI console as
+mid-outage break-glass since headscale itself lives in the cluster); TLS via
+certbot HTTP-01 (port 80 permanently open — LE validation is
+multi-perspective and unscopeable); the VM is a cattle-rebuild from a new
+`stacks/backup-mx/` Terraform stack (OCI provider + cloud-init, which must
+also punch 25/80 through the OCI Ubuntu image's OS-level iptables REJECT).
+On the primary, the drain stream (one /32) is enabled at the layers that
+actually bite — `check_client_access` permits past
+`reject_unknown_client_hostname` and spoof-protection, an anvil rate-limit
+exception, and rspamd `external_relay` (score against the *original* sender
+IP) with the reject action capped to tag/fold so drained spam can never force
+the VM to emit backscatter. Go-live is gated on empirical checks: inbound-25
+reachability (recurring probe — Oracle publishes no commitment), drain
+end-to-end, and a live failover test that includes a high-spam-score and a
+>10 MB message. Two independent adversarial reviews (2026-07-04) shaped this
+final form. Design:
+[`plans/2026-07-04-backup-mx-design.md`](../plans/2026-07-04-backup-mx-design.md).
+
+## Considered options
+
+- **Roller Network free Secondary MX** — v1 of this decision, killed at the
+  validation gates the same day: free tier caps at 200 relayed messages or
+  10 MB per rolling 7 days, and overage suspends the domain for 48 h
+  answering **SMTP 5xx** (permanent bounces) — since spammers target backup
+  MXes even while the primary is up, background spam alone can hold it
+  suspended, making it *worse than no backup MX*. Free accounts are also
+  being discontinued. (Their TLS checked out; their paid Basic at $30/yr is
+  the documented fallback if the OCI route sours.)
+- **Dynu Email Backup ($9.99/yr)** — queue lifetime undocumented (FAQ hints
+  12–24 h, barely beating sender retry); filtering black-box; not free.
+- **Cloudflare Email Routing / mailflare** — no store-and-forward / terminal
+  inbox on Cloudflare; rejected earlier (2026-04-12; 2026-07-04 memory #7148).
+- **Other free tiers** (challenged and re-verified 2026-07-04): GCP e2-micro
+  blocks egress 25 too and its free regions are US-only; AWS's 2025+ "free"
+  plan is a 6-month credit; Azure has no always-free VM and blocks 25;
+  Hetzner has no free tier; Fly.io ended free allowances; Vultr/Linode are
+  trial credits; DNSExit/KisoLabs/DuoCircle backup-MX are paid or dead. OCI
+  is the only standing free option.
+- **Harden-only** (5xx-misconfig guards + paging) — does not address
+  multi-day outages or short-retry senders; deferred as a complementary
+  track.
+
+## Consequences
+
+- **A pet outside the cluster** — deliberately cattle: rebuilt entirely from
+  Terraform + cloud-init, patched by unattended-upgrades, scraped by the
+  cluster's Prometheus (exporters on the reserved public IP, allowlisted to
+  the homelab WAN /32 — there is **no cluster→tailnet route**, so tailnet
+  scraping was rejected as fictional; blackbox TCP:25 + MX-set drift alerts
+  besides). Never a backup target itself.
+- **Oracle free-tier caprice is the top risk**: Oracle silently halved the A1
+  free allowance in June 2026 and terminated over-limit instances, and
+  publishes no commitment that inbound 25 stays open. Mitigations:
+  **Pay-As-You-Go conversion is a required prerequisite** (exempts idle
+  reclamation, stays $0), a recurring inbound-25 probe, `BackupMxDown`, and
+  the queue being empty outside outages (a surprise reclamation loses
+  coverage, never mail). Home region is fixed at signup — Frankfurt, chosen
+  once.
+- The drain stream bypasses `reject_unknown_client_hostname`, anvil limits,
+  and rspamd's reject tier for one /32; DKIM verification, SPF/DMARC (against
+  the original IP via `external_relay`), and content scoring stay on — spam
+  arriving via the backup is tagged and folded to Junk, never bounced. The VM
+  is deliberately NOT in the primary's `mynetworks` (a compromised VM must
+  not relay through us).
+- **Outages > 30 days lose queued mail silently** — no DSN can ever leave the
+  VM. Stated and accepted (6× better than the status quo).
+- Outage mail sits in plaintext on Oracle disk ≤ 30 days — single-tenant but
+  off-premises; accepted (same class as Brevo holding outbound today).
+- Cloudflare zone lands at 197/200 records; the MTA-STS follow-up (policy
+  host found dangling during design — inert today; must list `mx2` when
+  fixed) needs 1–2 more → schedule the next record purge proactively.
+- `architecture/mailserver.md` §"No Backup MX" superseded at implementation;
+  new runbook `docs/runbooks/backup-mx.md` (incl. OCI console break-glass);
+  `vpn.md`'s stale headscale claims fixed in passing; the roundtrip probe's
+  failure semantics change (a "failing" probe may now mean "delayed via mx2,
+  drains shortly" — noted in alert description).
--- a/docs/architecture/chrome-service.md
+++ b/docs/architecture/chrome-service.md
@ -329,6 +329,12 @@ Two independent grants make up "browser access" for a user:
 the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate
 a token by deleting its `<user>-browser-token` Secret).

+Because the SA is the user's DEFAULT kubectl credential, other per-namespace
+port-forward grants hang off the same identity: `stacks/excalidraw/rbac.tf`
+grants `emo-browser` `pods/portforward` in `excalidraw` (2026-07-02) so emo's
+agent can upload drawings via the port-forward + `X-Authentik-Username` recipe
+in his `~/.claude/CLAUDE.md`. Revoking the SA revokes those too.
+
 ## Limits + risks

 - **Anti-bot vs stealth arms race** — when an upstream beats us (DRM
--- a/docs/architecture/ci-cd.md
+++ b/docs/architecture/ci-cd.md
@ -94,7 +94,7 @@ can't reach Forgejo's public hairpin.
 | Visibility | Packages | Pull mechanism |
 |------------|----------|----------------|
 | **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous |
-| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson |
+| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci, k8s-portal, excalidraw-library | `ghcr-credentials` dockerconfigjson |

 Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the
 kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit
@ -188,6 +188,8 @@ reconciled — the workflows were added to the GitHub lineage via PR):
 | android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` |
 | infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` |
 | infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` |
+| k8s-portal | `build-k8s-portal.yml` | private `ghcr.io/viktorbarzin/k8s-portal` (Keel rolls `:latest` digests) |
+| excalidraw-library | `build-excalidraw.yml` | private `ghcr.io/viktorbarzin/excalidraw-library` (Keel rolls `:latest` digests; DockerHub `:v4` frozen as rollback) |

 **`infra-ci`** is the image the `.woodpecker/default.yml` apply step and
 `drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is
--- a/docs/architecture/dns.md
+++ b/docs/architecture/dns.md
@ -277,7 +277,7 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons

 Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).

-**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too.
+**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too. **Off-infra Valia sites** (Cloudflare Pages, ADR-0018) are the other class of public-only names with no Traefik ingress — without internal records they NXDOMAIN for every internal client while working fine externally. Since 2026-07-03 they are reconciled **declaratively**: `stacks/valia-sites` writes the ConfigMap `valia-sites-dns` (technitium ns, `<name> → <project>.pages.dev`), and the sync script ensures/updates a CNAME per entry and **deletes** stale internal CNAMEs targeting `*.pages.dev` that left the map (retire/rename cleans itself up; deletion is suffix-scoped so nothing else can be touched).

 ## NodeLocal DNSCache

@ -368,6 +368,7 @@ The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`)
 | TXT (MTA-STS) | 1 | `v=STSv1; id=20260412` | TLS enforcement |
 | TXT (TLSRPT) | 1 | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS reporting |
 | A (keyserver) | 1 | `130.162.165.220` (Oracle VPS) | PGP keyserver |
+| CNAME (CF Pages) | 2 | `<project>.pages.dev` (Cloudflare Pages) | bridge, stem95su — Valia sites (ADR-0018), managed by `stacks/valia-sites` |

 ### Proxied vs Non-Proxied

@ -513,6 +514,7 @@ For external `.viktorbarzin.me` records:
 1. Add `dns_type = "proxied"` (or `"non-proxied"`) to the `ingress_factory` module call in the service stack
 2. Run `scripts/tg apply` on the service stack — DNS record is auto-created
 3. For non-standard records (MX, TXT), add a `cloudflare_record` resource in `stacks/cloudflared/modules/cloudflared/cloudflare.tf`
+4. For a Valia site (off-infra Cloudflare Pages), add the entry to `local.sites` in `stacks/valia-sites/main.tf` — public CNAME + internal record both follow (`docs/runbooks/valia-sites.md`)

 ## Incident History

--- a/docs/architecture/mailserver.md
+++ b/docs/architecture/mailserver.md
@ -161,6 +161,17 @@ https://mail.viktorbarzin.me → Traefik → Roundcubemail
  DB: MySQL (mysql.dbaas.svc.cluster.local)
 ```

+### Paperless ingest mailbox (docs@)
+
+`docs@viktorbarzin.me` is a dedicated real mailbox (explicit self-alias in
+`extra/aliases.txt` so the `@domain → spam@` catch-all doesn't shadow it) that
+paperless-ngx polls over IMAP; family members forward document emails to it
+and the sender maps 1:1 to a paperless account. A per-user Dovecot sieve
+(`docs-at-viktorbarzin.me.dovecot.sieve` in the `mailserver.config` ConfigMap,
+mounted as `/tmp/docker-mailserver/docs@viktorbarzin.me.dovecot.sieve`)
+discards mail from non-allowlisted senders at delivery. Full flow, sender map,
+and add-a-sender procedure: [`runbooks/paperless-mail-ingest.md`](../runbooks/paperless-mail-ingest.md).
+
 ## DNS Records

 All managed in Terraform at `stacks/cloudflared/modules/cloudflared/cloudflare.tf`.
@ -300,6 +311,21 @@ Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from External

 ## Troubleshooting

+### All mail tempfailing with `451 4.3.0 queue file write error` (postsrsd spin)
+
+Seen 2026-07-03 right after a pod restart. Signature in `/var/log/mail/mail.log`:
+`postfix/cleanup: warning: tcp:localhost:10001 lookup error` +
+`sender_canonical_maps map lookup problem ... message not accepted, try again later`.
+Cause: **postsrsd** (SRS daemon, `sender_canonical_maps = tcp:localhost:10001`)
+came up spinning at 100% CPU without binding 10001/10002 — supervisor shows it
+`RUNNING` but `ss -ltn | grep 1000` is empty and its log is empty. Postfix then
+tempfails every message (inbound AND submission); senders retry so nothing is
+lost, and the roundtrip probe alerts within the hour.
+Fix: `supervisorctl restart postsrsd` inside the container; if the fresh
+process spins again (it did once), `kubectl -n mailserver delete pod` for a
+full re-init — that healed it. Root cause not pinned down (one-off bad init;
+postsrsd 1.10).
+
 ### Inbound mail not arriving
 1. **DNS/MX**: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me`
 2. **WAN reachability**: `nc -zw5 mail.viktorbarzin.me 25` from outside
--- a/docs/architecture/networking.md
+++ b/docs/architecture/networking.md
@ -1,10 +1,10 @@
 # Networking Architecture

-Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS TSIG-signed)
+Last updated: 2026-07-02 (dCCTV segment added — dedicated pfSense leg for the garage camera, ADR-0017)

 ## Overview

-The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.
+The homelab network is built on three isolated segments behind pfSense (management VLAN 10, Kubernetes VLAN 20, and the physically-legged dCCTV camera segment — see ADR-0017) with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.

 ## Architecture Diagram

@ -24,9 +24,14 @@ graph TB

    CSdrop[CrowdSec drop<br/>nftables / CF edge<br/>out-of-band, pre-Traefik]

-    subgraph "Proxmox Host (eno1)"
+    subgraph "Proxmox Host (eno1, eno2)"
        vmbr0[vmbr0 Bridge<br/>192.168.1.127/24]
        vmbr1[vmbr1 Internal<br/>VLAN-aware]
+        vmbr2[vmbr2 Bridge<br/>eno2 → TL-SG105PE]
+
+        subgraph "dCCTV - 10.0.30.0/24<br/>ADR-0017"
+            Camera[vermont-garage<br/>10.0.30.70]
+        end

        subgraph "VLAN 10 - Management<br/>10.0.10.0/24"
            Proxmox[Proxmox Host<br/>10.0.10.1]
@ -71,6 +76,9 @@ graph TB
    vmbr1 -.VLAN 20.- Tech
    vmbr1 -.VLAN 20.- Master
    vmbr1 -.VLAN 20.- Node1
+    vmbr2 -.physical link.- eno2
+    vmbr2 -.untagged.- Camera
+    vmbr2 -.pfSense net3 = dCCTV 10.0.30.1.- pfSense
 ```

 ## Components
@ -81,6 +89,7 @@ graph TB
 | phpIPAM | v1.7.0 | phpipam.viktorbarzin.me | IP address management, device inventory, DNS sync |
 | vmbr0 | Linux bridge | 192.168.1.127/24 | Physical bridge on eno1, uplink to LAN |
 | vmbr1 | Linux bridge (VLAN-aware) | Internal | VLAN trunk for VM isolation |
+| vmbr2 | Linux bridge | Physical (eno2) | DORMANT fallback leg for dCCTV (ADR-0017 rev 3) — live dCCTV rides vmbr0 tag 30 over the LAN1 trunk |
 | Technitium DNS | Container | 10.0.20.201 (LB) / 10.96.0.53 (ClusterIP) | Internal DNS (viktorbarzin.lan) + full recursive resolver |
 | Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me |
 | Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding |
@ -90,6 +99,22 @@ graph TB
 | MetalLB | v0.15.3 Helm chart | K8s | LoadBalancer IPs (10.0.20.200-10.0.20.220), all services on 10.0.20.200 |
 | Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 |

+## CCTV Segment (dCCTV) — as-built 2026-07-02
+
+Isolated camera segment for owned cameras at the Sofia site (first: `vermont-garage`, HiLook IPC-T241H-C at the garage entrance). Decision + rejected alternatives: `docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md`.
+
+**Physical path (rev 3, single switch)**: camera → TL-SG105PE PoE port (untagged VLAN 30) → trunk port (home LAN untagged + CCTV **tagged 30**) → the existing LAN1 cable → R730 `eno1` → `vmbr0` (vlan-aware) → pfSense `net3`/vtnet3 = `vmbr0 tag=30` = interface **dCCTV `10.0.30.1/24`**. The TL-SG105PE **replaces** the old garage TL-SG105E (retired to cold spare) and carries everything: apartment uplink, 4G router `192.168.1.7`, UPS mgmt (VLAN 1), camera (VLAN 30), trunk — all 5 ports used. VLAN-30 membership is {camera port, trunk port} only, so tagged injection from other ports is dropped. `eno2`/`vmbr2` remain dormant as the fallback physical leg (rev 2).
+
+**Addressing**: Kea DHCP pool `10.0.30.100-199`; devices get MAC reservations (camera `10.0.30.70`; the PE switch mgmt inherits the retired switch's `192.168.1.6` on the home LAN). Kea DDNS auto-registers names in Technitium; `phpipam-pfsense-import` picks up leases hourly.
+
+**Firewall** (all on pfSense):
+- dCCTV in: pass `udp OPT4-net → 10.0.30.1:123` (NTP) — everything else hits the interface's default deny. Cameras cannot reach LAN, other segments, or the internet.
+- WAN in (home LAN side): pass `192.168.1.8` (ha-sofia) → `10.0.30.70:80` (ISAPI/hikvision_next) and `:554` (RTSP), reply-to disabled on both.
+- dKubernetes is allow-all, so cluster Frigate/go2rtc pulls RTSP with no extra rule (pod egress SNATs to node IPs).
+- Home-LAN clients need the **AX6000 static route** `10.0.30.0/24 via 192.168.1.2` (camera-day step) to reach the camera UI.
+
+**Consumers**: cluster Frigate (`/srv/nfs/frigate/config/config.yml` — NOT Terraform) pulls `rtsp://10.0.30.70:554` main+sub as `vermont-garage`; HA integrates via Frigate plus direct hikvision_next for tamper events.
+
 ## IPAM & DNS Auto-Registration

 Devices are automatically discovered, named, and registered in DNS without manual intervention.
@ -207,6 +232,8 @@ VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the up
  - blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send, audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden, changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser, travel, netbox
 - **Non-proxied domains** (grey cloud, direct IP resolution):
  - mail, wg, headscale, immich, calibre, vaultwarden, and other services requiring direct connections
+- **Internal-IP domains** (grey cloud, A → `10.0.20.203` Traefik LB, `ingress_factory` `dns_type = "internal"`):
+  - highlights-immich, highlights-immich-emo — publicly *resolvable* but only *routable* from home LANs / WG sites / VPN (spokes policy-route `10.0.0.0/8` down the tunnel, so kiosk devices with baked-in URLs need no per-site DNS overrides). The record is reachability, not a gate — enforcement is the `home-lans-only` Traefik ipAllowList (Sofia/London/Valchedrym LANs + 10/8) on the ingress. See `docs/plans/2026-07-04-immich-frame-lan-only-design.md`.
 - CNAME records for proxied domains point to Cloudflared tunnel FQDNs

 ### Ingress Flow
@ -261,7 +288,7 @@ Traefik chain:

 1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`).
 2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend.
-3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), and authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients).
+3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients), tripit (`tripit-rate-limit`, 100/1000, photo-tab thumbnail bursts), health (`health-rate-limit`, 100/1000, SPA shell + API burst per page), and dawarich (`dawarich-rate-limit`, 100/1000 — the Rails app self-serves all fingerprinted assets and the map adds an API burst per load; the default burst 429'd the asset tail and risked dropping OwnTracks/mobile location POSTs on the same host).
 4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors).

 Additional middleware:
@ -552,7 +579,7 @@ chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Che

 **Diagnosis**: Check Traefik middleware config for the affected IngressRoute.

-**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, authentik 100/1000 (login SPA `/static` chunk burst → blank screen).
+**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, and tripit/health/authentik/dawarich each 100/1000 (SPA or asset-heavy page loads bursting past the default from one client IP).

 ### Large Downloads or Uploads Truncate / Fail Partway

--- a/docs/plans/2026-07-03-vault-token-self-heal-design.md
+++ b/docs/plans/2026-07-03-vault-token-self-heal-design.md
@ -0,0 +1,103 @@
+# Vault Token Renewer Self-Heal Design
+
+**Date**: 2026-07-03
+**Status**: Approved (brainstorm complete; implementation pending)
+**Owner**: wizard@devvm
+**Supersedes**: the "version-only, no self-heal" scope choice recorded in
+`docs/runbooks/vault-token-renew-devvm.md` (2026-06-07)
+
+## Problem
+
+`wizard@devvm` holds a maintenance-free periodic Vault token
+(`token-devvm-wizard`, `period=768h`, renewed daily by the
+`vault-token-renew` user timer) precisely so no weekly re-login is needed.
+But `~/.vault-token` is the Vault CLI's default token sink, so any
+`vault login -method=oidc` — which the infra docs themselves instruct before
+applies — overwrites it with a 7-day OIDC token. The renewer's drift guard
+(deliberately detect-only) then refuses to renew the foreign token and fails
+the unit daily, into a log nobody watches.
+
+Observed consequence: a self-perpetuating weekly-expiry loop. The OIDC token
+expires after 7 days → Vault 403s → the natural response is another
+`vault login -method=oidc` → clobbers again. Drift persisted unnoticed
+2026-06-18 → 06-26 and 2026-06-29 → 07-03 (memory #7121); Viktor experienced
+it as "the token expires maybe once a week".
+
+**Goal**: `vault login -method=oidc` becomes harmless on devvm. The renewer
+converts any admin-capable clobber back into the permanent periodic token,
+unattended. (Chosen over "never log in" doc-fixes and over instant path-unit
+healing — see Alternatives.)
+
+## Decisions
+
+| # | Decision | Notes |
+|---|----------|-------|
+| 1 | Heal in the existing renewer's drift branch, at its nightly run | ~20-line diff to an already-tested script; no new units. A few-hours window holding the 7-day OIDC token is harmless (heal window 24h ≪ 7d TTL) |
+| 2 | Heal = *attempt* re-mint using the foreign token itself; let Vault's 403 decide | No policy-list guessing — identity-vs-token-policies burned us before (memory #4211). OIDC tokens carry `vault-admin` via `identity_policies`, so the create succeeds |
+| 3 | Weak foreign token (create denied) → keep today's loud DRIFT failure | A read-only clobber (e.g. the 2026-06-05 `kubernetes-woodpecker-default` incident) signals a misbehaving agent flow; auto-papering over it would hide the offender. Log gains a "heal denied — investigate what wrote it" suffix |
+| 4 | Do NOT revoke the clobbering OIDC token | It may still back the user's live login session; it ages out in 7 days on its own |
+| 5 | After a successful heal, revoke stale `token-devvm-wizard` accessors | Anti-sprawl: each heal would otherwise strand the previous periodic **admin** token server-side for up to 32 days. Walk `auth/token/accessors`, revoke every `display_name=token-devvm-wizard` except the just-minted one. Runs only on heal (rare), never on the happy path |
+| 6 | Minted-token sanity check before writing the file | Look up the new token; require `display_name=token-devvm-wizard`. Write via temp file + `mv` + `chmod 600` so a failed mint can never truncate `~/.vault-token` |
+| 7 | Keep timer cadence (daily) and all happy-path behavior unchanged | |
+| 8 | No notification plumbing in this change | devvm alerting is tracked separately (beads `code-aslh`). Heal events are logged; heal-denied/FAIL still fail the unit |
+
+## Behavior matrix
+
+| Token found in `~/.vault-token` | Before | After |
+|---|---|---|
+| Our periodic token | renew-self, log `OK` | unchanged |
+| Foreign, admin-capable (OIDC login) | log `DRIFT`, exit 1 | re-mint periodic token with it, sanity-check, atomic write, revoke stale periodic accessors, log `HEALED: re-minted from foreign dn=<dn> (revoked N stale)`, exit 0 |
+| Foreign, weak (read-only k8s clobber) | log `DRIFT`, exit 1 | log `DRIFT … heal denied — foreign token lacks create authority; investigate what wrote it`, exit 1 |
+| Vault unreachable / lookup fails | log `FAIL`, exit 1 | unchanged |
+
+Re-mint command (identical to the manual recovery the DRIFT log already
+prescribes):
+
+```
+vault token create -orphan -period=768h \
+  -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard
+```
+
+## Testing
+
+- **Unit** (`scripts/test-vault-token-renew.sh`, existing source-the-functions
+  harness): new pure functions for (a) the stale-accessor revoke filter
+  (match on `display_name`, exclude the current accessor) and (b) the
+  minted-token sanity predicate; regression cases for the existing drift
+  predicate stay green.
+- **Live, post-deploy** (on devvm):
+  1. Mint a fake 1h admin token (`-display-name=fake-oidc`,
+     `-policy=vault-admin -policy=sops-admin`), write to `~/.vault-token`,
+     start the service → expect `HEALED`, file holds `token-devvm-wizard`.
+  2. Mint a fake 10m no-privilege token (`-policy=default`), write it, start
+     the service → expect `DRIFT … heal denied`, unit `failed`; restore real
+     token.
+  3. Revoke both fakes; one-off sweep of stale periodic accessors left by the
+     June 26 / July 3 manual re-mints.
+
+## Docs & rollout
+
+- Same commit rewrites the runbook's "Drift guard & recovery" section:
+  self-heal is the recovery for admin-capable clobbers; manual re-mint remains
+  only for weak clobbers (or a dead token with no admin-capable replacement in
+  the file).
+- `vault login -method=oidc` instructions across the docs stay as-is — the
+  login is now harmless by design.
+- Deploy per the runbook's manual model: `install -m 0755` to
+  `~/.local/bin/vault-token-renew`. Units unchanged — no daemon-reload.
+- After landing: update memories #4204/#4211 (gotcha now self-healing).
+
+## Alternatives considered
+
+- **Instant heal** (systemd path unit + protected source-copy of the token):
+  strictly more capable (seconds-latency, heals weak clobbers too, zero
+  re-minting), but 2 new units + a second secret file + inotify re-trigger
+  edge cases — machinery disproportionate to the residual risk. Revisit only
+  if the few-hour heal window ever bites.
+- **Vault CLI `token_helper` interception**: right interception point in
+  theory, but a helper bug breaks every `vault` CLI call, Terraform reads
+  `~/.vault-token` natively anyway, and it adds latency inside login. Rejected.
+- **Docs-only ("never log in")**: rejected by user — the login should keep
+  working, not become forbidden knowledge.
+- **Raise the OIDC role's 7-day `token_max_ttl`**: shared role, affects every
+  OIDC user; rejected previously for the same reason (memory #4205).
--- a/docs/plans/2026-07-03-vault-token-self-heal-plan.md
+++ b/docs/plans/2026-07-03-vault-token-self-heal-plan.md
@ -0,0 +1,443 @@
+# Vault Token Renewer Self-Heal Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Make `vault login -method=oidc` harmless on devvm — the nightly renewer re-mints the permanent periodic token from any admin-capable clobber of `~/.vault-token`, unattended.
+
+**Architecture:** Extend the drift branch of `scripts/vault-token-renew.sh` (deployed to `~/.local/bin/vault-token-renew`, driven by an existing systemd user timer). On drift, *attempt* the re-mint with the clobbering token itself and let Vault's 403 be the authority; sanity-check the minted token, replace the file atomically, then revoke stale `token-devvm-wizard` leftovers. Weak clobbers keep today's loud failure. Design: `docs/plans/2026-07-03-vault-token-self-heal-design.md`.
+
+**Tech Stack:** bash + jq + vault CLI; existing test harness `scripts/test-vault-token-renew.sh` (sources the script, `vtr_main` is guarded).
+
+**Working copy:** everything below runs in the worktree
+`~/code/infra/.worktrees/vault-token-self-heal` on branch `wizard/vault-token-self-heal`.
+Per repo policy, EVERY git command in this git-crypt repo worktree carries:
+`-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false`
+(abbreviated as `$GCFLAGS` below; define once per shell:
+`GCFLAGS="-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false"`
+and use it unquoted: `git $GCFLAGS <verb> …`).
+
+---
+
+### Task 1: Unit tests for the two new pure functions (RED)
+
+**Files:**
+- Modify: `scripts/test-vault-token-renew.sh` (append before the final `printf`/exit lines)
+
+- [ ] **Step 1: Append the failing tests**
+
+Insert this block immediately after the existing "parse + decide end-to-end" section (after the line `no "oidc: parse+decide refused" …`, before the final `printf '\n%d passed…'`):
+
+```bash
+# --- vtr_accessor: parse accessor out of lookup JSON ---
+LOOKUP_NEW='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-new","policies":["default","sops-admin","vault-admin"],"identity_policies":null}}'
+eq "accessor parsed"          "acc-new" "$(vtr_accessor "$LOOKUP_NEW")"
+eq "accessor absent -> empty" ""        "$(vtr_accessor '{"data":{"display_name":"x"}}')"
+
+# --- vtr_is_stale_periodic: the heal's revoke filter — ONLY old token-devvm-wizard
+# --- tokens are swept; the just-minted token, foreign tokens, and anything with an
+# --- unknown accessor are kept. An empty keep-accessor sweeps NOTHING (fail-safe).
+STALE_OURS='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-old","policies":["default","sops-admin","vault-admin"]}}'
+ok "older periodic token is stale"      vtr_is_stale_periodic "$STALE_OURS" "acc-new"
+no "the just-minted token is kept"      vtr_is_stale_periodic "$LOOKUP_NEW" "acc-new"
+no "foreign oidc token never swept"     vtr_is_stale_periodic "$LOOKUP_OIDC" "acc-new"
+no "woodpecker token never swept"       vtr_is_stale_periodic "$LOOKUP_WP" "acc-new"
+no "missing accessor never swept"       vtr_is_stale_periodic '{"data":{"display_name":"token-devvm-wizard"}}' "acc-new"
+no "empty keep-accessor sweeps nothing" vtr_is_stale_periodic "$STALE_OURS" ""
+```
+
+(`LOOKUP_OIDC` / `LOOKUP_WP` and the `ok`/`no`/`eq` helpers already exist in the file.)
+
+- [ ] **Step 2: Run tests, verify they fail**
+
+Run: `bash scripts/test-vault-token-renew.sh`
+Expected: FAILs / `command not found` for `vtr_accessor` and `vtr_is_stale_periodic`; the 17 pre-existing tests stay green.
+
+### Task 2: Implement the pure functions (GREEN)
+
+**Files:**
+- Modify: `scripts/vault-token-renew.sh` (insert after `vtr_drift_ok()`, before `vtr_main()`)
+
+- [ ] **Step 1: Add the two functions**
+
+```bash
+# vtr_accessor <lookup-json> -> the token accessor (empty if absent).
+vtr_accessor() {
+  printf '%s' "$1" | jq -r '.data.accessor // ""'
+}
+
+# vtr_is_stale_periodic <lookup-json> <keep-accessor> -> 0 if this lookup
+# describes one of OUR periodic tokens (display name matches) that is NOT the
+# one to keep — i.e. a stale leftover a heal should revoke. 1 otherwise.
+# Name-only on purpose (no policy check): anything named token-devvm-wizard
+# that isn't the current token is garbage from a previous mint. An empty
+# keep-accessor sweeps NOTHING (fail-safe: never revoke when we don't know
+# which token is current).
+vtr_is_stale_periodic() {
+  local dn acc
+  [ -n "${2:-}" ] || return 1
+  dn=$(vtr_display_name "$1")
+  acc=$(vtr_accessor "$1")
+  [ "$dn" = "$EXPECTED_DN" ] || return 1
+  [ -n "$acc" ] || return 1
+  [ "$acc" != "$2" ]
+}
+```
+
+- [ ] **Step 2: Run tests, verify all pass**
+
+Run: `bash scripts/test-vault-token-renew.sh`
+Expected: `25 passed, 0 failed`, exit 0.
+
+- [ ] **Step 3: Commit**
+
+```bash
+cd ~/code/infra/.worktrees/vault-token-self-heal
+git $GCFLAGS add scripts/vault-token-renew.sh scripts/test-vault-token-renew.sh
+git $GCFLAGS commit -m "vault-token-renew: pure helpers for the self-heal revoke filter
+
+vtr_accessor parses the accessor from lookup JSON; vtr_is_stale_periodic
+decides which old token-devvm-wizard tokens a heal may revoke (never the
+just-minted one, never foreign tokens, nothing when the keeper is unknown).
+TDD red-green for the heal branch that lands next."
+```
+
+### Task 3: The heal branch (`vtr_heal` + `vtr_main` wiring)
+
+**Files:**
+- Modify: `scripts/vault-token-renew.sh`
+
+- [ ] **Step 1: Add `vtr_heal` after `vtr_is_stale_periodic()`, before `vtr_main()`**
+
+```bash
+# vtr_heal <foreign-dn> <log-file> -> 0 if ~/.vault-token was re-minted back to
+# our periodic admin token using the foreign token's own authority, 1 if the
+# heal was denied or failed (caller exits non-zero; the unit goes failed).
+#
+# Self-heal added 2026-07-03 (docs/plans/2026-07-03-vault-token-self-heal-design.md):
+# an OIDC login — which the infra docs prescribe before applies — clobbers
+# ~/.vault-token with a 7-day token, and detect-only drift left that unnoticed
+# for weeks (the weekly-expiry loop). We ATTEMPT the re-mint with the
+# clobbering token itself and let Vault's authz decide — a read-only clobber
+# (the 2026-06-05 woodpecker incident) is denied the mint and stays a loud
+# failure, because it signals a misbehaving flow that someone should look at.
+vtr_heal() {
+  local foreign_dn="$1" log="$2"
+  local errf new_token new_info new_dn new_pols new_acc tmp
+  errf=$(mktemp)
+  if ! new_token=$(vault token create -orphan -period=768h \
+        -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard \
+        -field=token 2>"$errf") || [ -z "$new_token" ]; then
+    printf '%s DRIFT: ~/.vault-token is dn=%q — heal denied, foreign token lacks create authority (%s); investigate what wrote it. Manual re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \
+      "$(date -Is)" "$foreign_dn" "$(tr '\n' ' ' <"$errf")" >>"$log"
+    rm -f "$errf"
+    return 1
+  fi
+  rm -f "$errf"
+
+  # Sanity: the minted token must itself pass the drift guard before it may
+  # replace ~/.vault-token.
+  if ! new_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json 2>&1); then
+    printf '%s FAIL: heal minted a token but its lookup failed: %s\n' \
+      "$(date -Is)" "$new_info" >>"$log"
+    return 1
+  fi
+  new_dn=$(vtr_display_name "$new_info")
+  new_pols=$(vtr_policies_csv "$new_info")
+  if ! vtr_drift_ok "$new_dn" "$new_pols"; then
+    printf '%s FAIL: heal minted an unexpected token (dn=%q policies=%q) — not writing it\n' \
+      "$(date -Is)" "$new_dn" "$new_pols" >>"$log"
+    return 1
+  fi
+
+  # Atomic replace: mktemp files are 0600 from birth; same-filesystem mv.
+  tmp=$(mktemp "$HOME/.vault-token.XXXXXX")
+  printf '%s' "$new_token" >"$tmp"
+  mv "$tmp" "$HOME/.vault-token"
+
+  # Anti-sprawl: revoke previous token-devvm-wizard tokens — each heal would
+  # otherwise strand the prior periodic ADMIN token server-side for up to 32d.
+  # The clobbering foreign token is deliberately NOT revoked: it may still back
+  # the user's live login session, and it ages out on its own (7d for OIDC).
+  local sweep="accessor sweep skipped (list denied)" accessors a a_info revoked=0
+  new_acc=$(vtr_accessor "$new_info")
+  if [ -n "$new_acc" ] && accessors=$(VAULT_TOKEN="$new_token" vault list -format=json auth/token/accessors 2>/dev/null); then
+    while IFS= read -r a; do
+      [ -n "$a" ] || continue
+      a_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json -accessor "$a" 2>/dev/null) || continue
+      if vtr_is_stale_periodic "$a_info" "$new_acc"; then
+        VAULT_TOKEN="$new_token" vault token revoke -accessor "$a" >/dev/null 2>&1 && revoked=$((revoked + 1))
+      fi
+    done < <(printf '%s' "$accessors" | jq -r '.[]')
+    sweep="revoked $revoked stale periodic token(s)"
+  fi
+
+  printf '%s HEALED: re-minted periodic token from foreign dn=%q (%s)\n' \
+    "$(date -Is)" "$foreign_dn" "$sweep" >>"$log"
+}
+```
+
+- [ ] **Step 2: Rewire the drift branch in `vtr_main`**
+
+Replace this exact block (comment + if):
+
+```bash
+  # Drift guard (added 2026-06-07): the renewer must NOT keep a FOREIGN token alive.
+  # On 2026-06-05 a stray `vault login -method=kubernetes` overwrote ~/.vault-token
+  # with a read-only woodpecker token, and this script then silently renewed THAT
+  # for two days — masking the loss of write access. So before renewing, confirm
+  # the token is our periodic admin token; if it has drifted, fail loudly (systemd
+  # marks the unit failed) instead of keeping someone else's token alive.
+  if ! vtr_drift_ok "$dn" "$pols"; then
+    printf '%s DRIFT: ~/.vault-token is dn=%q policies=%q (expected dn=%q with %q). Refusing to renew a foreign token. Re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \
+      "$(date -Is)" "$dn" "$pols" "$EXPECTED_DN" "$REQUIRED_POLICY" >>"$log"
+    exit 1
+  fi
+```
+
+with:
+
+```bash
+  # Drift guard (2026-06-07) + self-heal (2026-07-03): the renewer must not
+  # keep a FOREIGN token alive (on 2026-06-05 a stray kubernetes login was
+  # silently renewed for two days, masking lost write access). But detect-only
+  # drift proved worse in practice: an OIDC login — which the infra docs
+  # prescribe before applies — clobbers this file too, and the resulting DRIFT
+  # failures went unnoticed for weeks while access degraded to a 7-day token
+  # (the weekly-expiry loop). On drift we now ATTEMPT to heal (see vtr_heal):
+  # re-mint the periodic token with the clobbering token's own authority.
+  # Vault's authz keeps the old guarantee — a token that couldn't legitimately
+  # hold vault-admin is denied the mint, and we still fail loud.
+  if ! vtr_drift_ok "$dn" "$pols"; then
+    vtr_heal "$dn" "$log" || exit 1
+    exit 0
+  fi
+```
+
+- [ ] **Step 3: Syntax + lint + regression check**
+
+Run: `bash -n scripts/vault-token-renew.sh && bash scripts/test-vault-token-renew.sh; command -v shellcheck >/dev/null && shellcheck scripts/vault-token-renew.sh`
+Expected: syntax OK, `25 passed, 0 failed`; shellcheck (if installed) reports nothing new.
+
+- [ ] **Step 4: Commit**
+
+```bash
+git $GCFLAGS add scripts/vault-token-renew.sh
+git $GCFLAGS commit -m "vault-token-renew: self-heal the periodic token on admin-capable clobber
+
+Viktor asked for 'vault login -method=oidc' to work seamlessly: the OIDC
+login the docs prescribe kept clobbering ~/.vault-token with a 7-day token,
+and detect-only DRIFT failures went unnoticed for weeks (weekly-expiry
+loop, twice in June). On drift the renewer now re-mints the periodic token
+with the clobbering token's own authority (Vault's 403 is the judge — no
+policy guessing), sanity-checks it, replaces the file atomically, and
+revokes stale token-devvm-wizard leftovers. Weak/read-only clobbers still
+fail loudly on purpose. Design: docs/plans/2026-07-03-vault-token-self-heal-design.md"
+```
+
+### Task 4: Docs — runbook + test-file header
+
+**Files:**
+- Modify: `docs/runbooks/vault-token-renew-devvm.md` (the `## Drift guard & recovery` section + the healthy-log-line note + `## Tests`)
+- Modify: `scripts/test-vault-token-renew.sh` (header comment only)
+
+- [ ] **Step 1: Replace the runbook's `## Drift guard & recovery` section with:**
+
+```markdown
+## Drift guard & self-heal
+
+`~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login`
+overwrites it. Two confirmed clobber vectors:
+
+1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer
+   can't push past the OIDC role's 7-day `token_max_ttl`). The infra docs
+   prescribe this login before applies, so it recurs — it went unnoticed for
+   weeks twice (2026-06-18→26, 2026-06-29→07-03) and read as "Vault expires
+   weekly".
+2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) →
+   writes a read-only `kubernetes-woodpecker-default` token (can read Vault but
+   **cannot** write `secret/*`). Happened 2026-06-05, unnoticed for two days.
+
+Since 2026-07-03 the renewer **self-heals**
+(`docs/plans/2026-07-03-vault-token-self-heal-design.md`). On a foreign token
+it attempts the re-mint **with the clobbering token's own authority** and lets
+Vault's authz decide:
+
+- **Admin-capable clobber (OIDC login)** → re-mints the periodic token,
+  sanity-checks it against the drift guard, atomically replaces
+  `~/.vault-token`, revokes stale `token-devvm-wizard` leftovers
+  (anti-sprawl), logs
+  `HEALED: re-minted periodic token from foreign dn=… (revoked N stale periodic token(s))`
+  and exits 0. The clobbering token is NOT revoked — it may still back a live
+  login session; it ages out on its own.
+- **Weak clobber (read-only k8s token)** → the mint is denied; logs
+  `DRIFT: … heal denied, foreign token lacks create authority …; investigate what wrote it`
+  and exits non-zero (unit `failed`). Deliberately loud: this signals a
+  misbehaving agent flow — exactly the 2026-06-05 case.
+
+**Manual recovery** is only needed for the weak-clobber case (the DRIFT log
+line still contains the exact command) — run the
+[mint/re-mint](#mint--re-mint-the-token) block.
+```
+
+- [ ] **Step 2: In the runbook's `## Health check` section**, after the "A healthy log line looks like…" sentence, add:
+
+```markdown
+After an OIDC login you'll instead see, at the next nightly run:
+`<ts> HEALED: re-minted periodic token from foreign dn="oidc-…" (revoked N stale periodic token(s))` — that's the self-heal working as designed.
+```
+
+- [ ] **Step 3: In the runbook's `## Tests` section**, replace the first sentence with:
+
+```markdown
+`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision,
+the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber
+case), and the self-heal's revoke filter (which stale periodic tokens a heal
+may sweep).
+```
+
+- [ ] **Step 4: Update the test file's header comment** (lines 2–7) to:
+
+```bash
+# Unit tests for the pure functions in vault-token-renew.sh.
+# Sources the script (vtr_main is guarded) and exercises (a) the drift-guard
+# decision — is ~/.vault-token OUR periodic admin token (renew) or a foreign
+# clobber (heal / fail loud)? — whose ABSENCE let the 2026-06-05 woodpecker
+# clobber be silently renewed for two days, and (b) the self-heal's revoke
+# filter — which stale token-devvm-wizard tokens a heal may sweep.
+# Run: bash infra/scripts/test-vault-token-renew.sh
+```
+
+- [ ] **Step 5: Run tests once more, then commit**
+
+Run: `bash scripts/test-vault-token-renew.sh`
+Expected: `25 passed, 0 failed`.
+
+```bash
+git $GCFLAGS add docs/runbooks/vault-token-renew-devvm.md scripts/test-vault-token-renew.sh
+git $GCFLAGS commit -m "vault-token-renew runbook: document the self-heal behavior
+
+Drift guard section rewritten: admin-capable clobbers now self-heal at the
+nightly run (HEALED log line); weak clobbers keep the loud DRIFT failure;
+manual re-mint is only the weak-clobber recovery now."
+```
+
+### Task 5: Deploy + live verification (on devvm, as wizard)
+
+**Files:** none (host deploy + live checks)
+
+- [ ] **Step 1: Install from the worktree**
+
+```bash
+install -m 0755 ~/code/infra/.worktrees/vault-token-self-heal/scripts/vault-token-renew.sh ~/.local/bin/vault-token-renew
+```
+
+(Units unchanged — no `daemon-reload` needed.)
+
+- [ ] **Step 2: Live case 1 — admin-capable clobber heals**
+
+```bash
+export VAULT_ADDR=https://vault.viktorbarzin.me
+export XDG_RUNTIME_DIR=/run/user/$(id -u)
+FAKE_ADMIN=$(vault token create -ttl=1h -policy=vault-admin -policy=sops-admin -display-name=fake-oidc -field=token)
+printf '%s' "$FAKE_ADMIN" > ~/.vault-token
+systemctl --user start vault-token-renew.service; echo "exit=$?"
+tail -1 ~/.local/state/vault-token-renew.log
+vault token lookup | grep -E 'display_name|period'
+```
+
+Expected: `exit=0`; log line `HEALED: re-minted periodic token from foreign dn="token-fake-oidc" (revoked N stale periodic token(s))` with N ≥ 1 (the pre-clobber periodic token is itself swept as stale — by design — along with any strays from the June 26 / July 3 manual re-mints); lookup shows `display_name token-devvm-wizard`, `period 768h`. Note: `FAKE_ADMIN` is a child of the swept old token, so the cascade revokes it too — no cleanup needed.
+
+- [ ] **Step 3: Verify exactly ONE periodic token remains server-side**
+
+```bash
+for a in $(vault list -format=json auth/token/accessors | jq -r '.[]'); do
+  vault token lookup -format=json -accessor "$a" 2>/dev/null \
+    | jq -r 'select(.data.display_name=="token-devvm-wizard") | .data.accessor'
+done
+```
+
+Expected: exactly one line, matching `vault token lookup -format=json | jq -r .data.accessor`.
+
+- [ ] **Step 4: Live case 2 — weak clobber stays a loud failure**
+
+```bash
+GOOD=$(cat ~/.vault-token)
+FAKE_WEAK=$(vault token create -ttl=10m -policy=default -display-name=fake-weak -field=token)
+printf '%s' "$FAKE_WEAK" > ~/.vault-token
+systemctl --user start vault-token-renew.service; echo "exit=$?"
+systemctl --user is-failed vault-token-renew.service
+tail -1 ~/.local/state/vault-token-renew.log
+printf '%s' "$GOOD" > ~/.vault-token && chmod 600 ~/.vault-token
+vault token revoke "$FAKE_WEAK" >/dev/null
+```
+
+Expected: `exit=1` (start reports the oneshot failure), `is-failed` prints `failed`, log line `DRIFT: ~/.vault-token is dn="token-fake-weak" — heal denied, foreign token lacks create authority (… permission denied …); investigate what wrote it. Manual re-mint: …`.
+
+- [ ] **Step 5: Happy path still green**
+
+```bash
+systemctl --user start vault-token-renew.service; echo "exit=$?"
+tail -1 ~/.local/state/vault-token-renew.log
+```
+
+Expected: `exit=0`, log `OK renewed (dn=token-devvm-wizard ttl=2764800s)`.
+
+### Task 6: Land on master + cleanup
+
+- [ ] **Step 1: Merge latest master into the branch, re-verify, push**
+
+```bash
+cd ~/code/infra/.worktrees/vault-token-self-heal
+git $GCFLAGS fetch forgejo
+git $GCFLAGS merge forgejo/master
+bash scripts/test-vault-token-renew.sh
+git $GCFLAGS push forgejo HEAD:master
+```
+
+Expected: clean merge (or already up to date), `25 passed, 0 failed`, push accepted. Non-fast-forward → fetch, merge, push again.
+
+- [ ] **Step 2: Watch CI to completion**
+
+The push fires the infra Woodpecker `default.yml` (terragrunt apply for changed stacks). This change touches only `scripts/` + `docs/` → expect a fast success / no-op apply. Check (Forgejo-forge infra repo = Woodpecker repo id 82):
+
+```bash
+export VAULT_ADDR=https://vault.viktorbarzin.me
+vault kv get -format=json secret/ci/global | jq -r '.data.data | keys[]'   # find the woodpecker admin token key
+WP_TOKEN=$(vault kv get -field=<that-key> secret/ci/global)
+curl -s -H "Authorization: Bearer $WP_TOKEN" 'https://ci.viktorbarzin.me/api/repos/82/pipelines?perPage=1' | jq '.[0] | {number, status, commit: .commit[0:8]}'
+```
+
+Expected: the pipeline for the pushed commit reaches `status: "success"` (poll until terminal). If it fails, fix before proceeding.
+
+- [ ] **Step 3: Remove worktree + branch, reconcile main checkout**
+
+```bash
+git -C ~/code/infra $GCFLAGS worktree remove .worktrees/vault-token-self-heal
+git -C ~/code/infra $GCFLAGS branch -d wizard/vault-token-self-heal
+git -C ~/code/infra status --porcelain   # expect clean before pulling
+git -C ~/code/infra $GCFLAGS pull --ff-only forgejo master
+```
+
+Expected: worktree gone, branch deleted (already merged), main checkout fast-forwards to the landed commit.
+
+### Task 7: Memory + wrap-up
+
+- [ ] **Step 1: Update the stale memories** (they say the drift guard is detect-only / recovery is manual):
+
+```bash
+homelab memory recall "vault periodic token renewer drift"   # confirm ids 4204, 4211, 7121 still say detect-only
+homelab memory update 4211 "<original gotcha content, amended: since 2026-07-03 the renewer SELF-HEALS admin-capable clobbers at its nightly run (re-mints the periodic token with the clobbering token's authority + revokes stale token-devvm-wizard leftovers; weak clobbers still fail loudly). An OIDC login on devvm is now harmless. Design: infra docs/plans/2026-07-03-vault-token-self-heal-design.md>"
+homelab memory update 7121 "<original content, amended: PLAYBOOK OBSOLETE for admin clobbers — self-heal shipped 2026-07-03; manual re-mint only needed for weak/read-only clobbers>"
+```
+
+(Fetch each memory's current text first and preserve it — amend, don't replace wholesale.)
+
+- [ ] **Step 2: End-of-task extraction** — dispatch the standard M.3 memory-mining subagent per `~/.claude/rules/execution.md`, then give the final summary.
+
+---
+
+## Plan self-review (done at write time)
+
+- **Spec coverage**: heal-on-admin-clobber (T3), loud-fail-on-weak (T3 + live T5.4), no-revoke-foreign (T3 comment + design decision 4), anti-sprawl sweep + fail-safe filter (T2/T3, live T5.3), minted-token sanity + atomic write (T3), unit tests (T1/T2), runbook (T4), deploy + live sim (T5), memory updates (T7). ✓
+- **Placeholders**: `<that-key>` in T6.2 is a deliberate discovery step (key name verified live from Vault, not invented). No other TBDs. ✓
+- **Name consistency**: `vtr_accessor`, `vtr_is_stale_periodic`, `vtr_heal`, `EXPECTED_DN` match across tasks; test count 17→25 consistent (8 new cases). ✓
--- a/docs/plans/2026-07-04-backup-mx-design.md
+++ b/docs/plans/2026-07-04-backup-mx-design.md
@ -0,0 +1,335 @@
+# Backup MX — self-hosted store-and-forward relay on Oracle Always-Free — design
+
+Date: 2026-07-04 (v3 — post-challenge; v2 Oracle pivot same day) · Status: design,
+pre-implementation · ADR: [0019](../adr/0019-backup-mx-self-hosted-oracle-relay.md)
+
+v3 incorporates two independent adversarial-challenge reviews (same day). Their
+material corrections are marked **[CH]** throughout — the largest: the v2 drain
+path would never have drained (primary-side smtpd rejects), monitoring-over-
+tailnet was fiction (no cluster→tailnet route exists), and the VM's bounce
+model was wrong (it can never deliver a DSN).
+
+## Goal
+
+Inbound mail for `viktorbarzin.me` must survive homelab outages without loss.
+Requirement level (Viktor, 2026-07-04): **never lose mail; delayed delivery is
+acceptable; budget is $0** (hard constraint — reaffirmed after the Rollernet
+gates failed). A store-and-forward backup MX queues mail while the homelab is
+down and re-delivers when it returns.
+
+Out of scope, explicitly:
+
+- Reading new mail *during* an outage.
+- Outbound mail during outages.
+- The "primary up but hard-bouncing 5xx" misconfig class — a backup MX is
+  never consulted when the primary answers. Separate hardening/alerting track.
+
+Known residual limit (state it plainly): an outage **longer than 30 days**
+loses the queued mail *silently* — the VM cannot emit a bounce to anyone
+(egress 25 blocked), so no sender ever learns. Accepted; 30 days is already
+6× the sender-retry status quo.
+
+## v1 → v2: why Rollernet was dropped (gate evidence, 2026-07-04)
+
+v1 selected Roller Network's free Secondary MX. The validation gates killed it
+before any DNS change:
+
+- **G2 FAILED**: the [free-accounts policy](https://rollernet.us/policy/free-accounts.html)
+  caps free mail service at **200 relayed messages or 10 MB per rolling 7
+  days**; overage → domain suspended **48 h answering SMTP 5xx** (permanent
+  bounces), repeatable. Spammers deliberately target backup MXes even while
+  the primary is up, so background spam alone can hold the domain suspended —
+  worse than no backup MX.
+- **G1 SHAKY**: same policy page says free accounts are being discontinued.
+- **G3 PASSED** (for posterity): `mail{,2}.rollernet.us` present valid LE
+  certs over STARTTLS.
+- Signup is Cloudflare-Turnstile-gated — moot given G1/G2.
+
+Viktor's decision: stay free → self-host on Oracle Always-Free. **[CH]** The
+external challenger re-searched the free landscape (DNSExit, KisoLabs,
+DuoCircle, AWS/Azure/GCP/Hetzner/Fly/Vultr/Linode free tiers) and confirmed:
+no credible free managed backup-MX or free VM with a usable port-25 story
+exists in 2026 other than OCI. GCP's free e2-micro also blocks egress 25 and
+is US-regions-only (wrong continent).
+
+## Decision
+
+A minimal **Postfix store-and-forward relay** (`mx2.viktorbarzin.me`) on an
+Oracle Cloud **Always-Free** compute instance, published as a lower-preference
+MX. It accepts mail for `viktorbarzin.me` when the primary is unreachable,
+queues up to 30 days, and drains to the primary when it returns. No mailboxes,
+no third-party terms — the queue-lifetime and reject-behavior knobs are ours.
+
+## Architecture
+
+```
+                         ┌── pri 1  mail.viktorbarzin.me ──► pfSense HAProxy ──► mailserver pod
+sender MTA ──► MX lookup ┤                                        ▲
+                         └── pri 20 mx2.viktorbarzin.me           │ drain: smtp to
+                             (Oracle VM, Postfix relay,           │ mail.viktorbarzin.me:2526
+                              queue ≤ 30 days) ───────────────────┘ (pfSense WAN NAT rdr
+                                                                     2526 → 10.0.20.1:25,
+                                                                     existing HAProxy frontend)
+```
+
+- **Normal operation**: senders use pri 1; the VM idles (spammers targeting
+  the backup + transient-blip retries get relayed onward immediately).
+- **Outage**: senders fall back to pri 20 → VM accepts + queues → Postfix
+  retries the primary on its native schedule → queue drains after recovery
+  through the standard external ingress path (PROXY v2 → :2525 → rspamd →
+  Dovecot).
+- **Custom drain port**: Oracle blocks **egress TCP 25** tenancy-wide
+  (post-2021; exemptions unreliable) — the VM cannot reach
+  `mail.viktorbarzin.me:25`. One pfSense WAN NAT rule `TCP 2526 →
+  10.0.20.1:25` reuses the existing HAProxy frontend unchanged. **[CH]
+  Verified against the runbook**: the frontend binds `*:25` on pfSense (not
+  strictly 10.0.20.1), rdr dst-port rewrite is the existing production
+  pattern (WAN:25 already rewrites to 10.0.20.1:25), and port 2526 collides
+  with nothing (the HAProxy test frontend uses :2525). Inbound TCP 25 **to**
+  the VM is unaffected by Oracle's egress-only block per practitioner
+  evidence (iRedMail/mailcow on OCI: receive works, send doesn't) — **to be
+  proven at gate O2 before any DNS change** (Oracle publishes no positive
+  commitment).
+
+## Oracle account & instance
+
+- **Account**: Viktor creates it (human signup; card for identity, $0
+  charged). **Home region is fixed at signup and Always-Free compute exists
+  only there — choose `eu-frankfurt-1` deliberately; there is no
+  try-another-region fallback without a new account. [CH]**
+- **[CH] PAYG conversion is a REQUIRED prerequisite, not a recommendation**:
+  Oracle stops idle Always-Free instances (95th-pct CPU < 20% over 7 days — an
+  idle Postfix box qualifies) and demonstrably changes free-tier terms without
+  notice, enforcing by termination (June 2026: A1 allowance silently halved,
+  over-limit instances shut down). PAYG keeps Always-Free resources free and
+  exempts them from idle reclamation.
+- **Shape**: `VM.Standard.E2.1.Micro` (x86, 1/8 OCPU burst, 1 GB RAM; 2
+  always-free instances allowed; ample for queue-only Postfix — and untouched
+  by the 2026 A1 cuts). ARM A1 fallback is **unreliable** (halved quota,
+  chronic Frankfurt capacity) — treat E2.1.Micro availability as the gate.
+- **[CH] Reserved public IP is mandatory** (`oci_core_public_ip`, reserved):
+  an ephemeral IP rotates on stop/start and would silently break all four
+  IP-keyed controls at once (pfSense NAT source-restriction, the primary's
+  smtpd/rspamd exemptions, the Oracle security list, Prometheus scrape
+  allowlist) — discovered only at the next outage's drain.
+- **OS**: Ubuntu 24.04. **[CH] OCI Ubuntu images ship an OS-level iptables
+  ruleset (`/etc/iptables/rules.v4`) that ACCEPTs 22 and REJECTs everything
+  else, independent of security lists** — cloud-init must insert ACCEPT rules
+  for 25/80 (+ scrape ports) ahead of the REJECT and persist them, or gate O2
+  fails on day 1 with a correct security list.
+- **Credentials**: OCI API key for Terraform → Vault `secret/viktor`
+  (`oci_*`); web login → Vaultwarden item `Oracle Cloud (backup MX)`.
+
+## Networking & security posture
+
+- **Ingress on the VM**: TCP 25 world-open (the service). **[CH] TCP 80
+  world-open permanently** — Let's Encrypt validation is multi-perspective
+  with no published source IPs, so it cannot be source-scoped, and a
+  "open-only-during-renewal" toggle is unspecified automation whose realistic
+  failure mode is an expired cert at day ~90. Nothing listens on 80 outside
+  certbot's seconds-long renewal windows; connection-refused surface is
+  negligible. TCP 9100/9154 (exporters) restricted to the homelab WAN /32
+  (176.12.22.76) in both the Oracle security list and the VM firewall.
+- **No public SSH**: management rides the headscale tailnet — cloud-init
+  enrolls via a **preauth key for a dedicated non-OIDC headscale user** with
+  node tag `tag:backup-mx` (headscale 0.28.0 file-mode ACL, content in Vault
+  `secret/headscale` → `headscale_acl`); SSH bound to the tailnet interface.
+  ACL grant: `group:admin → tag:backup-mx:22` (cluster pods are NOT tailnet
+  members — see monitoring). **[CH] Outage caveat**: headscale's control
+  plane + DERP live in the cluster, so mid-outage tailnet reachability is
+  cached-netmap best-effort — the runbook documents the **OCI instance
+  console connection as break-glass** management. (Also fix `vpn.md`'s stale
+  "0.23.x / OIDC-only" claims while in there.)
+- **VM compromise blast radius**: plaintext of outage-queued mail + a relay
+  surface contained by `relay_domains = viktorbarzin.me` only, no submission
+  ports, no SASL, no local delivery. The VM is deliberately NOT added to the
+  primary's `mynetworks` (that would let a compromised VM relay arbitrary
+  mail *through* the primary) — per-stage exemptions instead, below.
+
+## Postfix configuration (relay-only, accept-and-queue with 4xx-only hygiene)
+
+- `relay_domains = viktorbarzin.me`; `mydestination =` (empty).
+- **[CH]** `smtpd_relay_restrictions = permit_mynetworks,
+  reject_unauth_destination` — explicit 5xx for foreign-domain RCPTs (the
+  default tail is `defer_unauth_destination`, whose 4xx invites every relay
+  probe to retry forever).
+- **[CH]** `relay_recipient_maps` explicitly set to the wildcard form
+  (`@viktorbarzin.me OK`) — documents accept-all-recipients as a decision
+  (the domain is catch-all; every RCPT is valid by definition).
+- `transport_maps`: `viktorbarzin.me smtp:[mail.viktorbarzin.me]:2526`.
+- `maximal_queue_lifetime = 30d`. **[CH]** `bounce_queue_lifetime = 1d` and
+  `delay_warning_time = 0` — this host can never deliver a DSN to anyone
+  (egress 25 blocked; its only egress is 2526 to the primary), so undeliverable
+  bounces must be discarded quickly or they rot in the queue for a month and
+  permanently poison the queue-depth alert.
+- **[CH]** `message_size_limit = 209715200` — exactly the primary's 200 MB
+  (`POSTFIX_MESSAGE_SIZE_LIMIT`, mailserver main.tf:88). The stock 10 MB
+  default would 552-reject large legitimate mail during outages — the exact
+  loss mode this project exists to prevent. Equal, never higher (higher
+  recreates drain-time rejects).
+- **[CH] postscreen on the VM in 4xx-only posture**: pregreet test ON
+  (fire-and-forget bots don't retry; real MTAs do — the whole design already
+  rests on sender retry, so 4xx filtering is loss-free by construction),
+  optionally `postscreen_dnsbl_action = defer` with a conservative threshold.
+  v2's blanket "no DNSBL" conflated 5xx reputation rejects (rightly banned)
+  with 4xx tempfail (harmless); without any hygiene the backup is a 24/7
+  spam backdoor since spammers deliberately deliver to the highest-numbered
+  MX. Zero 5xx from reputation, ever.
+- `inet_protocols = ipv4` **[CH]** — the primary publishes an AAAA (HE
+  tunnel) but the IPv6 HAProxy bridge has no :2526 listener; skip the wasted
+  v6 attempt per delivery.
+- `smtpd_tls_cert_file` = LE cert for `mx2.viktorbarzin.me` (opportunistic
+  STARTTLS inbound; `smtp_tls_security_level = may` on the drain leg).
+- Queue disk: the ~45 GB free boot volume dwarfs any realistic 30-day
+  accumulation for a personal domain.
+
+## TLS
+
+certbot standalone HTTP-01 for `mx2.viktorbarzin.me` (no Cloudflare API token
+on an internet-facing VM). Port 80 permanently open (see above); certbot renew
+timer. The MTA-STS follow-up (separate task; policy host currently dangling —
+below) must list `mx2.viktorbarzin.me` when implemented.
+
+## Primary-side drain enablement **[CH — this section replaces v2's "SPF/DMARC exemption + postscreen permit", which exempted the wrong layers]**
+
+The v2 exemptions targeted postscreen DNSBL (which is **off** on the primary —
+`ENABLE_DNSBL` unset) and rspamd SPF/DMARC scoring — but missed the three
+mechanisms that would actually break the drain. All are keyed on the VM's
+reserved /32 (the PROXY-v2-recovered client IP):
+
+1. **`reject_unknown_client_hostname` bypass** — the primary sets
+   `POSTFIX_REJECT_UNKNOWN_CLIENT_HOSTNAME=1` (main.tf:89); an Oracle IP
+   without full FCrDNS (PTR needs an Oracle SR; limited on free accounts)
+   would be **450-deferred on every drain attempt → the queue never drains →
+   mass-bounces at day 30**. Fix: `check_client_access` permit for the VM /32
+   early in `smtpd_client_restrictions`, and a matching permit at the sender
+   stage (SPOOF_PROTECTION=1 rejects unauthenticated own-domain envelope
+   senders — drained self-addressed/bounced mail would 5xx). Attempt the
+   Oracle PTR anyway (belt and braces).
+2. **Anvil rate-limit exception** — `smtpd_client_message_rate_limit = 30`/min
+   keys on the VM's IP at drain; a >3,600-message backlog would throttle for
+   hours and false-fire the queue alert. Add the VM /32 to
+   `smtpd_client_event_limit_exceptions`.
+3. **rspamd: evaluate the original sender, never 5xx the drain stream** — via
+   the existing override.d ConfigMap pattern (same mount as
+   `dkim_signing.conf`): (a) configure rspamd's **`external_relay`** module
+   (ip_map = VM /32) so SPF/DMARC/IP reputation evaluate against the
+   *original* client IP parsed from the VM's Received header — this keeps
+   DMARC protection for the entire drain stream instead of v2's blanket
+   disable; (b) cap rspamd's **action at the VM /32 to tag/fold — never
+   milter-reject**: the primary's default reject tier (DMS default, active
+   since only dkim_signing is overridden today) would 5xx high-score spam at
+   DATA, forcing the VM to generate DSNs to forged senders = classic
+   backup-MX backscatter → mx2's IP blacklisted. Drained spam lands tagged in
+   the catch-all's Junk instead. Validate the external_relay ↔ settings-rule
+   interplay at gate O5 with a high-spam-score message.
+4. postscreen permit for the /32 (harmless; pregreet never trips a real
+   Postfix client and DNSBL is off — kept for future-proofing only).
+
+## Our-side changes (Terraform unless noted)
+
+1. **New stack `stacks/backup-mx/`** (Tier 1): OCI provider (creds from
+   Vault), VCN + subnet + security list + **reserved public IP** +
+   `VM.Standard.E2.1.Micro` + cloud-init (`templatefile`): **OS iptables
+   ACCEPTs for 25/80/9100/9154 ahead of the OCI image's REJECT rule
+   (persisted)**, postfix + config above, certbot, tailscale→headscale
+   enrollment (preauth key from Vault), node_exporter, postfix_exporter,
+   unattended-upgrades.
+2. **DNS** — `stacks/cloudflared/modules/cloudflared/cloudflare.tf`: A
+   `mx2.viktorbarzin.me` → reserved IP (non-proxied), MX pref 20 → `mx2`.
+   **[CH] Live zone count verified: 195/200 → 197/200 after this change; only
+   3 slots remain and the MTA-STS follow-up needs 1–2 → plan the next
+   record-purge now, not at collision time.**
+3. **pfSense (live network device — approved as part of this plan)**: WAN NAT
+   rdr `TCP 2526 → 10.0.20.1:25` + firewall rule, source-restricted to the
+   reserved IP. **[CH] Scripted** (extend the existing
+   `scripts/pfsense-*-haproxy*.php` bootstrap-script family), not
+   hand-clicked — keeps the git-rebuildable parity the rest of the pfSense
+   mail config has. Config.xml rides the nightly backup.
+4. **Mailserver stack**: the four-layer drain enablement above (client+sender
+   `check_client_access` permits, anvil exception, rspamd external_relay +
+   action cap, postscreen permit) — all keyed to one /32, via the existing
+   `postfix_cf` / `user-patches.sh` / rspamd-override hook points (verified
+   present: main.tf:129-144, 222-281, 467-474).
+5. **Monitoring [CH — replaces v2's tailnet scraping, which had no transport:
+   no cluster→tailnet route exists and no existing target is scraped that
+   way]**: Prometheus scrapes `node_exporter`/`postfix_exporter` on the VM's
+   **public reserved IP**, allowed only from the homelab WAN /32 (Oracle SL +
+   VM firewall); blackbox TCP:25 from the cluster (`BackupMxDown`, warning);
+   MX-set drift assertion (both MX records present). Alerts:
+   `BackupMxQueueStuck` = **non-bounce** queue depth > 0 for 2 h while the
+   primary is healthy (gate on the existing `MailServerDown`/roundtrip
+   series, machine-readable — not prose); bounce residue is excluded by the
+   1-day bounce lifetime. Note: during a full homelab outage Prometheus
+   itself is down — queue growth is unobservable live under ANY transport;
+   what we actually watch is the post-recovery drain. A WAN-IP change stales
+   the Oracle allowlist → visible as ScrapeTargetDown (self-signaling).
+   **Probe semantics note**: once mx2 exists, the Brevo roundtrip probe's
+   mail fails over to mx2 on transient primary blips and arrives minutes late
+   via the drain — `EmailRoundtripFailing` may then mean "delayed via mx2",
+   not "lost"; note in the alert description and runbook.
+6. **Docs (same commit as implementation)**: rewrite `mailserver.md` §"No
+   Backup MX", new runbook `docs/runbooks/backup-mx.md` (`postqueue -p`,
+   forced drain `postqueue -f`, cert renewal, **OCI console break-glass**, VM
+   rebuild from stack, Oracle account facts incl. PAYG + home-region lock),
+   `vpn.md` headscale-version/OIDC staleness fix, monitoring rows.
+
+### MTA-STS finding (unchanged; no action in this change)
+
+`_mta-sts` TXT is published but `mta-sts.viktorbarzin.me` has no record and
+nothing serves the policy — MTA-STS is inert today. When fixed, the policy
+MUST include `mx: mx2.viktorbarzin.me` (and budget its DNS records against the
+3 remaining zone slots).
+
+## Validation gates (in order; any failure → stop and report)
+
+| # | Gate | Method | Failure handling |
+|---|------|--------|------------------|
+| O1 | Oracle account (home region `eu-frankfurt-1`, **fixed forever at signup**), **PAYG conversion done**, E2.1.Micro capacity | Viktor signs up + converts; TF apply | A1-in-home-region is a best-effort fallback only (halved quota, contended); else decision returns to Viktor |
+| O2 | Inbound TCP 25 reachable from the internet (after the OS-iptables fix) | `nc -zv <reserved-ip> 25` from outside + recurring Uptime-Kuma TCP monitor (keeps proving it — Oracle publishes no commitment) | Stop; decision returns to Viktor |
+| O3 | Drain works: VM → `mail.viktorbarzin.me:2526` delivers end-to-end | Test message injected on the VM | Debug pfSense NAT / HAProxy path |
+| O4 | LE cert issued | certbot standalone | STARTTLS is opportunistic — non-blocking for go-live; fix before MTA-STS |
+| O5 | Live failover test — **hardened [CH]** | presence-claim → scale mailserver to 0 (~30 min) → send from Gmail + Brevo **plus a high-spam-score message and a >10 MB message** → confirm queued (`postqueue -p`) → scale up → verify full drain within the anvil-exception expectations, spam folded to Junk (not bounced), headers show original-IP SPF/DMARC evaluation, no DSN generated on the VM, roundtrip probe recovers | Debug or roll back (remove MX record) |
+
+## Failure modes
+
+Covered: cluster/pod outages, pfSense/power/ISP outages ≤ 30 days, WAN IP
+changes, short-retry senders. If pfSense is down the drain waits — Postfix
+retries until it heals.
+
+Not covered: primary-up-but-5xx misconfigs; outbound; mid-outage mailbox
+access; **outages > 30 days lose queued mail silently (no DSN possible)**.
+Simultaneous Oracle+homelab outage = status quo ante (sender retries).
+
+Newly introduced, accepted:
+
+- **A pet outside the cluster** — deliberately cattle: rebuilt from TF +
+  cloud-init, patched by unattended-upgrades, scraped by Prometheus. Never a
+  backup target.
+- **Oracle free-tier caprice [CH — upgraded from v2's framing]**: Oracle has
+  silently cut Always-Free allowances and terminated over-limit instances
+  (June 2026, A1). Mitigations: PAYG (required), recurring inbound-25 probe,
+  `BackupMxDown`, and the fact that outside an active outage the queue is
+  empty — a surprise reclamation loses nothing, only coverage until rebuilt.
+  Rollernet Basic ($30/yr) stays the documented fallback if OCI sours.
+- **Spam hygiene**: 4xx-only postscreen on the VM (pregreet + conservative
+  DNSBL-defer) instead of v2's nothing; drained spam is tagged/folded by
+  rspamd, never bounced.
+- Outage mail sits plaintext on Oracle disk ≤ 30 days (single-tenant;
+  accepted).
+
+## Rollback
+
+Remove the MX + A records; wait for `postqueue -p` empty; `terraform destroy`
+on `backup-mx`; delete the pfSense NAT rule (scripted); drop the mailserver
+/32 exemptions. Order matters: MX record first.
+
+## Viktor's manual steps (everything else is mine)
+
+1. Create the Oracle Cloud account — **home region `eu-frankfurt-1`** (fixed
+   forever), card for identity, $0 charged.
+2. **Convert the tenancy to Pay-As-You-Go** (required — idle-reclamation
+   exemption; Always-Free stays $0).
+3. Hand me the tenancy OCID + a console user → I mint the API key, store
+   creds (Vault + Vaultwarden), and build the stack.
+4. Approve the (scripted) pfSense NAT rule when I reach that step.
--- a/docs/plans/2026-07-04-drone-logbook-design.md
+++ b/docs/plans/2026-07-04-drone-logbook-design.md
@ -0,0 +1,89 @@
+# Drone Logbook (Open DroneLog) — Design
+
+**Date:** 2026-07-04
+**Status:** Approved (Viktor, 2026-07-04)
+**Owner request:** "I have a DJI Mini 4 Pro. I'm interested in github.com/ViktorBarzin/drone-logbook" → self-host it in the cluster.
+
+## Goal
+
+Self-host [Open DroneLog](https://github.com/arpanghosh8453/open-dronelog) (upstream of the
+`ViktorBarzin/drone-logbook` fork) at **https://dronelog.viktorbarzin.me** so Viktor can import
+DJI Fly flight logs from his DJI Mini 4 Pro and analyze them privately: telemetry charts, 3D map
+replay, per-flight and lifetime stats. All data stays in the cluster (single DuckDB database).
+
+## Decisions (interview, 2026-07-04)
+
+| Question | Decision |
+|---|---|
+| Deployment form | Self-hosted Docker web app in k8s (not desktop app, not hosted webapp) |
+| Exposure | Public `dronelog.viktorbarzin.me`, **Authentik forward-auth** (`auth = "required"`) |
+| Log ingestion | **Both** manual web upload *and* a server-side auto-import drop folder from day one |
+| Image source | **Upstream** `ghcr.io/arpanghosh8453/open-dronelog:latest` — NOT the fork |
+| Fork disposition | Fork is 0 ahead / 372 behind, adds nothing; delete or park it. Only revive (sync + ADR-0002 GHA build) if Viktor starts modifying the code |
+
+## Architecture
+
+New Tier-1 stack `stacks/drone-logbook/`, modeled line-by-line on `stacks/freshrss/`
+(the closest existing shape: single upstream-image app, own data volume, Keel-updated):
+
+- **Namespace** `drone-logbook`, tier `4-aux`, label `keel.sh/enrolled=true` → Kyverno injects
+  Keel poll annotations → auto-upgrades as upstream releases (project is actively maintained).
+- **Deployment** (1 replica, `Recreate` — DuckDB is single-writer/embedded):
+  - image `ghcr.io/arpanghosh8453/open-dronelog:latest` (nginx frontend + Axum REST backend, port 80)
+  - memory request=limit **512Mi** (DuckDB import/analytics spikes), cpu request 25m, no cpu limit
+  - standard `KYVERNO_LIFECYCLE_V1` / `KEEL_IGNORE_IMAGE` / `KEEL_LIFECYCLE_V1` lifecycle ignores
+- **App data** `/data/drone-logbook` (DuckDB db, cached DJI decryption keys, uploaded originals):
+  **`proxmox-lvm-encrypted` block PVC** `drone-logbook-data-encrypted`, 2Gi, topolvm autoresize →
+  10Gi ceiling. Encrypted class because flight logs are GPS traces of home/travel — sensitive data
+  defaults to `proxmox-lvm-encrypted` per the storage decision rule (`.claude/CLAUDE.md`).
+  Embedded DBs stay off NFS (same rationale documented in the freshrss stack: NFS only for static files).
+- **Backup CronJob** `drone-logbook-backup` (mandatory for every proxmox-lvm app): daily 01:30
+  file copy of the data volume → NFS `/srv/nfs/drone-logbook-backup` (dated dirs, 30-day retention,
+  Pushgateway metrics), pod-affinity co-scheduled with the app pod (RWO volume). 01:30 sits outside
+  the 00:00/08:00/16:00 sync-import windows so the DuckDB file is quiescent; retained upload
+  originals make even a torn copy recoverable by re-import. `nfs-mirror` (02:00) ships it to sda →
+  Synology offsite. Vaultwarden pattern.
+- **Sync drop folder**: static NFS volume (`modules/kubernetes/nfs_volume`)
+  `192.168.1.127:/srv/nfs/drone-logbook/sync-logs`, mounted **read-only** at `/sync-logs`;
+  `SYNC_LOGS_PATH=/sync-logs`, `SYNC_INTERVAL="0 0 */8 * * *"` (every 8 h).
+  Any producer (Nextcloud sync, scp, a future phone pipeline) drops `.txt` logs there; the app
+  imports them automatically. `KEEP_UPLOADED_FILES=true` keeps re-importable originals in the PVC.
+- **Ingress** via `ingress_factory`: `name = "dronelog"`, `auth = "required"` (Authentik
+  forward-auth), `dns_type = "proxied"`. External Uptime Kuma HTTPS monitor comes automatically
+  with the ingress annotation. Homepage tile (group "Media & Entertainment", icon `mdi-quadcopter`).
+- **Secrets**: Vault KV `secret/drone-logbook` (`profile_creation_pass`) → ExternalSecret
+  (`vault-kv` ClusterSecretStore) → k8s secret `drone-logbook-secrets` → env
+  `PROFILE_CREATION_PASS`. Gates profile create/delete even for other Authentik-logged-in users.
+  No plan-time secret reads needed (no `data "kubernetes_secret"`).
+  No `DJI_API_KEY` — bundled default is fine at personal import volume; add later if rate-limited.
+
+## Operational notes
+
+- **DJI egress dependency**: importing a *new* log file requires the pod to reach DJI's servers
+  once (flight-log decryption key fetch; keys are then cached in the data dir). Remember this when
+  egress enforcement lands (Security wave 1, beads `code-8ywc`).
+- The web UI is desktop-first; mobile is functional but basic.
+- NFS host prerequisite: `/srv/nfs/drone-logbook/sync-logs` (root:www-data, 2775 — same shape as
+  sibling dirs) and `/srv/nfs/drone-logbook-backup` created on 192.168.1.127 and recorded in
+  `secrets/nfs_directories.txt`. `/srv/nfs` is exported whole-tree, so no `/etc/exports`
+  (`scripts/pve-nfs-exports`) change.
+- Backup story = the daily app-level backup CronJob (above) + the host `daily-backup` LVM-snapshot
+  leg + original log files retained both in the drop folder and in the data volume
+  (`KEEP_UPLOADED_FILES=true`).
+
+## Alternatives considered
+
+- **Build from the fork** (`ghcr.io/viktorbarzin/...` via GHA, ADR-0002): rejected for now — fork
+  has zero custom commits; a build chain adds maintenance for no benefit. Revisit if code changes
+  are wanted.
+- **`auth = "app"` + app profile passwords** (would enable the `opendronelog-sync` native uploader
+  from anywhere): rejected — a single app password guarding GPS traces of home/travel on the open
+  internet is weaker than Authentik; the sync drop folder covers automated ingestion instead.
+- **Internal-only (.lan + VPN)**: rejected — Authentik-gated public matches the rest of the
+  homelab and works without VPN while traveling.
+- **NFS for the DuckDB data**: rejected — embedded-DB-on-NFS locking risk; freshrss precedent
+  keeps app DB data on proxmox-lvm.
+
+## Implementation
+
+See `2026-07-04-drone-logbook-plan.md`.
--- a/docs/plans/2026-07-04-drone-logbook-plan.md
+++ b/docs/plans/2026-07-04-drone-logbook-plan.md
@ -0,0 +1,542 @@
+# Drone Logbook (Open DroneLog) Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Deploy Open DroneLog (DJI flight-log analyzer) at https://dronelog.viktorbarzin.me — new Tier-1 stack `stacks/drone-logbook/`, upstream image, Authentik-gated, with a DuckDB data PVC and an NFS auto-import drop folder.
+
+**Architecture:** Single Deployment running `ghcr.io/arpanghosh8453/open-dronelog:latest` (nginx + Axum + DuckDB, port 80) in namespace `drone-logbook`; data on a `proxmox-lvm-encrypted` PVC (GPS logs = sensitive data), `/sync-logs` drop folder on static NFS, daily backup CronJob to `/srv/nfs/drone-logbook-backup` (vaultwarden pattern), `ingress_factory` with `auth = "required"`, Keel auto-upgrades via namespace enrollment. Modeled line-by-line on `stacks/freshrss/`. Design: `2026-07-04-drone-logbook-design.md`.
+
+**Tech Stack:** Terraform/Terragrunt (Tier-1 PG state), Vault KV + ESO, ingress_factory, nfs_volume module, Keel/Kyverno.
+
+Terraform is exempt from TDD (execution.md); each task ends with a concrete verification instead.
+
+---
+
+### Task 1: Vault secret
+
+**Files:** none (Vault KV only)
+
+- [ ] **Step 1.1: Create `secret/drone-logbook` with a generated profile-creation password**
+
+```bash
+vault kv put secret/drone-logbook profile_creation_pass="$(openssl rand -base64 24)"
+```
+
+- [ ] **Step 1.2: Verify**
+
+```bash
+vault kv get -field=profile_creation_pass secret/drone-logbook | wc -c
+```
+
+Expected: `33` (32 chars + newline). Never echo the value itself.
+
+### Task 2: NFS drop folder on 192.168.1.127
+
+**Files:**
+- Modify: `secrets/nfs_directories.txt` (git-crypt'd — **edit from the MAIN checkout only**, never the worktree; sorted list, add `drone-logbook/sync-logs`)
+
+- [ ] **Step 2.1: Create the directories** — world-writable + setgid like `vaultwarden-backup` (the `/srv/nfs` export root-squashes, so pod-root writes land as `nobody`):
+
+```bash
+ssh root@192.168.1.127 'mkdir -p /srv/nfs/drone-logbook/sync-logs /srv/nfs/drone-logbook-backup && chown -R root:www-data /srv/nfs/drone-logbook /srv/nfs/drone-logbook-backup && chmod 2777 /srv/nfs/drone-logbook/sync-logs /srv/nfs/drone-logbook-backup && ls -ld /srv/nfs/drone-logbook/sync-logs /srv/nfs/drone-logbook-backup'
+```
+
+Expected: `drwxrwsrwx ... root www-data ...` for both.
+No `/etc/exports` (`scripts/pve-nfs-exports`) change — `/srv/nfs` is exported whole-tree.
+
+- [ ] **Step 2.2: Record them in the declarative list (MAIN checkout, plaintext there)** — insert `drone-logbook-backup` and `drone-logbook/sync-logs` (after `diun`, before `etcd-backup`) in `~/code/infra/secrets/nfs_directories.txt`, then commit that single file to master:
+
+```bash
+git -C ~/code/infra add secrets/nfs_directories.txt
+git -C ~/code/infra commit -m "nfs_directories: add drone-logbook/sync-logs
+
+Drop folder for the new drone-logbook stack's auto-import (SYNC_LOGS_PATH).
+Directory created on 192.168.1.127 root:www-data 2775."
+git -C ~/code/infra push forgejo master
+```
+
+(Trivial single-file exception per execution.md; encrypted files cannot be edited from the worktree.)
+
+### Task 3: Stack files (in the `wizard/drone-logbook` worktree)
+
+**Files:**
+- Create: `stacks/drone-logbook/main.tf` (content below)
+- Create: `stacks/drone-logbook/terragrunt.hcl` (content below)
+- Create: `stacks/drone-logbook/secrets` → symlink to `../../secrets`
+- (`backend.tf`, `tiers.tf`, `cloudflare_provider.tf`, `providers.tf`, `.terraform.lock.hcl` are terragrunt-generated and **gitignored** — do NOT create or commit them; the tracked copies in old stacks like freshrss predate the ignore rule)
+
+- [ ] **Step 3.1: `terragrunt.hcl`**
+
+```hcl
+include "root" {
+  path = find_in_parent_folders()
+}
+
+dependency "platform" {
+  config_path  = "../platform"
+  skip_outputs = true
+}
+```
+
+- [ ] **Step 3.2: `main.tf`** — exact content:
+
+```hcl
+variable "tls_secret_name" {
+  type      = string
+  sensitive = true
+}
+variable "nfs_server" { type = string }
+
+# Open DroneLog (https://github.com/arpanghosh8453/open-dronelog) — self-hosted
+# DJI flight-log analyzer for the DJI Mini 4 Pro. Runs the UPSTREAM image (the
+# ViktorBarzin/drone-logbook fork has no custom commits); Keel tracks :latest.
+# Design: docs/plans/2026-07-04-drone-logbook-design.md
+resource "kubernetes_namespace" "drone_logbook" {
+  metadata {
+    name = "drone-logbook"
+    labels = {
+      tier               = local.tiers.aux
+      "keel.sh/enrolled" = "true"
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
+    ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
+  }
+}
+
+resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
+  manifest = {
+    apiVersion = "external-secrets.io/v1"
+    kind       = "ExternalSecret"
+    metadata = {
+      name      = "drone-logbook-secrets"
+      namespace = "drone-logbook"
+    }
+    spec = {
+      refreshInterval = "15m"
+      secretStoreRef = {
+        name = "vault-kv"
+        kind = "ClusterSecretStore"
+      }
+      target = {
+        name = "drone-logbook-secrets"
+      }
+      dataFrom = [{
+        extract = {
+          key = "drone-logbook"
+        }
+      }]
+    }
+  }
+  depends_on = [kubernetes_namespace.drone_logbook]
+}
+
+module "tls_secret" {
+  source          = "../../modules/kubernetes/setup_tls_secret"
+  namespace       = kubernetes_namespace.drone_logbook.metadata[0].name
+  tls_secret_name = var.tls_secret_name
+}
+
+# DuckDB database + cached DJI decryption keys + uploaded originals.
+# Embedded DB -> block storage, not NFS (same rationale as freshrss data).
+# Encrypted class: flight logs are GPS traces of home/travel (sensitive data
+# -> proxmox-lvm-encrypted per the storage decision rule in .claude/CLAUDE.md).
+resource "kubernetes_persistent_volume_claim" "data" {
+  wait_until_bound = false
+  metadata {
+    name      = "drone-logbook-data-encrypted"
+    namespace = kubernetes_namespace.drone_logbook.metadata[0].name
+    annotations = {
+      "resize.topolvm.io/threshold"     = "10%"
+      "resize.topolvm.io/increase"      = "100%"
+      "resize.topolvm.io/storage_limit" = "10Gi"
+    }
+  }
+  spec {
+    access_modes       = ["ReadWriteOnce"]
+    storage_class_name = "proxmox-lvm-encrypted"
+    resources {
+      requests = {
+        storage = "2Gi"
+      }
+    }
+  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and PVCs
+    # can't shrink; without this every apply tries to revert the size.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
+}
+
+# Drop folder: any producer (Nextcloud sync, scp, future phone pipeline) lands
+# DJI .txt logs here over NFS; the app auto-imports on SYNC_INTERVAL.
+module "nfs_sync_logs" {
+  source     = "../../modules/kubernetes/nfs_volume"
+  name       = "drone-logbook-sync-logs"
+  namespace  = kubernetes_namespace.drone_logbook.metadata[0].name
+  nfs_server = var.nfs_server
+  nfs_path   = "/srv/nfs/drone-logbook/sync-logs"
+  storage    = "5Gi"
+}
+
+resource "kubernetes_deployment" "drone_logbook" {
+  metadata {
+    name      = "drone-logbook"
+    namespace = kubernetes_namespace.drone_logbook.metadata[0].name
+    labels = {
+      app                             = "drone-logbook"
+      "kubernetes.io/cluster-service" = "true"
+      tier                            = local.tiers.aux
+    }
+  }
+  spec {
+    replicas = 1
+    strategy {
+      # DuckDB is single-writer; never overlap two pods on the same volume
+      type = "Recreate"
+    }
+    selector {
+      match_labels = {
+        app = "drone-logbook"
+      }
+    }
+    template {
+      metadata {
+        labels = {
+          app                             = "drone-logbook"
+          "kubernetes.io/cluster-service" = "true"
+        }
+      }
+      spec {
+        container {
+          name  = "drone-logbook"
+          image = "ghcr.io/arpanghosh8453/open-dronelog:latest"
+          env {
+            name  = "RUST_LOG"
+            value = "info"
+          }
+          env {
+            # keep re-importable originals under /data/drone-logbook/uploaded
+            name  = "KEEP_UPLOADED_FILES"
+            value = "true"
+          }
+          env {
+            name  = "SYNC_LOGS_PATH"
+            value = "/sync-logs"
+          }
+          env {
+            # 6-field cron (sec min hour dom mon dow): scan drop folder every 8h
+            name  = "SYNC_INTERVAL"
+            value = "0 0 */8 * * *"
+          }
+          env {
+            name = "PROFILE_CREATION_PASS"
+            value_from {
+              secret_key_ref {
+                name = "drone-logbook-secrets"
+                key  = "profile_creation_pass"
+              }
+            }
+          }
+          volume_mount {
+            name       = "data"
+            mount_path = "/data/drone-logbook"
+          }
+          volume_mount {
+            name       = "sync-logs"
+            mount_path = "/sync-logs"
+            read_only  = true
+          }
+          port {
+            name           = "http"
+            container_port = 80
+            protocol       = "TCP"
+          }
+          resources {
+            requests = {
+              cpu    = "25m"
+              memory = "512Mi"
+            }
+            limits = {
+              memory = "512Mi"
+            }
+          }
+        }
+        volume {
+          name = "data"
+          persistent_volume_claim {
+            claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
+          }
+        }
+        volume {
+          name = "sync-logs"
+          persistent_volume_claim {
+            claim_name = module.nfs_sync_logs.claim_name
+          }
+        }
+      }
+    }
+  }
+  depends_on = [kubernetes_manifest.external_secret]
+  lifecycle {
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+      metadata[0].annotations["keel.sh/match-tag"],
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["kubernetes.io/change-cause"],
+      metadata[0].annotations["deployment.kubernetes.io/revision"],
+      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
+    ]
+  }
+}
+
+resource "kubernetes_service" "drone_logbook" {
+  metadata {
+    name      = "drone-logbook"
+    namespace = kubernetes_namespace.drone_logbook.metadata[0].name
+    labels = {
+      "app" = "drone-logbook"
+    }
+  }
+
+  spec {
+    selector = {
+      app = "drone-logbook"
+    }
+    port {
+      port        = "80"
+      target_port = "80"
+    }
+  }
+}
+
+# -----------------------------------------------------------------------------
+# Backup — required for every proxmox-lvm(-encrypted) app: daily copy of the
+# data volume to NFS /srv/nfs/drone-logbook-backup (picked up by nfs-mirror ->
+# sda -> Synology offsite). 01:30 = outside the 00:00/08:00/16:00 sync-import
+# windows, so the DuckDB file is quiescent; uploaded originals make even a
+# mid-write copy recoverable by re-import. Pod-affinity co-schedules with the
+# app pod (RWO volume mounts twice only on the same node). Vaultwarden pattern.
+# -----------------------------------------------------------------------------
+
+module "nfs_backup" {
+  source     = "../../modules/kubernetes/nfs_volume"
+  name       = "drone-logbook-backup-host"
+  namespace  = kubernetes_namespace.drone_logbook.metadata[0].name
+  nfs_server = var.nfs_server
+  nfs_path   = "/srv/nfs/drone-logbook-backup"
+}
+
+resource "kubernetes_cron_job_v1" "backup" {
+  metadata {
+    name      = "drone-logbook-backup"
+    namespace = kubernetes_namespace.drone_logbook.metadata[0].name
+  }
+  spec {
+    concurrency_policy            = "Replace"
+    failed_jobs_history_limit     = 5
+    schedule                      = "30 1 * * *"
+    starting_deadline_seconds     = 300
+    successful_jobs_history_limit = 3
+    job_template {
+      metadata {}
+      spec {
+        backoff_limit              = 3
+        ttl_seconds_after_finished = 10
+        template {
+          metadata {}
+          spec {
+            affinity {
+              pod_affinity {
+                required_during_scheduling_ignored_during_execution {
+                  label_selector {
+                    match_labels = {
+                      app = "drone-logbook"
+                    }
+                  }
+                  topology_key = "kubernetes.io/hostname"
+                }
+              }
+            }
+            container {
+              name  = "drone-logbook-backup"
+              image = "docker.io/library/alpine"
+              command = ["/bin/sh", "-c", <<-EOT
+                set -euxo pipefail
+                _t0=$(date +%s)
+                now=$(date +"%Y_%m_%d_%H_%M")
+                mkdir -p /backup/$now
+                cp -a /data/. /backup/$now/
+                # Rotate — 30 day retention
+                find /backup -maxdepth 1 -mindepth 1 -type d -mtime +30 -exec rm -rf {} +
+                _dur=$(($(date +%s) - _t0))
+                _out_bytes=$(du -sb /backup/$now | awk '{print $1}')
+                wget -qO- --post-data "backup_duration_seconds $${_dur}
+                backup_output_bytes $${_out_bytes}
+                backup_last_success_timestamp $(date +%s)
+                " "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/drone-logbook-backup" || true
+              EOT
+              ]
+              volume_mount {
+                name       = "data"
+                mount_path = "/data"
+                read_only  = true
+              }
+              volume_mount {
+                name       = "backup"
+                mount_path = "/backup"
+              }
+            }
+            volume {
+              name = "data"
+              persistent_volume_claim {
+                claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
+              }
+            }
+            volume {
+              name = "backup"
+              persistent_volume_claim {
+                claim_name = module.nfs_backup.claim_name
+              }
+            }
+            dns_config {
+              option {
+                name  = "ndots"
+                value = "2"
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
+  }
+}
+
+# https://dronelog.viktorbarzin.me
+module "ingress" {
+  source          = "../../modules/kubernetes/ingress_factory"
+  auth            = "required" # Authentik forward-auth — flight logs are GPS traces of home/travel
+  dns_type        = "proxied"
+  namespace       = kubernetes_namespace.drone_logbook.metadata[0].name
+  name            = "dronelog"
+  service_name    = "drone-logbook"
+  tls_secret_name = var.tls_secret_name
+  extra_annotations = {
+    "gethomepage.dev/enabled"      = "true"
+    "gethomepage.dev/name"         = "Drone Logbook"
+    "gethomepage.dev/description"  = "DJI flight log analyzer"
+    "gethomepage.dev/icon"         = "mdi-quadcopter"
+    "gethomepage.dev/group"        = "Media & Entertainment"
+    "gethomepage.dev/pod-selector" = ""
+  }
+}
+```
+
+- [ ] **Step 3.3: Boilerplate**
+
+```bash
+ln -s ../../secrets ~/code/infra/.worktrees/drone-logbook/stacks/drone-logbook/secrets
+```
+
+- [ ] **Step 3.4: Format check**
+
+```bash
+terraform fmt -check -diff $WT/stacks/drone-logbook/ || terraform fmt $WT/stacks/drone-logbook/
+```
+
+Expected: no diff (or auto-fixed).
+
+- [ ] **Step 3.5: Commit on the branch (files by name, git-crypt filter flags per execution.md)**
+
+```bash
+git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false \
+  add docs/plans/2026-07-04-drone-logbook-design.md docs/plans/2026-07-04-drone-logbook-plan.md \
+      stacks/drone-logbook/main.tf stacks/drone-logbook/terragrunt.hcl stacks/drone-logbook/secrets \
+      .claude/reference/service-catalog.md
+git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false \
+  commit -m "drone-logbook: new stack — self-hosted Open DroneLog at dronelog.viktorbarzin.me
+
+Viktor asked to self-host the DJI flight-log analyzer for his DJI Mini 4 Pro
+(fork ViktorBarzin/drone-logbook -> upstream arpanghosh8453/open-dronelog).
+Upstream ghcr image with Keel auto-upgrade, DuckDB data on proxmox-lvm PVC,
+NFS /sync-logs drop folder auto-imported every 8h, Authentik-gated ingress,
+PROFILE_CREATION_PASS from Vault via ESO. Design + plan in docs/plans/."
+```
+
+### Task 4: Land and apply
+
+- [ ] **Step 4.1: Presence claim** (CI apply mutates shared infra)
+
+```bash
+~/code/scripts/presence claim infra:drone-logbook --purpose "deploy new drone-logbook stack (Open DroneLog) via CI apply"
+```
+
+- [ ] **Step 4.2: Merge latest master into the branch, push to master**
+
+```bash
+git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false fetch forgejo
+git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false merge forgejo/master
+git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false push forgejo HEAD:master
+```
+
+Non-fast-forward → another agent landed first: fetch, merge, push again. Branch-protection rejection → fall back to PR via Forgejo API (token = password in `~/.git-credentials`).
+
+- [ ] **Step 4.3: Watch the CI apply to completion** — Woodpecker pipeline on the infra repo (`ci.viktorbarzin.me`), then confirm live:
+
+```bash
+kubectl get ns drone-logbook && kubectl -n drone-logbook get deploy,pvc,pods,externalsecret,cronjob
+kubectl -n drone-logbook rollout status deploy/drone-logbook --timeout=300s
+```
+
+Expected: namespace present, ExternalSecret `SecretSynced`, data PVC `Bound` (the NFS PVCs bind on first pod/job use), CronJob `drone-logbook-backup` scheduled `30 1 * * *`, pod `Running 1/1`.
+
+- [ ] **Step 4.4: Cleanup worktree + branch; release presence**
+
+```bash
+git -C ~/code/infra worktree remove .worktrees/drone-logbook
+git -C ~/code/infra branch -d wizard/drone-logbook
+git -C ~/code/infra pull --ff-only   # only if main checkout clean/quiescent
+~/code/scripts/presence release infra:drone-logbook
+```
+
+### Task 5: End-to-end verification
+
+- [ ] **Step 5.1: Ingress + Authentik gate**
+
+```bash
+curl -sI https://dronelog.viktorbarzin.me | head -5
+```
+
+Expected: `302` redirect into Authentik (NOT `200`, NOT `404`).
+
+- [ ] **Step 5.2: App alive behind the gate** (bypass ingress via port-forward, read-only debug)
+
+```bash
+kubectl -n drone-logbook port-forward svc/drone-logbook 18080:80 &
+sleep 2 && curl -s -o /dev/null -w '%{http_code}\n' http://127.0.0.1:18080/ && kill %1
+```
+
+Expected: `200`.
+
+- [ ] **Step 5.3: Sync folder visible in-pod**
+
+```bash
+kubectl -n drone-logbook exec deploy/drone-logbook -- ls -ld /sync-logs /data/drone-logbook
+```
+
+Expected: both directories listed; `/sync-logs` read-only mount.
+
+- [ ] **Step 5.4: Monitor + homepage** — Uptime Kuma external monitor for `dronelog.viktorbarzin.me` auto-created (ingress annotation); homepage tile under "Media & Entertainment".
+
+- [ ] **Step 5.5: Functional import** — Viktor uploads a real Mini 4 Pro `.txt` log via the web UI (or drops it in `/srv/nfs/drone-logbook/sync-logs`); confirms flight appears with charts/map. Requires pod egress to DJI once per new log (decryption key). If an upstream sample log is available, the agent may pre-verify import via the REST API through the port-forward.
--- a/docs/plans/2026-07-04-immich-frame-lan-only-design.md
+++ b/docs/plans/2026-07-04-immich-frame-lan-only-design.md
@ -0,0 +1,125 @@
+# immich-frame: LAN-only access, Portals untouched (2026-07-04)
+
+## Goal
+
+Strangers must no longer be able to view `highlights-immich.viktorbarzin.me`
+(Viktor's London Portal Plus frame) or `highlights-immich-emo.viktorbarzin.me`
+(Emo's Sofia Portal Mini frame) — pages or ImmichFrame API. Both were
+`auth = "none"`, Cloudflare-proxied, fully public.
+
+Who keeps access (per Viktor, this session): the two Portals plus **any
+household device on the Sofia, London, or Valchedrym home networks**. No
+public access, no tailnet requirement. Hard constraint: the Portal app is a
+WebView with the URL **baked in at APK build time** (`portal-immich-frame`,
+`-PframeUrl`), so the exact URLs must keep loading from where the Portals sit
+— zero app rebuilds, zero device touches, zero router changes.
+
+## Design
+
+Two cooperating pieces — the gate and the reachability pointer:
+
+1. **The gate — `home-lans-only` Traefik middleware** (traefik stack, next to
+   `local-only`): `ipAllowList` of `192.168.1.0/24` (Sofia LAN), `10.0.0.0/8`
+   (VLANs, K8s pods `10.10.0.0/16`, services `10.96.0.0/12`, WG tunnel
+   `10.3.2.0/24`), `192.168.8.0/24` (London LAN), `192.168.9.0/24` (London
+   GUEST net — post-rollout discovery: the Portal Plus actually leases here,
+   `Portal-75AE8F9C2A8A` = `192.168.9.198`, added same day), `192.168.0.0/24`
+   (Valchedrym LAN), `fc00::/7`, `fe80::/10`. Attached to both frame
+   ingresses via `extra_middlewares`. Everyone else gets a Traefik 403 —
+   including direct-to-WAN-IP requests carrying the right SNI, which DNS
+   changes alone cannot stop. A **separate** middleware rather than a widened
+   `local-only`, because widening would silently grant the remote LANs access
+   to the 9 admin surfaces using it (Prometheus, iDRAC, Loki, …).
+
+2. **The pointer — `dns_type = "internal"`** (new `ingress_factory` tier,
+   Viktor's idea): a **non-proxied public A record → `10.0.20.203`** (module
+   var `internal_lb_ip`). Outsiders resolve it but get an unroutable RFC1918
+   address; every household resolver path delivers a working answer with no
+   config anywhere: Sofia LAN already gets the internal CNAME from Technitium,
+   London/Valchedrym resolve the public record via any upstream and
+   policy-route `10.0.0.0/8` down the WireGuard tunnel. IPv4-only (spokes
+   route no internal v6 range).
+
+Interlock (the reason both flip together): with a *proxied* record, public
+traffic arrives from cloudflared **pod IPs inside 10/8** and would sail
+through the allowlist. `internal` removes the Cloudflare path entirely (CF
+edge stops serving the hostname), so every request reaches Traefik with its
+real source IP (ETP=Local). Verified: no wildcard `*.viktorbarzin.me` record
+exists to resurrect public resolution.
+
+`auth` stays `"none"` — there is still no *user* auth by design (kiosk
+WebView; forward-auth would 302 the device to a login it can't complete, and
+emo's Google-only account can't log in inside a WebView at all); the
+convention comment now names the ipAllowList as the gate.
+
+### Resulting flows
+
+| Client | Path | Result |
+|---|---|---|
+| Emo's Portal Mini (Sofia LAN) | Technitium CNAME → `.203` direct (unchanged) | allowed (`192.168.1.x`) |
+| Viktor's Portal Plus (London GUEST net) | public A → `10.0.20.203` → WG tunnel | allowed (`192.168.9.x`) |
+| Household browsers (any of the 3 LANs) | same as above | allowed |
+| In-cluster checks (`homelab browser`, blackbox) | CoreDNS → Technitium → `.203` | allowed (pod IP in 10/8) |
+| Stranger, resolves hostname | gets `10.0.20.203` | unroutable |
+| Stranger, hits WAN IP with SNI | pfSense NAT → Traefik (real source IP) | **403** |
+| Stranger, via Cloudflare | no proxied record | CF edge won't serve the host |
+
+### Rejected alternatives
+
+- **ImmichFrame `AuthenticationSecret`** (supported upstream: web input field
+  or `?authsecret=` param + bearer API): real auth from anywhere, but family
+  browsers would face a secret prompt (fails "household devices just work"),
+  the secret leaks into URLs/analytics/APK, and robust rollout needs APK
+  rebuild + USB-adb sideload on both Portals (the Sofia one is high-friction).
+- **Authentik forward-auth / `auth = "public"`**: WebView can't complete SSO
+  (Google blocks WebView logins; session expiry silently bricks an appliance);
+  the anonymous outpost is an audit trail, not a gate.
+- **Remove DNS + London router AdGuardHome rewrites**: works, but adds an
+  out-of-band, un-IaC'd router dependency the internal-IP record makes
+  unnecessary. Kept as documented fallback if resolver-side private-IP
+  filtering ever appears in the London path.
+
+## Pre-verified facts (2026-07-04)
+
+- London Flint 2 DNS chain returns RFC1918 answers unfiltered
+  (`nslookup 10.0.20.203.nip.io 127.0.0.1` on the router → `10.0.20.203`;
+  dnsmasq `rebind_protection '0'`, no AdGuardHome rebind filtering).
+- Technitium already CNAMEs both hostnames → apex → `10.0.20.203`
+  (`technitium-ingress-dns-sync` is ingress-driven, not DNS-record-driven, so
+  the internal answer survives the Cloudflare record swap).
+- Pod CIDR `10.10.0.0/16`, service CIDR `10.96.0.0/12` — inside `10.0.0.0/8`.
+- No public wildcard record in the zone.
+
+## Blast radius & cleanups
+
+- `external_monitor = false` set explicitly on both ingresses: the
+  external-monitor-sync default opt-in would otherwise keep the now-doomed
+  `[External] highlights-immich*` uptime-kuma monitors alive and red. Verify
+  the sync drops them post-apply.
+- rybbit CF worker: `highlights-immich` removed from `SITE_IDS` (`index.js`)
+  and `wrangler.toml` routes — off Cloudflare the route can never fire.
+  Requires a `wrangler deploy` to take effect (route removal is hygiene, not
+  functional).
+- Homepage dashboard link keeps working from LANs (hostname unchanged).
+- Docs updated in the same change: `.claude/CLAUDE.md` (DNS tier +
+  external-monitor mechanism), `AGENTS.md`, `docs/architecture/networking.md`
+  (Internal-IP domains category). The `portal-immich-frame` repo's glossary
+  ("public, login-less URL") updated separately in that repo.
+
+## Failure-mode delta
+
+London frame now depends on the WG tunnel instead of Cloudflare+cloudflared
+(the app self-heals with 5s retries; tunnel-flap modes documented in
+`docs/architecture/vpn.md`). A Traefik LB renumber must update
+`internal_lb_ip` in the module alongside the split-horizon apex record.
+Cutover window: cached proxied answers keep working ≤ ~5 min TTL, then the
+WebView's own retry picks up the new path.
+
+## Verification & rollback
+
+Verify: public dig → `10.0.20.203` (both hosts); Technitium dig → `.203`;
+curl from devvm (10/8) → 200; external vantage (WebFetch/cloud) → unreachable
+or 403; middleware attached on both ingresses; Emo's frame renders via
+`homelab browser`; London Portal image fetches visible in Traefik access logs
+from `192.168.8.x`. Rollback: `git revert` + apply traefik/immich — records
+and middleware chain restore (`allow_overwrite = true` re-adopts the records).
--- a/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md
+++ b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md
@ -129,3 +129,40 @@ heavy user between 12–16G even with RAM free; bump to 16/20 if that bites.
  storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the
  correct pairing. A famous tool that "does OOM" still has to be proven to fire
  under *your* configuration.
+
+## Addendum (2026-07-02): the MemoryHigh throttle band livelocks — removed
+
+The soft-cap layer of this design was falsified in production on 2026-07-02
+(~15:42–16:35 UTC): an agent-spawned `ugrep` (12.35G RSS; `-o` with wide
+alternation captures over a multi-GB `.jsonl` transcript) **plateaued inside
+t3-serve@wizard's `MemoryHigh=12G..MemoryMax=16G` band**. With
+`MemorySwapMax=0` its anonymous pages were unreclaimable, so the kernel parked
+every allocating task of the cgroup in `mem_cgroup_handle_over_high`
+(`memory.pressure full avg60 ≈ 80%`, `memory.events high=882948`, `oom_kill=0`)
+— including the `t3 serve` event loop (~0.5G RSS, pure collateral). The accept
+queue backed up (21 pending connections), t3-probe logged `t3serve: [Errno 104]
+Connection reset by peer`, t3-dispatch logged `proxy error: context canceled`,
+and t3.viktorbarzin.me was dead for its user until the hog was SIGKILLed by
+hand (the D-state high-throttle sleep IS killable; the cgroup dropped 14G→1.4G
+and the service recovered in seconds with no restart).
+
+The Verification bullet above — a soft-capped balloon "throttled to a crawl,
+making no progress and **harming nothing**" — holds only when the hog is alone
+in its cgroup. Sharing the cgroup with a latency-sensitive server, the crawl
+IS the harm: a hog that stabilises below `MemoryMax` never triggers the local
+OOM the design counted on, so the band converts "runaway dies" into "everyone
+in the cgroup stalls forever".
+
+**Fix (same day, admin-approved): `MemoryHigh=infinity` on all three work
+cgroup definitions** — `scripts/t3-serve@.service`, the `user-.slice.d`
+drop-in, and `docker.slice` (`setup-devvm.sh` §10a/§10c). A runaway now runs
+unthrottled into `MemoryMax` and is cgroup-OOM-killed immediately
+(`OOMPolicy=continue` keeps t3-serve itself alive; in slices the kernel kills
+the biggest task). `MemoryMax`, `MemorySwapMax=0`, and earlyoom — the layers
+the stress tests actually validated — are unchanged. Applied live via
+`daemon-reload` + runtime `set-property` on the running cgroups; no session
+restarts.
+
+Lesson: **with `swap=0`, `memory.high` is not a gentler `memory.max` — it is
+an unbounded stall injector for everything sharing the cgroup.** Cap-and-kill
+beats throttle-and-pray for multi-tenant interactive services.
--- a/docs/runbooks/paperless-mail-ingest.md
+++ b/docs/runbooks/paperless-mail-ingest.md
@ -0,0 +1,135 @@
+# Paperless-ngx Mail Ingest (docs@viktorbarzin.me)
+
+Last updated: 2026-07-03 (initial build)
+
+Forward any email with document attachments to **`docs@viktorbarzin.me`** and
+paperless-ngx ingests the attachments, owned by the paperless account mapped
+from the **sender** (From) address. Built entirely from existing parts: a
+docker-mailserver mailbox + Dovecot sieve, and paperless-ngx's native mail
+consumer (the same machinery as the `utility:` rules).
+
+## Flow
+
+```
+family member forwards email ──> MX ──> docker-mailserver
+    │  postfix virtual: docs@ has an explicit self-alias (extra/aliases.txt),
+    │  so the @domain catch-all (→ spam@, swept by TripIt) does NOT apply
+    ▼
+Dovecot LMTP delivery to docs@
+    │  per-user sieve (docs@viktorbarzin.me.dovecot.sieve): sender NOT in
+    │  allowlist → discard (decision 2026-07-03: unmatched = ignore & delete)
+    ▼
+docs@ INBOX ── paperless-ngx mail task (every 10 min, PAPERLESS_EMAIL_TASK_CRON
+    │          default) applies mail rules in order: filter_from = <sender>
+    │          → consume attachments (ALL parts incl. inline — see design
+    │          notes: Apple Mail marks real PDFs inline), owner = mapped user,
+    │          tag = email-ingest, title = mail subject
+    ▼
+consumed mail is MOVED to the "Processed" IMAP folder (audit trail);
+INBOX stays empty in steady state
+```
+
+## Sender → paperless account map (as built)
+
+| Sender (From)            | Paperless user | Rule            |
+|--------------------------|----------------|-----------------|
+| me@viktorbarzin.me       | root (id 3)    | forward: Viktor (me@) |
+| vbarzin@gmail.com        | root (id 3)    | forward: Viktor (gmail) |
+| viktorbarzin@meta.com    | root (id 3)    | forward: Viktor (meta) |
+| ancaelena98@gmail.com    | anca (id 4)    | forward: Anca   |
+| emil.barzin@gmail.com    | emo (id 7)     | forward: Emo    |
+
+The map lives in **two places by design** — keep them in sync:
+
+1. **Delivery gate (infra, Terraform):**
+   `stacks/mailserver/modules/mailserver/extra/docs-at-viktorbarzin.me.dovecot.sieve`
+   — senders not listed here are discarded at delivery (spam control + the
+   "ignore and delete unmatched" behaviour; paperless cannot express
+   "delete without ingesting", so this must happen before the mailbox).
+2. **Owner map (paperless DB, via API/UI):** one mail rule per sender on the
+   `docs@viktorbarzin.me` mail account. DB-state like workflows — NOT
+   Terraform.
+
+## Add a family member / sender
+
+1. Add the address to the sieve allowlist file above; commit; apply the
+   `mailserver` stack (normal apply is enough — the sieve CM key is not under
+   `ignore_changes`; Reloader restarts the pod).
+2. Clone an existing `forward:` mail rule in the paperless admin UI
+   (Mail → Rules) or via API, changing `filter_from` and the rule **owner**
+   (documents are owned by the rule owner — `assign_owner_from_rule=true`).
+   Keep: action = Move to `Processed`, attachment type = **process all files
+   including inline** (`attachment_type=2` — NOT attachments-only, see design
+   notes), consumption scope = attachments only, tag `email-ingest`, order
+   after the existing rules.
+
+## Operations
+
+- **Trigger a fetch immediately** (instead of waiting ≤10 min):
+  `kubectl -n paperless-ngx exec deploy/paperless-ngx -c paperless-ngx -- s6-setuidgid paperless python3 manage.py mail_fetcher`
+  The `s6-setuidgid paperless` is **required**: `kubectl exec` runs as root, and a
+  root-run fetcher downloads attachments root-owned into the scratch dir, which
+  the celery consumer (uid 1000) then can't read — `PermissionError` on
+  `/tmp/paperless/paperless-mail-*/...`, consume task FAILURE (hit during the
+  2026-07-03 build E2E). The mail correctly stays in INBOX for retry (the move
+  action is a chord callback on successful consumption). Recover: `rm -rf
+  /tmp/paperless/paperless-mail-*` (as root) and let the next scheduled fetch
+  re-process.
+- **Mailbox credentials:** Vault `secret/platform` → `mailserver_accounts`
+  JSON, key `docs@viktorbarzin.me` (also used by the paperless mail account).
+- **Inspect the mailbox:**
+  `python3 -c` IMAP to `mailserver.mailserver.svc.cluster.local:993` (in-cluster,
+  from a pod) or `mail.viktorbarzin.me:993` (externally / devvm).
+- **Paperless-side logs:** `kubectl -n paperless-ngx logs deploy/paperless-ngx | grep -i mail`
+  (also Loki, ns `paperless-ngx`). Rule/account state: `GET /api/mail_rules/`,
+  `GET /api/mail_accounts/` with the admin token
+  (k8s secret `paperless-ngx-secrets`, field `api_token`).
+- **Account/mailbox provisioning:** adding/rotating anything in
+  `mailserver_accounts` requires the ConfigMap replace workaround —
+  `scripts/tg apply mailserver -- -replace=module.mailserver.kubernetes_config_map.mailserver_config`
+  — because `postfix-accounts.cf` is under `ignore_changes`
+  (non-deterministic bcrypt; see the module comment).
+
+## Design notes / caveats
+
+- **Why not the catch-all?** Mail to unknown `@viktorbarzin.me` addresses
+  lands in `spam@`, which the TripIt `ingest-plans` CronJob sweeps every
+  15 min: it marks everything `\Seen`, LLM-parses mail from linked senders and
+  replies with ack/failure emails. Forwarded bank statements would get
+  "couldn't parse a trip" replies. `docs@` being a real mailbox bypasses that
+  path entirely; TripIt, the `smoke-test@` roundtrip probe, and `dmarc@` are
+  untouched.
+- **Spoofing:** the sender match is on the From header. Rspamd verifies
+  SPF/DKIM/DMARC on inbound mail, but gmail.com publishes `p=none`, so a
+  crafted spoof could ingest documents into a family member's account. Accepted
+  risk (worst case: unwanted documents appear, visible + deletable in
+  paperless).
+- **Not PDF-only:** any attachment type paperless supports is consumed
+  (PDF, images, Office via the existing tika+gotenberg pipeline).
+- **Inline attachments ARE processed (`attachment_type=2`, flipped
+  2026-07-03):** the rules originally used attachments-only (1) to skip
+  signature logos, but the very first real forward (Apple Mail, Viktor's
+  client) attached the invoice PDF with `Content-Disposition: inline` —
+  paperless matched the rule, consumed nothing, and recorded
+  `PROCESSED_WO_CONSUMPTION` (which, like any ProcessedMail row, blocks that
+  UID from ever being re-processed — delete the row via `manage.py shell` to
+  retry). Trade-off: signature/inline images in forwards may be ingested as
+  junk docs (tagged `email-ingest`, easy to spot). If that gets noisy, add
+  `filter_attachment_filename_exclude` patterns to the rules using the
+  actually-observed junk filenames — do NOT flip back to attachments-only.
+- **No dedicated alerting** (deliberate, 2026-07-03): mail-task errors surface
+  in paperless logs; the mailserver inbound path is covered by
+  `email-roundtrip-monitor`. Revisit if forwards start silently failing.
+- **Workflows:** the global `payslip-webhook` + `claude-mcp-readers
+  auto-permission` workflows fire for mail-ingested docs like any other
+  consumption source (verified pre-build; payslip receiver does its own
+  filtering).
+
+## Rollback
+
+1. Disable/delete the 5 `forward:` mail rules + the `docs@` mail account
+   (paperless admin UI or API).
+2. Revert the infra commit (aliases.txt entry, sieve file, CM key + mount).
+3. Remove `docs@viktorbarzin.me` from Vault `mailserver_accounts`, then apply
+   with the `-replace` workaround above. Mail to docs@ then falls back to the
+   catch-all (spam@) like any unknown address.
--- a/docs/runbooks/t3-drop-attribution.md
+++ b/docs/runbooks/t3-drop-attribution.md
@ -109,10 +109,17 @@ rate(node_pressure_memory_stalled_seconds_total{instance="devvm"}[5m])
 node_memory_SwapFree_bytes{instance="devvm"}
 ```

-Guardrails in place (2026-06-10, `scripts/t3-serve@.service`): per-unit
-`MemoryHigh=12G`, `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue` —
-a runaway agent now OOMs alone inside the cgroup instead of taking the box
-(and the WS server) with it.
+Guardrails in place (2026-06-10, hardened 2026-07-02; `scripts/t3-serve@.service`):
+per-unit `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue`, and
+`MemoryHigh=infinity` — deliberately NO soft throttle band. With swap=0, a hog
+plateauing between high and max never OOMs and the kernel high-throttle stalls
+the whole unit: a 12.3G agent `ugrep` livelocked t3-serve@wizard for ~50min on
+2026-07-02 (signature: probe `t3serve` leg `Connection reset by peer`, dispatch
+`proxy error: context canceled`, server D-state in `mem_cgroup_handle_over_high`,
+`ss` backlog on the serve port; fix: SIGKILL the hog — the D-state is killable).
+A runaway agent now OOMs alone at 16G inside the cgroup instead of throttling
+the WS server with it. Post-mortem addendum:
+`docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md`.

 ## 4. Known root causes (2026-06-10 investigation)

--- a/docs/runbooks/valia-sites.md
+++ b/docs/runbooks/valia-sites.md
@ -0,0 +1,98 @@
+# Valia sites — add / update / retire
+
+Off-infra static sites authored by Valia (ADR-0018, CONTEXT.md "Valia site").
+Serving: Cloudflare Pages. Freshness: the `valia-sites-sync` CronJob
+(`valia-sites` ns) mirrors each Content folder every 10 minutes and deploys
+only when the folder's manifest hash changed. Registry: `local.sites` in
+`stacks/valia-sites/main.tf` — one entry per site drives everything (Pages
+project, custom domain, public CNAME, internal split-horizon CNAME, sync).
+
+Current sites: `bridge` (ОбУ „Отец Паисий“ — "мост"), `stem95su` (95. СУ STEM
+board).
+
+## Add a site
+
+1. Valia shares the Drive folder with **vbarzin@gmail.com** (viewer is enough —
+   the pipeline is strictly read-only towards Drive).
+2. Get the folder id from its URL (`drive.google.com/drive/folders/<ID>`).
+3. Pick the **English** subdomain name (Viktor's call — CONTEXT.md naming rule).
+4. Add one entry to `local.sites` in `stacks/valia-sites/main.tf`:
+
+   ```hcl
+   <name> = {
+     folder_id  = "<ID>"
+     src_path   = ""            # or "sub/folder" if servable files live deeper
+     entry_file = "index.html"  # or whatever her main HTML file is called
+     manage_dns = true
+   }
+   ```
+
+5. Commit + push; CI applies. Within ~10 min the sync deploys content and the
+   site serves at `https://<name>.viktorbarzin.me` (custom-domain TLS takes
+   ~5–10 min extra on first attach — CF returns 522 for the hostname until
+   then). Internal LAN/VLAN/pod resolution appears when the hourly
+   `technitium-ingress-dns-sync` next runs — trigger it early with:
+   `kubectl create job --from=cronjob/technitium-ingress-dns-sync valia-dns-now -n technitium`
+
+## Content rules (what Valia's folder must look like)
+
+- The **entry file** must exist — the sync stages a copy as `index.html` at
+  deploy time, so `/` works; the original filename keeps working too (deep
+  links survive). If the folder is empty or the entry file is missing, the
+  sync **skips the site and leaves it as-is** (never wipes a live site).
+- Google-native files (Docs/Sheets) are **ignored** (`--drive-skip-gdocs`) —
+  only real files (`.html`, images, …) deploy. Gemini's HTML exports are fine.
+- Per-file limit 25 MB (Cloudflare Pages), 20k files max — far beyond a
+  1-page site.
+
+## Update a site
+
+Nothing to do: Valia edits the folder, the site follows within ~10 minutes.
+Force it early: `kubectl create job --from=cronjob/valia-sites-sync sync-now -n valia-sites`
+
+## Rename / retire a site
+
+Rename = retire + add (Pages projects can't be renamed). Retire:
+
+1. Delete the entry from `local.sites`; commit + push. TF destroys the public
+   CNAME + custom domain + Pages project; the internal record is removed by
+   the next `technitium-ingress-dns-sync` run (its deletion pass drops any
+   internal `*.pages.dev` CNAME that left the `valia-sites-dns` ConfigMap —
+   scoped so it can never touch non-Pages records).
+2. That's all — no manual DNS cleanup (the pre-ADR-0018 add-only gotcha is
+   fixed by the deletion pass).
+
+## Failure modes / debugging
+
+- **Visibility is failed-Job-only by choice** (ADR-0018): no alerts, no
+  notifications. Check: `kubectl get jobs -n valia-sites | tail`, logs of the
+  last `valia-sites-sync-*` pod.
+- **Drive auth broken** (`FATAL … Drive list failed`): the shared
+  `secret/valia-sites.rclone_conf` token died. The GCP OAuth app
+  (`home-lab-1700868541205`) must stay published to "Production" or refresh
+  tokens expire weekly (same constraint as the old stem95su conf, which this
+  one was copied from). Re-mint and `vault kv patch secret/valia-sites
+  rclone_conf=@…`.
+- **Wrangler auth broken**: `secret/valia-sites.cloudflare_pages_token` is a
+  SCOPED token (Pages Read+Write on the account, id
+  `355d2c9d11579bdad1e9498dafca30d5`) — re-mint via
+  `POST /user/tokens` with the Global API Key (`secret/platform`), patch
+  Vault. Do NOT put the Global API Key in the pod.
+- **Site serves stale content**: check the state CM
+  (`kubectl get cm valia-sites-state -n valia-sites -o yaml`) — deleting a
+  site's key forces a redeploy on the next run.
+- **`GUARD … skipping`** in logs: Valia's folder is empty or renamed the
+  entry file — the site deliberately kept its last content. Fix the folder or
+  update `entry_file`.
+
+## History
+
+- stem95su served in-cluster (nginx + NFS + its own rclone CronJob) until
+  2026-07-03, when it was cut over to this pattern and the old stack retired
+  (ADR-0018). The blocking 42.9 MB `stem_video.mp4` was compressed to 21.4 MB
+  (same 1080p, ~2.5 Mbps H.264) and replaced in Valia's folder with Viktor's
+  explicit one-time OK. `secret/stem95su` is superseded by
+  `secret/valia-sites`; `/srv/nfs/stem-site` on the PVE host is a harmless
+  leftover.
+- bridge started as a hand-deployed wrangler experiment (2026-07-03, memory
+  id 7085) and was adopted into the stack the same day.
--- a/docs/runbooks/vault-token-renew-devvm.md
+++ b/docs/runbooks/vault-token-renew-devvm.md
@ -82,33 +82,48 @@ tail -5 ~/.local/state/vault-token-renew.log              # recent results
 A healthy log line looks like:
 `<ts> OK renewed (dn=token-devvm-wizard ttl=2764800s)` (ttl 2764800s = 768h).

-## Drift guard & recovery
+After an OIDC login you'll instead see, at the next nightly run:
+`<ts> HEALED: re-minted periodic token from foreign dn=oidc-… (revoked N stale periodic token(s))`
+— that's the self-heal working as designed.
+
+## Drift guard & self-heal

 `~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login`
 overwrites it. Two confirmed clobber vectors:

 1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer
-   can't push past the OIDC role's 7-day `token_max_ttl`).
+   can't push past the OIDC role's 7-day `token_max_ttl`). The infra docs
+   prescribe this login before applies, so it recurs — it went unnoticed for
+   weeks twice (2026-06-18→26, 2026-06-29→07-03) and read as "Vault expires
+   weekly".
 2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) →
   writes a read-only `kubernetes-woodpecker-default` token (can read Vault but
-   **cannot** write `secret/*`). This happened 2026-06-05 and went unnoticed for
-   two days — reads worked, writes silently 403'd.
+   **cannot** write `secret/*`). Happened 2026-06-05, unnoticed for two days.

-To stop the renewer from silently keeping a foreign token alive, it runs a
-**drift guard** first: it refuses to renew unless the token is
-`token-devvm-wizard` **and** carries `vault-admin`. On drift it logs loudly and
-exits non-zero (the systemd unit goes `failed`) rather than renewing someone
-else's token. Symptom in the log:
+Since 2026-07-03 the renewer **self-heals**
+(`docs/plans/2026-07-03-vault-token-self-heal-design.md`). On a foreign token
+it attempts the re-mint **with the clobbering token's own authority** and lets
+Vault's authz decide:

-`<ts> DRIFT: ~/.vault-token is dn=... policies=... Refusing to renew a foreign token. Re-mint: ...`
+- **Admin-capable clobber (OIDC login)** → re-mints the periodic token,
+  sanity-checks it against the drift guard, atomically replaces
+  `~/.vault-token`, revokes stale `token-devvm-wizard` leftovers
+  (anti-sprawl), logs
+  `HEALED: re-minted periodic token from foreign dn=… (revoked N stale periodic token(s))`
+  and exits 0. The clobbering token is NOT revoked — it may still back a live
+  login session; it ages out on its own.
+- **Weak clobber (read-only k8s token)** → the mint is denied; logs
+  `DRIFT: … heal denied, foreign token lacks create authority …; investigate what wrote it`
+  and exits non-zero (unit `failed`). Deliberately loud: this signals a
+  misbehaving agent flow — exactly the 2026-06-05 case.

-**Recovery: re-mint** (the DRIFT log line contains the exact command) — run the
-[mint/re-mint](#mint--re-mint-the-token) block. The drift guard detects but does
-**not** auto-recover (a deliberate scope choice — version-only, no self-heal);
-recovery is the manual re-mint above.
+**Manual recovery** is only needed for the weak-clobber case (the DRIFT log
+line still contains the exact command) — run the
+[mint/re-mint](#mint--re-mint-the-token) block.

 ## Tests

-`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision
-and the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber
-case). Run: `bash infra/scripts/test-vault-token-renew.sh`.
+`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision,
+the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber
+case), and the self-heal's revoke filter (which stale periodic tokens a heal
+may sweep). Run: `bash infra/scripts/test-vault-token-renew.sh`.
--- a/modules/kubernetes/ingress_factory/main.tf
+++ b/modules/kubernetes/ingress_factory/main.tf
@ -127,20 +127,29 @@ variable "anti_ai_scraping" {
 variable "dns_type" {
  type        = string
  default     = "none"
-  description = "Cloudflare DNS: 'proxied' (CNAME to tunnel), 'non-proxied' (A/AAAA to public IP), or 'none'"
+  description = <<-EOT
+    Cloudflare DNS: 'proxied' (CNAME to tunnel), 'non-proxied' (A/AAAA to
+    public IP), 'internal' (A to the internal Traefik LB IP — resolvable from
+    any resolver but only ROUTABLE from home LANs / WG sites / VPN; the record
+    is a reachability pointer, NOT a gate: pair it with an ipAllowList via
+    extra_middlewares, e.g. traefik-home-lans-only@kubernetescrd, because
+    direct-to-WAN-IP requests with the right SNI still hit Traefik), or 'none'.
+  EOT
  validation {
-    condition     = contains(["proxied", "non-proxied", "none"], var.dns_type)
-    error_message = "dns_type must be 'proxied', 'non-proxied', or 'none'."
+    condition     = contains(["proxied", "non-proxied", "internal", "none"], var.dns_type)
+    error_message = "dns_type must be 'proxied', 'non-proxied', 'internal', or 'none'."
  }
 }

 # Uptime Kuma external monitor: when true, annotate the ingress so the
 # external-monitor-sync CronJob creates a `[External] <name>` monitor pointing
-# at https://<host>. Null means "follow dns_type" — enabled when proxied.
+# at https://<host>. Null means "follow dns_type" — enabled when the ingress
+# has a PUBLIC DNS record (proxied or non-proxied; 'internal' records are not
+# externally reachable, so no external monitor).
 variable "external_monitor" {
  type        = bool
  default     = null
-  description = "Enable Uptime Kuma external monitor. null = auto (enabled when dns_type == 'proxied')."
+  description = "Enable Uptime Kuma external monitor. null = auto (enabled when dns_type is 'proxied' or 'non-proxied')."
 }

 variable "external_monitor_name" {
@ -171,6 +180,15 @@ variable "public_ipv6" {
  default = "2001:470:6e:43d::2"
 }

+# Internal Traefik LB IP used by dns_type = "internal" records. Tracks the
+# dedicated MetalLB IP from stacks/traefik (ETP=Local). A future LB renumber
+# must update this default alongside the split-horizon apex record — see
+# docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*.
+variable "internal_lb_ip" {
+  type    = string
+  default = "10.0.20.203"
+}
+
 variable "homepage_group" {
  type    = string
  default = null # auto-detect from namespace
@ -201,8 +219,10 @@ locals {
  )

  # External monitor enabled by default when the ingress has a public DNS
-  # record (either CF-proxied or direct A/AAAA). Explicit bool overrides.
-  effective_external_monitor = var.external_monitor != null ? var.external_monitor : (var.dns_type != "none")
+  # record (either CF-proxied or direct A/AAAA). 'internal' records resolve
+  # publicly but are unroutable from outside, so they get no external monitor.
+  # Explicit bool overrides.
+  effective_external_monitor = var.external_monitor != null ? var.external_monitor : (var.dns_type == "proxied" || var.dns_type == "non-proxied")

  # Emit the annotation when effective is true (positive signal), or when the
  # caller explicitly set external_monitor=false (opt-out). When the caller
@ -424,3 +444,19 @@ resource "cloudflare_record" "non_proxied_aaaa" {
  zone_id         = var.cloudflare_zone_id
  allow_overwrite = true
 }
+
+# 'internal': a publicly-resolvable A record carrying the INTERNAL Traefik LB
+# IP. Outsiders resolve it but can't route to it; home-LAN/WG-site/VPN clients
+# reach Traefik directly (the WG spokes policy-route 10.0.0.0/8 through the
+# tunnel), so kiosk devices with baked-in URLs need no DNS overrides anywhere.
+# IPv4-only on purpose: the spokes route no internal IPv6 range.
+resource "cloudflare_record" "internal_a" {
+  count           = var.dns_type == "internal" ? 1 : 0
+  name            = local.dns_name
+  content         = var.internal_lb_ip
+  proxied         = false
+  ttl             = 1
+  type            = "A"
+  zone_id         = var.cloudflare_zone_id
+  allow_overwrite = true
+}
--- a/scripts/t3-serve@.service
+++ b/scripts/t3-serve@.service
@ -21,12 +21,19 @@ WorkingDirectory=/home/%i
 ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3
 Restart=on-failure
 RestartSec=5
-# Memory containment (2026-06-10): agent children live in this cgroup; a
-# runaway agent (10.8G anon on a 23G host) swap-thrashed the whole devvm —
-# every >20s stall fires the t3 client watchdog (visible "disconnects") —
-# then global-OOMed. Cap the cgroup so a runaway OOMs early and locally,
-# and forbid swap so stalls can't smear into minutes-long freezes.
-MemoryHigh=12G
+# Memory containment (2026-06-10, amended 2026-07-02): agent children live in
+# this cgroup; a runaway agent (10.8G anon on a 23G host) swap-thrashed the
+# whole devvm — every >20s stall fires the t3 client watchdog (visible
+# "disconnects") — then global-OOMed. Cap the cgroup so a runaway OOMs early
+# and locally, and forbid swap so stalls can't smear into minutes-long freezes.
+# MemoryHigh is DELIBERATELY infinity — do not add a soft band below MemoryMax:
+# with swap=0 a hog that plateaus between high and max is unreclaimable but
+# never OOMs, and the kernel's high-throttle stalls EVERY task in the cgroup
+# (the t3 event loop included) indefinitely. A 12.3G agent ugrep livelocked
+# this unit for ~50min on 2026-07-02 exactly this way. Straight-to-OOM at
+# MemoryMax is the containment; OOMPolicy=continue below keeps the server up.
+# See docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md addendum.
+MemoryHigh=infinity
 MemoryMax=16G
 MemorySwapMax=0
 # Default OOMPolicy=stop kills the WHOLE unit (8.5min outage 2026-06-10
--- a/scripts/test-vault-token-renew.sh
+++ b/scripts/test-vault-token-renew.sh
@ -1,10 +1,11 @@
 #!/usr/bin/env bash
-# Unit tests for the pure drift-guard functions in vault-token-renew.sh.
-# Sources the script (vtr_main is guarded) and exercises the decision logic that
-# decides whether ~/.vault-token is OUR periodic admin token (renew) or a foreign
-# token that clobbered the file (refuse, fail loud). This is exactly the logic
-# whose ABSENCE let the 2026-06-05 woodpecker-token clobber be silently renewed
-# for two days. Run: bash infra/scripts/test-vault-token-renew.sh
+# Unit tests for the pure functions in vault-token-renew.sh.
+# Sources the script (vtr_main is guarded) and exercises (a) the drift-guard
+# decision — is ~/.vault-token OUR periodic admin token (renew) or a foreign
+# clobber (heal / fail loud)? — whose ABSENCE let the 2026-06-05 woodpecker
+# clobber be silently renewed for two days, and (b) the self-heal's revoke
+# filter — which stale token-devvm-wizard tokens a heal may sweep.
+# Run: bash infra/scripts/test-vault-token-renew.sh
 set -uo pipefail
 DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 # shellcheck source=/dev/null
@ -53,5 +54,21 @@ ok "ours: parse+decide renews"        vtr_drift_ok "$(vtr_display_name "$LOOKUP_
 no "woodpecker: parse+decide refused" vtr_drift_ok "$(vtr_display_name "$LOOKUP_WP")"   "$(vtr_policies_csv "$LOOKUP_WP")"
 no "oidc: parse+decide refused"       vtr_drift_ok "$(vtr_display_name "$LOOKUP_OIDC")" "$(vtr_policies_csv "$LOOKUP_OIDC")"

+# --- vtr_accessor: parse accessor out of lookup JSON ---
+LOOKUP_NEW='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-new","policies":["default","sops-admin","vault-admin"],"identity_policies":null}}'
+eq "accessor parsed"          "acc-new" "$(vtr_accessor "$LOOKUP_NEW")"
+eq "accessor absent -> empty" ""        "$(vtr_accessor '{"data":{"display_name":"x"}}')"
+
+# --- vtr_is_stale_periodic: the heal's revoke filter — ONLY old token-devvm-wizard
+# --- tokens are swept; the just-minted token, foreign tokens, and anything with an
+# --- unknown accessor are kept. An empty keep-accessor sweeps NOTHING (fail-safe).
+STALE_OURS='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-old","policies":["default","sops-admin","vault-admin"]}}'
+ok "older periodic token is stale"      vtr_is_stale_periodic "$STALE_OURS" "acc-new"
+no "the just-minted token is kept"      vtr_is_stale_periodic "$LOOKUP_NEW" "acc-new"
+no "foreign oidc token never swept"     vtr_is_stale_periodic "$LOOKUP_OIDC" "acc-new"
+no "woodpecker token never swept"       vtr_is_stale_periodic "$LOOKUP_WP" "acc-new"
+no "missing accessor never swept"       vtr_is_stale_periodic '{"data":{"display_name":"token-devvm-wizard"}}' "acc-new"
+no "empty keep-accessor sweeps nothing" vtr_is_stale_periodic "$STALE_OURS" ""
+
 printf '\n%d passed, %d failed\n' "$pass" "$fail"
 (( fail == 0 ))
--- a/scripts/vault-token-renew.sh
+++ b/scripts/vault-token-renew.sh
@ -45,6 +45,94 @@ vtr_drift_ok() {
  printf ',%s,' "$pols" | grep -q ",$REQUIRED_POLICY," || return 1
 }

+# vtr_accessor <lookup-json> -> the token accessor (empty if absent).
+vtr_accessor() {
+  printf '%s' "$1" | jq -r '.data.accessor // ""'
+}
+
+# vtr_is_stale_periodic <lookup-json> <keep-accessor> -> 0 if this lookup
+# describes one of OUR periodic tokens (display name matches) that is NOT the
+# one to keep — i.e. a stale leftover a heal should revoke. 1 otherwise.
+# Name-only on purpose (no policy check): anything named token-devvm-wizard
+# that isn't the current token is garbage from a previous mint. An empty
+# keep-accessor sweeps NOTHING (fail-safe: never revoke when we don't know
+# which token is current).
+vtr_is_stale_periodic() {
+  local dn acc
+  [ -n "${2:-}" ] || return 1
+  dn=$(vtr_display_name "$1")
+  acc=$(vtr_accessor "$1")
+  [ "$dn" = "$EXPECTED_DN" ] || return 1
+  [ -n "$acc" ] || return 1
+  [ "$acc" != "$2" ]
+}
+
+# vtr_heal <foreign-dn> <log-file> -> 0 if ~/.vault-token was re-minted back to
+# our periodic admin token using the foreign token's own authority, 1 if the
+# heal was denied or failed (caller exits non-zero; the unit goes failed).
+#
+# Self-heal added 2026-07-03 (docs/plans/2026-07-03-vault-token-self-heal-design.md):
+# an OIDC login — which the infra docs prescribe before applies — clobbers
+# ~/.vault-token with a 7-day token, and detect-only drift left that unnoticed
+# for weeks (the weekly-expiry loop). We ATTEMPT the re-mint with the
+# clobbering token itself and let Vault's authz decide — a read-only clobber
+# (the 2026-06-05 woodpecker incident) is denied the mint and stays a loud
+# failure, because it signals a misbehaving flow that someone should look at.
+vtr_heal() {
+  local foreign_dn="$1" log="$2"
+  local errf new_token new_info new_dn new_pols new_acc tmp
+  errf=$(mktemp)
+  if ! new_token=$(vault token create -orphan -period=768h \
+        -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard \
+        -field=token 2>"$errf") || [ -z "$new_token" ]; then
+    printf '%s DRIFT: ~/.vault-token is dn=%q — heal denied, foreign token lacks create authority (%s); investigate what wrote it. Manual re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \
+      "$(date -Is)" "$foreign_dn" "$(tr '\n' ' ' <"$errf")" >>"$log"
+    rm -f "$errf"
+    return 1
+  fi
+  rm -f "$errf"
+
+  # Sanity: the minted token must itself pass the drift guard before it may
+  # replace ~/.vault-token.
+  if ! new_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json 2>&1); then
+    printf '%s FAIL: heal minted a token but its lookup failed: %s\n' \
+      "$(date -Is)" "$new_info" >>"$log"
+    return 1
+  fi
+  new_dn=$(vtr_display_name "$new_info")
+  new_pols=$(vtr_policies_csv "$new_info")
+  if ! vtr_drift_ok "$new_dn" "$new_pols"; then
+    printf '%s FAIL: heal minted an unexpected token (dn=%q policies=%q) — not writing it\n' \
+      "$(date -Is)" "$new_dn" "$new_pols" >>"$log"
+    return 1
+  fi
+
+  # Atomic replace: mktemp files are 0600 from birth; same-filesystem mv.
+  tmp=$(mktemp "$HOME/.vault-token.XXXXXX")
+  printf '%s' "$new_token" >"$tmp"
+  mv "$tmp" "$HOME/.vault-token"
+
+  # Anti-sprawl: revoke previous token-devvm-wizard tokens — each heal would
+  # otherwise strand the prior periodic ADMIN token server-side for up to 32d.
+  # The clobbering foreign token is deliberately NOT revoked: it may still back
+  # the user's live login session, and it ages out on its own (7d for OIDC).
+  local sweep="accessor sweep skipped (list denied)" accessors a a_info revoked=0
+  new_acc=$(vtr_accessor "$new_info")
+  if [ -n "$new_acc" ] && accessors=$(VAULT_TOKEN="$new_token" vault list -format=json auth/token/accessors 2>/dev/null); then
+    while IFS= read -r a; do
+      [ -n "$a" ] || continue
+      a_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json -accessor "$a" 2>/dev/null) || continue
+      if vtr_is_stale_periodic "$a_info" "$new_acc"; then
+        VAULT_TOKEN="$new_token" vault token revoke -accessor "$a" >/dev/null 2>&1 && revoked=$((revoked + 1))
+      fi
+    done < <(printf '%s' "$accessors" | jq -r '.[]')
+    sweep="revoked $revoked stale periodic token(s)"
+  fi
+
+  printf '%s HEALED: re-minted periodic token from foreign dn=%q (%s)\n' \
+    "$(date -Is)" "$foreign_dn" "$sweep" >>"$log"
+}
+
 vtr_main() {
  set -euo pipefail
  export PATH="/usr/local/bin:/usr/bin:/bin:${PATH:-}"
@ -61,16 +149,19 @@ vtr_main() {
  dn=$(vtr_display_name "$info")
  pols=$(vtr_policies_csv "$info")

-  # Drift guard (added 2026-06-07): the renewer must NOT keep a FOREIGN token alive.
-  # On 2026-06-05 a stray `vault login -method=kubernetes` overwrote ~/.vault-token
-  # with a read-only woodpecker token, and this script then silently renewed THAT
-  # for two days — masking the loss of write access. So before renewing, confirm
-  # the token is our periodic admin token; if it has drifted, fail loudly (systemd
-  # marks the unit failed) instead of keeping someone else's token alive.
+  # Drift guard (2026-06-07) + self-heal (2026-07-03): the renewer must not
+  # keep a FOREIGN token alive (on 2026-06-05 a stray kubernetes login was
+  # silently renewed for two days, masking lost write access). But detect-only
+  # drift proved worse in practice: an OIDC login — which the infra docs
+  # prescribe before applies — clobbers this file too, and the resulting DRIFT
+  # failures went unnoticed for weeks while access degraded to a 7-day token
+  # (the weekly-expiry loop). On drift we now ATTEMPT to heal (see vtr_heal):
+  # re-mint the periodic token with the clobbering token's own authority.
+  # Vault's authz keeps the old guarantee — a token that couldn't legitimately
+  # hold vault-admin is denied the mint, and we still fail loud.
  if ! vtr_drift_ok "$dn" "$pols"; then
-    printf '%s DRIFT: ~/.vault-token is dn=%q policies=%q (expected dn=%q with %q). Refusing to renew a foreign token. Re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \
-      "$(date -Is)" "$dn" "$pols" "$EXPECTED_DN" "$REQUIRED_POLICY" >>"$log"
-    exit 1
+    vtr_heal "$dn" "$log" || exit 1
+    exit 0
  fi

  # `vault token renew` with no argument renews the calling token (renew-self).
--- a/scripts/workstation/setup-devvm.sh
+++ b/scripts/workstation/setup-devvm.sh
@ -244,9 +244,15 @@ log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-us
 #     virtual disk into an IO storm + multi-minute freeze (hard-killed 2026-06-22).
 #     t3-serve@ was already capped (its [Service] block); the HOLE was the uncapped
 #     user-<uid>.slice (all ssh/tmux work). Design — per user, on BOTH trees:
-#     MemoryHigh=12G soft (throttles a runaway to a crawl), MemoryMax=16G hard,
-#     MemorySwapMax=0 (work never touches disk swap → no thrash; it OOMs locally at
-#     the ceiling instead), plus fair-share CPU/IO weights.
+#     MemoryMax=16G hard + MemorySwapMax=0 (work never touches disk swap → no
+#     thrash; a runaway is cgroup-OOM-killed locally at the ceiling), plus
+#     fair-share CPU/IO weights.
+#     NO MemoryHigh soft band (removed 2026-07-02; was 12G "throttle to a crawl"):
+#     with swap=0, a hog that PLATEAUS between high and max is unreclaimable but
+#     never OOMs — the kernel parks every task of the cgroup in
+#     mem_cgroup_handle_over_high and the whole tree stalls indefinitely. A 12.3G
+#     agent ugrep livelocked t3-serve@wizard (t3 down ~50min) exactly this way.
+#     Cap-and-kill, never throttle-and-pray — see the post-mortem addendum.
 #     BACKSTOP = earlyoom, NOT systemd-oomd. We first shipped systemd-oomd but it is
 #     INERT with swap=0: its pressure-kill only acts on cgroups doing active reclaim
 #     (pgscan rising), and a no-swap anon workload never reclaims — verified live, a
@ -260,12 +266,16 @@ log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-us
 # 10a) per-user caps + fair-share weights on EVERY user-<uid>.slice (ssh/tmux)
 install -d -m 0755 /etc/systemd/system/user-.slice.d
 cat > /etc/systemd/system/user-.slice.d/50-devvm-resource.conf <<'SLICE_EOF'
-# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22).
-# Applies to EACH user-<uid>.slice = all of one user's ssh/tmux work. Mirrors the
-# t3-serve@.service caps so a user is bounded in whichever surface they work in.
+# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22;
+# MemoryHigh dropped 2026-07-02). Applies to EACH user-<uid>.slice = all of one
+# user's ssh/tmux work. Mirrors the t3-serve@.service caps so a user is bounded
+# in whichever surface they work in. MemoryHigh stays infinity: with swap=0 a
+# hog plateauing in a high..max band livelocks the entire slice (every ssh/tmux
+# session of that user) instead of dying — straight-to-OOM at MemoryMax is the
+# containment (see post-mortem addendum 2026-07-02).
 [Slice]
 MemoryAccounting=yes
-MemoryHigh=12G
+MemoryHigh=infinity
 MemoryMax=16G
 MemorySwapMax=0
 CPUAccounting=yes
@ -294,12 +304,14 @@ cat > /etc/systemd/system/docker.slice <<'DOCKER_SLICE_EOF'
 # All docker containers live here (cgroup-parent in /etc/docker/daemon.json) so
 # they share one bounded budget and a runaway container is capped at MemoryMax
 # (cgroup-OOM'd locally) instead of escaping into the uncapped system.slice.
-# setup-devvm.sh §10, 2026-06-22.
+# setup-devvm.sh §10, 2026-06-22; MemoryHigh dropped 2026-07-02 — a container
+# plateauing in the high..max band would throttle-livelock EVERY container in
+# the slice (see post-mortem addendum); MemoryMax OOM is the containment.
 [Unit]
 Description=Docker containers slice (capped)
 [Slice]
 MemoryAccounting=yes
-MemoryHigh=6G
+MemoryHigh=infinity
 MemoryMax=8G
 MemorySwapMax=0
 CPUAccounting=yes
--- a/secrets/nfs_directories.txt
+++ b/secrets/nfs_directories.txt
--- a/stacks/cloudflared/modules/cloudflared/cloudflare.tf
+++ b/stacks/cloudflared/modules/cloudflared/cloudflare.tf
@ -235,6 +235,12 @@ resource "cloudflare_record" "keyserver" {
  zone_id  = var.cloudflare_zone_id
 }

+# bridge.viktorbarzin.me (Cloudflare Pages, "мост" school site) moved to
+# stacks/valia-sites (ADR-0018) — all Valia-site records live there now.
+# State handoff was a manual `tg state rm` (2026-07-03): the CI terraform
+# (<1.7) rejects removed{} blocks even at the stack root, so declarative
+# forget wasn't available. valia-sites imported the live record by id.
+
 # Enable HTTP/3 (QUIC) for Cloudflare-proxied domains
 resource "cloudflare_zone_settings_override" "http3" {
  zone_id = var.cloudflare_zone_id
--- a/stacks/dawarich/main.tf
+++ b/stacks/dawarich/main.tf
@ -16,7 +16,7 @@ resource "kubernetes_namespace" "dawarich" {
    name = "dawarich"
    labels = {
      "istio-injection" : "disabled"
-      tier = local.tiers.edge
+      tier               = local.tiers.edge
      "keel.sh/enrolled" = "true"
    }
  }
@ -330,7 +330,7 @@ resource "kubernetes_deployment" "dawarich" {
  }
  lifecycle {
    ignore_changes = [
-      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].dns_config,         # KYVERNO_LIFECYCLE_V1
      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
      metadata[0].annotations["keel.sh/policy"],
      metadata[0].annotations["keel.sh/trigger"],
@ -458,6 +458,13 @@ module "ingress" {
  namespace       = kubernetes_namespace.dawarich.metadata[0].name
  name            = "dawarich"
  tls_secret_name = var.tls_secret_name
+  # Rails serves all its fingerprinted assets itself and the map view adds an
+  # API burst per page load — the default 10/50 limiter 429s the asset tail
+  # from a single client IP (and risks dropping OwnTracks/mobile ingestion
+  # POSTs on the same host). Dedicated 100/1000 limiter defined in
+  # stacks/traefik/modules/traefik/middleware.tf.
+  skip_default_rate_limit = true
+  extra_middlewares       = ["traefik-dawarich-rate-limit@kubernetescrd"]
  extra_annotations = {
    "gethomepage.dev/enabled"      = "true"
    "gethomepage.dev/name"         = "Dawarich"
--- a/stacks/dbaas/modules/dbaas/main.tf
+++ b/stacks/dbaas/modules/dbaas/main.tf
@ -1511,6 +1511,34 @@ resource "null_resource" "pg_instagram_poster_db" {
  }
 }

+# Create tasks database for the tasks PWA (Reminders-style front-end over
+# Nextcloud CalDAV; FastAPI + SvelteKit SPA — see ~/code/tasks). Stores
+# Connected Accounts (Fernet-encrypted Nextcloud app passwords) + sync state.
+# Role password is managed by Vault Database Secrets Engine (static role
+# `pg-tasks`, 7d rotation). Tables are created by alembic on app startup.
+resource "null_resource" "pg_tasks_db" {
+  depends_on = [null_resource.pg_cluster]
+
+  triggers = {
+    db_name  = "tasks"
+    username = "tasks"
+  }
+
+  provisioner "local-exec" {
+    command = <<-EOT
+      PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
+      kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
+        bash -c '
+          psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'tasks'"'"'" | grep -q 1 || \
+            psql -U postgres -c "CREATE ROLE tasks WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
+          psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_database WHERE datname = '"'"'tasks'"'"'" | grep -q 1 || \
+            psql -U postgres -c "CREATE DATABASE tasks OWNER tasks"
+          psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE tasks TO tasks"
+        '
+    EOT
+  }
+}
+
 # Old PostgreSQL deployment — kept commented for rollback reference
 # resource "kubernetes_deployment" "postgres" {
 #   metadata {
--- a/stacks/drone-logbook/main.tf
+++ b/stacks/drone-logbook/main.tf
@ -0,0 +1,360 @@
+variable "tls_secret_name" {
+  type      = string
+  sensitive = true
+}
+variable "nfs_server" { type = string }
+
+# Open DroneLog (https://github.com/arpanghosh8453/open-dronelog) — self-hosted
+# DJI flight-log analyzer for the DJI Mini 4 Pro. Runs the UPSTREAM image (the
+# ViktorBarzin/drone-logbook fork has no custom commits); Keel tracks :latest.
+# Design: docs/plans/2026-07-04-drone-logbook-design.md
+resource "kubernetes_namespace" "drone_logbook" {
+  metadata {
+    name = "drone-logbook"
+    labels = {
+      tier               = local.tiers.aux
+      "keel.sh/enrolled" = "true"
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
+    ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
+  }
+}
+
+resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
+  manifest = {
+    apiVersion = "external-secrets.io/v1"
+    kind       = "ExternalSecret"
+    metadata = {
+      name      = "drone-logbook-secrets"
+      namespace = "drone-logbook"
+    }
+    spec = {
+      refreshInterval = "15m"
+      secretStoreRef = {
+        name = "vault-kv"
+        kind = "ClusterSecretStore"
+      }
+      target = {
+        name = "drone-logbook-secrets"
+      }
+      dataFrom = [{
+        extract = {
+          key = "drone-logbook"
+        }
+      }]
+    }
+  }
+  depends_on = [kubernetes_namespace.drone_logbook]
+}
+
+module "tls_secret" {
+  source          = "../../modules/kubernetes/setup_tls_secret"
+  namespace       = kubernetes_namespace.drone_logbook.metadata[0].name
+  tls_secret_name = var.tls_secret_name
+}
+
+# DuckDB database + cached DJI decryption keys + uploaded originals.
+# Embedded DB -> block storage, not NFS (same rationale as freshrss data).
+# Encrypted class: flight logs are GPS traces of home/travel (sensitive data
+# -> proxmox-lvm-encrypted per the storage decision rule in .claude/CLAUDE.md).
+resource "kubernetes_persistent_volume_claim" "data" {
+  wait_until_bound = false
+  metadata {
+    name      = "drone-logbook-data-encrypted"
+    namespace = kubernetes_namespace.drone_logbook.metadata[0].name
+    annotations = {
+      "resize.topolvm.io/threshold"     = "10%"
+      "resize.topolvm.io/increase"      = "100%"
+      "resize.topolvm.io/storage_limit" = "10Gi"
+    }
+  }
+  spec {
+    access_modes       = ["ReadWriteOnce"]
+    storage_class_name = "proxmox-lvm-encrypted"
+    resources {
+      requests = {
+        storage = "2Gi"
+      }
+    }
+  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and PVCs
+    # can't shrink; without this every apply tries to revert the size.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
+}
+
+# Drop folder: any producer (Nextcloud sync, scp, future phone pipeline) lands
+# DJI .txt logs here over NFS; the app auto-imports on SYNC_INTERVAL.
+module "nfs_sync_logs" {
+  source     = "../../modules/kubernetes/nfs_volume"
+  name       = "drone-logbook-sync-logs"
+  namespace  = kubernetes_namespace.drone_logbook.metadata[0].name
+  nfs_server = var.nfs_server
+  nfs_path   = "/srv/nfs/drone-logbook/sync-logs"
+  storage    = "5Gi"
+}
+
+resource "kubernetes_deployment" "drone_logbook" {
+  metadata {
+    name      = "drone-logbook"
+    namespace = kubernetes_namespace.drone_logbook.metadata[0].name
+    labels = {
+      app                             = "drone-logbook"
+      "kubernetes.io/cluster-service" = "true"
+      tier                            = local.tiers.aux
+    }
+  }
+  spec {
+    replicas = 1
+    strategy {
+      # DuckDB is single-writer; never overlap two pods on the same volume
+      type = "Recreate"
+    }
+    selector {
+      match_labels = {
+        app = "drone-logbook"
+      }
+    }
+    template {
+      metadata {
+        labels = {
+          app                             = "drone-logbook"
+          "kubernetes.io/cluster-service" = "true"
+        }
+      }
+      spec {
+        container {
+          name  = "drone-logbook"
+          image = "ghcr.io/arpanghosh8453/open-dronelog:latest"
+          env {
+            name  = "RUST_LOG"
+            value = "info"
+          }
+          env {
+            # keep re-importable originals under /data/drone-logbook/uploaded
+            name  = "KEEP_UPLOADED_FILES"
+            value = "true"
+          }
+          env {
+            name  = "SYNC_LOGS_PATH"
+            value = "/sync-logs"
+          }
+          env {
+            # 6-field cron (sec min hour dom mon dow): scan drop folder every 8h
+            name  = "SYNC_INTERVAL"
+            value = "0 0 */8 * * *"
+          }
+          env {
+            name = "PROFILE_CREATION_PASS"
+            value_from {
+              secret_key_ref {
+                name = "drone-logbook-secrets"
+                key  = "profile_creation_pass"
+              }
+            }
+          }
+          volume_mount {
+            name       = "data"
+            mount_path = "/data/drone-logbook"
+          }
+          volume_mount {
+            name       = "sync-logs"
+            mount_path = "/sync-logs"
+            read_only  = true
+          }
+          port {
+            name           = "http"
+            container_port = 80
+            protocol       = "TCP"
+          }
+          resources {
+            requests = {
+              cpu    = "25m"
+              memory = "512Mi"
+            }
+            limits = {
+              memory = "512Mi"
+            }
+          }
+        }
+        volume {
+          name = "data"
+          persistent_volume_claim {
+            claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
+          }
+        }
+        volume {
+          name = "sync-logs"
+          persistent_volume_claim {
+            claim_name = module.nfs_sync_logs.claim_name
+          }
+        }
+      }
+    }
+  }
+  depends_on = [kubernetes_manifest.external_secret]
+  lifecycle {
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+      metadata[0].annotations["keel.sh/match-tag"],
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["kubernetes.io/change-cause"],
+      metadata[0].annotations["deployment.kubernetes.io/revision"],
+      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
+    ]
+  }
+}
+
+resource "kubernetes_service" "drone_logbook" {
+  metadata {
+    name      = "drone-logbook"
+    namespace = kubernetes_namespace.drone_logbook.metadata[0].name
+    labels = {
+      "app" = "drone-logbook"
+    }
+  }
+
+  spec {
+    selector = {
+      app = "drone-logbook"
+    }
+    port {
+      port        = "80"
+      target_port = "80"
+    }
+  }
+}
+
+# -----------------------------------------------------------------------------
+# Backup — required for every proxmox-lvm(-encrypted) app: daily copy of the
+# data volume to NFS /srv/nfs/drone-logbook-backup (picked up by nfs-mirror ->
+# sda -> Synology offsite). 01:30 = outside the 00:00/08:00/16:00 sync-import
+# windows, so the DuckDB file is quiescent; uploaded originals make even a
+# mid-write copy recoverable by re-import. Pod-affinity co-schedules with the
+# app pod (RWO volume mounts twice only on the same node). Vaultwarden pattern.
+# -----------------------------------------------------------------------------
+
+module "nfs_backup" {
+  source     = "../../modules/kubernetes/nfs_volume"
+  name       = "drone-logbook-backup-host"
+  namespace  = kubernetes_namespace.drone_logbook.metadata[0].name
+  nfs_server = var.nfs_server
+  nfs_path   = "/srv/nfs/drone-logbook-backup"
+}
+
+resource "kubernetes_cron_job_v1" "backup" {
+  metadata {
+    name      = "drone-logbook-backup"
+    namespace = kubernetes_namespace.drone_logbook.metadata[0].name
+  }
+  spec {
+    concurrency_policy            = "Replace"
+    failed_jobs_history_limit     = 5
+    schedule                      = "30 1 * * *"
+    starting_deadline_seconds     = 300
+    successful_jobs_history_limit = 3
+    job_template {
+      metadata {}
+      spec {
+        backoff_limit              = 3
+        ttl_seconds_after_finished = 10
+        template {
+          metadata {}
+          spec {
+            affinity {
+              pod_affinity {
+                required_during_scheduling_ignored_during_execution {
+                  label_selector {
+                    match_labels = {
+                      app = "drone-logbook"
+                    }
+                  }
+                  topology_key = "kubernetes.io/hostname"
+                }
+              }
+            }
+            container {
+              name  = "drone-logbook-backup"
+              image = "docker.io/library/alpine"
+              command = ["/bin/sh", "-c", <<-EOT
+                set -euxo pipefail
+                _t0=$(date +%s)
+                now=$(date +"%Y_%m_%d_%H_%M")
+                mkdir -p /backup/$now
+                cp -a /data/. /backup/$now/
+                # Rotate — 30 day retention
+                find /backup -maxdepth 1 -mindepth 1 -type d -mtime +30 -exec rm -rf {} +
+                _dur=$(($(date +%s) - _t0))
+                _out_bytes=$(du -sb /backup/$now | awk '{print $1}')
+                wget -qO- --post-data "backup_duration_seconds $${_dur}
+                backup_output_bytes $${_out_bytes}
+                backup_last_success_timestamp $(date +%s)
+                " "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/drone-logbook-backup" || true
+              EOT
+              ]
+              volume_mount {
+                name       = "data"
+                mount_path = "/data"
+                read_only  = true
+              }
+              volume_mount {
+                name       = "backup"
+                mount_path = "/backup"
+              }
+            }
+            volume {
+              name = "data"
+              persistent_volume_claim {
+                claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
+              }
+            }
+            volume {
+              name = "backup"
+              persistent_volume_claim {
+                claim_name = module.nfs_backup.claim_name
+              }
+            }
+            dns_config {
+              option {
+                name  = "ndots"
+                value = "2"
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
+  }
+}
+
+# https://dronelog.viktorbarzin.me
+module "ingress" {
+  source          = "../../modules/kubernetes/ingress_factory"
+  auth            = "required" # Authentik forward-auth — flight logs are GPS traces of home/travel
+  dns_type        = "proxied"
+  namespace       = kubernetes_namespace.drone_logbook.metadata[0].name
+  name            = "dronelog"
+  service_name    = "drone-logbook"
+  tls_secret_name = var.tls_secret_name
+  extra_annotations = {
+    "gethomepage.dev/enabled"      = "true"
+    "gethomepage.dev/name"         = "Drone Logbook"
+    "gethomepage.dev/description"  = "DJI flight log analyzer"
+    "gethomepage.dev/icon"         = "mdi-quadcopter"
+    "gethomepage.dev/group"        = "Media & Entertainment"
+    "gethomepage.dev/pod-selector" = ""
+  }
+}
--- a/stacks/drone-logbook/secrets
+++ b/stacks/drone-logbook/secrets
@ -0,0 +1 @@
+../../secrets
--- a/stacks/drone-logbook/terragrunt.hcl
+++ b/stacks/drone-logbook/terragrunt.hcl
@ -0,0 +1,8 @@
+include "root" {
+  path = find_in_parent_folders()
+}
+
+dependency "platform" {
+  config_path  = "../platform"
+  skip_outputs = true
+}
--- a/stacks/excalidraw/main.tf
+++ b/stacks/excalidraw/main.tf
@ -10,7 +10,7 @@ resource "kubernetes_namespace" "excalidraw" {
    name = "excalidraw"
    labels = {
      "istio-injection" : "disabled"
-      tier = local.tiers.aux
+      tier               = local.tiers.aux
      "keel.sh/enrolled" = "true"
    }
  }
@ -45,6 +45,15 @@ resource "kubernetes_deployment" "excalidraw" {
      app  = "excalidraw"
      tier = local.tiers.aux
    }
+    # Keel rolls new ghcr:latest digests (k8s-portal pattern). Values here are
+    # recreate-correct seeds only — the keys are in ignore_changes below, so
+    # the live annotations win on an existing deployment.
+    annotations = {
+      "keel.sh/policy"       = "force"
+      "keel.sh/trigger"      = "poll"
+      "keel.sh/match-tag"    = "true"
+      "keel.sh/pollSchedule" = "@every 5m"
+    }
  }
  spec {
    replicas = 1
@ -67,9 +76,19 @@ resource "kubernetes_deployment" "excalidraw" {
        }
      }
      spec {
+        # GHCR pull secret: the ghcr-credentials Secret in this namespace is
+        # cloned in by the kyverno stack's sync-ghcr-credentials ClusterPolicy
+        # (allowlisted private-ghcr namespaces only — ADR-0002). Source of
+        # truth: stacks/kyverno/modules/kyverno/ghcr-credentials.tf.
+        image_pull_secrets {
+          name = "ghcr-credentials"
+        }
        container {
-          image             = "viktorbarzin/excalidraw-library:v4"
-          image_pull_policy = "IfNotPresent"
+          # ADR-0002: GHA-built (.github/workflows/build-excalidraw.yml),
+          # PRIVATE ghcr; Keel rolls new :latest digests. DockerHub
+          # viktorbarzin/excalidraw-library:v4 is the frozen rollback image.
+          image             = "ghcr.io/viktorbarzin/excalidraw-library:latest"
+          image_pull_policy = "Always"
          name              = "excalidraw"
          port {
            container_port = 8080
@ -107,7 +126,7 @@ resource "kubernetes_deployment" "excalidraw" {
  }
  lifecycle {
    ignore_changes = [
-      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].dns_config,         # KYVERNO_LIFECYCLE_V1
      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
      metadata[0].annotations["keel.sh/policy"],
      metadata[0].annotations["keel.sh/trigger"],
--- a/stacks/excalidraw/project/README.md
+++ b/stacks/excalidraw/project/README.md
@ -4,18 +4,28 @@ A self-hosted Excalidraw library with per-user drawing storage and management.

 ## Features

- Dashboard to manage all your drawings
+- Dashboard to manage all your drawings (create, open, rename, delete)
 - Per-user storage (via Authentik SSO headers)
- Create, edit, and delete drawings
+- Rename drawings from the dashboard or by clicking the drawing name in the editor
+- Native Excalidraw export via the editor's hamburger menu: "Save to..."
+  (.excalidraw file) and "Export image..." (PNG / SVG / clipboard)
+- Autosave (2s debounce) + manual save (Ctrl+S or menu "Save now")
 - Persistent storage via NFS

 ## Docker Image

 ```
-viktorbarzin/excalidraw-library:v4
+ghcr.io/viktorbarzin/excalidraw-library:latest
 ```

-Available on Docker Hub: https://hub.docker.com/r/viktorbarzin/excalidraw-library
+Built by GitHub Actions (`.github/workflows/build-excalidraw.yml` in the infra
+repo, ADR-0002) on every master push touching `stacks/excalidraw/project/**`;
+tags `:latest` + `:<git-sha>`. The package is PRIVATE — cluster pulls use the
+Kyverno-synced `ghcr-credentials` secret. Keel polls `:latest` and rolls the
+deployment on digest change.
+
+The legacy manually-built DockerHub image `viktorbarzin/excalidraw-library:v4`
+is frozen as the rollback target; nothing pushes to it anymore.

 ## Configuration

@ -39,54 +49,13 @@ Mount a persistent volume to the `DATA_DIR` path. Drawings are stored as `.excal
    └── my-diagram.excalidraw
 ```

+The filename (without extension) is both the drawing ID and its display name;
+renaming a drawing renames the file (`os.Rename`, mtime preserved).
+
 ## Deployment

-### Docker
-
-```bash
-docker run -d \
-  --name excalidraw-rooms \
-  -p 8080:8080 \
-  -v /path/to/storage:/data \
-  viktorbarzin/excalidraw-library:v4
-```
-
-### Kubernetes
-
-```yaml
-apiVersion: apps/v1
-kind: Deployment
-metadata:
-  name: excalidraw
-spec:
-  replicas: 1
-  selector:
-    matchLabels:
-      app: excalidraw
-  template:
-    metadata:
-      labels:
-        app: excalidraw
-    spec:
-      containers:
-        - name: excalidraw
-          image: viktorbarzin/excalidraw-library:v4
-          ports:
-            - containerPort: 8080
-          env:
-            - name: DATA_DIR
-              value: /data
-            - name: PORT
-              value: "8080"
-          volumeMounts:
-            - name: data
-              mountPath: /data
-      volumes:
-        - name: data
-          nfs:
-            server: 192.168.1.127
-            path: /srv/nfs/excalidraw
-```
+Deployed by the `stacks/excalidraw` Terraform stack (namespace `excalidraw`,
+service `draw`, ingress `draw.viktorbarzin.me` with `auth = "required"`).

 ### With Authentik SSO

@ -96,23 +65,7 @@ The application reads user identity from Authentik headers:
 - `X-Authentik-Email` - Displayed in UI
 - `X-Authentik-Name` - Displayed in UI

-Configure your ingress to pass these headers:
-
-```yaml
-annotations:
-  nginx.ingress.kubernetes.io/auth-response-headers: "X-authentik-username,X-authentik-email,X-authentik-name"
-```
-
-## Building
-
-```bash
-# Build the Docker image
-docker build -t excalidraw-library .
-
-# Or build locally
-go build -o excalidraw-library .
-./excalidraw-library
-```
+Requests without `X-Authentik-Username` fall back to the `anonymous` user.

 ## API Endpoints

@ -122,10 +75,25 @@ go build -o excalidraw-library .
 | GET | `/api/drawings` | List all drawings for current user |
 | GET | `/api/drawings/:id` | Get drawing data |
 | PUT | `/api/drawings/:id` | Save drawing |
+| PATCH | `/api/drawings/:id` | Rename drawing — body `{"name": "<new-name>"}`; returns `{"status":"renamed","id":"<new-id>"}`; 409 if the target name exists |
 | DELETE | `/api/drawings/:id` | Delete drawing |
 | GET | `/api/user` | Get current user info |
 | GET | `/draw/:id` | Open drawing in editor |

+Rename names are sanitized server-side to `[a-zA-Z0-9-_]` (other characters
+become `-`; a trailing `.excalidraw` is stripped). Existing IDs are accepted
+as-is for backward compatibility with API clients.
+
+## Development
+
+```bash
+# Run tests
+go test ./...
+
+# Run locally
+DATA_DIR=/tmp/excalidraw-data go run .
+```
+
 ## License

 MIT
--- a/stacks/excalidraw/project/main.go
+++ b/stacks/excalidraw/project/main.go
@ -9,6 +9,7 @@ import (
 	"net/http"
 	"os"
 	"path/filepath"
+	"regexp"
 	"sort"
 	"strings"
 	"time"
@ -63,6 +64,21 @@ func getUsername(r *http.Request) string {
 	return username
 }

+var invalidNameChars = regexp.MustCompile(`[^a-zA-Z0-9-_]`)
+
+// sanitizeName normalizes a user-supplied drawing name into a safe file ID
+// (same charset the dashboard applies on create). Returns "" if nothing
+// meaningful remains.
+func sanitizeName(name string) string {
+	name = strings.TrimSpace(name)
+	name = strings.TrimSuffix(name, ".excalidraw")
+	name = invalidNameChars.ReplaceAllString(name, "-")
+	if strings.Trim(name, "-") == "" {
+		return ""
+	}
+	return name
+}
+
 // getUserDataDir returns the data directory for a specific user and ensures it exists
 func getUserDataDir(username string) string {
 	userDir := filepath.Join(dataDir, username)
@ -168,6 +184,41 @@ func handleDrawing(w http.ResponseWriter, r *http.Request) {
 		w.Header().Set("Content-Type", "application/json")
 		json.NewEncoder(w).Encode(map[string]string{"status": "saved", "id": id})

+	case http.MethodPatch:
+		var req struct {
+			Name string `json:"name"`
+		}
+		if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
+			http.Error(w, "Invalid JSON body", http.StatusBadRequest)
+			return
+		}
+		newID := sanitizeName(req.Name)
+		if newID == "" {
+			http.Error(w, "Invalid name", http.StatusBadRequest)
+			return
+		}
+		if _, err := os.Stat(filePath); err != nil {
+			if os.IsNotExist(err) {
+				http.Error(w, "Drawing not found", http.StatusNotFound)
+			} else {
+				http.Error(w, err.Error(), http.StatusInternalServerError)
+			}
+			return
+		}
+		if newID != id {
+			newPath := filepath.Join(userDataDir, newID+".excalidraw")
+			if _, err := os.Stat(newPath); err == nil {
+				http.Error(w, "A drawing with that name already exists", http.StatusConflict)
+				return
+			}
+			if err := os.Rename(filePath, newPath); err != nil {
+				http.Error(w, err.Error(), http.StatusInternalServerError)
+				return
+			}
+		}
+		w.Header().Set("Content-Type", "application/json")
+		json.NewEncoder(w).Encode(map[string]string{"status": "renamed", "id": newID})
+
 	case http.MethodDelete:
 		if err := os.Remove(filePath); err != nil {
 			if os.IsNotExist(err) {
@ -264,6 +315,8 @@ const dashboardHTML = `<!DOCTYPE html>
        .btn:hover { background: #5b4cdb; }
        .btn-danger { background: #e74c3c; }
        .btn-danger:hover { background: #c0392b; }
+        .btn-secondary { background: #3d3d5c; }
+        .btn-secondary:hover { background: #4a4a70; }
        .btn-small { padding: 0.4rem 0.8rem; font-size: 0.85rem; }
        .drawings { display: grid; gap: 1rem; }
        .drawing {
@ -342,11 +395,11 @@ const dashboardHTML = `<!DOCTYPE html>

    <div id="modal" class="modal">
        <div class="modal-content">
-            <h2>New Drawing</h2>
+            <h2 id="modal-title">New Drawing</h2>
            <input type="text" id="drawingName" placeholder="Drawing name..." autofocus>
            <div class="modal-actions">
                <button class="btn" style="background:#444" onclick="hideModal()">Cancel</button>
-                <button class="btn" onclick="createDrawing()">Create</button>
+                <button class="btn" id="modal-confirm" onclick="confirmModal()">Create</button>
            </div>
        </div>
    </div>
@ -369,31 +422,63 @@ const dashboardHTML = `<!DOCTYPE html>
            }
        }

+        function drawingRow(d) {
+            var row = document.createElement('div');
+            row.className = 'drawing';
+
+            var info = document.createElement('div');
+            info.className = 'drawing-info';
+            var nameLink = document.createElement('a');
+            nameLink.className = 'drawing-name';
+            nameLink.href = '/draw/' + encodeURIComponent(d.id);
+            nameLink.textContent = d.name;
+            var meta = document.createElement('div');
+            meta.className = 'drawing-meta';
+            meta.textContent = 'Modified: ' + new Date(d.modified).toLocaleDateString() + ' ' +
+                new Date(d.modified).toLocaleTimeString() + ' - ' + formatSize(d.size);
+            info.appendChild(nameLink);
+            info.appendChild(meta);
+
+            var actions = document.createElement('div');
+            actions.className = 'drawing-actions';
+            var open = document.createElement('a');
+            open.className = 'btn btn-small';
+            open.href = '/draw/' + encodeURIComponent(d.id);
+            open.textContent = 'Open';
+            var rename = document.createElement('button');
+            rename.className = 'btn btn-small btn-secondary';
+            rename.textContent = 'Rename';
+            rename.onclick = function() { showRenameModal(d.id); };
+            var del = document.createElement('button');
+            del.className = 'btn btn-small btn-danger';
+            del.textContent = 'Delete';
+            del.onclick = function() { deleteDrawing(d.id); };
+            actions.appendChild(open);
+            actions.appendChild(rename);
+            actions.appendChild(del);
+
+            row.appendChild(info);
+            row.appendChild(actions);
+            return row;
+        }
+
        async function loadDrawings() {
            const resp = await fetch('/api/drawings');
            const drawings = await resp.json();
            const container = document.getElementById('drawings');
+            container.replaceChildren();

            if (!drawings || drawings.length === 0) {
-                container.innerHTML = '<div class="empty">No drawings yet. Create your first one!</div>';
+                var empty = document.createElement('div');
+                empty.className = 'empty';
+                empty.textContent = 'No drawings yet. Create your first one!';
+                container.appendChild(empty);
                return;
            }

-            container.innerHTML = drawings.map(function(d) {
-                return '<div class="drawing">' +
-                    '<div class="drawing-info">' +
-                    '<a href="/draw/' + d.id + '" class="drawing-name">' + d.name + '</a>' +
-                    '<div class="drawing-meta">' +
-                    'Modified: ' + new Date(d.modified).toLocaleDateString() + ' ' + new Date(d.modified).toLocaleTimeString() +
-                    ' - ' + formatSize(d.size) +
-                    '</div>' +
-                    '</div>' +
-                    '<div class="drawing-actions">' +
-                    '<a href="/draw/' + d.id + '" class="btn btn-small">Open</a>' +
-                    '<button class="btn btn-small btn-danger" onclick="deleteDrawing(\'' + d.id + '\')">Delete</button>' +
-                    '</div>' +
-                    '</div>';
-            }).join('');
+            drawings.forEach(function(d) {
+                container.appendChild(drawingRow(d));
+            });
        }

        function formatSize(bytes) {
@ -402,18 +487,64 @@ const dashboardHTML = `<!DOCTYPE html>
            return (bytes / (1024 * 1024)).toFixed(1) + ' MB';
        }

-        function showNewModal() {
+        var modalAction = null; // invoked with the input value on confirm
+
+        function showModal(title, confirmLabel, initialValue, action) {
+            document.getElementById('modal-title').textContent = title;
+            document.getElementById('modal-confirm').textContent = confirmLabel;
+            var input = document.getElementById('drawingName');
+            input.value = initialValue || '';
+            modalAction = action;
            document.getElementById('modal').classList.add('active');
-            document.getElementById('drawingName').focus();
+            input.focus();
+            input.select();
+        }
+
+        function showNewModal() {
+            showModal('New Drawing', 'Create', '', createDrawing);
+        }
+
+        function showRenameModal(id) {
+            showModal('Rename Drawing', 'Rename', id, function(value) {
+                renameDrawing(id, value);
+            });
        }

        function hideModal() {
            document.getElementById('modal').classList.remove('active');
            document.getElementById('drawingName').value = '';
+            modalAction = null;
        }

-        async function createDrawing() {
-            var name = document.getElementById('drawingName').value.trim();
+        function confirmModal() {
+            if (modalAction) modalAction(document.getElementById('drawingName').value);
+        }
+
+        async function renameDrawing(id, newName) {
+            newName = (newName || '').trim();
+            if (!newName || newName === id) {
+                hideModal();
+                return;
+            }
+            var resp = await fetch('/api/drawings/' + encodeURIComponent(id), {
+                method: 'PATCH',
+                headers: { 'Content-Type': 'application/json' },
+                body: JSON.stringify({ name: newName })
+            });
+            if (resp.status === 409) {
+                alert('A drawing with that name already exists.');
+                return; // keep the modal open so the user can pick another name
+            }
+            if (!resp.ok) {
+                alert('Rename failed: ' + await resp.text());
+                return;
+            }
+            hideModal();
+            loadDrawings();
+        }
+
+        async function createDrawing(name) {
+            name = (name || '').trim();
            if (!name) {
                name = 'drawing-' + Date.now();
            }
@ -446,7 +577,7 @@ const dashboardHTML = `<!DOCTYPE html>
        }

        document.getElementById('drawingName').addEventListener('keypress', function(e) {
-            if (e.key === 'Enter') createDrawing();
+            if (e.key === 'Enter') confirmModal();
        });

        document.getElementById('modal').addEventListener('click', function(e) {
--- a/stacks/excalidraw/project/main_test.go
+++ b/stacks/excalidraw/project/main_test.go
@ -0,0 +1,249 @@
+package main
+
+import (
+	"encoding/json"
+	"net/http"
+	"net/http/httptest"
+	"os"
+	"path/filepath"
+	"strings"
+	"testing"
+)
+
+const testDrawing = `{"type":"excalidraw","version":2,"source":"excalidraw-library","elements":[{"id":"e1"}],"appState":{"viewBackgroundColor":"#ffffff"}}`
+
+func setupDataDir(t *testing.T) {
+	t.Helper()
+	dataDir = t.TempDir()
+}
+
+// doDrawing sends a request to handleDrawing for the given user and returns the recorder.
+func doDrawing(t *testing.T, method, id, body, user string) *httptest.ResponseRecorder {
+	t.Helper()
+	var reader *strings.Reader
+	if body == "" {
+		reader = strings.NewReader("")
+	} else {
+		reader = strings.NewReader(body)
+	}
+	req := httptest.NewRequest(method, "/api/drawings/"+id, reader)
+	if user != "" {
+		req.Header.Set("X-Authentik-Username", user)
+	}
+	w := httptest.NewRecorder()
+	handleDrawing(w, req)
+	return w
+}
+
+func listDrawings(t *testing.T, user string) []Drawing {
+	t.Helper()
+	req := httptest.NewRequest(http.MethodGet, "/api/drawings", nil)
+	if user != "" {
+		req.Header.Set("X-Authentik-Username", user)
+	}
+	w := httptest.NewRecorder()
+	handleListDrawings(w, req)
+	if w.Code != http.StatusOK {
+		t.Fatalf("list: expected 200, got %d", w.Code)
+	}
+	var drawings []Drawing
+	if err := json.Unmarshal(w.Body.Bytes(), &drawings); err != nil {
+		t.Fatalf("list: bad JSON: %v", err)
+	}
+	return drawings
+}
+
+func TestPutGetRoundtrip(t *testing.T) {
+	setupDataDir(t)
+	if w := doDrawing(t, http.MethodPut, "foo", testDrawing, "alice"); w.Code != http.StatusOK {
+		t.Fatalf("PUT: expected 200, got %d: %s", w.Code, w.Body.String())
+	}
+	w := doDrawing(t, http.MethodGet, "foo", "", "alice")
+	if w.Code != http.StatusOK {
+		t.Fatalf("GET: expected 200, got %d", w.Code)
+	}
+	if w.Body.String() != testDrawing {
+		t.Errorf("GET: content mismatch: %s", w.Body.String())
+	}
+}
+
+func TestGetMissing(t *testing.T) {
+	setupDataDir(t)
+	if w := doDrawing(t, http.MethodGet, "nope", "", "alice"); w.Code != http.StatusNotFound {
+		t.Fatalf("expected 404, got %d", w.Code)
+	}
+}
+
+func TestListDrawings(t *testing.T) {
+	setupDataDir(t)
+	doDrawing(t, http.MethodPut, "one", testDrawing, "alice")
+	doDrawing(t, http.MethodPut, "two", testDrawing, "alice")
+	drawings := listDrawings(t, "alice")
+	if len(drawings) != 2 {
+		t.Fatalf("expected 2 drawings, got %d", len(drawings))
+	}
+	ids := map[string]bool{drawings[0].ID: true, drawings[1].ID: true}
+	if !ids["one"] || !ids["two"] {
+		t.Errorf("unexpected ids: %v", ids)
+	}
+	for _, d := range drawings {
+		if d.Name != d.ID {
+			t.Errorf("name should equal id: %+v", d)
+		}
+	}
+}
+
+func TestDelete(t *testing.T) {
+	setupDataDir(t)
+	doDrawing(t, http.MethodPut, "foo", testDrawing, "alice")
+	if w := doDrawing(t, http.MethodDelete, "foo", "", "alice"); w.Code != http.StatusOK {
+		t.Fatalf("DELETE: expected 200, got %d", w.Code)
+	}
+	if w := doDrawing(t, http.MethodGet, "foo", "", "alice"); w.Code != http.StatusNotFound {
+		t.Fatalf("GET after delete: expected 404, got %d", w.Code)
+	}
+	if w := doDrawing(t, http.MethodDelete, "foo", "", "alice"); w.Code != http.StatusNotFound {
+		t.Fatalf("second DELETE: expected 404, got %d", w.Code)
+	}
+}
+
+func TestPerUserIsolation(t *testing.T) {
+	setupDataDir(t)
+	doDrawing(t, http.MethodPut, "secret", testDrawing, "alice")
+	if w := doDrawing(t, http.MethodGet, "secret", "", "bob"); w.Code != http.StatusNotFound {
+		t.Fatalf("bob should not see alice's drawing, got %d", w.Code)
+	}
+	if drawings := listDrawings(t, "bob"); len(drawings) != 0 {
+		t.Fatalf("bob's list should be empty, got %d", len(drawings))
+	}
+}
+
+// --- rename (PATCH) ---
+
+func renameReq(t *testing.T, id, newName, user string) *httptest.ResponseRecorder {
+	t.Helper()
+	return doDrawing(t, http.MethodPatch, id, `{"name":`+strconv(newName)+`}`, user)
+}
+
+// strconv JSON-quotes a string without importing encoding/json for a one-liner.
+func strconv(s string) string {
+	b, _ := json.Marshal(s)
+	return string(b)
+}
+
+func TestRenameSuccess(t *testing.T) {
+	setupDataDir(t)
+	doDrawing(t, http.MethodPut, "foo", testDrawing, "alice")
+	w := renameReq(t, "foo", "bar", "alice")
+	if w.Code != http.StatusOK {
+		t.Fatalf("PATCH: expected 200, got %d: %s", w.Code, w.Body.String())
+	}
+	var resp map[string]string
+	if err := json.Unmarshal(w.Body.Bytes(), &resp); err != nil {
+		t.Fatalf("PATCH: bad JSON: %v", err)
+	}
+	if resp["id"] != "bar" || resp["status"] != "renamed" {
+		t.Errorf("unexpected response: %v", resp)
+	}
+	if w := doDrawing(t, http.MethodGet, "bar", "", "alice"); w.Code != http.StatusOK || w.Body.String() != testDrawing {
+		t.Errorf("GET new id: code=%d content=%q", w.Code, w.Body.String())
+	}
+	if w := doDrawing(t, http.MethodGet, "foo", "", "alice"); w.Code != http.StatusNotFound {
+		t.Errorf("GET old id: expected 404, got %d", w.Code)
+	}
+}
+
+func TestRenameConflict(t *testing.T) {
+	setupDataDir(t)
+	doDrawing(t, http.MethodPut, "a", testDrawing, "alice")
+	doDrawing(t, http.MethodPut, "b", testDrawing, "alice")
+	if w := renameReq(t, "a", "b", "alice"); w.Code != http.StatusConflict {
+		t.Fatalf("expected 409, got %d", w.Code)
+	}
+	// both drawings intact
+	for _, id := range []string{"a", "b"} {
+		if w := doDrawing(t, http.MethodGet, id, "", "alice"); w.Code != http.StatusOK {
+			t.Errorf("drawing %q should be intact, got %d", id, w.Code)
+		}
+	}
+}
+
+func TestRenameMissing(t *testing.T) {
+	setupDataDir(t)
+	if w := renameReq(t, "nope", "new", "alice"); w.Code != http.StatusNotFound {
+		t.Fatalf("expected 404, got %d", w.Code)
+	}
+}
+
+func TestRenameSameName(t *testing.T) {
+	setupDataDir(t)
+	doDrawing(t, http.MethodPut, "foo", testDrawing, "alice")
+	w := renameReq(t, "foo", "foo", "alice")
+	if w.Code != http.StatusOK {
+		t.Fatalf("same-name rename: expected 200, got %d: %s", w.Code, w.Body.String())
+	}
+	if w := doDrawing(t, http.MethodGet, "foo", "", "alice"); w.Code != http.StatusOK {
+		t.Errorf("drawing should be intact, got %d", w.Code)
+	}
+}
+
+func TestRenameInvalidNames(t *testing.T) {
+	setupDataDir(t)
+	doDrawing(t, http.MethodPut, "foo", testDrawing, "alice")
+	for _, name := range []string{"", "   ", "../..", "---"} {
+		if w := renameReq(t, "foo", name, "alice"); w.Code != http.StatusBadRequest {
+			t.Errorf("rename to %q: expected 400, got %d", name, w.Code)
+		}
+	}
+	// malformed body
+	if w := doDrawing(t, http.MethodPatch, "foo", `{not json`, "alice"); w.Code != http.StatusBadRequest {
+		t.Errorf("malformed body: expected 400, got %d", w.Code)
+	}
+}
+
+func TestRenameSanitization(t *testing.T) {
+	setupDataDir(t)
+	cases := []struct{ in, want string }{
+		{"My Drawing!", "My-Drawing-"},
+		{"net diag.excalidraw", "net-diag"}, // .excalidraw suffix stripped, not mangled
+		{"a/b\\c", "a-b-c"},
+	}
+	for _, c := range cases {
+		doDrawing(t, http.MethodPut, "src", testDrawing, "alice")
+		w := renameReq(t, "src", c.in, "alice")
+		if w.Code != http.StatusOK {
+			t.Errorf("rename to %q: expected 200, got %d: %s", c.in, w.Code, w.Body.String())
+			continue
+		}
+		var resp map[string]string
+		json.Unmarshal(w.Body.Bytes(), &resp)
+		if resp["id"] != c.want {
+			t.Errorf("rename to %q: expected id %q, got %q", c.in, c.want, resp["id"])
+		}
+		// file must be inside the user dir under the sanitized name
+		if _, err := os.Stat(filepath.Join(dataDir, "alice", c.want+".excalidraw")); err != nil {
+			t.Errorf("rename to %q: expected file %q on disk: %v", c.in, c.want, err)
+		}
+		doDrawing(t, http.MethodDelete, resp["id"], "", "alice")
+	}
+}
+
+func TestRenameTraversalStaysInUserDir(t *testing.T) {
+	setupDataDir(t)
+	doDrawing(t, http.MethodPut, "foo", testDrawing, "alice")
+	w := renameReq(t, "foo", "../../../etc/passwd", "alice")
+	if w.Code == http.StatusOK {
+		var resp map[string]string
+		json.Unmarshal(w.Body.Bytes(), &resp)
+		if strings.Contains(resp["id"], "/") || strings.Contains(resp["id"], "..") {
+			t.Fatalf("traversal characters survived: %q", resp["id"])
+		}
+		if _, err := os.Stat(filepath.Join(dataDir, "alice", resp["id"]+".excalidraw")); err != nil {
+			t.Fatalf("renamed file escaped user dir: %v", err)
+		}
+	}
+	// nothing outside the data dir
+	if _, err := os.Stat(filepath.Join(dataDir, "..", "etc")); err == nil {
+		t.Fatal("file escaped the data dir")
+	}
+}
--- a/stacks/excalidraw/project/static/editor.html
+++ b/stacks/excalidraw/project/static/editor.html
@ -8,41 +8,41 @@
        * { margin: 0; padding: 0; }
        html, body { width: 100%; height: 100%; overflow: hidden; }
        #root { width: 100%; height: 100%; }
-        .toolbar {
-            position: fixed;
-            top: 10px;
-            left: 10px;
-            z-index: 1000;
+        .top-right-ui {
            display: flex;
+            align-items: center;
            gap: 8px;
-            background: rgba(255,255,255,0.95);
+            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+        }
+        .top-right-ui a, .top-right-ui button {
+            display: inline-flex;
+            align-items: center;
+            gap: 6px;
            padding: 8px 12px;
+            border: 1px solid transparent;
            border-radius: 8px;
-            box-shadow: 0 2px 8px rgba(0,0,0,0.15);
-        }
-        .toolbar button, .toolbar a {
-            padding: 6px 14px;
-            border: none;
-            border-radius: 6px;
            cursor: pointer;
-            font-size: 14px;
-            background: #6c5ce7;
-            color: white;
+            font-size: 13px;
            text-decoration: none;
-            display: inline-block;
+            box-shadow: 0 1px 4px rgba(0,0,0,0.12);
+            max-width: 40vw;
+            white-space: nowrap;
+            overflow: hidden;
+            text-overflow: ellipsis;
        }
-        .toolbar button:hover, .toolbar a:hover { background: #5b4cdb; }
-        .toolbar .secondary { background: #ddd; color: #333; }
-        .toolbar .secondary:hover { background: #ccc; }
-        .toolbar .title {
-            font-weight: 600;
-            padding: 6px 0;
-            color: #333;
+        .top-right-ui.theme-light a, .top-right-ui.theme-light button {
+            background: #ffffff;
+            color: #1b1b1f;
        }
+        .top-right-ui.theme-dark a, .top-right-ui.theme-dark button {
+            background: #232329;
+            color: #e9ecef;
+        }
+        .top-right-ui button:hover, .top-right-ui a:hover { border-color: #a29bfe; }
        .status {
            position: fixed;
            bottom: 10px;
-            right: 10px;
+            right: 60px;
            padding: 6px 12px;
            background: rgba(0,0,0,0.7);
            color: white;
@ -51,6 +51,7 @@
            z-index: 1000;
            opacity: 0;
            transition: opacity 0.3s;
+            pointer-events: none;
        }
        .status.show { opacity: 1; }
        .loading {
@ -67,11 +68,6 @@
    </style>
 </head>
 <body>
-    <div class="toolbar">
-        <a href="/" class="secondary">Back to Library</a>
-        <span class="title" id="title">Loading...</span>
-        <button onclick="saveDrawing()">Save</button>
-    </div>
    <div id="root">
        <div class="loading">
            <div>Loading Excalidraw...</div>
@ -81,16 +77,33 @@
    <div id="status" class="status">Saved</div>

    <script>
+        // Replaces #root with an error panel (safe DOM methods, no innerHTML).
+        function showFatal(title, detail) {
+            var root = document.getElementById('root');
+            root.replaceChildren();
+            var panel = document.createElement('div');
+            panel.className = 'loading error';
+            var titleEl = document.createElement('div');
+            titleEl.textContent = title;
+            panel.appendChild(titleEl);
+            if (detail) {
+                var detailEl = document.createElement('div');
+                detailEl.style.fontSize = '0.9rem';
+                detailEl.textContent = detail;
+                panel.appendChild(detailEl);
+            }
+            root.appendChild(panel);
+        }
+
        // Get drawing ID from URL path: /draw/{id}
        var pathParts = window.location.pathname.split('/');
        var drawingId = pathParts[pathParts.length - 1] || pathParts[pathParts.length - 2];

        if (!drawingId) {
-            document.getElementById('root').innerHTML = '<div class="loading error">No drawing ID specified</div>';
+            showFatal('No drawing ID specified');
            throw new Error('No drawing ID');
        }

-        document.getElementById('title').textContent = drawingId;
        document.title = drawingId + ' - Excalidraw';

        var excalidrawAPI = null;
@ -159,6 +172,46 @@
            autoSaveTimeout = setTimeout(saveDrawing, 2000);
        }

+        // Renames the current drawing via the API. Returns the new ID, or null
+        // if the rename was cancelled or failed.
+        async function renameCurrentDrawing() {
+            var newName = window.prompt('Rename drawing', drawingId);
+            if (newName === null) return null;
+            newName = newName.trim();
+            if (!newName || newName === drawingId) return null;
+
+            // A pending autosave would resurrect the old file after the rename.
+            clearTimeout(autoSaveTimeout);
+
+            var resp;
+            try {
+                resp = await fetch('/api/drawings/' + drawingId, {
+                    method: 'PATCH',
+                    headers: { 'Content-Type': 'application/json' },
+                    body: JSON.stringify({ name: newName })
+                });
+            } catch (e) {
+                showStatus('Rename failed!');
+                return null;
+            }
+            if (resp.status === 409) {
+                window.alert('A drawing with that name already exists.');
+                return null;
+            }
+            if (!resp.ok) {
+                window.alert('Rename failed: ' + (await resp.text()));
+                return null;
+            }
+            var result = await resp.json();
+            drawingId = result.id;
+            document.title = drawingId + ' - Excalidraw';
+            window.history.replaceState(null, '', '/draw/' + encodeURIComponent(drawingId));
+            showStatus('Renamed');
+            // Flush any unsaved changes to the new file.
+            saveDrawing();
+            return drawingId;
+        }
+
        // Load scripts dynamically
        function loadScript(src) {
            return new Promise(function(resolve, reject) {
@ -197,33 +250,76 @@

                updateLoadStatus('Rendering Excalidraw...');

-                // Create Excalidraw component
+                var e = React.createElement;
+                var MainMenu = ExcalidrawLib.MainMenu;
+
+                // Native default menu items, existence-guarded so a library
+                // update that drops one degrades gracefully.
+                function defaultItem(name) {
+                    var C = MainMenu && MainMenu.DefaultItems && MainMenu.DefaultItems[name];
+                    return C ? e(C, { key: name }) : null;
+                }
+
                function App() {
-                    return React.createElement(ExcalidrawLib.Excalidraw, {
+                    var nameState = React.useState(drawingId);
+                    var name = nameState[0], setName = nameState[1];
+
+                    function onRename() {
+                        renameCurrentDrawing().then(function(newId) {
+                            if (newId) setName(newId);
+                        });
+                    }
+
+                    // The menu is where the native export features live:
+                    // Export = "Save to..." (.excalidraw), SaveAsImage =
+                    // "Export image..." (PNG / SVG / clipboard).
+                    var menu = MainMenu ? e(MainMenu, { key: 'menu' },
+                        e(MainMenu.Item, { key: 'back', onSelect: function() { window.location.href = '/'; } }, 'Back to Library'),
+                        e(MainMenu.Item, { key: 'save', onSelect: saveDrawing }, 'Save now'),
+                        e(MainMenu.Item, { key: 'rename', onSelect: onRename }, 'Rename drawing…'),
+                        MainMenu.Separator ? e(MainMenu.Separator, { key: 'sep1' }) : null,
+                        defaultItem('LoadScene'),
+                        defaultItem('Export'),
+                        defaultItem('SaveAsImage'),
+                        MainMenu.Separator ? e(MainMenu.Separator, { key: 'sep2' }) : null,
+                        defaultItem('ClearCanvas'),
+                        defaultItem('ToggleTheme'),
+                        defaultItem('ChangeCanvasBackground'),
+                        defaultItem('Help')
+                    ) : null;
+
+                    return e(ExcalidrawLib.Excalidraw, {
                        initialData: initialData ? {
                            elements: initialData.elements || [],
                            appState: initialData.appState || {}
                        } : undefined,
+                        UIOptions: { canvasActions: { toggleTheme: true } },
                        excalidrawAPI: function(api) {
                            excalidrawAPI = api;
                            console.log('Excalidraw API ready');
                        },
-                        onChange: onChange
-                    });
+                        onChange: onChange,
+                        renderTopRightUI: function(isMobile, appState) {
+                            return e('div', { className: 'top-right-ui theme-' + (appState.theme || 'light') },
+                                e('a', { key: 'home', href: '/', title: 'Back to Library' }, '← Library'),
+                                e('button', {
+                                    key: 'name',
+                                    title: 'Click to rename',
+                                    onClick: onRename
+                                }, name + ' ✎')
+                            );
+                        }
+                    }, menu);
                }

                var root = ReactDOM.createRoot(document.getElementById('root'));
-                root.render(React.createElement(App));
+                root.render(e(App));

                console.log('Excalidraw rendered successfully');

-            } catch (e) {
-                console.error('Init error:', e);
-                document.getElementById('root').innerHTML =
-                    '<div class="loading error">' +
-                    '<div>Failed to load Excalidraw</div>' +
-                    '<div style="font-size:0.9rem">' + e.message + '</div>' +
-                    '</div>';
+            } catch (err) {
+                console.error('Init error:', err);
+                showFatal('Failed to load Excalidraw', err.message);
            }
        }

--- a/stacks/excalidraw/rbac.tf
+++ b/stacks/excalidraw/rbac.tf
@ -0,0 +1,49 @@
+# emo's Claude → Excalidraw upload RBAC.
+#
+# emo's agent uploads drawings with `kubectl -n excalidraw port-forward svc/draw`
+# + `PUT /api/drawings/<name>` carrying the X-Authentik-Username header (the
+# documented recipe in emo's ~/.claude/CLAUDE.md — the app sits behind Authentik
+# forward-auth, so direct curl gets redirected). His hands-off credential is the
+# chrome-service/emo-browser ServiceAccount kubeconfig (stacks/chrome-service/rbac.tf);
+# its cluster-wide grant (oidc-power-user-readonly) is read-only, so pods/portforward
+# must be granted per namespace. This is the excalidraw-namespace grant
+# (Viktor's call, 2026-07-02; same pattern as the chrome-service one).
+#
+# TRADE-OFF (accepted): port-forward into this namespace bypasses the Authentik
+# ingress and the drawings API trusts the X-Authentik-Username header, so the SA
+# can read/write ANY user's drawings, not only emo's. The namespace runs nothing
+# but the drawings app, and the same class of trade-off was already accepted for
+# the shared browser (CDP reach into Viktor's sessions).
+
+resource "kubernetes_role" "portforward" {
+  metadata {
+    name      = "excalidraw-portforward"
+    namespace = kubernetes_namespace.excalidraw.metadata[0].name
+  }
+  rule {
+    api_groups = [""]
+    resources  = ["pods/portforward"]
+    verbs      = ["create"]
+  }
+}
+
+resource "kubernetes_role_binding" "emo_browser_portforward" {
+  metadata {
+    name      = "emo-browser-portforward"
+    namespace = kubernetes_namespace.excalidraw.metadata[0].name
+  }
+  role_ref {
+    api_group = "rbac.authorization.k8s.io"
+    kind      = "Role"
+    name      = kubernetes_role.portforward.metadata[0].name
+  }
+  subject {
+    kind = "ServiceAccount"
+    # Defined in stacks/chrome-service/rbac.tf — referenced by name across
+    # stacks, same as that file references the oidc-power-user-readonly
+    # ClusterRole. get/list on pods+services (needed to resolve svc/draw) comes
+    # from the SA's cluster-read binding there.
+    name      = "emo-browser"
+    namespace = "chrome-service"
+  }
+}
--- a/stacks/f1-stream/main.tf
+++ b/stacks/f1-stream/main.tf
@ -166,6 +166,33 @@ resource "kubernetes_deployment" "f1-stream" {
            name  = "DISCORD_CHANNELS"
            value = var.discord_f1_channel_ids
          }
+          # Replays feature (app repo ADR-0002). optional=true so the pod still
+          # starts before the Reddit app credentials exist; the app treats missing
+          # creds as "replays off" (logs "Replays pipeline disabled"). The
+          # ExternalSecret above uses dataFrom.extract on the Vault "f1-stream"
+          # key, so adding reddit_client_id / reddit_client_secret there auto-syncs
+          # them into this Secret — no ExternalSecret change needed, just a pod
+          # restart to pick them up.
+          env {
+            name = "REDDIT_CLIENT_ID"
+            value_from {
+              secret_key_ref {
+                name     = "f1-stream-secrets"
+                key      = "reddit_client_id"
+                optional = true
+              }
+            }
+          }
+          env {
+            name = "REDDIT_CLIENT_SECRET"
+            value_from {
+              secret_key_ref {
+                name     = "f1-stream-secrets"
+                key      = "reddit_client_secret"
+                optional = true
+              }
+            }
+          }
          # Verifier connects to in-cluster headed Chromium pool — see
          # stacks/chrome-service/. Falls back to in-process headless if unset.
          # 2026-06-04: migrated WS (:3000 / path-token) → CDP (:9222 /
--- a/stacks/frigate/main.tf
+++ b/stacks/frigate/main.tf
@ -117,8 +117,9 @@ resource "kubernetes_deployment" "frigate" {
            limits = {
              memory           = "10Gi"
              "nvidia.com/gpu" = "1"
-              # GPU VRAM budget (ADR-0016): detector + ffmpeg decode (~1.9 GiB).
-              "viktorbarzin.me/gpumem" = "2000"
+              # GPU VRAM budget (ADR-0016): detector + ffmpeg decode (~1.9 GiB),
+              # +~250 MiB NVDEC headroom for the vermont-garage camera (ADR-0017).
+              "viktorbarzin.me/gpumem" = "2300"
            }
          }
          env {
--- a/stacks/immich/frame-emo.tf
+++ b/stacks/immich/frame-emo.tf
@ -34,7 +34,7 @@ resource "kubernetes_config_map" "frame_config_emo" {
    Accounts:
        - ImmichServerUrl: http://immich.viktorbarzin.me
          ApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_api_key_emo"]}
-          ImagesFromDays: 730
+          ImagesFromDays: 365
    EOF
  }
 }
@ -73,7 +73,9 @@ resource "kubernetes_deployment" "immich-frame-emo" {
      }
      spec {
        container {
-          image = "ghcr.io/immichframe/immichframe:v1.0.32.0"
+          # immich_v3: upstream compat tag for Immich v3 — see frame.tf for the
+          # full story; repin to a versioned tag once upstream releases v3 support.
+          image = "ghcr.io/immichframe/immichframe:immich_v3"
          name  = "immich-frame-emo"
          resources {
            requests = {
@ -142,14 +144,21 @@ resource "kubernetes_service" "immich-frame-emo" {

 module "ingress_emo" {
  source = "../../modules/kubernetes/ingress_factory"
-  # Photo-frame kiosk display on Emo's Portal — headless browser pulling images
-  # via an Immich API key (no user login). Forward-auth would 302 the device to
-  # Authentik with no way to complete login.
-  # auth = "none": photo-frame kiosk; headless browser with API key; no user login.
-  auth            = "none"
-  dns_type        = "proxied"
-  namespace       = "immich"
-  name            = "highlights-immich-emo"
-  tls_secret_name = var.tls_secret_name
-  service_name    = "immich-frame-emo"
+  # Photo-frame kiosk display on Emo's Portal Mini (Sofia LAN) — WebView
+  # pulling images via an Immich API key; no user login possible on the
+  # device. Same LAN-only gating as frame.tf: home-lans-only ipAllowList +
+  # dns_type "internal" (Emo's Portal already resolves this host internally
+  # via Technitium; the public internal-IP record covers any resolver).
+  # LAN-only design: docs/plans/2026-07-04-immich-frame-lan-only-design.md.
+  # auth = "none": kiosk WebView, no user auth by design; gated by the home-lans-only ipAllowList instead.
+  auth              = "none"
+  dns_type          = "internal"
+  extra_middlewares = ["traefik-home-lans-only@kubernetescrd"]
+  # Not externally reachable — explicit opt-out so external-monitor-sync
+  # drops the old [External] monitor instead of default-opting it back in.
+  external_monitor = false
+  namespace        = "immich"
+  name             = "highlights-immich-emo"
+  tls_secret_name  = var.tls_secret_name
+  service_name     = "immich-frame-emo"
 }
--- a/stacks/immich/frame.tf
+++ b/stacks/immich/frame.tf
@ -69,7 +69,11 @@ resource "kubernetes_deployment" "immich-frame" {
      }
      spec {
        container {
-          image = "ghcr.io/immichframe/immichframe:v1.0.32.0"
+          # immich_v3 is the upstream compat tag for Immich v3 servers — every
+          # versioned release (≤ v1.0.33.0) crashes deserializing v3 API
+          # responses (immichFrame/immichFrame#653). Pin back to a vX.Y.Z.W tag
+          # once a stable release ships v3 support (upstream PR #654).
+          image = "ghcr.io/immichframe/immichframe:immich_v3"
          name  = "immich-frame"
          resources {
            requests = {
@ -138,14 +142,23 @@ resource "kubernetes_service" "immich-frame" {

 module "ingress" {
  source = "../../modules/kubernetes/ingress_factory"
-  # Photo-frame kiosk display — runs in headless browser mode on a TV/frame
-  # device and pulls images via an Immich API key (no user login). Forward-auth
-  # would 302 the device to Authentik with no way to complete login.
-  # auth = "none": Photo-frame kiosk display — headless browser with API key; no user login; forward-auth breaks device automation.
-  auth            = "none"
-  dns_type        = "proxied"
-  namespace       = "immich"
-  name            = "highlights-immich"
-  tls_secret_name = var.tls_secret_name
-  service_name    = "immich-frame"
+  # Photo-frame kiosk display (Viktor's London Portal Plus WebView) — pulls
+  # images via an Immich API key; no user login possible on the device, so
+  # forward-auth would 302 it to Authentik with no way to complete login.
+  # The GATE is network-level: the home-lans-only ipAllowList (Sofia/London/
+  # Valchedrym LANs + 10/8) 403s everyone else, and dns_type "internal"
+  # publishes the Traefik LB IP publicly so the Portal's baked-in URL resolves
+  # from any resolver yet routes only via the home LANs / WG tunnel.
+  # LAN-only design: docs/plans/2026-07-04-immich-frame-lan-only-design.md.
+  # auth = "none": kiosk WebView, no user auth by design; gated by the home-lans-only ipAllowList instead.
+  auth              = "none"
+  dns_type          = "internal"
+  extra_middlewares = ["traefik-home-lans-only@kubernetescrd"]
+  # Not externally reachable — explicit opt-out so external-monitor-sync
+  # drops the old [External] monitor instead of default-opting it back in.
+  external_monitor = false
+  namespace        = "immich"
+  name             = "highlights-immich"
+  tls_secret_name  = var.tls_secret_name
+  service_name     = "immich-frame"
 }
--- a/stacks/immich/main.tf
+++ b/stacks/immich/main.tf
@ -15,7 +15,7 @@ locals {
 variable "immich_version" {
  type = string
  # Change me to upgrade
-  default = "v2.7.5"
+  default = "v3.0.0"
 }
 variable "proxmox_host" { type = string }
 variable "redis_host" { type = string }
@ -492,7 +492,7 @@ resource "kubernetes_deployment" "immich-postgres" {
      }
      spec {
        container {
-          image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0"
+          image = "ghcr.io/immich-app/postgres:15-vectorchord0.4.3-pgvectors0.2.0"
          name  = "immich-postgresql"
          port {
            container_port = 5432
@ -882,7 +882,7 @@ resource "kubernetes_cron_job_v1" "clip-index-prewarm" {
            restart_policy = "Never"
            container {
              name  = "prewarm"
-              image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0"
+              image = "ghcr.io/immich-app/postgres:15-vectorchord0.4.3-pgvectors0.2.0"
              # command overrides the postgres entrypoint → runs psql directly.
              command = [
                "psql", "-v", "ON_ERROR_STOP=1", "-c",
@ -964,7 +964,7 @@ resource "kubernetes_cron_job_v1" "immich-search-probe" {
            }
            init_container {
              name  = "measure"
-              image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0"
+              image = "ghcr.io/immich-app/postgres:15-vectorchord0.4.3-pgvectors0.2.0"
              command = ["/bin/bash", "-c", <<-EOT
                set -uo pipefail
                OUT=/shared/metrics.prom
--- a/stacks/kyverno/modules/kyverno/ghcr-credentials.tf
+++ b/stacks/kyverno/modules/kyverno/ghcr-credentials.tf
@ -43,6 +43,11 @@ locals {
    # ghcr.io/passionprojectsanca/book-plotter (built by GHA in Anca's repo,
    # under her own org's ghcr). The deployment references the cloned secret.
    "plotting-book",
+    # excalidraw: infra-owned image migrated from manual DockerHub pushes to
+    # PRIVATE ghcr.io/viktorbarzin/excalidraw-library (ADR-0002, built by
+    # .github/workflows/build-excalidraw.yml). The deployment references the
+    # cloned secret.
+    "excalidraw",
  ]
 }

--- a/stacks/mailserver/modules/mailserver/extra/aliases.txt
+++ b/stacks/mailserver/modules/mailserver/extra/aliases.txt
@ -19,3 +19,12 @@ plans@viktorbarzin.me spam@viktorbarzin.me
 # to trips@, or every verification/recovery send is rejected (550 sender). Also
 # routes any inbound trips@ to spam@.
 trips@viktorbarzin.me spam@viktorbarzin.me
+
+# docs@ -> docs@: explicit self-alias for the paperless-ngx ingest MAILBOX
+# (a real account in secret/platform.mailserver_accounts). Without this the
+# @domain catch-all above (Vault-side aliases) rewrites docs@ to spam@ and the
+# mail lands in the TripIt-swept catch-all mailbox instead. Same pattern as
+# me@ -> me@. Delivery-time sender allowlist: docs-at-viktorbarzin.me
+# .dovecot.sieve (mounted as docs@viktorbarzin.me.dovecot.sieve).
+# Runbook: docs/runbooks/paperless-mail-ingest.md
+docs@viktorbarzin.me docs@viktorbarzin.me
--- a/stacks/mailserver/modules/mailserver/extra/docs-at-viktorbarzin.me.dovecot.sieve
+++ b/stacks/mailserver/modules/mailserver/extra/docs-at-viktorbarzin.me.dovecot.sieve
@ -0,0 +1,17 @@
+# Sender allowlist for the paperless-ngx ingest mailbox docs@viktorbarzin.me.
+# Family members forward document emails here; paperless-ngx polls the INBOX
+# over IMAP and maps each sender to a paperless account (1 mail rule per
+# sender). Decision (Viktor, 2026-07-03): mail from any OTHER sender is
+# ignored and deleted — discarded here at LMTP delivery, before paperless
+# ever sees it. This also keeps spam to the guessable address out entirely.
+#
+# Keep this list in sync with the paperless mail rules (the sender -> owner
+# map). Add-a-sender procedure: docs/runbooks/paperless-mail-ingest.md
+if not address :is "from" ["me@viktorbarzin.me",
+                           "vbarzin@gmail.com",
+                           "viktorbarzin@meta.com",
+                           "ancaelena98@gmail.com",
+                           "emil.barzin@gmail.com"] {
+    discard;
+    stop;
+}
--- a/stacks/mailserver/modules/mailserver/main.tf
+++ b/stacks/mailserver/modules/mailserver/main.tf
@ -14,10 +14,15 @@ variable "nfs_server" { type = string }
 locals {
  _account_set   = keys(var.mailserver_accounts)
  _virtual_lines = split("\n", format("%s%s", var.postfix_account_aliases, file("${path.module}/extra/aliases.txt")))
+  # NOTE: the length guard must live in a ternary, not a leading `&&` operand.
+  # Terraform only short-circuits && / || from v1.6 — on the older terraform
+  # pinned in the infra-ci image, `split(" ", line)[1]` was still evaluated
+  # for blank/comment lines and failed the whole plan with "Invalid index"
+  # (first hit by CI pipeline #469, 2026-07-03). A conditional expression is
+  # lazy on every terraform version.
  postfix_virtual = join("\n", [
    for line in local._virtual_lines : line
-    if !(
-      length(split(" ", line)) == 2 &&
+    if length(split(" ", line)) != 2 ? true : !(
      contains(local._account_set, split(" ", line)[0]) &&
      contains(local._account_set, split(" ", line)[1]) &&
      split(" ", line)[0] != split(" ", line)[1]
@ -110,6 +115,12 @@ resource "kubernetes_config_map" "mailserver_config" {
    "postfix-main.cf"     = var.postfix_cf
    "postfix-virtual.cf"  = local.postfix_virtual

+    # Per-user Dovecot sieve for the paperless-ngx ingest mailbox: DMS installs
+    # any /tmp/docker-mailserver/<login>.dovecot.sieve at startup. ConfigMap
+    # keys can't contain '@', so the key is sanitized ("-at-") and the
+    # volume_mount below restores the real filename.
+    "docs-at-viktorbarzin.me.dovecot.sieve" = file("${path.module}/extra/docs-at-viktorbarzin.me.dovecot.sieve")
+
    KeyTable      = "mail._domainkey.viktorbarzin.me viktorbarzin.me:mail:/etc/opendkim/keys/viktorbarzin.me-mail.key\n"
    SigningTable  = "*@viktorbarzin.me mail._domainkey.viktorbarzin.me\n"
    TrustedHosts  = "127.0.0.1\nlocalhost\n"
@ -404,6 +415,12 @@ resource "kubernetes_deployment" "mailserver" {
            sub_path   = "postfix-virtual.cf"
            read_only  = true
          }
+          volume_mount {
+            name       = "config"
+            mount_path = "/tmp/docker-mailserver/docs@viktorbarzin.me.dovecot.sieve"
+            sub_path   = "docs-at-viktorbarzin.me.dovecot.sieve"
+            read_only  = true
+          }
          volume_mount {
            name       = "config"
            mount_path = "/tmp/docker-mailserver/fetchmail.cf"
--- a/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf
+++ b/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf
@ -60,6 +60,10 @@ locals {
    # t3 dispatch probe surface (auth="none" path carve-out on /probe): WS echo
    # + healthz for the t3-probe drop-attribution client (stacks/t3code).
    "t3-probe-ws" = "https://t3.viktorbarzin.me/probe/healthz"
+    # tasks PWA icons + manifest (auth="none" path carve-out, stacks/tasks
+    # module.ingress_icons): macOS/iOS/Android icon fetchers carry no session
+    # cookies, so an Authentik 302 here breaks Add-to-Dock icons.
+    "tasks-icons" = "https://tasks.viktorbarzin.me/apple-touch-icon.png"
    # NOTE: openclaw task-webhook (auth="none") is intentionally NOT probed — it
    # has no public DNS record (NXDOMAIN, external_monitor=false), so there is no
    # externally GET-able URL to probe. Its carve-out is internal-only.
--- a/stacks/rybbit/worker/index.js
+++ b/stacks/rybbit/worker/index.js
@ -18,7 +18,6 @@ const SITE_IDS = {
  "stacks.viktorbarzin.me": "b38fda4285df",
  "f1.viktorbarzin.me": "7e69786f66d5",
  "frigate.viktorbarzin.me": "0d4044069ff5",
-  "highlights-immich.viktorbarzin.me": "602167601c6b",
  "immich.viktorbarzin.me": "35eedb7a3d2b",
  "mail.viktorbarzin.me": "082f164faa7d",
  "navidrome.viktorbarzin.me": "8a3844ff75ba",
--- a/stacks/rybbit/worker/wrangler.toml
+++ b/stacks/rybbit/worker/wrangler.toml
@ -28,7 +28,6 @@ routes = [
  { pattern = "stacks.viktorbarzin.me/*",               zone_name = "viktorbarzin.me" },
  { pattern = "f1.viktorbarzin.me/*",                   zone_name = "viktorbarzin.me" },
  { pattern = "frigate.viktorbarzin.me/*",              zone_name = "viktorbarzin.me" },
-  { pattern = "highlights-immich.viktorbarzin.me/*",    zone_name = "viktorbarzin.me" },
  { pattern = "immich.viktorbarzin.me/*",               zone_name = "viktorbarzin.me" },
  { pattern = "mail.viktorbarzin.me/*",                 zone_name = "viktorbarzin.me" },
  { pattern = "navidrome.viktorbarzin.me/*",            zone_name = "viktorbarzin.me" },
--- a/stacks/stem95su/gdrive-sync.tf
+++ b/stacks/stem95su/gdrive-sync.tf
@ -1,122 +0,0 @@
-# Automatic Google Drive -> site sync (added 2026-06-09; supersedes the
-# earlier on-demand-only model now that content is actively maintained).
-#
-# A CronJob mirrors the READ-ONLY Drive folder "claude" (servable content in
-# subfolder "stem claude/files/") onto the NFS content volume every 10 min via
-# rclone. rclone is delta-aware: an unchanged run lists ~33 files' metadata and
-# transfers nothing, so the schedule is cheap (not a 24MB re-download). nginx
-# keeps serving the same volume read-only; updates appear within ~5s (actimeo).
-#
-# Drive is treated strictly READ-ONLY: scope=drive.readonly and rclone only ever
-# reads the remote (sync gdrive: -> /data), never writes back.
-#
-# TOKEN LONGEVITY: the GCP OAuth app (project home-lab-1700868541205) MUST be
-# published to "Production" or its refresh token expires ~weekly and this job
-# fails. After publishing, re-mint the token and refresh
-# `secret/stem95su.rclone_conf`. A failed run surfaces as a failed Job.
-
-resource "kubernetes_manifest" "rclone_external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
-  manifest = {
-    apiVersion = "external-secrets.io/v1"
-    kind       = "ExternalSecret"
-    metadata = {
-      name      = "stem95su-rclone"
-      namespace = kubernetes_namespace.stem95su.metadata[0].name
-    }
-    spec = {
-      refreshInterval = "1h"
-      secretStoreRef = {
-        name = "vault-kv"
-        kind = "ClusterSecretStore"
-      }
-      target = { name = "stem95su-rclone" }
-      data = [{
-        secretKey = "rclone.conf"
-        remoteRef = {
-          key      = "stem95su"
-          property = "rclone_conf"
-        }
-      }]
-    }
-  }
-  depends_on = [kubernetes_namespace.stem95su]
-}
-
-resource "kubernetes_cron_job_v1" "gdrive_sync" {
-  metadata {
-    name      = "stem95su-gdrive-sync"
-    namespace = kubernetes_namespace.stem95su.metadata[0].name
-    labels    = { run = "stem95su", component = "gdrive-sync" }
-  }
-  spec {
-    schedule                      = "*/10 * * * *"
-    concurrency_policy            = "Forbid"
-    successful_jobs_history_limit = 2
-    failed_jobs_history_limit     = 3
-    job_template {
-      metadata {}
-      spec {
-        backoff_limit              = 1
-        ttl_seconds_after_finished = 86400
-        template {
-          metadata { labels = { run = "stem95su", component = "gdrive-sync" } }
-          spec {
-            restart_policy = "OnFailure"
-            container {
-              name  = "rclone"
-              image = "docker.io/rclone/rclone:1.74.3"
-              # Mirror Drive folder -> /data. Guard: hard-fail on auth/list error
-              # (so an expired token is visible); skip quietly if the source is
-              # empty / missing the dashboard (never wipe the live site);
-              # --max-delete caps catastrophic deletes from a partial listing.
-              command = ["/bin/sh", "-c", <<-EOT
-                set -eu
-                cp /config/rclone.conf /tmp/rc.conf
-                SRC="gdrive:stem claude/files"
-                LIST=$(rclone --config /tmp/rc.conf lsf "$SRC" --files-only) || { echo "FATAL: Drive list failed (auth/network)"; exit 1; }
-                N=$(printf '%s\n' "$LIST" | grep -c . || true)
-                if [ "$N" -lt 1 ] || ! printf '%s\n' "$LIST" | grep -qx "stem_board.html"; then
-                  echo "GUARD: source N=$N / stem_board.html missing -- skipping, site untouched"; exit 0
-                fi
-                echo "source OK ($N files) -- mirroring to /data"
-                rclone --config /tmp/rc.conf sync "$SRC" /data --exclude ".DS_Store" --fast-list --transfers 4 --max-delete 25 -v
-              EOT
-              ]
-              resources {
-                requests = { cpu = "10m", memory = "64Mi" }
-                limits   = { memory = "192Mi" }
-              }
-              volume_mount {
-                name       = "rclone-config"
-                mount_path = "/config"
-                read_only  = true
-              }
-              volume_mount {
-                name       = "content"
-                mount_path = "/data"
-              }
-            }
-            volume {
-              name = "rclone-config"
-              secret { secret_name = "stem95su-rclone" }
-            }
-            volume {
-              name = "content"
-              persistent_volume_claim {
-                claim_name = module.nfs_content.claim_name
-              }
-            }
-          }
-        }
-      }
-    }
-  }
-  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
-  }
-  depends_on = [kubernetes_manifest.rclone_external_secret]
-}
--- a/stacks/stem95su/main.tf
+++ b/stacks/stem95su/main.tf
@ -1,173 +1,9 @@
-# STEM educational platform for 95. СУ „Проф. Иван Шишманов" (Sofia).
-# Public, open static site at stem95su.viktorbarzin.me. Self-contained HTML
-# pages + media authored externally (Gemini exports), served by a stock nginx
-# straight off the PVE host NFS — NOT baked into an image, so content can be
-# updated out-of-band (Nextcloud "PVE NFS Pool" or rsync to /srv/nfs/stem-site)
-# without a rebuild. Auto-backed-up offsite by the existing nfs-mirror job.
-
-resource "kubernetes_namespace" "stem95su" {
-  metadata {
-    name = "stem95su"
-    labels = {
-      "istio-injection" : "disabled"
-      tier = local.tiers.aux
-    }
-  }
-  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
-    ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
-  }
-}
-
-module "tls_secret" {
-  source          = "../../modules/kubernetes/setup_tls_secret"
-  namespace       = kubernetes_namespace.stem95su.metadata[0].name
-  tls_secret_name = var.tls_secret_name
-}
-
-# Content lives on the PVE host NFS. NOTE: the nfs_volume module creates only
-# the K8s PV+PVC — the export subdir (/srv/nfs/stem-site) must already exist on
-# 192.168.1.127 or the pod fails to mount (mount.nfs exit 32). It is created
-# during deploy and re-created on demand if ever lost.
-module "nfs_content" {
-  source       = "../../modules/kubernetes/nfs_volume"
-  name         = "stem95su-content"
-  namespace    = kubernetes_namespace.stem95su.metadata[0].name
-  nfs_server   = var.nfs_server
-  nfs_path     = "/srv/nfs/stem-site"
-  storage      = "1Gi"
-  access_modes = ["ReadWriteMany"]
-}
-
-# Minimal nginx server block: serve the static dir, with the dashboard
-# (stem_board.html) as the directory index so "/" loads the platform home.
-# All other pages/assets are reached by their exact filenames (the dashboard
-# links to them by name — those must not be renamed).
-resource "kubernetes_config_map" "nginx_conf" {
-  metadata {
-    name      = "stem95su-nginx-conf"
-    namespace = kubernetes_namespace.stem95su.metadata[0].name
-  }
-  data = {
-    "default.conf" = <<-EOT
-      server {
-          listen       80;
-          server_name  _;
-          root   /usr/share/nginx/html;
-          index  stem_board.html index.html;
-      }
-    EOT
-  }
-}
-
-resource "kubernetes_deployment" "stem95su" {
-  metadata {
-    name      = "stem95su"
-    namespace = kubernetes_namespace.stem95su.metadata[0].name
-    labels = {
-      run  = "stem95su"
-      tier = local.tiers.aux
-    }
-  }
-  spec {
-    replicas = 1
-    selector {
-      match_labels = {
-        run = "stem95su"
-      }
-    }
-    template {
-      metadata {
-        labels = {
-          run = "stem95su"
-        }
-      }
-      spec {
-        container {
-          image = "nginx:1.28-alpine"
-          name  = "nginx"
-          resources {
-            limits = {
-              memory = "64Mi"
-            }
-            requests = {
-              cpu    = "10m"
-              memory = "64Mi"
-            }
-          }
-          port {
-            container_port = 80
-          }
-          volume_mount {
-            name       = "content"
-            mount_path = "/usr/share/nginx/html"
-            read_only  = true
-          }
-          volume_mount {
-            name       = "nginx-conf"
-            mount_path = "/etc/nginx/conf.d"
-            read_only  = true
-          }
-          readiness_probe {
-            http_get {
-              path = "/"
-              port = 80
-            }
-            initial_delay_seconds = 3
-            period_seconds        = 10
-          }
-        }
-        volume {
-          name = "content"
-          persistent_volume_claim {
-            claim_name = module.nfs_content.claim_name
-          }
-        }
-        volume {
-          name = "nginx-conf"
-          config_map {
-            name = kubernetes_config_map.nginx_conf.metadata[0].name
-          }
-        }
-      }
-    }
-  }
-  lifecycle {
-    ignore_changes = [
-      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
-    ]
-  }
-}
-
-resource "kubernetes_service" "stem95su" {
-  metadata {
-    name      = "stem95su"
-    namespace = kubernetes_namespace.stem95su.metadata[0].name
-    labels = {
-      run = "stem95su"
-    }
-  }
-  spec {
-    selector = {
-      run = "stem95su"
-    }
-    port {
-      name        = "http"
-      port        = "80"
-      target_port = "80"
-    }
-  }
-}
-
-module "ingress" {
-  source = "../../modules/kubernetes/ingress_factory"
-  # auth = "none": public static educational site for 95. СУ, open to the internet by design — CrowdSec + ai-bot-block gate bots; no login.
-  auth            = "none"
-  namespace       = kubernetes_namespace.stem95su.metadata[0].name
-  name            = "stem95su"
-  service_name    = kubernetes_service.stem95su.metadata[0].name
-  port            = "80"
-  host            = "stem95su"
-  dns_type        = "proxied"
-  tls_secret_name = var.tls_secret_name
-}
+# stem95su moved OFF-INFRA to Cloudflare Pages (ADR-0018 cutover, 2026-07-03) —
+# registry entry `stem95su` in stacks/valia-sites; runbook
+# docs/runbooks/valia-sites.md. This stack intentionally declares NOTHING:
+# the apply that landed this file destroyed the old in-cluster serving
+# (nginx + NFS content PVC + ingress + per-site gdrive-sync CronJob +
+# namespace). Directory kept only so the destroy could run through CI —
+# safe to delete the dir + its PG state schema in a later cleanup.
+# Harmless leftovers (manual cleanup if ever wanted): /srv/nfs/stem-site on
+# the PVE host, and Vault secret/stem95su (superseded by secret/valia-sites).
--- a/stacks/stem95su/variables.tf
+++ b/stacks/stem95su/variables.tf
@ -1,9 +0,0 @@
-variable "tls_secret_name" {
-  type      = string
-  sensitive = true
-}
-
-variable "nfs_server" {
-  type    = string
-  default = "192.168.1.127"
-}
--- a/stacks/tasks/imports.tf
+++ b/stacks/tasks/imports.tf
@ -0,0 +1,53 @@
+# One-shot adoption of the live tasks-stack resources that exist in-cluster but
+# were never persisted to Terraform state: pipeline 477 (2026-07-03, the stack's
+# first apply) died mid-`[tasks] apply` — after creating the resources, before
+# the pg backend write — so `tasks.states` stayed empty and every later apply
+# would create-fail with `namespaces "tasks" already exists` (same class as the
+# monitoring alert-digest adoption in stacks/monitoring/imports.tf). Importing
+# reconciles them into state so `terraform apply` UPDATES instead of failing to
+# create. These blocks are idempotent (a no-op once the resources are in state)
+# and may be removed after the next green apply. Defs: main.tf.
+# (module.ingress_icons is deliberately NOT here — it does not exist live yet;
+# the same apply creates it.)
+
+import {
+  to = kubernetes_namespace.tasks
+  id = "tasks"
+}
+
+import {
+  to = kubernetes_manifest.external_secret
+  id = "apiVersion=external-secrets.io/v1,kind=ExternalSecret,namespace=tasks,name=tasks-secrets"
+}
+
+import {
+  to = kubernetes_manifest.db_external_secret
+  id = "apiVersion=external-secrets.io/v1,kind=ExternalSecret,namespace=tasks,name=tasks-db-creds"
+}
+
+import {
+  to = kubernetes_deployment.tasks
+  id = "tasks/tasks"
+}
+
+import {
+  to = kubernetes_service.tasks
+  id = "tasks/tasks"
+}
+
+import {
+  to = kubernetes_network_policy_v1.tasks_ingress
+  id = "tasks/tasks-ingress"
+}
+
+import {
+  to = module.ingress.kubernetes_ingress_v1.proxied-ingress
+  id = "tasks/tasks"
+}
+
+# Cloudflare record ID looked up via the API (zone fd2c5dd4… / record for
+# tasks.viktorbarzin.me, CNAME → the cfargotunnel target, proxied).
+import {
+  to = module.ingress.cloudflare_record.proxied[0]
+  id = "fd2c5dd4efe8fe38958944e74d0ced6d/a8e6901a074c5255d09700d93eaaf705"
+}
--- a/stacks/tasks/main.tf
+++ b/stacks/tasks/main.tf
@ -0,0 +1,378 @@
+variable "image_tag" {
+  type        = string
+  default     = "latest"
+  description = "tasks image tag. Running tag is set by the Woodpecker deploy (kubectl set image)."
+}
+
+variable "postgresql_host" { type = string }
+
+variable "tls_secret_name" {
+  type      = string
+  sensitive = true
+}
+
+locals {
+  namespace = "tasks"
+  # ADR-0002: built on GHA from the public GitHub mirror, pushed to ghcr
+  # (public package — anonymous pulls). Running tag is managed by the
+  # Woodpecker deploy (kubectl set image); the image ref below is
+  # ignore_changes'd (KEEL_IGNORE_IMAGE), so this base only matters on
+  # (re)create.
+  image = "ghcr.io/viktorbarzin/tasks:${var.image_tag}"
+  labels = {
+    app = "tasks"
+  }
+}
+
+resource "kubernetes_namespace" "tasks" {
+  metadata {
+    name = local.namespace
+    labels = {
+      tier              = local.tiers.aux
+      "istio-injection" = "disabled"
+      # Opt into Keel auto-update (inject-keel-annotations ClusterPolicy).
+      "keel.sh/enrolled" = "true"
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label.
+    ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
+  }
+}
+
+# App secrets — seed these in Vault before applying:
+#   secret/tasks
+#     fernet_key — Fernet key encrypting the per-user Nextcloud app passwords
+#                  stored in the Connected Accounts table (tasks ADR-0002).
+#
+# DB: CNPG database `tasks` (created in dbaas, null_resource.pg_tasks_db);
+# role password managed via the Vault database engine — see
+# static-creds/pg-tasks. Alembic runs migrations on app startup (no init
+# container needed).
+resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
+  manifest = {
+    apiVersion = "external-secrets.io/v1"
+    kind       = "ExternalSecret"
+    metadata = {
+      name      = "tasks-secrets"
+      namespace = local.namespace
+    }
+    spec = {
+      refreshInterval = "15m"
+      secretStoreRef = {
+        name = "vault-kv"
+        kind = "ClusterSecretStore"
+      }
+      target = {
+        name = "tasks-secrets"
+        template = {
+          metadata = {
+            annotations = {
+              "reloader.stakater.com/match" = "true"
+            }
+          }
+        }
+      }
+      data = [
+        { secretKey = "TASKS_FERNET_KEY", remoteRef = { key = "tasks", property = "fernet_key" } },
+      ]
+    }
+  }
+  depends_on = [kubernetes_namespace.tasks]
+}
+
+# DB credentials from Vault database engine (7-day rotation).
+# Builds the asyncpg DSN consumed by the FastAPI app as TASKS_DB_DSN.
+# Pre-req in dbaas: CNPG cluster has DB `tasks`, role `tasks`, and Vault
+# role `static-creds/pg-tasks`.
+resource "kubernetes_manifest" "db_external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
+  manifest = {
+    apiVersion = "external-secrets.io/v1"
+    kind       = "ExternalSecret"
+    metadata = {
+      name      = "tasks-db-creds"
+      namespace = local.namespace
+    }
+    spec = {
+      refreshInterval = "15m"
+      secretStoreRef = {
+        name = "vault-database"
+        kind = "ClusterSecretStore"
+      }
+      target = {
+        name = "tasks-db-creds"
+        template = {
+          metadata = {
+            annotations = {
+              "reloader.stakater.com/match" = "true"
+            }
+          }
+          data = {
+            TASKS_DB_DSN = "postgresql+asyncpg://tasks:{{ .password }}@${var.postgresql_host}:5432/tasks"
+            DB_PASSWORD  = "{{ .password }}"
+          }
+        }
+      }
+      data = [{
+        secretKey = "password"
+        remoteRef = {
+          key      = "static-creds/pg-tasks"
+          property = "password"
+        }
+      }]
+    }
+  }
+  depends_on = [kubernetes_namespace.tasks]
+}
+
+resource "kubernetes_deployment" "tasks" {
+  metadata {
+    name      = "tasks"
+    namespace = kubernetes_namespace.tasks.metadata[0].name
+    labels = merge(local.labels, {
+      tier = local.tiers.aux
+    })
+    annotations = {
+      # Reloader restarts the pod when tasks-secrets / tasks-db-creds change
+      # (both carry reloader.stakater.com/match=true) — required because the
+      # DB password rotates every 7 days and is read only at startup.
+      "reloader.stakater.com/search" = "true"
+    }
+  }
+
+  spec {
+    # Single leader: the CalDAV sync engine wants one writer per user's
+    # sync-token cursor; the SPA is served by the same process.
+    replicas = 1
+    strategy {
+      type = "Recreate"
+    }
+
+    selector {
+      match_labels = local.labels
+    }
+
+    template {
+      metadata {
+        labels = local.labels
+        annotations = {
+          # Prometheus scrapes the service-endpoints (annotations live on the
+          # Service below); the pod annotations here let the kubernetes-pods
+          # SD job also discover /metrics directly.
+          "prometheus.io/scrape" = "true"
+          "prometheus.io/path"   = "/metrics"
+          "prometheus.io/port"   = "8000"
+        }
+      }
+
+      spec {
+        image_pull_secrets {
+          name = "registry-credentials"
+        }
+
+        container {
+          name  = "tasks"
+          image = local.image
+
+          port {
+            container_port = 8000
+          }
+
+          # TASKS_FERNET_KEY via tasks-secrets; TASKS_DB_DSN via tasks-db-creds.
+          env_from {
+            secret_ref { name = "tasks-secrets" }
+          }
+          env_from {
+            secret_ref { name = "tasks-db-creds" }
+          }
+
+          # Wall-clock zone for all-day due dates (DUE;VALUE=DATE) and the
+          # Today/Scheduled smart views.
+          env {
+            name  = "TASKS_LOCAL_TZ"
+            value = "Europe/Sofia"
+          }
+          # SECURITY INVARIANT — DEV_USER must NEVER be set here. It is the
+          # dev-only identity fallback: when present the backend treats every
+          # request as that user, bypassing the Authentik forward-auth
+          # identity (X-authentik-username) entirely. Production identity
+          # comes ONLY from the header Traefik/Authentik injects.
+
+          readiness_probe {
+            http_get {
+              path = "/healthz"
+              port = 8000
+            }
+            initial_delay_seconds = 5
+            period_seconds        = 10
+          }
+          liveness_probe {
+            http_get {
+              path = "/healthz"
+              port = 8000
+            }
+            initial_delay_seconds = 30
+            period_seconds        = 30
+          }
+
+          resources {
+            requests = { cpu = "100m", memory = "384Mi" }
+            limits   = { memory = "384Mi" }
+          }
+        }
+      }
+    }
+  }
+
+  lifecycle {
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+      metadata[0].annotations["keel.sh/match-tag"],
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Woodpecker deploy sets the running tag
+      metadata[0].annotations["kubernetes.io/change-cause"],
+      metadata[0].annotations["deployment.kubernetes.io/revision"],
+      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
+    ]
+  }
+
+  depends_on = [
+    kubernetes_manifest.external_secret,
+    kubernetes_manifest.db_external_secret,
+  ]
+}
+
+resource "kubernetes_service" "tasks" {
+  metadata {
+    name      = "tasks"
+    namespace = kubernetes_namespace.tasks.metadata[0].name
+    labels    = local.labels
+    annotations = {
+      # Prometheus kubernetes-service-endpoints SD scrapes /metrics here.
+      "prometheus.io/scrape" = "true"
+      "prometheus.io/path"   = "/metrics"
+      "prometheus.io/port"   = "8000"
+    }
+  }
+
+  spec {
+    type     = "ClusterIP"
+    selector = local.labels
+
+    port {
+      name        = "http"
+      port        = 8000
+      target_port = 8000
+    }
+  }
+}
+
+# Kyverno ClusterPolicy `sync-tls-secret` auto-clones the wildcard TLS
+# secret into every namespace, so we don't need a setup_tls_secret module.
+
+module "ingress" {
+  source = "../../modules/kubernetes/ingress_factory"
+  # auth = "required": Authentik forward-auth gates EVERY request — the app
+  # has no login of its own and blindly trusts the X-authentik-username
+  # header the outpost injects, so Authentik is the only thing standing
+  # between strangers and everyone's tasks. Do NOT relax this tier (tasks
+  # design decision #3; pairs with the NetworkPolicy below, SEC-1).
+  auth            = "required"
+  dns_type        = "proxied"
+  namespace       = kubernetes_namespace.tasks.metadata[0].name
+  name            = "tasks"
+  port            = 8000
+  tls_secret_name = var.tls_secret_name
+}
+
+# Carve-out for the PWA icon assets + web manifest. macOS Safari's
+# "Add to Dock" (and every other OS icon fetcher: iOS Add-to-Home-Screen,
+# Android install prompt) fetches these in a cookie-less context — behind
+# forward-auth it got the Authentik 302 and fell back to a letter monogram.
+# Traefik prioritises these longer path prefixes over the main "/" router,
+# so ONLY these five static files bypass Authentik; the SPA shell and /api
+# stay gated by the main ingress above (and the app itself 401s /api
+# without the identity header). Guarded against regression by the
+# tasks-icons entry in the Authentik walling-off probe
+# (stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf).
+module "ingress_icons" {
+  source = "../../modules/kubernetes/ingress_factory"
+  # auth = "none": public static icons + manifest, no user data; required for
+  # OS icon fetchers (Safari Add-to-Dock etc.) that carry no session and
+  # cannot complete the Authentik redirect dance.
+  auth         = "none"
+  namespace    = kubernetes_namespace.tasks.metadata[0].name
+  name         = "tasks-icons"
+  service_name = kubernetes_service.tasks.metadata[0].name
+  port         = 8000
+  ingress_path = [
+    "/apple-touch-icon.png",
+    "/favicon.png",
+    "/pwa-192x192.png",
+    "/pwa-512x512.png",
+    "/manifest.webmanifest",
+  ]
+  full_host        = "tasks.viktorbarzin.me" # MUST match the main ingress host; otherwise the factory derives tasks-icons.viktorbarzin.me and the carve-out never matches.
+  dns_type         = "none"                  # host record already owned by the main tasks ingress
+  tls_secret_name  = var.tls_secret_name
+  anti_ai_scraping = false # Five static icons + a manifest; nothing for scrapers to mine.
+  homepage_enabled = false # path carve-out, not its own dashboard tile
+}
+
+# --- NetworkPolicy: scoped pod ingress (security-review finding SEC-1). ---
+# The app trusts X-authentik-username unconditionally, so its ENTIRE auth
+# model depends on requests only ever arriving through Traefik (where the
+# Authentik forward-auth middleware sets that header). Any pod that could
+# reach the pod IP directly could spoof the header and read/write anyone's
+# tasks — hence ingress is restricted to:
+#   - TCP/8000 from the traefik namespace (user traffic, post-forward-auth);
+#   - TCP/8000 from the monitoring namespace (Prometheus /metrics scrape).
+# The cluster has no default-deny, so this NP only takes effect inside the
+# tasks ns — pods elsewhere remain unaffected. (Same shape as
+# chrome-service's chrome-service-ws-ingress.)
+resource "kubernetes_network_policy_v1" "tasks_ingress" {
+  metadata {
+    name      = "tasks-ingress"
+    namespace = kubernetes_namespace.tasks.metadata[0].name
+  }
+  spec {
+    pod_selector {
+      match_labels = local.labels
+    }
+    policy_types = ["Ingress"]
+    ingress {
+      from {
+        namespace_selector {
+          match_labels = {
+            "kubernetes.io/metadata.name" = "traefik"
+          }
+        }
+      }
+      ports {
+        port     = "8000"
+        protocol = "TCP"
+      }
+    }
+    ingress {
+      from {
+        namespace_selector {
+          match_labels = {
+            "kubernetes.io/metadata.name" = "monitoring"
+          }
+        }
+      }
+      ports {
+        port     = "8000"
+        protocol = "TCP"
+      }
+    }
+  }
+}
--- a/stacks/tasks/terragrunt.hcl
+++ b/stacks/tasks/terragrunt.hcl
@ -0,0 +1,23 @@
+include "root" {
+  path = find_in_parent_folders()
+}
+
+dependency "platform" {
+  config_path  = "../platform"
+  skip_outputs = true
+}
+
+dependency "vault" {
+  config_path  = "../vault"
+  skip_outputs = true
+}
+
+dependency "external-secrets" {
+  config_path  = "../external-secrets"
+  skip_outputs = true
+}
+
+inputs = {
+  # Override per-deploy in CI / commit.
+  image_tag = "latest"
+}
--- a/stacks/technitium/modules/technitium/main.tf
+++ b/stacks/technitium/modules/technitium/main.tf
@ -873,6 +873,14 @@ resource "kubernetes_cluster_role" "ingress_dns_sync" {
    resources  = ["services"]
    verbs      = ["get", "list"]
  }
+  # Read the Valia-sites internal-DNS feed (written by stacks/valia-sites,
+  # ADR-0018) so the sync can reconcile off-infra Pages CNAMEs declaratively.
+  rule {
+    api_groups     = [""]
+    resources      = ["configmaps"]
+    resource_names = ["valia-sites-dns"]
+    verbs          = ["get"]
+  }
 }

 resource "kubernetes_cluster_role_binding" "ingress_dns_sync" {
@ -1002,6 +1010,42 @@ resource "kubernetes_cron_job_v1" "technitium_ingress_dns_sync" {
                  echo "mail-auth: MX present"
                fi

+                # Valia sites (ADR-0018) — off-infra Cloudflare Pages sites.
+                # The internal zone is authoritative (superset rule above), so
+                # these public-only names must exist here or every internal
+                # client NXDOMAINs on them. Reconciled DECLARATIVELY from the
+                # ConfigMap valia-sites-dns (written by stacks/valia-sites):
+                # ensure/update every entry, and DELETE stale records that
+                # left the map (site retired/renamed). Deletion is scoped to
+                # CNAMEs targeting *.pages.dev — nothing else is ever touched.
+                # Targets resolve upstream to CF edge IPs; no hairpin involved.
+                VALIA=$$(kubectl get configmap valia-sites-dns -n technitium -o go-template='{{range $$k, $$v := .data}}{{$$k}} {{$$v}}{{"\n"}}{{end}}' 2>/dev/null || true)
+                if [ -n "$$VALIA" ]; then
+                  printf '%s\n' "$$VALIA" | while read -r VNAME VTARGET; do
+                    [ -z "$$VNAME" ] && continue
+                    CUR=$$(curl -sf "$$TECH_API/api/zones/records/get?token=$$TOKEN&zone=$$ZONE&domain=$$VNAME.$$ZONE" | grep -o '"cname":"[^"]*"' | head -1 | cut -d'"' -f4)
+                    if [ "$$CUR" = "$$VTARGET" ]; then
+                      echo "valia: $$VNAME.$$ZONE ok"
+                      continue
+                    fi
+                    if [ -n "$$CUR" ]; then
+                      curl -sf -G "$$TECH_API/api/zones/records/delete" --data-urlencode "token=$$TOKEN" --data-urlencode "zone=$$ZONE" --data-urlencode "domain=$$VNAME.$$ZONE" --data-urlencode "type=CNAME" --data-urlencode "cname=$$CUR" > /dev/null || true
+                    fi
+                    R=$$(curl -sf -G "$$TECH_API/api/zones/records/add" --data-urlencode "token=$$TOKEN" --data-urlencode "zone=$$ZONE" --data-urlencode "domain=$$VNAME.$$ZONE" --data-urlencode "type=CNAME" --data-urlencode "cname=$$VTARGET" --data-urlencode "ttl=3600") || true
+                    echo "$$R" | grep -q '"status":"ok"' && echo "valia: set $$VNAME.$$ZONE -> $$VTARGET" || echo "valia: FAILED $$VNAME.$$ZONE -- $$R"
+                  done
+                  # Deletion pass: zone CNAMEs targeting *.pages.dev that are
+                  # no longer in the map. ZONE_DUMP predates this run's adds,
+                  # but just-set names are in $VALIA so they're never deleted.
+                  printf '%s' "$$ZONE_DUMP" | tr ',' '\n' | awk -F'"' '/"name":/{n=$$4} /"cname":/{print n" "$$4}' | grep '\.pages\.dev *$$' | while read -r RNAME RTARGET; do
+                    SHORT=$${RNAME%%.$$ZONE}
+                    printf '%s\n' "$$VALIA" | grep -q "^$$SHORT " && continue
+                    curl -sf -G "$$TECH_API/api/zones/records/delete" --data-urlencode "token=$$TOKEN" --data-urlencode "zone=$$ZONE" --data-urlencode "domain=$$RNAME" --data-urlencode "type=CNAME" --data-urlencode "cname=$$RTARGET" > /dev/null && echo "valia: removed stale $$RNAME -> $$RTARGET"
+                  done
+                else
+                  echo "valia: CM valia-sites-dns absent/unreadable -- skipping Pages CNAMEs this run"
+                fi
+
                # Pin the .lan ingress anchor A record to the LIVE Traefik LB IP.
                # *.viktorbarzin.lan ingress hosts CNAME to ingress.viktorbarzin.lan,
                # so a Traefik LB IP move that misses the .lan zone silently breaks
--- a/stacks/traefik/modules/traefik/middleware.tf
+++ b/stacks/traefik/modules/traefik/middleware.tf
@ -119,6 +119,41 @@ resource "kubernetes_manifest" "middleware_local_only" {
  depends_on = [helm_release.traefik]
 }

+# IP allowlist for household access across ALL home sites: Sofia LAN + the
+# WireGuard spoke LANs (London, Valchedrym) + 10/8 (VLANs, K8s pods/services,
+# WG tunnel IPs). Deliberately a SEPARATE middleware from `local-only` —
+# widening local-only would grant the remote LANs access to the admin surfaces
+# that use it (Prometheus, iDRAC, Loki, …). Use for family-facing services
+# (e.g. the immich-frame kiosks) that every household device may open but the
+# public internet must not. Pair with ingress_factory `dns_type = "internal"`:
+# a Cloudflare-proxied record would deliver public traffic from cloudflared
+# POD IPs (inside 10/8) and silently bypass this allowlist.
+resource "kubernetes_manifest" "middleware_home_lans_only" {
+  manifest = {
+    apiVersion = "traefik.io/v1alpha1"
+    kind       = "Middleware"
+    metadata = {
+      name      = "home-lans-only"
+      namespace = kubernetes_namespace.traefik.metadata[0].name
+    }
+    spec = {
+      ipAllowList = {
+        sourceRange = [
+          "192.168.1.0/24", # Sofia LAN (hub site)
+          "10.0.0.0/8",     # VLANs, K8s pod/svc CIDRs, WG tunnel subnet
+          "192.168.8.0/24", # London LAN (via WG tunnel)
+          "192.168.9.0/24", # London GUEST net — the Portal Plus actually leases here (Portal-75AE8F9C2A8A = 192.168.9.198)
+          "192.168.0.0/24", # Valchedrym LAN (via WG tunnel)
+          "fc00::/7",
+          "fe80::/10",
+        ]
+      }
+    }
+  }
+
+  depends_on = [helm_release.traefik]
+}
+
 # HTTPS redirect middleware
 resource "kubernetes_manifest" "middleware_redirect_https" {
  manifest = {
@ -368,6 +403,33 @@ resource "kubernetes_manifest" "middleware_authentik_rate_limit" {
  depends_on = [helm_release.traefik]
 }

+# Dawarich-specific rate limit. The Rails app serves all its fingerprinted
+# assets itself (JS/CSS chunks, SVG store badges, favicons, webmanifest) and
+# the map view adds a points/API burst on load — a single page load from one
+# client IP blows past the default 10/50 limiter and 429s the asset tail
+# (seventh instance of the burst pattern, after ha-sofia, ActualBudget, noVNC,
+# tripit, health and authentik). Background location ingestion (OwnTracks
+# bridge + mobile api_key POSTs) rides the same host, so 429s here also risk
+# dropped pings. Burst absorbs a couple of full page loads back-to-back.
+resource "kubernetes_manifest" "middleware_dawarich_rate_limit" {
+  manifest = {
+    apiVersion = "traefik.io/v1alpha1"
+    kind       = "Middleware"
+    metadata = {
+      name      = "dawarich-rate-limit"
+      namespace = kubernetes_namespace.traefik.metadata[0].name
+    }
+    spec = {
+      rateLimit = {
+        average = 100
+        burst   = 1000
+      }
+    }
+  }
+
+  depends_on = [helm_release.traefik]
+}
+
 # Compress responses to clients at the entrypoint level (outermost).
 # Applied at websecure entrypoint so all responses get compressed.
 # Uses includedContentTypes (whitelist) instead of excludedContentTypes:
--- a/stacks/tripit/main.tf
+++ b/stacks/tripit/main.tf
@ -175,6 +175,12 @@ locals {
    STORY_SOURCE_MODE        = "web"
    SCRIPT_WRITER_MODE       = "chat"
    PLACE_RESOLVER_MODE      = "wikipedia"
+    # Saved Place preview photos (tripit ADR-0035/0040): the Wikipedia lead-image
+    # fetcher behind manual-add-time photos and the backfill sweep. Same fake-
+    # default gap as the resolver above — never set, so prod silently ran the
+    # fake and hand-added places (and any backfill) would store placeholder
+    # PNGs instead of real photos.
+    PLACE_PHOTO_PROVIDER = "wikipedia"
  }
 }

--- a/stacks/valia-sites/main.tf
+++ b/stacks/valia-sites/main.tf
@ -0,0 +1,368 @@
+# Valia sites (ADR-0018): small static sites authored by Valia in Google Drive,
+# served OFF-INFRA on Cloudflare Pages, mirrored by the in-cluster CronJob below
+# every 10 minutes. Registering a new site = one entry in local.sites (plus
+# Valia sharing the folder with vbarzin@gmail.com). Full runbook:
+# docs/runbooks/valia-sites.md
+#
+# Per site this stack fans out:
+#   - cloudflare_pages_project + custom domain <name>.viktorbarzin.me
+#   - public proxied CNAME <name> -> <project>.pages.dev   (manage_dns gate)
+#   - internal split-horizon CNAME via ConfigMap valia-sites-dns consumed by
+#     the technitium-ingress-dns-sync script (declarative: add/update/REMOVE)
+#   - a slot in the shared sync CronJob (rclone mirror -> wrangler deploy)
+
+locals {
+  cloudflare_account_id = "02e035473cfc4834fb10c5d35470d8b4" # vbarzin@gmail.com's account (not a secret)
+
+  # THE site registry. Keys are the public subdomain (English, Viktor picks —
+  # CONTEXT.md "Valia site"). folder_id = the Drive folder Valia shared (the
+  # Content folder); src_path = subfolder holding servable files ("" = root);
+  # entry_file = what / must serve (staged as index.html at deploy time).
+  # manage_dns = false parks a site's public CNAME + internal record while the
+  # name is still owned elsewhere (used for the stem95su ingress cutover).
+  sites = {
+    bridge = {
+      folder_id  = "1YWwAtSTsJD9HOzckGRIFXigWqCgYSGEa" # "мост" — ОбУ „Отец Паисий“
+      src_path   = ""
+      entry_file = "index.html"
+      manage_dns = true
+    }
+    stem95su = {
+      folder_id  = "1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_" # "claude" — 95. СУ STEM board
+      src_path   = "stem claude/files"
+      entry_file = "stem_board.html"
+      manage_dns = true
+    }
+  }
+
+  dns_managed_sites = { for k, v in local.sites : k => v if v.manage_dns }
+}
+
+# ---------------------------------------------------------------------------
+# Cloudflare Pages: project + custom domain per site
+# ---------------------------------------------------------------------------
+
+resource "cloudflare_pages_project" "site" {
+  for_each          = local.sites
+  account_id        = local.cloudflare_account_id
+  name              = each.key
+  production_branch = "main"
+}
+
+# bridge was created by hand (wrangler) on 2026-07-03 — adopt, don't recreate.
+import {
+  to = cloudflare_pages_project.site["bridge"]
+  id = "02e035473cfc4834fb10c5d35470d8b4/bridge"
+}
+
+resource "cloudflare_pages_domain" "site" {
+  for_each     = local.sites
+  account_id   = local.cloudflare_account_id
+  project_name = cloudflare_pages_project.site[each.key].name
+  domain       = "${each.key}.viktorbarzin.me"
+}
+
+import {
+  to = cloudflare_pages_domain.site["bridge"]
+  id = "02e035473cfc4834fb10c5d35470d8b4/bridge/bridge.viktorbarzin.me"
+}
+
+# Public proxied CNAME. Gated on manage_dns: a site whose name is still served
+# by an in-cluster ingress keeps its ingress_factory record until cutover
+# (two records can't share one name).
+resource "cloudflare_record" "site" {
+  for_each = local.dns_managed_sites
+  zone_id  = var.cloudflare_zone_id
+  name     = each.key
+  content  = cloudflare_pages_project.site[each.key].subdomain
+  type     = "CNAME"
+  proxied  = true
+  ttl      = 1
+}
+
+# bridge's record predates this stack (created 2026-07-03 in stacks/cloudflared,
+# handed off via removed{} there) — adopt by id.
+import {
+  to = cloudflare_record.site["bridge"]
+  id = "fd2c5dd4efe8fe38958944e74d0ced6d/ff4fb6f4900744d4b22de50d3fdd219b"
+}
+
+# ---------------------------------------------------------------------------
+# Internal split-horizon DNS feed (docs/architecture/dns.md "superset rule"):
+# the technitium-ingress-dns-sync script reads this CM and reconciles internal
+# CNAMEs for every entry — including deleting stale *.pages.dev records when
+# an entry disappears (site retired/renamed).
+# ---------------------------------------------------------------------------
+
+resource "kubernetes_config_map" "valia_sites_dns" {
+  metadata {
+    name      = "valia-sites-dns"
+    namespace = "technitium"
+    labels    = { "app.kubernetes.io/managed-by" = "valia-sites" }
+  }
+  data = { for k, v in local.dns_managed_sites : k => cloudflare_pages_project.site[k].subdomain }
+}
+
+# ---------------------------------------------------------------------------
+# The shared sync CronJob
+# ---------------------------------------------------------------------------
+
+resource "kubernetes_namespace" "valia_sites" {
+  metadata {
+    name = "valia-sites"
+    labels = {
+      "istio-injection" : "disabled"
+      tier = local.tiers.aux
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
+    ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
+  }
+}
+
+# Secrets: shared drive.readonly rclone conf + the SCOPED CF Pages token
+# (Pages Read/Write only — the Global API Key never enters a pod).
+resource "kubernetes_manifest" "sync_external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
+  manifest = {
+    apiVersion = "external-secrets.io/v1"
+    kind       = "ExternalSecret"
+    metadata = {
+      name      = "valia-sites-sync"
+      namespace = kubernetes_namespace.valia_sites.metadata[0].name
+    }
+    spec = {
+      refreshInterval = "1h"
+      secretStoreRef = {
+        name = "vault-kv"
+        kind = "ClusterSecretStore"
+      }
+      target = { name = "valia-sites-sync" }
+      data = [
+        {
+          secretKey = "rclone.conf"
+          remoteRef = { key = "valia-sites", property = "rclone_conf" }
+        },
+        {
+          secretKey = "CLOUDFLARE_API_TOKEN"
+          remoteRef = { key = "valia-sites", property = "cloudflare_pages_token" }
+        },
+        {
+          secretKey = "CLOUDFLARE_ACCOUNT_ID"
+          remoteRef = { key = "valia-sites", property = "account_id" }
+        },
+      ]
+    }
+  }
+  depends_on = [kubernetes_namespace.valia_sites]
+}
+
+# Site registry rendered for the job (folder ids aren't secrets).
+resource "kubernetes_config_map" "sync_config" {
+  metadata {
+    name      = "valia-sites-config"
+    namespace = kubernetes_namespace.valia_sites.metadata[0].name
+  }
+  data = {
+    "sites.json" = jsonencode(local.sites)
+  }
+}
+
+# Last-deployed manifest hash per site — written by the job (merge-patch), so
+# TF must never fight it over data.
+resource "kubernetes_config_map" "sync_state" {
+  metadata {
+    name      = "valia-sites-state"
+    namespace = kubernetes_namespace.valia_sites.metadata[0].name
+  }
+  data = {}
+  lifecycle {
+    ignore_changes = [data]
+  }
+}
+
+resource "kubernetes_service_account" "sync" {
+  metadata {
+    name      = "valia-sites-sync"
+    namespace = kubernetes_namespace.valia_sites.metadata[0].name
+  }
+}
+
+resource "kubernetes_role" "sync_state" {
+  metadata {
+    name      = "valia-sites-sync-state"
+    namespace = kubernetes_namespace.valia_sites.metadata[0].name
+  }
+  rule {
+    api_groups     = [""]
+    resources      = ["configmaps"]
+    resource_names = ["valia-sites-state"]
+    verbs          = ["get", "patch"]
+  }
+}
+
+resource "kubernetes_role_binding" "sync_state" {
+  metadata {
+    name      = "valia-sites-sync-state"
+    namespace = kubernetes_namespace.valia_sites.metadata[0].name
+  }
+  role_ref {
+    api_group = "rbac.authorization.k8s.io"
+    kind      = "Role"
+    name      = kubernetes_role.sync_state.metadata[0].name
+  }
+  subject {
+    kind      = "ServiceAccount"
+    name      = kubernetes_service_account.sync.metadata[0].name
+    namespace = kubernetes_namespace.valia_sites.metadata[0].name
+  }
+}
+
+resource "kubernetes_cron_job_v1" "sync" {
+  metadata {
+    name      = "valia-sites-sync"
+    namespace = kubernetes_namespace.valia_sites.metadata[0].name
+    labels    = { app = "valia-sites", component = "sync" }
+  }
+  spec {
+    schedule                      = "*/10 * * * *"
+    concurrency_policy            = "Forbid"
+    successful_jobs_history_limit = 2
+    failed_jobs_history_limit     = 3
+    job_template {
+      metadata {}
+      spec {
+        backoff_limit              = 1
+        ttl_seconds_after_finished = 86400
+        template {
+          metadata { labels = { app = "valia-sites", component = "sync" } }
+          spec {
+            restart_policy       = "OnFailure"
+            service_account_name = kubernetes_service_account.sync.metadata[0].name
+            container {
+              name  = "sync"
+              image = "ghcr.io/viktorbarzin/valia-sites-sync:latest"
+              # Guards mirror stem95su's proven set: hard-fail on Drive
+              # list/auth errors (visible as a failed Job — the chosen
+              # visibility, ADR-0018), skip quietly when a folder is empty or
+              # missing its entry file (never wipe a live site), capped
+              # deletes. Deploy ONLY on remote-manifest change: CF Pages caps
+              # monthly deployments on the free tier, so 144 no-op
+              # deploys/day is not an option.
+              command = ["/bin/sh", "-c", <<-EOT
+                set -u
+                cp /config/rclone.conf /tmp/rc.conf
+                APISERVER="https://kubernetes.default.svc"
+                SA=/var/run/secrets/kubernetes.io/serviceaccount
+                KTOKEN=$$(cat $$SA/token); NS=$$(cat $$SA/namespace)
+                STATE_URL="$$APISERVER/api/v1/namespaces/$$NS/configmaps/valia-sites-state"
+                FAILED=0
+                for SITE in $$(jq -r 'keys[]' /sites/sites.json); do
+                  FOLDER=$$(jq -r --arg s "$$SITE" '.[$$s].folder_id' /sites/sites.json)
+                  SRC_PATH=$$(jq -r --arg s "$$SITE" '.[$$s].src_path' /sites/sites.json)
+                  ENTRY=$$(jq -r --arg s "$$SITE" '.[$$s].entry_file' /sites/sites.json)
+                  RC="rclone --config /tmp/rc.conf --drive-root-folder-id=$$FOLDER --drive-skip-gdocs"
+                  # 1. Remote manifest (path+size+hash) — metadata only, no download.
+                  MANIFEST=$$($$RC lsf "gdrive:$$SRC_PATH" -R --files-only --format phs 2>/tmp/lsf.err) || {
+                    echo "FATAL [$$SITE]: Drive list failed (auth/network):"; cat /tmp/lsf.err; FAILED=1; continue; }
+                  N=$$(printf '%s\n' "$$MANIFEST" | grep -c . || true)
+                  if [ "$$N" -lt 1 ] || ! printf '%s\n' "$$MANIFEST" | cut -d';' -f1 | grep -qx "$$ENTRY"; then
+                    echo "GUARD [$$SITE]: N=$$N / $$ENTRY missing -- skipping, site untouched"; continue
+                  fi
+                  # Cloudflare Pages hard-caps files at 25 MB — deploying
+                  # without an oversize file would silently break the pages
+                  # that reference it, so skip the whole site instead (last
+                  # deployed content keeps serving) and say so loudly.
+                  OVERSIZE=$$(printf '%s\n' "$$MANIFEST" | awk -F';' '$$3 > 26214400 {print $$1" ("$$3" B)"}')
+                  if [ -n "$$OVERSIZE" ]; then
+                    echo "GUARD [$$SITE]: file(s) exceed the 25MB Pages limit -- skipping, site untouched:"; echo "$$OVERSIZE"; continue
+                  fi
+                  HASH=$$(printf '%s' "$$MANIFEST" | sha256sum | cut -d' ' -f1)
+                  LAST=$$(curl -sf --cacert $$SA/ca.crt -H "Authorization: Bearer $$KTOKEN" "$$STATE_URL" | jq -r --arg s "$$SITE" '.data[$$s] // ""')
+                  if [ "$$HASH" = "$$LAST" ]; then echo "OK [$$SITE]: unchanged"; continue; fi
+                  # 2. Content changed — pull and deploy.
+                  $$RC sync "gdrive:$$SRC_PATH" "/work/$$SITE" --exclude ".DS_Store" --fast-list --transfers 4 --max-delete 25 -v || {
+                    echo "FATAL [$$SITE]: rclone sync failed"; FAILED=1; continue; }
+                  if [ "$$ENTRY" != "index.html" ]; then
+                    cp "/work/$$SITE/$$ENTRY" "/work/$$SITE/index.html"
+                  fi
+                  wrangler pages deploy "/work/$$SITE" --project-name="$$SITE" --branch=main --commit-dirty=true || {
+                    echo "FATAL [$$SITE]: wrangler deploy failed"; FAILED=1; continue; }
+                  curl -sf --cacert $$SA/ca.crt -H "Authorization: Bearer $$KTOKEN" \
+                    -X PATCH -H "Content-Type: application/merge-patch+json" \
+                    -d "{\"data\":{\"$$SITE\":\"$$HASH\"}}" "$$STATE_URL" > /dev/null || {
+                    echo "WARN [$$SITE]: state patch failed (will redeploy next run)"; FAILED=1; }
+                  echo "DEPLOYED [$$SITE]: $$HASH"
+                done
+                exit $$FAILED
+              EOT
+              ]
+              env {
+                name = "CLOUDFLARE_API_TOKEN"
+                value_from {
+                  secret_key_ref {
+                    name = "valia-sites-sync"
+                    key  = "CLOUDFLARE_API_TOKEN"
+                  }
+                }
+              }
+              env {
+                name = "CLOUDFLARE_ACCOUNT_ID"
+                value_from {
+                  secret_key_ref {
+                    name = "valia-sites-sync"
+                    key  = "CLOUDFLARE_ACCOUNT_ID"
+                  }
+                }
+              }
+              resources {
+                requests = { cpu = "25m", memory = "128Mi" }
+                limits   = { memory = "512Mi" }
+              }
+              volume_mount {
+                name       = "rclone-config"
+                mount_path = "/config"
+                read_only  = true
+              }
+              volume_mount {
+                name       = "sites-config"
+                mount_path = "/sites"
+                read_only  = true
+              }
+              volume_mount {
+                name       = "work"
+                mount_path = "/work"
+              }
+            }
+            volume {
+              name = "rclone-config"
+              secret {
+                secret_name = "valia-sites-sync"
+                items {
+                  key  = "rclone.conf"
+                  path = "rclone.conf"
+                }
+              }
+            }
+            volume {
+              name = "sites-config"
+              config_map { name = kubernetes_config_map.sync_config.metadata[0].name }
+            }
+            volume {
+              name = "work"
+              empty_dir {}
+            }
+          }
+        }
+      }
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
+  }
+  depends_on = [kubernetes_manifest.sync_external_secret]
+}
--- a/stacks/valia-sites/sync-image/Dockerfile
+++ b/stacks/valia-sites/sync-image/Dockerfile
@ -0,0 +1,15 @@
+# valia-sites-sync: everything the 10-min Content-folder mirror needs, baked in
+# (no runtime installs — CronJob pods must not apk/npm on every start).
+# rclone pinned to match the proven stem95su version; wrangler pinned to major 4.
+FROM node:22-alpine
+
+RUN apk add --no-cache curl unzip ca-certificates jq \
+    && curl -fsSL https://downloads.rclone.org/v1.74.3/rclone-v1.74.3-linux-amd64.zip -o /tmp/rclone.zip \
+    && unzip -j /tmp/rclone.zip '*/rclone' -d /usr/local/bin \
+    && chmod +x /usr/local/bin/rclone \
+    && rm /tmp/rclone.zip \
+    && npm install -g wrangler@4 \
+    && npm cache clean --force
+
+# wrangler writes config/cache under $HOME; the CronJob runs as non-root node (uid 1000)
+ENV HOME=/tmp
--- a/stacks/valia-sites/terragrunt.hcl
+++ b/stacks/valia-sites/terragrunt.hcl
@ -0,0 +1,8 @@
+include "root" {
+  path = find_in_parent_folders()
+}
+
+dependency "platform" {
+  config_path  = "../platform"
+  skip_outputs = true
+}
--- a/stacks/valia-sites/variables.tf
+++ b/stacks/valia-sites/variables.tf
@ -0,0 +1,3 @@
+variable "cloudflare_zone_id" {
+  type = string
+}
--- a/stacks/vault/main.tf
+++ b/stacks/vault/main.tf
@ -675,6 +675,7 @@ resource "vault_database_secret_backend_connection" "postgresql" {
    "pg-nextcloud-todos",
    "pg-technitium",
    "pg-goldmane-edges",
+    "pg-tasks",
  ]

  postgresql {
@ -903,6 +904,17 @@ resource "vault_database_secret_backend_static_role" "pg_goldmane_edges" {
  rotation_period = 604800
 }

+# tasks PWA (Reminders-style front-end over Nextcloud CalDAV) — 7-day rotation
+# for the `tasks` CNPG role. Consumed by stacks/tasks via a vault-database
+# ExternalSecret -> TASKS_DB_DSN (remoteRef static-creds/pg-tasks).
+resource "vault_database_secret_backend_static_role" "pg_tasks" {
+  backend         = vault_mount.database.path
+  db_name         = vault_database_secret_backend_connection.postgresql.name
+  name            = "pg-tasks"
+  username        = "tasks"
+  rotation_period = 604800
+}
+
 # =============================================================================
 # Kubernetes Secrets Engine — Dynamic K8s Credentials
 # =============================================================================
--- a/state/stacks/vault/terraform.tfstate.enc
+++ b/state/stacks/vault/terraform.tfstate.enc