Compare commits

..

1 commit

Author SHA1 Message Date
43a5d2cc27 immich(frame-emo): show photos from the last 365 days (was 730)
Emil asked his Sofia Portal Mini photo-frame to show only the past
year of photos rolling from today, instead of the last two years.
Changes ImagesFromDays 730 -> 365 in the frame-emo Settings.yml.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 07:49:12 +00:00
78 changed files with 2815 additions and 9164 deletions

File diff suppressed because one or more lines are too long

View file

@ -81,7 +81,7 @@
| ytdlp | YouTube downloader | ytdlp |
| wealthfolio | Finance tracking | wealthfolio |
| audiobookshelf | Audiobook server (may be merged into ebooks stack) | audiobookshelf |
| paperless-ngx | Document management. Mail ingest: forward document emails to `docs@viktorbarzin.me` — sender maps 1:1 to a paperless account (runbook `paperless-mail-ingest.md`) | paperless-ngx |
| paperless-ngx | Document management | paperless-ngx |
| jsoncrack | JSON visualizer | jsoncrack |
| servarr | Media automation (Sonarr/Radarr/etc) | servarr |
| aiostreams | Stremio stream aggregator (Real-Debrid + Torrentio/Comet/StremThru Torz/Knaben; **MediaFusion removed 2026-06-07** — broken upstream `500`). `auth=app` (own UUID+password); stream-probe tests **both series+movie paths** with per-source breakdown (`aiostreams_streams_{comet,torrentio,stremthru_torz,knaben}`) + `aiostreams_error_streams` + `aiostreams_movie_stream_count`, success gated on Comet (workhorse) being alive; weekly NFS config + Stremio-account-collection backups to `/srv/nfs/aiostreams-backup/`. PG-backed user config (Comet timeout bumped 5s→10s 2026-06-07). | servarr/aiostreams |
@ -99,7 +99,6 @@
| tor-proxy | Tor proxy | tor-proxy |
| forgejo | Git forge. Open native self-signup (Turnstile captcha + email confirm) + Authentik & GitHub OAuth sign-in; see `docs/runbooks/forgejo-open-signups.md` | forgejo |
| freshrss | RSS reader | freshrss |
| drone-logbook | DJI flight-log analyzer (Open DroneLog, upstream image) — dronelog.viktorbarzin.me | drone-logbook |
| navidrome | Music streaming | navidrome |
| networking-toolbox | Network tools | networking-toolbox |
| stirling-pdf | PDF tools | stirling-pdf |
@ -121,9 +120,7 @@
| status-page | Status page | status-page |
| plotting-book | Book plotting/world-building app | plotting-book |
| tripit | Self-hosted TripIt-clone travel-itinerary PWA (FastAPI + SvelteKit SPA, same-origin). CNPG (`tripit` db, Vault static role `pg-tripit`) + RWX NFS trip-doc vault (`/srv/nfs/tripit-documents`) + RWO `proxmox-lvm-encrypted` personal-document vault `tripit-personal-documents` (passports/IDs — AES-256-GCM app-layer envelope, master key `DOCUMENT_ENCRYPTION_KEY` in `secret/tripit`). `auth=required` (Authentik forward-auth, reads `X-authentik-email`); second `auth=none` ingress on `/api/calendar` for HMAC-token-gated `.ics` feed. Email-ingest CronJob `tripit-ingest-plans` (`*/15`) is the SOLE inbound path — forward a booking to plans@viktorbarzin.me (catch-all → spam@), polled read-only and routed ONLY to a registered user / verified linked address (no default-owner fallback; strangers ignored), parsed by local LLM (`qwen3vl-4b`), and the sender is emailed the outcome (Added to trip / Couldn't import). Plus `tripit-poll-flights`, `tripit-run-reminders`, `tripit-transport-nudge`, `tripit-weather-brief`. (The old Gmail-scrape `tripit-ingest-mail` CronJob was removed 2026-06-05.) App secrets in Vault `secret/tripit`. | tripit |
| tasks | Reminders-style tasks PWA over Nextcloud CalDAV (FastAPI + SvelteKit SPA same-origin, single container; code `~/code/tasks`, design `tasks/docs/2026-07-03-tasks-pwa-design.md`). Nextcloud stays the source of truth (VTODOs); the app is the front-end Apple Reminders stopped being. CNPG (`tasks` db, Vault static role `pg-tasks`) stores Connected Accounts — per-user Nextcloud app passwords Fernet-encrypted with `fernet_key` from `secret/tasks`. `auth=required` (Authentik forward-auth; identity = `X-authentik-username`, NO app-level login — `DEV_USER` must never be set in prod) at tasks.viktorbarzin.me (proxied). Exception: the five PWA icon/manifest files (`/apple-touch-icon.png`, `/favicon.png`, `/pwa-192x192.png`, `/pwa-512x512.png`, `/manifest.webmanifest`) are a path-scoped `auth=none` carve-out (`module.ingress_icons`) so cookie-less OS icon fetchers (macOS Safari Add-to-Dock, mobile home-screen installs) get the real icon instead of the Authentik 302; guarded by the `tasks-icons` walloff-probe target. NetworkPolicy `tasks-ingress` (SEC-1) restricts pod ingress to traefik + monitoring namespaces so the trusted header can't be spoofed pod-to-pod. GHA → public ghcr `tasks` → Woodpecker deploy (ADR-0002). | tasks |
| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me — **a Valia site on Cloudflare Pages since 2026-07-03** (ADR-0018): registry entry in `stacks/valia-sites`, synced from Drive folder "claude" every 10 min, deploy-on-change. The old in-cluster stack (nginx off PVE NFS + per-site rclone CronJob) is RETIRED — stacks/stem95su is a tombstone; `secret/stem95su` superseded by `secret/valia-sites`; `stem_video.mp4` was compressed 42.9→21.4MB (25MB Pages cap) with Viktor's OK. See docs/runbooks/valia-sites.md. | — |
| valia-sites | **Valia-site registry + sync** (ADR-0018): all sites authored by Valia serve OFF-INFRA on Cloudflare Pages (`bridge` + `stem95su` live). One map entry in `stacks/valia-sites/main.tf` per site fans out Pages project + custom domain + public CNAME + internal split-horizon CNAME (ConfigMap `valia-sites-dns` → technitium sync, declarative incl. removal). CronJob `valia-sites-sync` (`*/10`, image ghcr `valia-sites-sync`) mirrors each Drive Content folder (rclone `drive.readonly`, stem95su-style guards + 25MB Pages-cap guard) and wrangler-deploys ONLY on manifest change (free-tier deploy cap). Secrets `secret/valia-sites` (shared rclone conf + SCOPED CF Pages token — Global API Key never in pods). Failed-Job-only visibility by choice. Runbook: docs/runbooks/valia-sites.md. | valia-sites |
| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me. Public **open** static site (`auth=none` — CrowdSec + ai-bot-block, no login). Stock `nginx:1.28-alpine` serving content **straight off PVE host NFS** `/srv/nfs/stem-site` (RWX `nfs_volume`, mounted read-only) — **NOT** image-baked, so the externally-authored (Gemini-exported) HTML/media updates with no rebuild; auto-backed-up offsite by `nfs-mirror`. **Content source = Google Drive folder "claude"** (id `1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_`, shared Valentina→vbarzin@gmail.com). **Deploy = scheduled mirror** (since 2026-06-09, reversed the earlier on-demand-only call once content went active): CronJob `stem95su-gdrive-sync` (`*/10`, `stacks/stem95su/gdrive-sync.tf`) mounts the content PVC RW and `rclone sync`s the Drive folder onto it (`docker.io/rclone/rclone:1.74.3`, `scope=drive.readonly` — Drive is READ-ONLY; empty-source guard + `--max-delete 25` so a partial listing can't wipe the site). rclone creds (OAuth refresh-token) in Vault `secret/stem95su` (`rclone_conf`) → ESO secret `stem95su-rclone`. **Requires the GCP OAuth app (project home-lab-1700868541205) published to "Production"** or the refresh token expires ~weekly (re-mint + `vault kv put secret/stem95su rclone_conf=…` after publishing); a dead token surfaces as a failed Job. Manual on-demand sync still possible (throwaway rclone container from devvm; recipe in claude-memory). Nextcloud "PVE NFS Pool"/rsync is a manual fallback. Dashboard `stem_board.html` served at `/` via a small nginx ConfigMap (`index`). No DB, no in-cluster secrets. Reference impl for the NFS-backed static-site pattern (see patterns.md). | stem95su |
| trek | **TRIAL (2026-06-05)** — self-hosted group-trip planner (upstream [TREK](https://github.com/mauriceboe/TREK), `mauriceboe/trek:3.0.22`, AGPL-3.0). Solo evaluation behind Authentik forward-auth (`auth=required`) before deciding build-vs-adopt; covers collaborative trip planning + accommodation records + activities + per-person budget splitting on free OpenStreetMap (no paid maps key). SQLite + uploads on `proxmox-lvm-encrypted` (`trek-data-encrypted` 2Gi, `trek-uploads-encrypted` 5Gi). For the trial only: `ENCRYPTION_KEY` is TREK-auto-generated onto the data PVC and the bootstrap admin (`admin@trek.local`) is printed to pod logs — NO Vault/ESO wiring (graduation TODO: move key to `secret/trek` + ESO, add an app-level SQLite backup CronJob since host file-backup can't read the LUKS PVC, wire TREK↔Authentik OIDC). Pinned image, TF-managed (no CI/Keel). Availability-poll companion (Rallly) deferred. Teardown: `tg destroy` in `stacks/trek`. | trek |
## Cloudflare Domains
@ -133,7 +130,7 @@
blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send,
audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden,
changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser,
travel, netbox, phpipam, tripit, t3, stem95su, tasks
travel, netbox, phpipam, tripit, t3, stem95su
```
### Non-Proxied (Direct DNS)

View file

@ -1,42 +0,0 @@
name: Build excalidraw-library
# ADR-0002 / no-local-builds: excalidraw-library (infra-owned Go app behind
# draw.viktorbarzin.me) builds off-infra on GHA → private ghcr; Keel polls
# ghcr:latest and rolls the deployment. Replaces the manual DockerHub pushes
# (viktorbarzin/excalidraw-library:v4 stays frozen as the rollback image).
on:
push:
branches: [master]
paths:
- 'stacks/excalidraw/project/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.21'
- run: go test ./...
working-directory: stacks/excalidraw/project
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/excalidraw/project
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/excalidraw-library:latest
ghcr.io/viktorbarzin/excalidraw-library:${{ github.sha }}

View file

@ -1,39 +0,0 @@
name: Build valia-sites-sync
# ADR-0002 + ADR-0018: infra-owned image built off-infra on GHA → ghcr (public).
# Rclone + wrangler runner for the Valia-sites Content-folder mirror CronJob.
# Rebuilds are rare (tool pins only change deliberately) → dispatch + path.
# Security note: no untrusted event inputs are interpolated anywhere (only
# github.actor / github.sha / GITHUB_TOKEN — same shape as the other
# build-*.yml workflows in this repo).
on:
push:
branches: [master]
paths:
- 'stacks/valia-sites/sync-image/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/valia-sites/sync-image
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/valia-sites-sync:latest
ghcr.io/viktorbarzin/valia-sites-sync:${{ github.sha }}

View file

@ -95,7 +95,7 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
## Key Paths
- `stacks/<service>/main.tf` — service definition
- `stacks/platform/modules/<service>/` — core infra modules
- `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI, and auto Cloudflare DNS (`dns_type = "proxied"`, `"non-proxied"`, or `"internal"` — a public A record carrying the internal Traefik LB IP for household-only services; pair with the `home-lans-only` ipAllowList middleware, never with `"proxied"`)
- `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI, and auto Cloudflare DNS (`dns_type = "proxied"` or `"non-proxied"`)
- `modules/kubernetes/nfs_volume/` — NFS volume module (CSI-backed, soft mount)
- `config.tfvars` — non-secret configuration (plaintext)
- `secrets.sops.json` — all secrets (SOPS-encrypted JSON)

View file

@ -118,14 +118,6 @@ _Avoid_: "external", "outside".
`viktorbarzin.lan`, served by Technitium DNS. Resolves only inside the homelab network.
_Avoid_: bare "lan", "private", "intranet".
**Segment**:
One isolated L2/L3 network with pfSense as its gateway — realised as a Proxmox-bridge-level tag feeding one dedicated untagged pfSense interface (dManagementsVms 10.0.10.0/24 = vmbr1 tag 10, dKubernetes 10.0.20.0/24 = vmbr1 tag 20, dCCTV 10.0.30.0/24 = vmbr0 tag 30). pfSense itself never terminates 802.1Q.
_Avoid_: "VLAN" as the primary name (the tags 10/20/30 are transport detail; the Segment is the concept).
**CCTV segment**:
The untrusted camera **Segment** (`dCCTV`) — devices in it may be pulled from (RTSP/ISAPI) but may initiate nothing except NTP to their gateway. Deliberately outside every trusted source-IP allowlist (ADR-0017).
_Avoid_: "camera VLAN", "CCTV LAN".
**Ingress auth**:
The `auth = "..."` parameter on `ingress_factory` — a discrete *mode*, not a ranked tier — one of `required` (Authentik forward-auth gates every request), `app` (the backend owns its login), `public` (anonymous Authentik binding for audit only), or `none` (Anubis-fronted content, or native-client API). Default `required` (fail-closed).
_Avoid_: "auth tier" / "auth mode" — refer to it by the canonical key, `auth` (e.g. `auth = "required"`). "tier" is reserved for State tier and Namespace tier.
@ -237,20 +229,6 @@ _Avoid_: expecting Diun to deploy; conflating with **Keel**.
**Anubis**:
A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content-bearing sites without app-level auth (blog, wiki, landing pages). Never in front of Git, WebDAV, CalDAV, or API endpoints (clients can't solve PoW).
### Externally-authored sites
**Valia site**:
A small public static site authored by Valia (Viktor's mother, external to the infra) and hosted for her under `<name>.viktorbarzin.me`. Its source of truth is a **Content folder** she owns; the live site is a mirror of that folder, fresh within ~10 minutes. Hosted **off-infra** (Cloudflare Pages) by decision: a homelab outage freezes content but never takes her sites down. Viktor picks the English subdomain name per site at registration (her folder names stay Bulgarian). Current instances: `stem95su`, `bridge`.
_Avoid_: "school site" (the family may grow beyond school projects); treating the deployed copy as editable — edits land only in the **Content folder**.
**Content folder**:
The Google Drive folder (or subfolder) Valia shares with `vbarzin@gmail.com` holding one **Valia site**'s files. Strictly read-only from the infra side — nothing ever writes back to her Drive. Empty or half-uploaded folder states must never wipe a live site.
_Avoid_: syncing a folder root when the servable content lives in a subfolder (stem95su serves `stem claude/files/`, not the folder root).
**Entry file**:
The HTML file a **Valia site** serves at `/`. Defaults to `index.html`; per-site override when she names it differently (stem95su: `stem_board.html`). The override is a registration-time setting, not a constraint on her authoring.
_Avoid_: asking Valia to rename her files to fit hosting conventions.
## Relationships
- A **Service** is defined by exactly one **Stack****flat** or wrapping a **Stack-local module** — which sources zero or more shared **Factory modules** and resolves to one or more K8s workloads.
@ -262,7 +240,6 @@ _Avoid_: asking Valia to rename her files to fit hosting conventions.
- A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither.
- An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**.
- Tier-1 **State tier** state and ~12 app databases share one **CNPG** `pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store.
- A **Valia site** mirrors exactly one **Content folder** and serves exactly one **Entry file** at `/`; the folder is hers, the subdomain name is Viktor's, the hosting is off-infra.
## Example dialogue

View file

@ -1 +1 @@
v0.12.0
v0.11.0

View file

@ -30,21 +30,11 @@ func memoryCommands() []Command {
}
}
// printMemories renders a {memories:[…]} response as one line per memory, or raw JSON.
// printMemories renders a {memories:[…]} response as compact lines, or raw JSON.
func printMemories(raw []byte, jsonOut bool) error {
fmt.Print(renderMemories(raw, jsonOut))
return nil
}
// renderMemories formats each memory as a single line with its FULL content
// (newlines flattened to spaces). Content is deliberately never truncated: the
// old 240-rune preview cut memories mid-sentence, misled agents into believing
// no full-content read-back existed, and made blind `update --content` from
// the preview silently destroy the stored tail. Full passthrough also can't
// produce invalid UTF-8 (the old mid-rune cut crashed the recall hook).
func renderMemories(raw []byte, jsonOut bool) string {
if jsonOut {
return string(raw) + "\n"
fmt.Println(string(raw))
return nil
}
var r struct {
Memories []struct {
@ -56,20 +46,36 @@ func renderMemories(raw []byte, jsonOut bool) string {
} `json:"memories"`
}
if err := json.Unmarshal(raw, &r); err != nil {
return string(raw) + "\n"
fmt.Println(string(raw))
return nil
}
if len(r.Memories) == 0 {
return "(no memories)\n"
fmt.Println("(no memories)")
return nil
}
var b strings.Builder
for _, m := range r.Memories {
c := strings.ReplaceAll(m.Content, "\n", " ")
fmt.Fprintf(&b, "#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240)
fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
if m.Tags != "" {
fmt.Fprintf(&b, " tags: %s\n", m.Tags)
fmt.Printf(" tags: %s\n", m.Tags)
}
}
return b.String()
return nil
}
// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it
// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240]
// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte
// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict
// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit
// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit
// hook error" for Cyrillic-language users.
func truncatePreview(s string, maxRunes int) string {
r := []rune(s)
if len(r) <= maxRunes {
return s
}
return string(r[:maxRunes]) + "…"
}
func memoryRecall(args []string) error {

View file

@ -8,53 +8,25 @@ import (
"unicode/utf8"
)
func TestRenderMemoriesFullContent(t *testing.T) {
// The pretty view must NOT truncate content: the old 240-rune preview cut
// memories mid-sentence, misled agents into thinking no full-content
// read-back existed, and made blind `update --content` from the preview
// destroy the stored tail. Full passthrough also removes the mid-rune-cut
// invalid-UTF-8 class by construction — nothing is ever sliced.
long := strings.Repeat("я", 300) + strings.Repeat("a", 300)
raw, _ := json.Marshal(map[string]interface{}{"memories": []map[string]interface{}{
{"id": 7, "content": long, "category": "facts", "tags": "t1,t2", "importance": 0.7},
}})
got := renderMemories(raw, false)
if !strings.Contains(got, long) {
t.Fatalf("content was truncated: %q", got)
}
if strings.Contains(got, "…") {
t.Fatalf("ellipsis in output — truncation still active: %q", got)
}
func TestTruncatePreviewKeepsValidUTF8(t *testing.T) {
// Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits
// invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must
// cut on a rune boundary and always stay valid UTF-8.
long := strings.Repeat("я", 300) // 300 runes / 600 bytes
got := truncatePreview(long, 240)
if !utf8.ValidString(got) {
t.Fatalf("invalid UTF-8 in output: %q", got)
t.Fatalf("truncatePreview produced invalid UTF-8: %q", got)
}
if !strings.Contains(got, "#7 [facts] (0.70) ") || !strings.Contains(got, "tags: t1,t2") {
t.Fatalf("line format broken: %q", got)
if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' {
t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r))
}
}
func TestRenderMemoriesFlattensNewlinesToOneLine(t *testing.T) {
// Consumers (the recall hook, terminal skims) rely on one memory per line;
// multi-line content is flattened, never split across lines.
raw, _ := json.Marshal(map[string]interface{}{"memories": []map[string]interface{}{
{"id": 1, "content": "line one\nline two\nline three", "category": "facts", "importance": 0.5},
}})
got := renderMemories(raw, false)
if !strings.Contains(got, "line one line two line three") {
t.Fatalf("newlines not flattened: %q", got)
// Short multibyte strings pass through untouched (no ellipsis).
if got := truncatePreview("кратко", 240); got != "кратко" {
t.Fatalf("short string altered: %q", got)
}
}
func TestRenderMemoriesEdgeCases(t *testing.T) {
if got := renderMemories([]byte(`{"memories":[]}`), false); got != "(no memories)\n" {
t.Fatalf("empty list: %q", got)
}
// --json and unparseable responses pass through raw.
if got := renderMemories([]byte(`{"x":1}`), true); got != "{\"x\":1}\n" {
t.Fatalf("json passthrough: %q", got)
}
if got := renderMemories([]byte(`not json`), false); got != "not json\n" {
t.Fatalf("unparseable passthrough: %q", got)
// ASCII boundary still works.
if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" {
t.Fatalf("ascii truncation wrong: %q", got)
}
}

Binary file not shown.

View file

@ -1,126 +0,0 @@
<svg xmlns="http://www.w3.org/2000/svg" width="1600" height="820" viewBox="0 0 1600 820" font-family="system-ui, -apple-system, 'Segoe UI', Roboto, sans-serif">
<!-- ADR-0017: PHYSICAL cabling only — no VLANs, no flows. Solid = cable in
place today · dashed = camera-day work · ~~~ = radio. Palette: neutral
grays + blue for copper runs (reference dataviz palette text tokens). -->
<defs>
<marker id="dot" viewBox="0 0 8 8" refX="4" refY="4" markerWidth="5" markerHeight="5">
<circle cx="4" cy="4" r="3" fill="#52514e"/>
</marker>
</defs>
<rect width="1600" height="820" fill="#fcfcfb"/>
<text x="40" y="42" font-size="26" font-weight="700" fill="#0b0b0b">ADR-0017 — physical cabling (single-switch, rev 3)</text>
<text x="40" y="66" font-size="15" fill="#52514e">wires only — no VLANs, no traffic · solid = in place · dashed = camera-day · ~ = radio</text>
<!-- ═════════ APARTMENT ═════════ -->
<rect x="40" y="100" width="330" height="330" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="56" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">APARTMENT</text>
<text x="70" y="158" font-size="13" fill="#52514e">☁ ISP (internet)</text>
<path d="M120,166 L120,196" fill="none" stroke="#52514e" stroke-width="2"/>
<rect x="64" y="198" width="220" height="64" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="80" y="222" font-size="14.5" font-weight="700" fill="#0b0b0b">AX6000 router</text>
<text x="80" y="242" font-size="12" fill="#52514e">192.168.1.1 · WAN←ISP · 8×LAN</text>
<rect x="64" y="290" width="220" height="52" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="80" y="312" font-size="14" font-weight="700" fill="#0b0b0b">Synology NAS · .13</text>
<text x="80" y="330" font-size="12" fill="#52514e">on an AX6000 LAN port</text>
<path d="M174,262 L174,290" fill="none" stroke="#2a78d6" stroke-width="2"/>
<text x="70" y="376" font-size="12.5" fill="#52514e">📶 wifi clients (phones, laptops)</text>
<path d="M110,262 C104,272 106,278 100,286 C106,294 104,300 100,308 C106,316 104,322 100,330 C106,338 104,344 100,352 C104,358 102,362 98,366" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
<!-- in-wall run apartment -> garage -->
<path d="M284,230 C450,230 540,228 616,228" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
<text x="330" y="218" font-size="12.5" font-weight="700" fill="#2a78d6">in-wall run → garage</text>
<!-- ═════════ GARAGE — RACK ═════════ -->
<rect x="560" y="100" width="640" height="680" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="576" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE — RACK</text>
<!-- switch -->
<rect x="600" y="150" width="560" height="150" rx="8" fill="#ffffff" stroke="#0b0b0b" stroke-opacity="0.5" stroke-width="1.6"/>
<text x="616" y="176" font-size="14.5" font-weight="700" fill="#0b0b0b">TL-SG105PE · 5-port gigabit PoE switch</text>
<text x="616" y="194" font-size="12" fill="#52514e">mgmt 192.168.1.6 · replaces the old TL-SG105E (→ shelf, cold spare)</text>
<g font-size="11.5" text-anchor="middle">
<rect x="616" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="664" y="227" font-weight="700" fill="#0b0b0b">P1</text>
<text x="664" y="242" fill="#52514e">← apartment</text>
<rect x="722" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="770" y="227" font-weight="700" fill="#0b0b0b">P2</text>
<text x="770" y="242" fill="#52514e">← 4G router</text>
<rect x="828" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="876" y="227" font-weight="700" fill="#0b0b0b">P3</text>
<text x="876" y="242" fill="#52514e">← UPS mgmt</text>
<rect x="934" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984" stroke-dasharray="4,3"/>
<text x="982" y="227" font-weight="700" fill="#0b0b0b">P4 ⚡PoE</text>
<text x="982" y="242" fill="#52514e">← camera</text>
<rect x="1040" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="1088" y="227" font-weight="700" fill="#0b0b0b">P5</text>
<text x="1088" y="242" fill="#52514e">← R730 eno1</text>
</g>
<text x="616" y="284" font-size="12" fill="#52514e">every cable below re-plugs old-switch → PE on camera day (≈3 min)</text>
<!-- 4G router -->
<rect x="600" y="360" width="250" height="64" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="616" y="384" font-size="14" font-weight="700" fill="#0b0b0b">4G router · 192.168.1.7</text>
<text x="616" y="403" font-size="12" fill="#52514e">~cellular uplink (out-of-band)</text>
<path d="M770,300 L770,360" fill="none" stroke="#2a78d6" stroke-width="2"/>
<path d="M856,392 C866,386 864,380 874,376 C866,370 868,364 876,360" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
<text x="884" y="380" font-size="12" fill="#52514e">📡 cellular</text>
<!-- UPS -->
<rect x="600" y="452" width="250" height="56" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="616" y="476" font-size="14" font-weight="700" fill="#0b0b0b">UPS (Huawei)</text>
<text x="616" y="494" font-size="12" fill="#52514e">network mgmt card</text>
<path d="M876,300 C876,340 800,410 720,452" fill="none" stroke="#2a78d6" stroke-width="2"/>
<!-- R730 -->
<rect x="600" y="540" width="560" height="220" rx="8" fill="#ffffff" stroke="#0b0b0b" stroke-opacity="0.5" stroke-width="1.6"/>
<text x="616" y="566" font-size="14.5" font-weight="700" fill="#0b0b0b">Dell R730 · PVE host · 192.168.1.127</text>
<g font-size="11.5">
<rect x="616" y="582" width="128" height="38" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
<text x="628" y="598" font-weight="700" fill="#0b0b0b">eno1 · LAN1</text>
<text x="628" y="613" fill="#52514e">← switch P5 · 1GbE</text>
<rect x="756" y="582" width="128" height="38" rx="5" fill="#ffffff" stroke="#8a8984" stroke-dasharray="4,3"/>
<text x="768" y="598" font-weight="700" fill="#52514e">eno2 · LAN2</text>
<text x="768" y="613" fill="#8a8984">dark · fallback leg</text>
<rect x="896" y="582" width="128" height="38" rx="5" fill="#ffffff" stroke="#d8d7d2"/>
<text x="908" y="598" fill="#8a8984">eno3 / eno4</text>
<text x="908" y="613" fill="#8a8984">free, uncabled</text>
<rect x="1036" y="582" width="108" height="38" rx="5" fill="#ffffff" stroke="#d8d7d2"/>
<text x="1048" y="598" fill="#8a8984">iDRAC · .4</text>
<text x="1048" y="613" fill="#8a8984">shared-LOM/eno1</text>
</g>
<text x="616" y="648" font-size="12" fill="#52514e">no other network cables — everything else on this host is VIRTUAL:</text>
<text x="616" y="668" font-size="12" fill="#52514e">pfSense · ha-sofia (HA) · devvm · k8s-master + node1-6 · registry VM …</text>
<text x="616" y="696" font-size="12" fill="#8a8984">(power: host + switch fed from the UPS — power wiring not drawn)</text>
<path d="M1088,300 C1088,420 720,500 680,582" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
<text x="1100" y="330" font-size="12.5" font-weight="700" fill="#2a78d6">LAN1 cable</text>
<!-- ═════════ GARAGE ENTRANCE ═════════ -->
<rect x="1280" y="100" width="280" height="200" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="1296" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE ENTRANCE</text>
<rect x="1304" y="150" width="232" height="110" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="1320" y="176" font-size="14" font-weight="700" fill="#0b0b0b">vermont-garage camera</text>
<text x="1320" y="196" font-size="12" fill="#52514e">HiLook IPC-T241H-C · 10.0.30.70</text>
<text x="1320" y="214" font-size="12" fill="#52514e">powered over the data cable (PoE)</text>
<text x="1320" y="232" font-size="12" fill="#52514e">outdoor · armored conduit</text>
<path d="M982,210 C982,150 1140,140 1304,180" fill="none" stroke="#52514e" stroke-width="2.5" stroke-dasharray="7,5"/>
<text x="1080" y="136" font-size="12.5" font-weight="700" fill="#52514e">single cat6 in conduit · data + PoE power (camera day)</text>
<!-- legend -->
<g transform="translate(40,780)" font-size="12.5">
<line x1="0" y1="-4" x2="44" y2="-4" stroke="#2a78d6" stroke-width="2.5"/>
<text x="52" y="0" fill="#0b0b0b">copper, in place</text>
<line x1="190" y1="-4" x2="234" y2="-4" stroke="#52514e" stroke-width="2.5" stroke-dasharray="7,5"/>
<text x="242" y="0" fill="#0b0b0b">camera-day cable / dark port</text>
<path d="M450,-4 C456,-10 454,-14 460,-18" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
<text x="470" y="0" fill="#0b0b0b">radio (wifi / cellular)</text>
<text x="650" y="0" fill="#52514e">total wired links at the rack: 5 (all on the one switch) · ADR-0017 rev 3</text>
</g>
</svg>

Before

Width:  |  Height:  |  Size: 9 KiB

View file

@ -1,99 +0,0 @@
# CCTV segment: dedicated pfSense interface, VLAN-30 trunk on the LAN1 cable
Status: accepted (2026-07-02, rev 3 — single-switch)
![Network topology — dCCTV segment, flows, and camera-day steps](./0017-cctv-segment-topology.svg)
![Physical cabling — wires only, no VLANs](./0017-cctv-physical-cabling.svg)
The first owned camera at the Sofia/Vermont site (`vermont-garage`, HiLook
IPC-T241H-C at the garage entrance) needs to be network-isolated: its cable is
physically exposed outside the apartment, so anything plugged into that cable
must land in a segment that can reach nothing. The original design doc
(NAS: `Emo shared/Claude shared/garage-camera/`) called for an "802.1Q trunk
to pfSense" — but nothing in this network terminates dot1q on pfSense; the
site idiom is one vlan-aware Proxmox bridge → one tagged VM NIC → one clean
untagged pfSense interface per segment.
**Decision (rev 3):** ONE switch — the new TL-SG105PE **replaces** the old
garage TL-SG105E (Viktor prefers not running two switches; retired unit
becomes a cold spare, its 192.168.1.6 mgmt IP passes to the PE). Five ports,
all used: apartment uplink, 4G router 192.168.1.7, UPS mgmt (all untagged
VLAN 1), the camera (untagged VLAN 30, PoE), and the **trunk to R730 `eno1`
carrying home LAN untagged + CCTV tagged 30** over the existing LAN1 cable.
pfSense `net3` (vtnet3) sits on `vmbr0` with `tag=30` — exactly the site
idiom used for dManagementsVms/dKubernetes (bridge-level tag → clean untagged
vNIC; pfSense still terminates no dot1q itself). The earlier dedicated
`eno2`/`vmbr2` leg is kept **dormant as a fallback** (rev 2 wired it; moving
net3 back to vmbr2 restores pure physical isolation in one `qm set`).
This narrows the earlier 802.1Q objection rather than contradicting it: the
rejection assumed *unmanaged* switches, where any LAN device could inject
tagged frames; with the managed PE as the only device on eno1, VLAN-30
membership is {camera port, trunk port} only, so tag-30 ingress from every
other port — and from the exposed camera cable — is dropped or contained.
Cameras are untrusted: default-deny on dCCTV with a single
NTP-to-gateway exception; Frigate (k8s) pulls RTSP in; ha-sofia (192.168.1.8)
may reach ISAPI/RTSP directly; home-LAN clients route in via an AX6000 static
route (10.0.30.0/24 via 192.168.1.2). 10.0.30.0/24 is deliberately NOT in the
10.0.20.0/22 trusted source-IP allowlist.
## Traffic on the trunk — how one cable carries two networks
The LAN1 cable is shared, but the two networks on it diverge at `vmbr0`
(the vlan-aware bridge on the PVE host), and only ONE of them ever touches
pfSense:
- **Untagged (VLAN 1, home LAN)** is plain L2 bridging: vmbr0 switches it
between the trunk, the host's own IP (192.168.1.127) and pfSense `net0`
where pfSense sits as an ordinary LAN *client* (WAN 192.168.1.2). The home
LAN's gateway is and remains the AX6000; home-LAN traffic never transits
pfSense. Consequently a pfSense (or R730 VM-level) outage does not affect
the home LAN, and the apartment ↔ 4G-router ↔ UPS paths don't even leave
the switch (P1/P2/P3 bridge internally), so out-of-band recovery via the
4G router survives the whole rack being down.
- **Tagged 30 (CCTV)** has exactly one possible landing: vmbr0 delivers
VID 30 only to pfSense `net3` (dCCTV, 10.0.30.1), which is the camera
segment's gateway, firewall and sole exit. "Camera → AX6000 → internet"
is impossible by construction, not merely by firewall rule.
- pfSense forwards *upstream* only its own segments (10.0.10/20/30), NATed
out of its WAN toward the AX6000. Load-wise the trunk gained only the
camera's ~8 Mbps — it already carried all rack-bound home-LAN traffic.
![VLAN tagging — where traffic can flow](./0017-cctv-vlan-tagging.svg)
*(editable source: [`0017-cctv-vlan-tagging.excalidraw`](./0017-cctv-vlan-tagging.excalidraw) — open it in excalidraw to tweak)*
## Considered options
- **802.1Q over the LAN path behind an UNMANAGED switch** (the original plan
read this way) — rejected: any LAN device could inject tagged frames into
vmbr0 (`bridge-vids 2-4094`) and tag-passing through a dumb switch is
undefined. Rev 3 adopts the tagged path ONLY because the managed PE now
polices VLAN-30 membership at the single entry point to eno1; no bridge
reconfiguration was needed (vmbr0 was already vlan-aware).
- **Dedicated physical leg (eno2 → vmbr2 → net3), one switch per role**
(rev 1/2 as-built) — superseded by rev 3: it forced either a second switch
(6 connections vs 5 ports once the PE also replaced the old switch) or new
hardware. Strongest isolation of all options; kept dormant as the fallback.
- **AX6000 as the camera gateway** — rejected earlier in the design (consumer
router, no inter-VLAN firewall).
## Consequences
- The switch is now single-point and load-bearing for everything in the rack
(apartment uplink, pfSense backup-WAN via 4G, UPS mgmt, CCTV) AND its VLAN
table + mgmt password are part of the isolation boundary — the Easy Smart
mgmt UI answers on every port, so the password is the gate between a
compromised camera and the switch config. All 5 ports are consumed: the
next camera forces an 8-port PoE upgrade (the wiring plan already fits it).
- `eno2`/`vmbr2` stay cabled-ready but dormant (fallback to rev 2's physical
leg); eno3/eno4 remain free.
- The old TL-SG105E is retired to cold spare; the PE inherits 192.168.1.6
(Kea reservation by MAC).
- Revision history (all 2026-07-02): rev 1 assumed one shared PE with a
port-VLAN split (conflated the two devices); rev 2 split into two switches
after inspecting 192.168.1.6 (old non-PoE SG105E, 4/5 ports used); rev 3
consolidated back to one switch — the PE replacing the SG105E — per
Viktor's preference, moving CCTV onto a managed tagged trunk.
- Frigate's ADR-0016 VRAM budget was bumped 2000 → 2300 MiB for the extra
NVDEC stream.

View file

@ -1,178 +0,0 @@
<svg xmlns="http://www.w3.org/2000/svg" width="1600" height="880" viewBox="0 0 1600 880" font-family="system-ui, -apple-system, 'Segoe UI', Roboto, sans-serif">
<!-- ADR-0017 rev 3 dCCTV topology (single switch, VLAN-30 trunk on LAN1).
Colors: reference dataviz palette (light mode). blue #2a78d6 = home LAN ·
violet #4a3aa7 = dCCTV · aqua #1baf7a = dKubernetes ·
yellow #eda100 = dManagementsVms · green #008300 allow · red #e34948 deny -->
<defs>
<marker id="arrGreen" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
<path d="M0,0 L10,5 L0,10 z" fill="#008300"/>
</marker>
<marker id="arrRed" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
<path d="M0,0 L10,5 L0,10 z" fill="#e34948"/>
</marker>
<marker id="arrGray" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse">
<path d="M0,0 L10,5 L0,10 z" fill="#52514e"/>
</marker>
</defs>
<rect width="1600" height="880" fill="#fcfcfb"/>
<text x="40" y="42" font-size="26" font-weight="700" fill="#0b0b0b">ADR-0017 — CCTV segment behind pfSense, VLAN-30 trunk on the LAN1 cable</text>
<text x="40" y="66" font-size="15" fill="#52514e">Sofia/Vermont · rev 3 (single switch) 2026-07-02 · dashed = camera-day · the ONLY 802.1Q is the trunk between the switch and eno1</text>
<!-- camera -> everything else (denied) -->
<path d="M240,168 C520,104 900,104 1148,140" fill="none" stroke="#e34948" stroke-width="3" marker-end="url(#arrRed)"/>
<g transform="translate(560,111)">
<circle r="11" fill="#fcfcfb" stroke="#e34948" stroke-width="2.5"/>
<path d="M-5,-5 L5,5 M5,-5 L-5,5" stroke="#e34948" stroke-width="2.5"/>
</g>
<text x="588" y="100" font-size="13.5" font-weight="700" fill="#e34948">DENY · camera → LAN / other segments / internet (default deny on dCCTV)</text>
<!-- GARAGE ENTRANCE -->
<rect x="40" y="128" width="240" height="180" rx="10" fill="#4a3aa7" fill-opacity="0.06" stroke="#4a3aa7" stroke-opacity="0.35"/>
<text x="56" y="154" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE ENTRANCE</text>
<rect x="64" y="170" width="192" height="112" rx="8" fill="#ffffff" stroke="#4a3aa7" stroke-width="2"/>
<text x="80" y="196" font-size="15" font-weight="700" fill="#0b0b0b">vermont-garage</text>
<text x="80" y="216" font-size="12.5" fill="#52514e">HiLook IPC-T241H-C · pure IR</text>
<text x="80" y="234" font-size="12.5" fill="#52514e">10.0.30.70 (Kea reservation)</text>
<text x="80" y="252" font-size="12.5" fill="#52514e">DNS: garage-cam.viktorbarzin.lan</text>
<text x="80" y="270" font-size="12.5" fill="#52514e">PoE from switch · cloud/P2P off</text>
<path d="M256,284 C330,330 412,368 417,430" fill="none" stroke="#52514e" stroke-width="2" stroke-dasharray="6,5" marker-end="url(#arrGray)"/>
<text x="330" y="322" font-size="12" fill="#52514e">cat6 in conduit · PoE → P4</text>
<!-- RACK zone: single switch -->
<rect x="40" y="360" width="560" height="265" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="56" y="384" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">RACK — GARAGE · ONE SWITCH</text>
<rect x="64" y="396" width="512" height="176" rx="8" fill="#4a3aa7" fill-opacity="0.04" stroke="#4a3aa7" stroke-width="2"/>
<text x="80" y="420" font-size="15" font-weight="700" fill="#0b0b0b">TL-SG105PE <tspan font-size="12.5" font-weight="400" fill="#52514e">replaces the SG105E · mgmt 192.168.1.6 (Kea) · all 5 ports used</tspan></text>
<g font-size="11.5" text-anchor="middle">
<rect x="80" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="124" y="454" font-weight="700" fill="#0b0b0b">P1 · V1</text>
<text x="124" y="470" fill="#52514e">apartment</text>
<text x="124" y="484" fill="#52514e">uplink</text>
<rect x="178" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="222" y="454" font-weight="700" fill="#0b0b0b">P2 · V1</text>
<text x="222" y="470" fill="#52514e">4G router</text>
<text x="222" y="484" fill="#52514e">192.168.1.7</text>
<rect x="276" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="320" y="454" font-weight="700" fill="#0b0b0b">P3 · V1</text>
<text x="320" y="470" fill="#52514e">UPS mgmt</text>
<rect x="374" y="436" width="88" height="56" rx="6" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="418" y="454" font-weight="700" fill="#0b0b0b">P4 · V30</text>
<text x="418" y="470" fill="#52514e">camera</text>
<text x="418" y="484" fill="#52514e">PoE ON</text>
<rect x="472" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.10" stroke="#4a3aa7" stroke-width="2" stroke-dasharray="0"/>
<text x="516" y="454" font-weight="700" fill="#0b0b0b">P5 · trunk</text>
<text x="516" y="470" fill="#52514e">V1 untagged</text>
<text x="516" y="484" fill="#4a3aa7">+ V30 tagged</text>
</g>
<text x="80" y="516" font-size="12" fill="#52514e">802.1Q: VLAN 1 untagged {P1,P2,P3,P5} · VLAN 30 {P4 untagged/PVID 30, P5 tagged}</text>
<text x="80" y="534" font-size="12" fill="#52514e">tag-30 ingress on P1/P2/P3 is dropped (not members) — the trunk is the only tagged path</text>
<text x="80" y="558" font-size="12" fill="#8a8984">old TL-SG105E → retired, cold spare · backup-WAN (4G) + UPS keep their ports</text>
<!-- trunk: two parallel lines to eno1 -->
<path d="M560,458 C630,458 640,428 692,420" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
<path d="M560,466 C632,466 644,436 692,428" fill="none" stroke="#4a3aa7" stroke-width="2.5"/>
<text x="588" y="404" font-size="12" font-weight="700" fill="#0b0b0b">LAN1 cable</text>
<!-- R730 / PVE zone -->
<rect x="680" y="330" width="880" height="440" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
<text x="696" y="356" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">DELL R730 — PVE HOST 192.168.1.127 (IN THE RACK)</text>
<g font-size="12">
<rect x="700" y="400" width="150" height="46" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="712" y="419" font-weight="700" fill="#0b0b0b">eno1 → vmbr0</text>
<text x="712" y="436" fill="#52514e">untag V1 + tag 30</text>
<rect x="700" y="471" width="150" height="46" rx="6" fill="#ffffff" stroke="#8a8984" stroke-dasharray="4,3"/>
<text x="712" y="490" font-weight="700" fill="#52514e">eno2 → vmbr2</text>
<text x="712" y="507" fill="#8a8984">dormant fallback leg</text>
<rect x="700" y="542" width="150" height="46" rx="6" fill="#0b0b0b" fill-opacity="0.04" stroke="#8a8984"/>
<text x="712" y="561" font-weight="700" fill="#0b0b0b">vmbr1</text>
<text x="712" y="578" fill="#52514e">internal · tags 10/20</text>
</g>
<!-- pfSense VM -->
<rect x="890" y="388" width="300" height="230" rx="8" fill="#ffffff" stroke="#8a8984"/>
<text x="906" y="414" font-size="15" font-weight="700" fill="#0b0b0b">pfSense (VM 101)</text>
<text x="906" y="432" font-size="12" fill="#52514e">gateway + firewall for every segment</text>
<g font-size="12">
<rect x="906" y="444" width="268" height="34" rx="5" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="916" y="465" fill="#0b0b0b">net0 · WAN <tspan fill="#52514e">192.168.1.2 · vmbr0 untagged</tspan></text>
<rect x="906" y="484" width="268" height="34" rx="5" fill="#eda100" fill-opacity="0.14" stroke="#eda100"/>
<text x="916" y="505" fill="#0b0b0b">net1 · dManagementsVms <tspan fill="#52514e">10.0.10.1</tspan></text>
<rect x="906" y="524" width="268" height="34" rx="5" fill="#1baf7a" fill-opacity="0.12" stroke="#1baf7a"/>
<text x="916" y="545" fill="#0b0b0b">net2 · dKubernetes <tspan fill="#52514e">10.0.20.1</tspan></text>
<rect x="906" y="564" width="268" height="34" rx="5" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="916" y="585" fill="#0b0b0b">net3 · dCCTV <tspan fill="#52514e">10.0.30.1/24 · vmbr0 tag 30</tspan></text>
</g>
<path d="M850,415 L890,458" fill="none" stroke="#2a78d6" stroke-width="1.6" opacity="0.6"/>
<path d="M850,430 L890,581" fill="none" stroke="#4a3aa7" stroke-width="2"/>
<path d="M850,565 L890,501" fill="none" stroke="#8a8984" stroke-width="1.6" opacity="0.6"/>
<path d="M850,565 L890,541" fill="none" stroke="#8a8984" stroke-width="1.6" opacity="0.6"/>
<!-- k8s VMs -->
<rect x="1240" y="388" width="290" height="230" rx="8" fill="#1baf7a" fill-opacity="0.07" stroke="#1baf7a"/>
<text x="1256" y="414" font-size="15" font-weight="700" fill="#0b0b0b">k8s VMs · 10.0.20.0/24</text>
<text x="1256" y="434" font-size="12.5" fill="#52514e">vmbr1 tag 20 · pod egress SNATs</text>
<text x="1256" y="450" font-size="12.5" fill="#52514e">to node IPs</text>
<rect x="1256" y="464" width="258" height="66" rx="6" fill="#ffffff" stroke="#1baf7a"/>
<text x="1268" y="486" font-size="13.5" font-weight="700" fill="#0b0b0b">Frigate · k8s-node1 (T4)</text>
<text x="1268" y="504" font-size="12" fill="#52514e">detect sub / record main</text>
<text x="1268" y="520" font-size="12" fill="#52514e">gpumem budget 2300 MiB</text>
<rect x="1256" y="540" width="258" height="52" rx="6" fill="#ffffff" stroke="#1baf7a"/>
<text x="1268" y="562" font-size="13.5" font-weight="700" fill="#0b0b0b">go2rtc LB 10.0.20.204</text>
<text x="1268" y="580" font-size="12" fill="#52514e">restream → HA live view (MSE/HLS)</text>
<!-- HOME LAN zone -->
<rect x="1148" y="128" width="412" height="180" rx="10" fill="#2a78d6" fill-opacity="0.06" stroke="#2a78d6" stroke-opacity="0.4"/>
<text x="1164" y="154" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">HOME LAN 192.168.1.0/24</text>
<rect x="1164" y="168" width="180" height="56" rx="6" fill="#ffffff" stroke="#2a78d6"/>
<text x="1176" y="190" font-size="13.5" font-weight="700" fill="#0b0b0b">AX6000 · .1</text>
<text x="1176" y="208" font-size="11.5" fill="#52514e">+ route 10.0.30.0/24 → .2</text>
<rect x="1164" y="236" width="180" height="52" rx="6" fill="#ffffff" stroke="#2a78d6"/>
<text x="1176" y="258" font-size="13.5" font-weight="700" fill="#0b0b0b">ha-sofia · .8</text>
<text x="1176" y="275" font-size="11.5" fill="#52514e">Frigate card + hikvision_next</text>
<rect x="1360" y="168" width="184" height="56" rx="6" fill="#ffffff" stroke="#2a78d6"/>
<text x="1372" y="190" font-size="13.5" font-weight="700" fill="#0b0b0b">apartment clients</text>
<text x="1372" y="208" font-size="11.5" fill="#52514e">laptops, phones</text>
<rect x="1360" y="236" width="184" height="52" rx="6" fill="#ffffff" stroke="#52514e" stroke-dasharray="5,4"/>
<text x="1372" y="256" font-size="11.5" font-weight="700" fill="#52514e">CAMERA DAY: static route</text>
<text x="1372" y="272" font-size="11.5" fill="#52514e">10.0.30.0/24 via 192.168.1.2</text>
<path d="M1254,308 C1150,352 950,372 790,400" fill="none" stroke="#2a78d6" stroke-width="2" opacity="0.6"/>
<text x="1010" y="374" font-size="12" fill="#2a78d6">apartment uplink · switch P1 · trunk · eno1</text>
<!-- FLOWS -->
<path d="M1256,497 C1010,690 330,730 120,650 C40,618 40,380 96,286" fill="none" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
<text x="620" y="700" font-size="13.5" font-weight="700" fill="#008300">ALLOW · Frigate → camera RTSP :554 (routed k8s → dCCTV; opt1 allow-all)</text>
<path d="M1164,262 C820,282 470,268 302,176 C286,167 278,166 270,172" fill="none" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
<text x="484" y="216" font-size="13.5" font-weight="700" fill="#008300">ALLOW · ha-sofia → camera :80 ISAPI + :554</text>
<text x="484" y="234" font-size="12" fill="#52514e">enters pfSense WAN · reply-to off · needs the AX6000 route</text>
<path d="M280,232 C660,200 860,320 936,386" fill="none" stroke="#008300" stroke-width="2" opacity="0.85" marker-end="url(#arrGreen)"/>
<text x="740" y="322" font-size="12.5" font-weight="700" fill="#008300">ALLOW · camera → 10.0.30.1:123 (NTP)</text>
<!-- LEGEND -->
<g transform="translate(40,800)" font-size="12.5">
<rect x="0" y="0" width="18" height="18" rx="4" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
<text x="26" y="14" fill="#0b0b0b">home LAN / VLAN 1</text>
<rect x="200" y="0" width="18" height="18" rx="4" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
<text x="226" y="14" fill="#0b0b0b">CCTV / VLAN 30 / dCCTV 10.0.30.0/24</text>
<rect x="500" y="0" width="18" height="18" rx="4" fill="#1baf7a" fill-opacity="0.12" stroke="#1baf7a"/>
<text x="526" y="14" fill="#0b0b0b">dKubernetes</text>
<rect x="640" y="0" width="18" height="18" rx="4" fill="#eda100" fill-opacity="0.14" stroke="#eda100"/>
<text x="666" y="14" fill="#0b0b0b">dManagementsVms</text>
<line x1="820" y1="9" x2="860" y2="9" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
<text x="870" y="14" fill="#0b0b0b">allowed flow</text>
<line x1="980" y1="9" x2="1020" y2="9" stroke="#e34948" stroke-width="3" marker-end="url(#arrRed)"/>
<text x="1030" y="14" fill="#0b0b0b">denied</text>
<line x1="1100" y1="9" x2="1140" y2="9" stroke="#52514e" stroke-width="2" stroke-dasharray="6,5"/>
<text x="1150" y="14" fill="#0b0b0b">camera-day step</text>
<text x="1320" y="14" fill="#52514e">ADR-0017 · rev 3</text>
</g>
</svg>

Before

Width:  |  Height:  |  Size: 13 KiB

File diff suppressed because it is too large Load diff

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 23 KiB

View file

@ -1,47 +0,0 @@
# Valia sites are served off-infra (Cloudflare Pages), synced in-cluster
Valia (Viktor's mother) authors small one-page static sites in Google Drive folders she
shares, and keeps asking for them to be hosted — two exist already (`stem95su`, `bridge`)
and more are expected. We decided all **Valia sites** are served **off-infra on Cloudflare
Pages** under `<english-name>.viktorbarzin.me`, kept fresh by **one shared in-cluster
CronJob** (`stacks/valia-sites/`) that mirrors each **Content folder** every 10 minutes
(rclone, drive.readonly) and re-deploys only on change (wrangler direct upload). The
existing in-cluster `stem95su` serving stack (nginx + NFS + ingress + per-site sync)
migrates onto this and is retired.
Why off-infra serving: these are her sites, shown to teachers/parents — they must survive
homelab outages (cf. the 2026-06-27 egress incident that took every proxied in-cluster
site down). With Pages, a homelab outage degrades to "content frozen until we're back",
never "site down". Serving costs no cluster resources and no per-site nginx/PVC/ingress/
Anubis. Why the syncer stays in-cluster anyway: secrets stay in Vault (no per-site GHA
secret sprawl), and the stem95su guard patterns (hard-fail on Drive auth errors, never
wipe a live site on an empty/partial folder, capped deletes) carry over wholesale. The
deliberate asymmetry — off-infra serving, on-infra syncing — is the point, not an
accident.
## Considered options
- **In-cluster everywhere** (generalise stem95su into a factory module): one roof, no
Cloudflare Pages dependency — but her sites share the homelab's fate and each site
spends cluster resources to serve static files a free CDN serves better.
- **Pages for new sites only**: less work now, two patterns and two runbooks forever.
- **GHA-scheduled sync** (fully off-infra pipeline): no cluster dependency at all, but
Drive + Cloudflare credentials would live as GitHub secrets per repo, outside Vault.
## Consequences
- Registration is one entry in the `sites` map (name, Content folder, optional Entry
file); CI applies Pages project, custom domain, public CNAME, and internal-DNS config
together. Names are English, picked by Viktor (most → bridge set the precedent).
- The internal split-horizon zone learns Valia sites from a ConfigMap the
`technitium-ingress-dns-sync` script consumes — declaratively, including **removal**
(the previous static-CNAME approach was add-only; a retired site left a stale record).
- Deploy-on-change is mandatory, not an optimisation: Pages caps monthly deployments on
the free tier, and a 10-minute cadence would burn ~4,300/month if unchanged runs
deployed.
- Failure visibility is **failed-Job-only** by explicit choice (no stale-sync alert, no
per-site uptime monitors, no notifications to Valia) — Viktor fields "it didn't
update" reports, consistent with the alert-noise-reduction posture. Revisit if a
silent stall actually bites.
- If the homelab is down, content updates pause; the sites keep serving last-deployed
content. Accepted degradation.

View file

@ -1,97 +0,0 @@
# Inbound mail gets a self-hosted store-and-forward backup MX on Oracle Always-Free
`viktorbarzin.me` has run a single direct MX to the home IP since the 2026-04-12
inbound overhaul, with sender-MTA retry (15 days, sender-dependent) as the only
outage protection — a documented "No Backup MX" decision made after ForwardEmail's
forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email
Routing proved pass-through-only. Viktor now wants inbound mail to survive
homelab outages **without loss** (2026-07-04): delayed delivery is fine,
mid-outage reading is not required, and the budget is **$0** — a hard
constraint that eliminated every managed option (see below).
We run a minimal **Postfix store-and-forward relay on an Oracle Cloud
Always-Free `VM.Standard.E2.1.Micro`** (`mx2.viktorbarzin.me`, **reserved**
public IP, MX preference 20; primary untouched at 1). It accepts everything
for the domain (catch-all — every RCPT is valid; reputation may only ever
4xx-defer, via postscreen pregreet + conservative DNSBL-defer on the VM —
never 5xx: a backup MX that hard-rejects manufactures the loss it exists to
prevent), queues up to **30 days** (bounce lifetime 1 day — the VM can never
deliver a DSN, its only egress is the drain), and drains to the primary over
**port 2526** — one scripted pfSense WAN NAT rule onto the existing HAProxy
frontend — because Oracle blocks egress TCP 25 tenancy-wide. Management is
tailnet-only (headscale preauth key, `tag:backup-mx`; OCI console as
mid-outage break-glass since headscale itself lives in the cluster); TLS via
certbot HTTP-01 (port 80 permanently open — LE validation is
multi-perspective and unscopeable); the VM is a cattle-rebuild from a new
`stacks/backup-mx/` Terraform stack (OCI provider + cloud-init, which must
also punch 25/80 through the OCI Ubuntu image's OS-level iptables REJECT).
On the primary, the drain stream (one /32) is enabled at the layers that
actually bite — `check_client_access` permits past
`reject_unknown_client_hostname` and spoof-protection, an anvil rate-limit
exception, and rspamd `external_relay` (score against the *original* sender
IP) with the reject action capped to tag/fold so drained spam can never force
the VM to emit backscatter. Go-live is gated on empirical checks: inbound-25
reachability (recurring probe — Oracle publishes no commitment), drain
end-to-end, and a live failover test that includes a high-spam-score and a
>10 MB message. Two independent adversarial reviews (2026-07-04) shaped this
final form. Design:
[`plans/2026-07-04-backup-mx-design.md`](../plans/2026-07-04-backup-mx-design.md).
## Considered options
- **Roller Network free Secondary MX** — v1 of this decision, killed at the
validation gates the same day: free tier caps at 200 relayed messages or
10 MB per rolling 7 days, and overage suspends the domain for 48 h
answering **SMTP 5xx** (permanent bounces) — since spammers target backup
MXes even while the primary is up, background spam alone can hold it
suspended, making it *worse than no backup MX*. Free accounts are also
being discontinued. (Their TLS checked out; their paid Basic at $30/yr is
the documented fallback if the OCI route sours.)
- **Dynu Email Backup ($9.99/yr)** — queue lifetime undocumented (FAQ hints
1224 h, barely beating sender retry); filtering black-box; not free.
- **Cloudflare Email Routing / mailflare** — no store-and-forward / terminal
inbox on Cloudflare; rejected earlier (2026-04-12; 2026-07-04 memory #7148).
- **Other free tiers** (challenged and re-verified 2026-07-04): GCP e2-micro
blocks egress 25 too and its free regions are US-only; AWS's 2025+ "free"
plan is a 6-month credit; Azure has no always-free VM and blocks 25;
Hetzner has no free tier; Fly.io ended free allowances; Vultr/Linode are
trial credits; DNSExit/KisoLabs/DuoCircle backup-MX are paid or dead. OCI
is the only standing free option.
- **Harden-only** (5xx-misconfig guards + paging) — does not address
multi-day outages or short-retry senders; deferred as a complementary
track.
## Consequences
- **A pet outside the cluster** — deliberately cattle: rebuilt entirely from
Terraform + cloud-init, patched by unattended-upgrades, scraped by the
cluster's Prometheus (exporters on the reserved public IP, allowlisted to
the homelab WAN /32 — there is **no cluster→tailnet route**, so tailnet
scraping was rejected as fictional; blackbox TCP:25 + MX-set drift alerts
besides). Never a backup target itself.
- **Oracle free-tier caprice is the top risk**: Oracle silently halved the A1
free allowance in June 2026 and terminated over-limit instances, and
publishes no commitment that inbound 25 stays open. Mitigations:
**Pay-As-You-Go conversion is a required prerequisite** (exempts idle
reclamation, stays $0), a recurring inbound-25 probe, `BackupMxDown`, and
the queue being empty outside outages (a surprise reclamation loses
coverage, never mail). Home region is fixed at signup — Frankfurt, chosen
once.
- The drain stream bypasses `reject_unknown_client_hostname`, anvil limits,
and rspamd's reject tier for one /32; DKIM verification, SPF/DMARC (against
the original IP via `external_relay`), and content scoring stay on — spam
arriving via the backup is tagged and folded to Junk, never bounced. The VM
is deliberately NOT in the primary's `mynetworks` (a compromised VM must
not relay through us).
- **Outages > 30 days lose queued mail silently** — no DSN can ever leave the
VM. Stated and accepted (6× better than the status quo).
- Outage mail sits in plaintext on Oracle disk ≤ 30 days — single-tenant but
off-premises; accepted (same class as Brevo holding outbound today).
- Cloudflare zone lands at 197/200 records; the MTA-STS follow-up (policy
host found dangling during design — inert today; must list `mx2` when
fixed) needs 12 more → schedule the next record purge proactively.
- `architecture/mailserver.md` §"No Backup MX" superseded at implementation;
new runbook `docs/runbooks/backup-mx.md` (incl. OCI console break-glass);
`vpn.md`'s stale headscale claims fixed in passing; the roundtrip probe's
failure semantics change (a "failing" probe may now mean "delayed via mx2,
drains shortly" — noted in alert description).

View file

@ -329,12 +329,6 @@ Two independent grants make up "browser access" for a user:
the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate
a token by deleting its `<user>-browser-token` Secret).
Because the SA is the user's DEFAULT kubectl credential, other per-namespace
port-forward grants hang off the same identity: `stacks/excalidraw/rbac.tf`
grants `emo-browser` `pods/portforward` in `excalidraw` (2026-07-02) so emo's
agent can upload drawings via the port-forward + `X-Authentik-Username` recipe
in his `~/.claude/CLAUDE.md`. Revoking the SA revokes those too.
## Limits + risks
- **Anti-bot vs stealth arms race** — when an upstream beats us (DRM

View file

@ -94,7 +94,7 @@ can't reach Forgejo's public hairpin.
| Visibility | Packages | Pull mechanism |
|------------|----------|----------------|
| **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous |
| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci, k8s-portal, excalidraw-library | `ghcr-credentials` dockerconfigjson |
| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson |
Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the
kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit
@ -188,8 +188,6 @@ reconciled — the workflows were added to the GitHub lineage via PR):
| android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` |
| infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` |
| infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` |
| k8s-portal | `build-k8s-portal.yml` | private `ghcr.io/viktorbarzin/k8s-portal` (Keel rolls `:latest` digests) |
| excalidraw-library | `build-excalidraw.yml` | private `ghcr.io/viktorbarzin/excalidraw-library` (Keel rolls `:latest` digests; DockerHub `:v4` frozen as rollback) |
**`infra-ci`** is the image the `.woodpecker/default.yml` apply step and
`drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is

View file

@ -277,7 +277,7 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons
Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).
**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too. **Off-infra Valia sites** (Cloudflare Pages, ADR-0018) are the other class of public-only names with no Traefik ingress — without internal records they NXDOMAIN for every internal client while working fine externally. Since 2026-07-03 they are reconciled **declaratively**: `stacks/valia-sites` writes the ConfigMap `valia-sites-dns` (technitium ns, `<name> → <project>.pages.dev`), and the sync script ensures/updates a CNAME per entry and **deletes** stale internal CNAMEs targeting `*.pages.dev` that left the map (retire/rename cleans itself up; deletion is suffix-scoped so nothing else can be touched).
**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too.
## NodeLocal DNSCache
@ -368,7 +368,6 @@ The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`)
| TXT (MTA-STS) | 1 | `v=STSv1; id=20260412` | TLS enforcement |
| TXT (TLSRPT) | 1 | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS reporting |
| A (keyserver) | 1 | `130.162.165.220` (Oracle VPS) | PGP keyserver |
| CNAME (CF Pages) | 2 | `<project>.pages.dev` (Cloudflare Pages) | bridge, stem95su — Valia sites (ADR-0018), managed by `stacks/valia-sites` |
### Proxied vs Non-Proxied
@ -514,7 +513,6 @@ For external `.viktorbarzin.me` records:
1. Add `dns_type = "proxied"` (or `"non-proxied"`) to the `ingress_factory` module call in the service stack
2. Run `scripts/tg apply` on the service stack — DNS record is auto-created
3. For non-standard records (MX, TXT), add a `cloudflare_record` resource in `stacks/cloudflared/modules/cloudflared/cloudflare.tf`
4. For a Valia site (off-infra Cloudflare Pages), add the entry to `local.sites` in `stacks/valia-sites/main.tf` — public CNAME + internal record both follow (`docs/runbooks/valia-sites.md`)
## Incident History

View file

@ -161,17 +161,6 @@ https://mail.viktorbarzin.me → Traefik → Roundcubemail
DB: MySQL (mysql.dbaas.svc.cluster.local)
```
### Paperless ingest mailbox (docs@)
`docs@viktorbarzin.me` is a dedicated real mailbox (explicit self-alias in
`extra/aliases.txt` so the `@domain → spam@` catch-all doesn't shadow it) that
paperless-ngx polls over IMAP; family members forward document emails to it
and the sender maps 1:1 to a paperless account. A per-user Dovecot sieve
(`docs-at-viktorbarzin.me.dovecot.sieve` in the `mailserver.config` ConfigMap,
mounted as `/tmp/docker-mailserver/docs@viktorbarzin.me.dovecot.sieve`)
discards mail from non-allowlisted senders at delivery. Full flow, sender map,
and add-a-sender procedure: [`runbooks/paperless-mail-ingest.md`](../runbooks/paperless-mail-ingest.md).
## DNS Records
All managed in Terraform at `stacks/cloudflared/modules/cloudflared/cloudflare.tf`.
@ -311,21 +300,6 @@ Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from External
## Troubleshooting
### All mail tempfailing with `451 4.3.0 queue file write error` (postsrsd spin)
Seen 2026-07-03 right after a pod restart. Signature in `/var/log/mail/mail.log`:
`postfix/cleanup: warning: tcp:localhost:10001 lookup error` +
`sender_canonical_maps map lookup problem ... message not accepted, try again later`.
Cause: **postsrsd** (SRS daemon, `sender_canonical_maps = tcp:localhost:10001`)
came up spinning at 100% CPU without binding 10001/10002 — supervisor shows it
`RUNNING` but `ss -ltn | grep 1000` is empty and its log is empty. Postfix then
tempfails every message (inbound AND submission); senders retry so nothing is
lost, and the roundtrip probe alerts within the hour.
Fix: `supervisorctl restart postsrsd` inside the container; if the fresh
process spins again (it did once), `kubectl -n mailserver delete pod` for a
full re-init — that healed it. Root cause not pinned down (one-off bad init;
postsrsd 1.10).
### Inbound mail not arriving
1. **DNS/MX**: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me`
2. **WAN reachability**: `nc -zw5 mail.viktorbarzin.me 25` from outside

View file

@ -1,10 +1,10 @@
# Networking Architecture
Last updated: 2026-07-02 (dCCTV segment added — dedicated pfSense leg for the garage camera, ADR-0017)
Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS TSIG-signed)
## Overview
The homelab network is built on three isolated segments behind pfSense (management VLAN 10, Kubernetes VLAN 20, and the physically-legged dCCTV camera segment — see ADR-0017) with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.
The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.
## Architecture Diagram
@ -24,14 +24,9 @@ graph TB
CSdrop[CrowdSec drop<br/>nftables / CF edge<br/>out-of-band, pre-Traefik]
subgraph "Proxmox Host (eno1, eno2)"
subgraph "Proxmox Host (eno1)"
vmbr0[vmbr0 Bridge<br/>192.168.1.127/24]
vmbr1[vmbr1 Internal<br/>VLAN-aware]
vmbr2[vmbr2 Bridge<br/>eno2 → TL-SG105PE]
subgraph "dCCTV - 10.0.30.0/24<br/>ADR-0017"
Camera[vermont-garage<br/>10.0.30.70]
end
subgraph "VLAN 10 - Management<br/>10.0.10.0/24"
Proxmox[Proxmox Host<br/>10.0.10.1]
@ -76,9 +71,6 @@ graph TB
vmbr1 -.VLAN 20.- Tech
vmbr1 -.VLAN 20.- Master
vmbr1 -.VLAN 20.- Node1
vmbr2 -.physical link.- eno2
vmbr2 -.untagged.- Camera
vmbr2 -.pfSense net3 = dCCTV 10.0.30.1.- pfSense
```
## Components
@ -89,7 +81,6 @@ graph TB
| phpIPAM | v1.7.0 | phpipam.viktorbarzin.me | IP address management, device inventory, DNS sync |
| vmbr0 | Linux bridge | 192.168.1.127/24 | Physical bridge on eno1, uplink to LAN |
| vmbr1 | Linux bridge (VLAN-aware) | Internal | VLAN trunk for VM isolation |
| vmbr2 | Linux bridge | Physical (eno2) | DORMANT fallback leg for dCCTV (ADR-0017 rev 3) — live dCCTV rides vmbr0 tag 30 over the LAN1 trunk |
| Technitium DNS | Container | 10.0.20.201 (LB) / 10.96.0.53 (ClusterIP) | Internal DNS (viktorbarzin.lan) + full recursive resolver |
| Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me |
| Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding |
@ -99,22 +90,6 @@ graph TB
| MetalLB | v0.15.3 Helm chart | K8s | LoadBalancer IPs (10.0.20.200-10.0.20.220), all services on 10.0.20.200 |
| Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 |
## CCTV Segment (dCCTV) — as-built 2026-07-02
Isolated camera segment for owned cameras at the Sofia site (first: `vermont-garage`, HiLook IPC-T241H-C at the garage entrance). Decision + rejected alternatives: `docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md`.
**Physical path (rev 3, single switch)**: camera → TL-SG105PE PoE port (untagged VLAN 30) → trunk port (home LAN untagged + CCTV **tagged 30**) → the existing LAN1 cable → R730 `eno1``vmbr0` (vlan-aware) → pfSense `net3`/vtnet3 = `vmbr0 tag=30` = interface **dCCTV `10.0.30.1/24`**. The TL-SG105PE **replaces** the old garage TL-SG105E (retired to cold spare) and carries everything: apartment uplink, 4G router `192.168.1.7`, UPS mgmt (VLAN 1), camera (VLAN 30), trunk — all 5 ports used. VLAN-30 membership is {camera port, trunk port} only, so tagged injection from other ports is dropped. `eno2`/`vmbr2` remain dormant as the fallback physical leg (rev 2).
**Addressing**: Kea DHCP pool `10.0.30.100-199`; devices get MAC reservations (camera `10.0.30.70`; the PE switch mgmt inherits the retired switch's `192.168.1.6` on the home LAN). Kea DDNS auto-registers names in Technitium; `phpipam-pfsense-import` picks up leases hourly.
**Firewall** (all on pfSense):
- dCCTV in: pass `udp OPT4-net → 10.0.30.1:123` (NTP) — everything else hits the interface's default deny. Cameras cannot reach LAN, other segments, or the internet.
- WAN in (home LAN side): pass `192.168.1.8` (ha-sofia) → `10.0.30.70:80` (ISAPI/hikvision_next) and `:554` (RTSP), reply-to disabled on both.
- dKubernetes is allow-all, so cluster Frigate/go2rtc pulls RTSP with no extra rule (pod egress SNATs to node IPs).
- Home-LAN clients need the **AX6000 static route** `10.0.30.0/24 via 192.168.1.2` (camera-day step) to reach the camera UI.
**Consumers**: cluster Frigate (`/srv/nfs/frigate/config/config.yml` — NOT Terraform) pulls `rtsp://10.0.30.70:554` main+sub as `vermont-garage`; HA integrates via Frigate plus direct hikvision_next for tamper events.
## IPAM & DNS Auto-Registration
Devices are automatically discovered, named, and registered in DNS without manual intervention.
@ -232,8 +207,6 @@ VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the up
- blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send, audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden, changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser, travel, netbox
- **Non-proxied domains** (grey cloud, direct IP resolution):
- mail, wg, headscale, immich, calibre, vaultwarden, and other services requiring direct connections
- **Internal-IP domains** (grey cloud, A → `10.0.20.203` Traefik LB, `ingress_factory` `dns_type = "internal"`):
- highlights-immich, highlights-immich-emo — publicly *resolvable* but only *routable* from home LANs / WG sites / VPN (spokes policy-route `10.0.0.0/8` down the tunnel, so kiosk devices with baked-in URLs need no per-site DNS overrides). The record is reachability, not a gate — enforcement is the `home-lans-only` Traefik ipAllowList (Sofia/London/Valchedrym LANs + 10/8) on the ingress. See `docs/plans/2026-07-04-immich-frame-lan-only-design.md`.
- CNAME records for proxied domains point to Cloudflared tunnel FQDNs
### Ingress Flow
@ -288,7 +261,7 @@ Traefik chain:
1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`).
2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend.
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients), tripit (`tripit-rate-limit`, 100/1000, photo-tab thumbnail bursts), health (`health-rate-limit`, 100/1000, SPA shell + API burst per page), and dawarich (`dawarich-rate-limit`, 100/1000 — the Rails app self-serves all fingerprinted assets and the map adds an API burst per load; the default burst 429'd the asset tail and risked dropping OwnTracks/mobile location POSTs on the same host).
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), and authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients).
4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors).
Additional middleware:
@ -579,7 +552,7 @@ chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Che
**Diagnosis**: Check Traefik middleware config for the affected IngressRoute.
**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, and tripit/health/authentik/dawarich each 100/1000 (SPA or asset-heavy page loads bursting past the default from one client IP).
**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, authentik 100/1000 (login SPA `/static` chunk burst → blank screen).
### Large Downloads or Uploads Truncate / Fail Partway

View file

@ -1,103 +0,0 @@
# Vault Token Renewer Self-Heal Design
**Date**: 2026-07-03
**Status**: Approved (brainstorm complete; implementation pending)
**Owner**: wizard@devvm
**Supersedes**: the "version-only, no self-heal" scope choice recorded in
`docs/runbooks/vault-token-renew-devvm.md` (2026-06-07)
## Problem
`wizard@devvm` holds a maintenance-free periodic Vault token
(`token-devvm-wizard`, `period=768h`, renewed daily by the
`vault-token-renew` user timer) precisely so no weekly re-login is needed.
But `~/.vault-token` is the Vault CLI's default token sink, so any
`vault login -method=oidc` — which the infra docs themselves instruct before
applies — overwrites it with a 7-day OIDC token. The renewer's drift guard
(deliberately detect-only) then refuses to renew the foreign token and fails
the unit daily, into a log nobody watches.
Observed consequence: a self-perpetuating weekly-expiry loop. The OIDC token
expires after 7 days → Vault 403s → the natural response is another
`vault login -method=oidc` → clobbers again. Drift persisted unnoticed
2026-06-18 → 06-26 and 2026-06-29 → 07-03 (memory #7121); Viktor experienced
it as "the token expires maybe once a week".
**Goal**: `vault login -method=oidc` becomes harmless on devvm. The renewer
converts any admin-capable clobber back into the permanent periodic token,
unattended. (Chosen over "never log in" doc-fixes and over instant path-unit
healing — see Alternatives.)
## Decisions
| # | Decision | Notes |
|---|----------|-------|
| 1 | Heal in the existing renewer's drift branch, at its nightly run | ~20-line diff to an already-tested script; no new units. A few-hours window holding the 7-day OIDC token is harmless (heal window 24h ≪ 7d TTL) |
| 2 | Heal = *attempt* re-mint using the foreign token itself; let Vault's 403 decide | No policy-list guessing — identity-vs-token-policies burned us before (memory #4211). OIDC tokens carry `vault-admin` via `identity_policies`, so the create succeeds |
| 3 | Weak foreign token (create denied) → keep today's loud DRIFT failure | A read-only clobber (e.g. the 2026-06-05 `kubernetes-woodpecker-default` incident) signals a misbehaving agent flow; auto-papering over it would hide the offender. Log gains a "heal denied — investigate what wrote it" suffix |
| 4 | Do NOT revoke the clobbering OIDC token | It may still back the user's live login session; it ages out in 7 days on its own |
| 5 | After a successful heal, revoke stale `token-devvm-wizard` accessors | Anti-sprawl: each heal would otherwise strand the previous periodic **admin** token server-side for up to 32 days. Walk `auth/token/accessors`, revoke every `display_name=token-devvm-wizard` except the just-minted one. Runs only on heal (rare), never on the happy path |
| 6 | Minted-token sanity check before writing the file | Look up the new token; require `display_name=token-devvm-wizard`. Write via temp file + `mv` + `chmod 600` so a failed mint can never truncate `~/.vault-token` |
| 7 | Keep timer cadence (daily) and all happy-path behavior unchanged | |
| 8 | No notification plumbing in this change | devvm alerting is tracked separately (beads `code-aslh`). Heal events are logged; heal-denied/FAIL still fail the unit |
## Behavior matrix
| Token found in `~/.vault-token` | Before | After |
|---|---|---|
| Our periodic token | renew-self, log `OK` | unchanged |
| Foreign, admin-capable (OIDC login) | log `DRIFT`, exit 1 | re-mint periodic token with it, sanity-check, atomic write, revoke stale periodic accessors, log `HEALED: re-minted from foreign dn=<dn> (revoked N stale)`, exit 0 |
| Foreign, weak (read-only k8s clobber) | log `DRIFT`, exit 1 | log `DRIFT … heal denied — foreign token lacks create authority; investigate what wrote it`, exit 1 |
| Vault unreachable / lookup fails | log `FAIL`, exit 1 | unchanged |
Re-mint command (identical to the manual recovery the DRIFT log already
prescribes):
```
vault token create -orphan -period=768h \
-policy=vault-admin -policy=sops-admin -display-name=devvm-wizard
```
## Testing
- **Unit** (`scripts/test-vault-token-renew.sh`, existing source-the-functions
harness): new pure functions for (a) the stale-accessor revoke filter
(match on `display_name`, exclude the current accessor) and (b) the
minted-token sanity predicate; regression cases for the existing drift
predicate stay green.
- **Live, post-deploy** (on devvm):
1. Mint a fake 1h admin token (`-display-name=fake-oidc`,
`-policy=vault-admin -policy=sops-admin`), write to `~/.vault-token`,
start the service → expect `HEALED`, file holds `token-devvm-wizard`.
2. Mint a fake 10m no-privilege token (`-policy=default`), write it, start
the service → expect `DRIFT … heal denied`, unit `failed`; restore real
token.
3. Revoke both fakes; one-off sweep of stale periodic accessors left by the
June 26 / July 3 manual re-mints.
## Docs & rollout
- Same commit rewrites the runbook's "Drift guard & recovery" section:
self-heal is the recovery for admin-capable clobbers; manual re-mint remains
only for weak clobbers (or a dead token with no admin-capable replacement in
the file).
- `vault login -method=oidc` instructions across the docs stay as-is — the
login is now harmless by design.
- Deploy per the runbook's manual model: `install -m 0755` to
`~/.local/bin/vault-token-renew`. Units unchanged — no daemon-reload.
- After landing: update memories #4204/#4211 (gotcha now self-healing).
## Alternatives considered
- **Instant heal** (systemd path unit + protected source-copy of the token):
strictly more capable (seconds-latency, heals weak clobbers too, zero
re-minting), but 2 new units + a second secret file + inotify re-trigger
edge cases — machinery disproportionate to the residual risk. Revisit only
if the few-hour heal window ever bites.
- **Vault CLI `token_helper` interception**: right interception point in
theory, but a helper bug breaks every `vault` CLI call, Terraform reads
`~/.vault-token` natively anyway, and it adds latency inside login. Rejected.
- **Docs-only ("never log in")**: rejected by user — the login should keep
working, not become forbidden knowledge.
- **Raise the OIDC role's 7-day `token_max_ttl`**: shared role, affects every
OIDC user; rejected previously for the same reason (memory #4205).

View file

@ -1,443 +0,0 @@
# Vault Token Renewer Self-Heal Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Make `vault login -method=oidc` harmless on devvm — the nightly renewer re-mints the permanent periodic token from any admin-capable clobber of `~/.vault-token`, unattended.
**Architecture:** Extend the drift branch of `scripts/vault-token-renew.sh` (deployed to `~/.local/bin/vault-token-renew`, driven by an existing systemd user timer). On drift, *attempt* the re-mint with the clobbering token itself and let Vault's 403 be the authority; sanity-check the minted token, replace the file atomically, then revoke stale `token-devvm-wizard` leftovers. Weak clobbers keep today's loud failure. Design: `docs/plans/2026-07-03-vault-token-self-heal-design.md`.
**Tech Stack:** bash + jq + vault CLI; existing test harness `scripts/test-vault-token-renew.sh` (sources the script, `vtr_main` is guarded).
**Working copy:** everything below runs in the worktree
`~/code/infra/.worktrees/vault-token-self-heal` on branch `wizard/vault-token-self-heal`.
Per repo policy, EVERY git command in this git-crypt repo worktree carries:
`-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false`
(abbreviated as `$GCFLAGS` below; define once per shell:
`GCFLAGS="-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false"`
and use it unquoted: `git $GCFLAGS <verb> …`).
---
### Task 1: Unit tests for the two new pure functions (RED)
**Files:**
- Modify: `scripts/test-vault-token-renew.sh` (append before the final `printf`/exit lines)
- [ ] **Step 1: Append the failing tests**
Insert this block immediately after the existing "parse + decide end-to-end" section (after the line `no "oidc: parse+decide refused" …`, before the final `printf '\n%d passed…'`):
```bash
# --- vtr_accessor: parse accessor out of lookup JSON ---
LOOKUP_NEW='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-new","policies":["default","sops-admin","vault-admin"],"identity_policies":null}}'
eq "accessor parsed" "acc-new" "$(vtr_accessor "$LOOKUP_NEW")"
eq "accessor absent -> empty" "" "$(vtr_accessor '{"data":{"display_name":"x"}}')"
# --- vtr_is_stale_periodic: the heal's revoke filter — ONLY old token-devvm-wizard
# --- tokens are swept; the just-minted token, foreign tokens, and anything with an
# --- unknown accessor are kept. An empty keep-accessor sweeps NOTHING (fail-safe).
STALE_OURS='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-old","policies":["default","sops-admin","vault-admin"]}}'
ok "older periodic token is stale" vtr_is_stale_periodic "$STALE_OURS" "acc-new"
no "the just-minted token is kept" vtr_is_stale_periodic "$LOOKUP_NEW" "acc-new"
no "foreign oidc token never swept" vtr_is_stale_periodic "$LOOKUP_OIDC" "acc-new"
no "woodpecker token never swept" vtr_is_stale_periodic "$LOOKUP_WP" "acc-new"
no "missing accessor never swept" vtr_is_stale_periodic '{"data":{"display_name":"token-devvm-wizard"}}' "acc-new"
no "empty keep-accessor sweeps nothing" vtr_is_stale_periodic "$STALE_OURS" ""
```
(`LOOKUP_OIDC` / `LOOKUP_WP` and the `ok`/`no`/`eq` helpers already exist in the file.)
- [ ] **Step 2: Run tests, verify they fail**
Run: `bash scripts/test-vault-token-renew.sh`
Expected: FAILs / `command not found` for `vtr_accessor` and `vtr_is_stale_periodic`; the 17 pre-existing tests stay green.
### Task 2: Implement the pure functions (GREEN)
**Files:**
- Modify: `scripts/vault-token-renew.sh` (insert after `vtr_drift_ok()`, before `vtr_main()`)
- [ ] **Step 1: Add the two functions**
```bash
# vtr_accessor <lookup-json> -> the token accessor (empty if absent).
vtr_accessor() {
printf '%s' "$1" | jq -r '.data.accessor // ""'
}
# vtr_is_stale_periodic <lookup-json> <keep-accessor> -> 0 if this lookup
# describes one of OUR periodic tokens (display name matches) that is NOT the
# one to keep — i.e. a stale leftover a heal should revoke. 1 otherwise.
# Name-only on purpose (no policy check): anything named token-devvm-wizard
# that isn't the current token is garbage from a previous mint. An empty
# keep-accessor sweeps NOTHING (fail-safe: never revoke when we don't know
# which token is current).
vtr_is_stale_periodic() {
local dn acc
[ -n "${2:-}" ] || return 1
dn=$(vtr_display_name "$1")
acc=$(vtr_accessor "$1")
[ "$dn" = "$EXPECTED_DN" ] || return 1
[ -n "$acc" ] || return 1
[ "$acc" != "$2" ]
}
```
- [ ] **Step 2: Run tests, verify all pass**
Run: `bash scripts/test-vault-token-renew.sh`
Expected: `25 passed, 0 failed`, exit 0.
- [ ] **Step 3: Commit**
```bash
cd ~/code/infra/.worktrees/vault-token-self-heal
git $GCFLAGS add scripts/vault-token-renew.sh scripts/test-vault-token-renew.sh
git $GCFLAGS commit -m "vault-token-renew: pure helpers for the self-heal revoke filter
vtr_accessor parses the accessor from lookup JSON; vtr_is_stale_periodic
decides which old token-devvm-wizard tokens a heal may revoke (never the
just-minted one, never foreign tokens, nothing when the keeper is unknown).
TDD red-green for the heal branch that lands next."
```
### Task 3: The heal branch (`vtr_heal` + `vtr_main` wiring)
**Files:**
- Modify: `scripts/vault-token-renew.sh`
- [ ] **Step 1: Add `vtr_heal` after `vtr_is_stale_periodic()`, before `vtr_main()`**
```bash
# vtr_heal <foreign-dn> <log-file> -> 0 if ~/.vault-token was re-minted back to
# our periodic admin token using the foreign token's own authority, 1 if the
# heal was denied or failed (caller exits non-zero; the unit goes failed).
#
# Self-heal added 2026-07-03 (docs/plans/2026-07-03-vault-token-self-heal-design.md):
# an OIDC login — which the infra docs prescribe before applies — clobbers
# ~/.vault-token with a 7-day token, and detect-only drift left that unnoticed
# for weeks (the weekly-expiry loop). We ATTEMPT the re-mint with the
# clobbering token itself and let Vault's authz decide — a read-only clobber
# (the 2026-06-05 woodpecker incident) is denied the mint and stays a loud
# failure, because it signals a misbehaving flow that someone should look at.
vtr_heal() {
local foreign_dn="$1" log="$2"
local errf new_token new_info new_dn new_pols new_acc tmp
errf=$(mktemp)
if ! new_token=$(vault token create -orphan -period=768h \
-policy=vault-admin -policy=sops-admin -display-name=devvm-wizard \
-field=token 2>"$errf") || [ -z "$new_token" ]; then
printf '%s DRIFT: ~/.vault-token is dn=%q — heal denied, foreign token lacks create authority (%s); investigate what wrote it. Manual re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \
"$(date -Is)" "$foreign_dn" "$(tr '\n' ' ' <"$errf")" >>"$log"
rm -f "$errf"
return 1
fi
rm -f "$errf"
# Sanity: the minted token must itself pass the drift guard before it may
# replace ~/.vault-token.
if ! new_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json 2>&1); then
printf '%s FAIL: heal minted a token but its lookup failed: %s\n' \
"$(date -Is)" "$new_info" >>"$log"
return 1
fi
new_dn=$(vtr_display_name "$new_info")
new_pols=$(vtr_policies_csv "$new_info")
if ! vtr_drift_ok "$new_dn" "$new_pols"; then
printf '%s FAIL: heal minted an unexpected token (dn=%q policies=%q) — not writing it\n' \
"$(date -Is)" "$new_dn" "$new_pols" >>"$log"
return 1
fi
# Atomic replace: mktemp files are 0600 from birth; same-filesystem mv.
tmp=$(mktemp "$HOME/.vault-token.XXXXXX")
printf '%s' "$new_token" >"$tmp"
mv "$tmp" "$HOME/.vault-token"
# Anti-sprawl: revoke previous token-devvm-wizard tokens — each heal would
# otherwise strand the prior periodic ADMIN token server-side for up to 32d.
# The clobbering foreign token is deliberately NOT revoked: it may still back
# the user's live login session, and it ages out on its own (7d for OIDC).
local sweep="accessor sweep skipped (list denied)" accessors a a_info revoked=0
new_acc=$(vtr_accessor "$new_info")
if [ -n "$new_acc" ] && accessors=$(VAULT_TOKEN="$new_token" vault list -format=json auth/token/accessors 2>/dev/null); then
while IFS= read -r a; do
[ -n "$a" ] || continue
a_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json -accessor "$a" 2>/dev/null) || continue
if vtr_is_stale_periodic "$a_info" "$new_acc"; then
VAULT_TOKEN="$new_token" vault token revoke -accessor "$a" >/dev/null 2>&1 && revoked=$((revoked + 1))
fi
done < <(printf '%s' "$accessors" | jq -r '.[]')
sweep="revoked $revoked stale periodic token(s)"
fi
printf '%s HEALED: re-minted periodic token from foreign dn=%q (%s)\n' \
"$(date -Is)" "$foreign_dn" "$sweep" >>"$log"
}
```
- [ ] **Step 2: Rewire the drift branch in `vtr_main`**
Replace this exact block (comment + if):
```bash
# Drift guard (added 2026-06-07): the renewer must NOT keep a FOREIGN token alive.
# On 2026-06-05 a stray `vault login -method=kubernetes` overwrote ~/.vault-token
# with a read-only woodpecker token, and this script then silently renewed THAT
# for two days — masking the loss of write access. So before renewing, confirm
# the token is our periodic admin token; if it has drifted, fail loudly (systemd
# marks the unit failed) instead of keeping someone else's token alive.
if ! vtr_drift_ok "$dn" "$pols"; then
printf '%s DRIFT: ~/.vault-token is dn=%q policies=%q (expected dn=%q with %q). Refusing to renew a foreign token. Re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \
"$(date -Is)" "$dn" "$pols" "$EXPECTED_DN" "$REQUIRED_POLICY" >>"$log"
exit 1
fi
```
with:
```bash
# Drift guard (2026-06-07) + self-heal (2026-07-03): the renewer must not
# keep a FOREIGN token alive (on 2026-06-05 a stray kubernetes login was
# silently renewed for two days, masking lost write access). But detect-only
# drift proved worse in practice: an OIDC login — which the infra docs
# prescribe before applies — clobbers this file too, and the resulting DRIFT
# failures went unnoticed for weeks while access degraded to a 7-day token
# (the weekly-expiry loop). On drift we now ATTEMPT to heal (see vtr_heal):
# re-mint the periodic token with the clobbering token's own authority.
# Vault's authz keeps the old guarantee — a token that couldn't legitimately
# hold vault-admin is denied the mint, and we still fail loud.
if ! vtr_drift_ok "$dn" "$pols"; then
vtr_heal "$dn" "$log" || exit 1
exit 0
fi
```
- [ ] **Step 3: Syntax + lint + regression check**
Run: `bash -n scripts/vault-token-renew.sh && bash scripts/test-vault-token-renew.sh; command -v shellcheck >/dev/null && shellcheck scripts/vault-token-renew.sh`
Expected: syntax OK, `25 passed, 0 failed`; shellcheck (if installed) reports nothing new.
- [ ] **Step 4: Commit**
```bash
git $GCFLAGS add scripts/vault-token-renew.sh
git $GCFLAGS commit -m "vault-token-renew: self-heal the periodic token on admin-capable clobber
Viktor asked for 'vault login -method=oidc' to work seamlessly: the OIDC
login the docs prescribe kept clobbering ~/.vault-token with a 7-day token,
and detect-only DRIFT failures went unnoticed for weeks (weekly-expiry
loop, twice in June). On drift the renewer now re-mints the periodic token
with the clobbering token's own authority (Vault's 403 is the judge — no
policy guessing), sanity-checks it, replaces the file atomically, and
revokes stale token-devvm-wizard leftovers. Weak/read-only clobbers still
fail loudly on purpose. Design: docs/plans/2026-07-03-vault-token-self-heal-design.md"
```
### Task 4: Docs — runbook + test-file header
**Files:**
- Modify: `docs/runbooks/vault-token-renew-devvm.md` (the `## Drift guard & recovery` section + the healthy-log-line note + `## Tests`)
- Modify: `scripts/test-vault-token-renew.sh` (header comment only)
- [ ] **Step 1: Replace the runbook's `## Drift guard & recovery` section with:**
```markdown
## Drift guard & self-heal
`~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login`
overwrites it. Two confirmed clobber vectors:
1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer
can't push past the OIDC role's 7-day `token_max_ttl`). The infra docs
prescribe this login before applies, so it recurs — it went unnoticed for
weeks twice (2026-06-18→26, 2026-06-29→07-03) and read as "Vault expires
weekly".
2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) →
writes a read-only `kubernetes-woodpecker-default` token (can read Vault but
**cannot** write `secret/*`). Happened 2026-06-05, unnoticed for two days.
Since 2026-07-03 the renewer **self-heals**
(`docs/plans/2026-07-03-vault-token-self-heal-design.md`). On a foreign token
it attempts the re-mint **with the clobbering token's own authority** and lets
Vault's authz decide:
- **Admin-capable clobber (OIDC login)** → re-mints the periodic token,
sanity-checks it against the drift guard, atomically replaces
`~/.vault-token`, revokes stale `token-devvm-wizard` leftovers
(anti-sprawl), logs
`HEALED: re-minted periodic token from foreign dn=… (revoked N stale periodic token(s))`
and exits 0. The clobbering token is NOT revoked — it may still back a live
login session; it ages out on its own.
- **Weak clobber (read-only k8s token)** → the mint is denied; logs
`DRIFT: … heal denied, foreign token lacks create authority …; investigate what wrote it`
and exits non-zero (unit `failed`). Deliberately loud: this signals a
misbehaving agent flow — exactly the 2026-06-05 case.
**Manual recovery** is only needed for the weak-clobber case (the DRIFT log
line still contains the exact command) — run the
[mint/re-mint](#mint--re-mint-the-token) block.
```
- [ ] **Step 2: In the runbook's `## Health check` section**, after the "A healthy log line looks like…" sentence, add:
```markdown
After an OIDC login you'll instead see, at the next nightly run:
`<ts> HEALED: re-minted periodic token from foreign dn="oidc-…" (revoked N stale periodic token(s))` — that's the self-heal working as designed.
```
- [ ] **Step 3: In the runbook's `## Tests` section**, replace the first sentence with:
```markdown
`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision,
the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber
case), and the self-heal's revoke filter (which stale periodic tokens a heal
may sweep).
```
- [ ] **Step 4: Update the test file's header comment** (lines 27) to:
```bash
# Unit tests for the pure functions in vault-token-renew.sh.
# Sources the script (vtr_main is guarded) and exercises (a) the drift-guard
# decision — is ~/.vault-token OUR periodic admin token (renew) or a foreign
# clobber (heal / fail loud)? — whose ABSENCE let the 2026-06-05 woodpecker
# clobber be silently renewed for two days, and (b) the self-heal's revoke
# filter — which stale token-devvm-wizard tokens a heal may sweep.
# Run: bash infra/scripts/test-vault-token-renew.sh
```
- [ ] **Step 5: Run tests once more, then commit**
Run: `bash scripts/test-vault-token-renew.sh`
Expected: `25 passed, 0 failed`.
```bash
git $GCFLAGS add docs/runbooks/vault-token-renew-devvm.md scripts/test-vault-token-renew.sh
git $GCFLAGS commit -m "vault-token-renew runbook: document the self-heal behavior
Drift guard section rewritten: admin-capable clobbers now self-heal at the
nightly run (HEALED log line); weak clobbers keep the loud DRIFT failure;
manual re-mint is only the weak-clobber recovery now."
```
### Task 5: Deploy + live verification (on devvm, as wizard)
**Files:** none (host deploy + live checks)
- [ ] **Step 1: Install from the worktree**
```bash
install -m 0755 ~/code/infra/.worktrees/vault-token-self-heal/scripts/vault-token-renew.sh ~/.local/bin/vault-token-renew
```
(Units unchanged — no `daemon-reload` needed.)
- [ ] **Step 2: Live case 1 — admin-capable clobber heals**
```bash
export VAULT_ADDR=https://vault.viktorbarzin.me
export XDG_RUNTIME_DIR=/run/user/$(id -u)
FAKE_ADMIN=$(vault token create -ttl=1h -policy=vault-admin -policy=sops-admin -display-name=fake-oidc -field=token)
printf '%s' "$FAKE_ADMIN" > ~/.vault-token
systemctl --user start vault-token-renew.service; echo "exit=$?"
tail -1 ~/.local/state/vault-token-renew.log
vault token lookup | grep -E 'display_name|period'
```
Expected: `exit=0`; log line `HEALED: re-minted periodic token from foreign dn="token-fake-oidc" (revoked N stale periodic token(s))` with N ≥ 1 (the pre-clobber periodic token is itself swept as stale — by design — along with any strays from the June 26 / July 3 manual re-mints); lookup shows `display_name token-devvm-wizard`, `period 768h`. Note: `FAKE_ADMIN` is a child of the swept old token, so the cascade revokes it too — no cleanup needed.
- [ ] **Step 3: Verify exactly ONE periodic token remains server-side**
```bash
for a in $(vault list -format=json auth/token/accessors | jq -r '.[]'); do
vault token lookup -format=json -accessor "$a" 2>/dev/null \
| jq -r 'select(.data.display_name=="token-devvm-wizard") | .data.accessor'
done
```
Expected: exactly one line, matching `vault token lookup -format=json | jq -r .data.accessor`.
- [ ] **Step 4: Live case 2 — weak clobber stays a loud failure**
```bash
GOOD=$(cat ~/.vault-token)
FAKE_WEAK=$(vault token create -ttl=10m -policy=default -display-name=fake-weak -field=token)
printf '%s' "$FAKE_WEAK" > ~/.vault-token
systemctl --user start vault-token-renew.service; echo "exit=$?"
systemctl --user is-failed vault-token-renew.service
tail -1 ~/.local/state/vault-token-renew.log
printf '%s' "$GOOD" > ~/.vault-token && chmod 600 ~/.vault-token
vault token revoke "$FAKE_WEAK" >/dev/null
```
Expected: `exit=1` (start reports the oneshot failure), `is-failed` prints `failed`, log line `DRIFT: ~/.vault-token is dn="token-fake-weak" — heal denied, foreign token lacks create authority (… permission denied …); investigate what wrote it. Manual re-mint: …`.
- [ ] **Step 5: Happy path still green**
```bash
systemctl --user start vault-token-renew.service; echo "exit=$?"
tail -1 ~/.local/state/vault-token-renew.log
```
Expected: `exit=0`, log `OK renewed (dn=token-devvm-wizard ttl=2764800s)`.
### Task 6: Land on master + cleanup
- [ ] **Step 1: Merge latest master into the branch, re-verify, push**
```bash
cd ~/code/infra/.worktrees/vault-token-self-heal
git $GCFLAGS fetch forgejo
git $GCFLAGS merge forgejo/master
bash scripts/test-vault-token-renew.sh
git $GCFLAGS push forgejo HEAD:master
```
Expected: clean merge (or already up to date), `25 passed, 0 failed`, push accepted. Non-fast-forward → fetch, merge, push again.
- [ ] **Step 2: Watch CI to completion**
The push fires the infra Woodpecker `default.yml` (terragrunt apply for changed stacks). This change touches only `scripts/` + `docs/` → expect a fast success / no-op apply. Check (Forgejo-forge infra repo = Woodpecker repo id 82):
```bash
export VAULT_ADDR=https://vault.viktorbarzin.me
vault kv get -format=json secret/ci/global | jq -r '.data.data | keys[]' # find the woodpecker admin token key
WP_TOKEN=$(vault kv get -field=<that-key> secret/ci/global)
curl -s -H "Authorization: Bearer $WP_TOKEN" 'https://ci.viktorbarzin.me/api/repos/82/pipelines?perPage=1' | jq '.[0] | {number, status, commit: .commit[0:8]}'
```
Expected: the pipeline for the pushed commit reaches `status: "success"` (poll until terminal). If it fails, fix before proceeding.
- [ ] **Step 3: Remove worktree + branch, reconcile main checkout**
```bash
git -C ~/code/infra $GCFLAGS worktree remove .worktrees/vault-token-self-heal
git -C ~/code/infra $GCFLAGS branch -d wizard/vault-token-self-heal
git -C ~/code/infra status --porcelain # expect clean before pulling
git -C ~/code/infra $GCFLAGS pull --ff-only forgejo master
```
Expected: worktree gone, branch deleted (already merged), main checkout fast-forwards to the landed commit.
### Task 7: Memory + wrap-up
- [ ] **Step 1: Update the stale memories** (they say the drift guard is detect-only / recovery is manual):
```bash
homelab memory recall "vault periodic token renewer drift" # confirm ids 4204, 4211, 7121 still say detect-only
homelab memory update 4211 "<original gotcha content, amended: since 2026-07-03 the renewer SELF-HEALS admin-capable clobbers at its nightly run (re-mints the periodic token with the clobbering token's authority + revokes stale token-devvm-wizard leftovers; weak clobbers still fail loudly). An OIDC login on devvm is now harmless. Design: infra docs/plans/2026-07-03-vault-token-self-heal-design.md>"
homelab memory update 7121 "<original content, amended: PLAYBOOK OBSOLETE for admin clobbers self-heal shipped 2026-07-03; manual re-mint only needed for weak/read-only clobbers>"
```
(Fetch each memory's current text first and preserve it — amend, don't replace wholesale.)
- [ ] **Step 2: End-of-task extraction** — dispatch the standard M.3 memory-mining subagent per `~/.claude/rules/execution.md`, then give the final summary.
---
## Plan self-review (done at write time)
- **Spec coverage**: heal-on-admin-clobber (T3), loud-fail-on-weak (T3 + live T5.4), no-revoke-foreign (T3 comment + design decision 4), anti-sprawl sweep + fail-safe filter (T2/T3, live T5.3), minted-token sanity + atomic write (T3), unit tests (T1/T2), runbook (T4), deploy + live sim (T5), memory updates (T7). ✓
- **Placeholders**: `<that-key>` in T6.2 is a deliberate discovery step (key name verified live from Vault, not invented). No other TBDs. ✓
- **Name consistency**: `vtr_accessor`, `vtr_is_stale_periodic`, `vtr_heal`, `EXPECTED_DN` match across tasks; test count 17→25 consistent (8 new cases). ✓

View file

@ -1,335 +0,0 @@
# Backup MX — self-hosted store-and-forward relay on Oracle Always-Free — design
Date: 2026-07-04 (v3 — post-challenge; v2 Oracle pivot same day) · Status: design,
pre-implementation · ADR: [0019](../adr/0019-backup-mx-self-hosted-oracle-relay.md)
v3 incorporates two independent adversarial-challenge reviews (same day). Their
material corrections are marked **[CH]** throughout — the largest: the v2 drain
path would never have drained (primary-side smtpd rejects), monitoring-over-
tailnet was fiction (no cluster→tailnet route exists), and the VM's bounce
model was wrong (it can never deliver a DSN).
## Goal
Inbound mail for `viktorbarzin.me` must survive homelab outages without loss.
Requirement level (Viktor, 2026-07-04): **never lose mail; delayed delivery is
acceptable; budget is $0** (hard constraint — reaffirmed after the Rollernet
gates failed). A store-and-forward backup MX queues mail while the homelab is
down and re-delivers when it returns.
Out of scope, explicitly:
- Reading new mail *during* an outage.
- Outbound mail during outages.
- The "primary up but hard-bouncing 5xx" misconfig class — a backup MX is
never consulted when the primary answers. Separate hardening/alerting track.
Known residual limit (state it plainly): an outage **longer than 30 days**
loses the queued mail *silently* — the VM cannot emit a bounce to anyone
(egress 25 blocked), so no sender ever learns. Accepted; 30 days is already
6× the sender-retry status quo.
## v1 → v2: why Rollernet was dropped (gate evidence, 2026-07-04)
v1 selected Roller Network's free Secondary MX. The validation gates killed it
before any DNS change:
- **G2 FAILED**: the [free-accounts policy](https://rollernet.us/policy/free-accounts.html)
caps free mail service at **200 relayed messages or 10 MB per rolling 7
days**; overage → domain suspended **48 h answering SMTP 5xx** (permanent
bounces), repeatable. Spammers deliberately target backup MXes even while
the primary is up, so background spam alone can hold the domain suspended —
worse than no backup MX.
- **G1 SHAKY**: same policy page says free accounts are being discontinued.
- **G3 PASSED** (for posterity): `mail{,2}.rollernet.us` present valid LE
certs over STARTTLS.
- Signup is Cloudflare-Turnstile-gated — moot given G1/G2.
Viktor's decision: stay free → self-host on Oracle Always-Free. **[CH]** The
external challenger re-searched the free landscape (DNSExit, KisoLabs,
DuoCircle, AWS/Azure/GCP/Hetzner/Fly/Vultr/Linode free tiers) and confirmed:
no credible free managed backup-MX or free VM with a usable port-25 story
exists in 2026 other than OCI. GCP's free e2-micro also blocks egress 25 and
is US-regions-only (wrong continent).
## Decision
A minimal **Postfix store-and-forward relay** (`mx2.viktorbarzin.me`) on an
Oracle Cloud **Always-Free** compute instance, published as a lower-preference
MX. It accepts mail for `viktorbarzin.me` when the primary is unreachable,
queues up to 30 days, and drains to the primary when it returns. No mailboxes,
no third-party terms — the queue-lifetime and reject-behavior knobs are ours.
## Architecture
```
┌── pri 1 mail.viktorbarzin.me ──► pfSense HAProxy ──► mailserver pod
sender MTA ──► MX lookup ┤ ▲
└── pri 20 mx2.viktorbarzin.me │ drain: smtp to
(Oracle VM, Postfix relay, │ mail.viktorbarzin.me:2526
queue ≤ 30 days) ───────────────────┘ (pfSense WAN NAT rdr
2526 → 10.0.20.1:25,
existing HAProxy frontend)
```
- **Normal operation**: senders use pri 1; the VM idles (spammers targeting
the backup + transient-blip retries get relayed onward immediately).
- **Outage**: senders fall back to pri 20 → VM accepts + queues → Postfix
retries the primary on its native schedule → queue drains after recovery
through the standard external ingress path (PROXY v2 → :2525 → rspamd →
Dovecot).
- **Custom drain port**: Oracle blocks **egress TCP 25** tenancy-wide
(post-2021; exemptions unreliable) — the VM cannot reach
`mail.viktorbarzin.me:25`. One pfSense WAN NAT rule `TCP 2526 →
10.0.20.1:25` reuses the existing HAProxy frontend unchanged. **[CH]
Verified against the runbook**: the frontend binds `*:25` on pfSense (not
strictly 10.0.20.1), rdr dst-port rewrite is the existing production
pattern (WAN:25 already rewrites to 10.0.20.1:25), and port 2526 collides
with nothing (the HAProxy test frontend uses :2525). Inbound TCP 25 **to**
the VM is unaffected by Oracle's egress-only block per practitioner
evidence (iRedMail/mailcow on OCI: receive works, send doesn't) — **to be
proven at gate O2 before any DNS change** (Oracle publishes no positive
commitment).
## Oracle account & instance
- **Account**: Viktor creates it (human signup; card for identity, $0
charged). **Home region is fixed at signup and Always-Free compute exists
only there — choose `eu-frankfurt-1` deliberately; there is no
try-another-region fallback without a new account. [CH]**
- **[CH] PAYG conversion is a REQUIRED prerequisite, not a recommendation**:
Oracle stops idle Always-Free instances (95th-pct CPU < 20% over 7 days an
idle Postfix box qualifies) and demonstrably changes free-tier terms without
notice, enforcing by termination (June 2026: A1 allowance silently halved,
over-limit instances shut down). PAYG keeps Always-Free resources free and
exempts them from idle reclamation.
- **Shape**: `VM.Standard.E2.1.Micro` (x86, 1/8 OCPU burst, 1 GB RAM; 2
always-free instances allowed; ample for queue-only Postfix — and untouched
by the 2026 A1 cuts). ARM A1 fallback is **unreliable** (halved quota,
chronic Frankfurt capacity) — treat E2.1.Micro availability as the gate.
- **[CH] Reserved public IP is mandatory** (`oci_core_public_ip`, reserved):
an ephemeral IP rotates on stop/start and would silently break all four
IP-keyed controls at once (pfSense NAT source-restriction, the primary's
smtpd/rspamd exemptions, the Oracle security list, Prometheus scrape
allowlist) — discovered only at the next outage's drain.
- **OS**: Ubuntu 24.04. **[CH] OCI Ubuntu images ship an OS-level iptables
ruleset (`/etc/iptables/rules.v4`) that ACCEPTs 22 and REJECTs everything
else, independent of security lists** — cloud-init must insert ACCEPT rules
for 25/80 (+ scrape ports) ahead of the REJECT and persist them, or gate O2
fails on day 1 with a correct security list.
- **Credentials**: OCI API key for Terraform → Vault `secret/viktor`
(`oci_*`); web login → Vaultwarden item `Oracle Cloud (backup MX)`.
## Networking & security posture
- **Ingress on the VM**: TCP 25 world-open (the service). **[CH] TCP 80
world-open permanently** — Let's Encrypt validation is multi-perspective
with no published source IPs, so it cannot be source-scoped, and a
"open-only-during-renewal" toggle is unspecified automation whose realistic
failure mode is an expired cert at day ~90. Nothing listens on 80 outside
certbot's seconds-long renewal windows; connection-refused surface is
negligible. TCP 9100/9154 (exporters) restricted to the homelab WAN /32
(176.12.22.76) in both the Oracle security list and the VM firewall.
- **No public SSH**: management rides the headscale tailnet — cloud-init
enrolls via a **preauth key for a dedicated non-OIDC headscale user** with
node tag `tag:backup-mx` (headscale 0.28.0 file-mode ACL, content in Vault
`secret/headscale``headscale_acl`); SSH bound to the tailnet interface.
ACL grant: `group:admin → tag:backup-mx:22` (cluster pods are NOT tailnet
members — see monitoring). **[CH] Outage caveat**: headscale's control
plane + DERP live in the cluster, so mid-outage tailnet reachability is
cached-netmap best-effort — the runbook documents the **OCI instance
console connection as break-glass** management. (Also fix `vpn.md`'s stale
"0.23.x / OIDC-only" claims while in there.)
- **VM compromise blast radius**: plaintext of outage-queued mail + a relay
surface contained by `relay_domains = viktorbarzin.me` only, no submission
ports, no SASL, no local delivery. The VM is deliberately NOT added to the
primary's `mynetworks` (that would let a compromised VM relay arbitrary
mail *through* the primary) — per-stage exemptions instead, below.
## Postfix configuration (relay-only, accept-and-queue with 4xx-only hygiene)
- `relay_domains = viktorbarzin.me`; `mydestination =` (empty).
- **[CH]** `smtpd_relay_restrictions = permit_mynetworks,
reject_unauth_destination` — explicit 5xx for foreign-domain RCPTs (the
default tail is `defer_unauth_destination`, whose 4xx invites every relay
probe to retry forever).
- **[CH]** `relay_recipient_maps` explicitly set to the wildcard form
(`@viktorbarzin.me OK`) — documents accept-all-recipients as a decision
(the domain is catch-all; every RCPT is valid by definition).
- `transport_maps`: `viktorbarzin.me smtp:[mail.viktorbarzin.me]:2526`.
- `maximal_queue_lifetime = 30d`. **[CH]** `bounce_queue_lifetime = 1d` and
`delay_warning_time = 0` — this host can never deliver a DSN to anyone
(egress 25 blocked; its only egress is 2526 to the primary), so undeliverable
bounces must be discarded quickly or they rot in the queue for a month and
permanently poison the queue-depth alert.
- **[CH]** `message_size_limit = 209715200` — exactly the primary's 200 MB
(`POSTFIX_MESSAGE_SIZE_LIMIT`, mailserver main.tf:88). The stock 10 MB
default would 552-reject large legitimate mail during outages — the exact
loss mode this project exists to prevent. Equal, never higher (higher
recreates drain-time rejects).
- **[CH] postscreen on the VM in 4xx-only posture**: pregreet test ON
(fire-and-forget bots don't retry; real MTAs do — the whole design already
rests on sender retry, so 4xx filtering is loss-free by construction),
optionally `postscreen_dnsbl_action = defer` with a conservative threshold.
v2's blanket "no DNSBL" conflated 5xx reputation rejects (rightly banned)
with 4xx tempfail (harmless); without any hygiene the backup is a 24/7
spam backdoor since spammers deliberately deliver to the highest-numbered
MX. Zero 5xx from reputation, ever.
- `inet_protocols = ipv4` **[CH]** — the primary publishes an AAAA (HE
tunnel) but the IPv6 HAProxy bridge has no :2526 listener; skip the wasted
v6 attempt per delivery.
- `smtpd_tls_cert_file` = LE cert for `mx2.viktorbarzin.me` (opportunistic
STARTTLS inbound; `smtp_tls_security_level = may` on the drain leg).
- Queue disk: the ~45 GB free boot volume dwarfs any realistic 30-day
accumulation for a personal domain.
## TLS
certbot standalone HTTP-01 for `mx2.viktorbarzin.me` (no Cloudflare API token
on an internet-facing VM). Port 80 permanently open (see above); certbot renew
timer. The MTA-STS follow-up (separate task; policy host currently dangling —
below) must list `mx2.viktorbarzin.me` when implemented.
## Primary-side drain enablement **[CH — this section replaces v2's "SPF/DMARC exemption + postscreen permit", which exempted the wrong layers]**
The v2 exemptions targeted postscreen DNSBL (which is **off** on the primary —
`ENABLE_DNSBL` unset) and rspamd SPF/DMARC scoring — but missed the three
mechanisms that would actually break the drain. All are keyed on the VM's
reserved /32 (the PROXY-v2-recovered client IP):
1. **`reject_unknown_client_hostname` bypass** — the primary sets
`POSTFIX_REJECT_UNKNOWN_CLIENT_HOSTNAME=1` (main.tf:89); an Oracle IP
without full FCrDNS (PTR needs an Oracle SR; limited on free accounts)
would be **450-deferred on every drain attempt → the queue never drains →
mass-bounces at day 30**. Fix: `check_client_access` permit for the VM /32
early in `smtpd_client_restrictions`, and a matching permit at the sender
stage (SPOOF_PROTECTION=1 rejects unauthenticated own-domain envelope
senders — drained self-addressed/bounced mail would 5xx). Attempt the
Oracle PTR anyway (belt and braces).
2. **Anvil rate-limit exception**`smtpd_client_message_rate_limit = 30`/min
keys on the VM's IP at drain; a >3,600-message backlog would throttle for
hours and false-fire the queue alert. Add the VM /32 to
`smtpd_client_event_limit_exceptions`.
3. **rspamd: evaluate the original sender, never 5xx the drain stream** — via
the existing override.d ConfigMap pattern (same mount as
`dkim_signing.conf`): (a) configure rspamd's **`external_relay`** module
(ip_map = VM /32) so SPF/DMARC/IP reputation evaluate against the
*original* client IP parsed from the VM's Received header — this keeps
DMARC protection for the entire drain stream instead of v2's blanket
disable; (b) cap rspamd's **action at the VM /32 to tag/fold — never
milter-reject**: the primary's default reject tier (DMS default, active
since only dkim_signing is overridden today) would 5xx high-score spam at
DATA, forcing the VM to generate DSNs to forged senders = classic
backup-MX backscatter → mx2's IP blacklisted. Drained spam lands tagged in
the catch-all's Junk instead. Validate the external_relay ↔ settings-rule
interplay at gate O5 with a high-spam-score message.
4. postscreen permit for the /32 (harmless; pregreet never trips a real
Postfix client and DNSBL is off — kept for future-proofing only).
## Our-side changes (Terraform unless noted)
1. **New stack `stacks/backup-mx/`** (Tier 1): OCI provider (creds from
Vault), VCN + subnet + security list + **reserved public IP** +
`VM.Standard.E2.1.Micro` + cloud-init (`templatefile`): **OS iptables
ACCEPTs for 25/80/9100/9154 ahead of the OCI image's REJECT rule
(persisted)**, postfix + config above, certbot, tailscale→headscale
enrollment (preauth key from Vault), node_exporter, postfix_exporter,
unattended-upgrades.
2. **DNS**`stacks/cloudflared/modules/cloudflared/cloudflare.tf`: A
`mx2.viktorbarzin.me` → reserved IP (non-proxied), MX pref 20 → `mx2`.
**[CH] Live zone count verified: 195/200 → 197/200 after this change; only
3 slots remain and the MTA-STS follow-up needs 12 → plan the next
record-purge now, not at collision time.**
3. **pfSense (live network device — approved as part of this plan)**: WAN NAT
rdr `TCP 2526 → 10.0.20.1:25` + firewall rule, source-restricted to the
reserved IP. **[CH] Scripted** (extend the existing
`scripts/pfsense-*-haproxy*.php` bootstrap-script family), not
hand-clicked — keeps the git-rebuildable parity the rest of the pfSense
mail config has. Config.xml rides the nightly backup.
4. **Mailserver stack**: the four-layer drain enablement above (client+sender
`check_client_access` permits, anvil exception, rspamd external_relay +
action cap, postscreen permit) — all keyed to one /32, via the existing
`postfix_cf` / `user-patches.sh` / rspamd-override hook points (verified
present: main.tf:129-144, 222-281, 467-474).
5. **Monitoring [CH — replaces v2's tailnet scraping, which had no transport:
no cluster→tailnet route exists and no existing target is scraped that
way]**: Prometheus scrapes `node_exporter`/`postfix_exporter` on the VM's
**public reserved IP**, allowed only from the homelab WAN /32 (Oracle SL +
VM firewall); blackbox TCP:25 from the cluster (`BackupMxDown`, warning);
MX-set drift assertion (both MX records present). Alerts:
`BackupMxQueueStuck` = **non-bounce** queue depth > 0 for 2 h while the
primary is healthy (gate on the existing `MailServerDown`/roundtrip
series, machine-readable — not prose); bounce residue is excluded by the
1-day bounce lifetime. Note: during a full homelab outage Prometheus
itself is down — queue growth is unobservable live under ANY transport;
what we actually watch is the post-recovery drain. A WAN-IP change stales
the Oracle allowlist → visible as ScrapeTargetDown (self-signaling).
**Probe semantics note**: once mx2 exists, the Brevo roundtrip probe's
mail fails over to mx2 on transient primary blips and arrives minutes late
via the drain — `EmailRoundtripFailing` may then mean "delayed via mx2",
not "lost"; note in the alert description and runbook.
6. **Docs (same commit as implementation)**: rewrite `mailserver.md` §"No
Backup MX", new runbook `docs/runbooks/backup-mx.md` (`postqueue -p`,
forced drain `postqueue -f`, cert renewal, **OCI console break-glass**, VM
rebuild from stack, Oracle account facts incl. PAYG + home-region lock),
`vpn.md` headscale-version/OIDC staleness fix, monitoring rows.
### MTA-STS finding (unchanged; no action in this change)
`_mta-sts` TXT is published but `mta-sts.viktorbarzin.me` has no record and
nothing serves the policy — MTA-STS is inert today. When fixed, the policy
MUST include `mx: mx2.viktorbarzin.me` (and budget its DNS records against the
3 remaining zone slots).
## Validation gates (in order; any failure → stop and report)
| # | Gate | Method | Failure handling |
|---|------|--------|------------------|
| O1 | Oracle account (home region `eu-frankfurt-1`, **fixed forever at signup**), **PAYG conversion done**, E2.1.Micro capacity | Viktor signs up + converts; TF apply | A1-in-home-region is a best-effort fallback only (halved quota, contended); else decision returns to Viktor |
| O2 | Inbound TCP 25 reachable from the internet (after the OS-iptables fix) | `nc -zv <reserved-ip> 25` from outside + recurring Uptime-Kuma TCP monitor (keeps proving it — Oracle publishes no commitment) | Stop; decision returns to Viktor |
| O3 | Drain works: VM → `mail.viktorbarzin.me:2526` delivers end-to-end | Test message injected on the VM | Debug pfSense NAT / HAProxy path |
| O4 | LE cert issued | certbot standalone | STARTTLS is opportunistic — non-blocking for go-live; fix before MTA-STS |
| O5 | Live failover test — **hardened [CH]** | presence-claim → scale mailserver to 0 (~30 min) → send from Gmail + Brevo **plus a high-spam-score message and a >10 MB message** → confirm queued (`postqueue -p`) → scale up → verify full drain within the anvil-exception expectations, spam folded to Junk (not bounced), headers show original-IP SPF/DMARC evaluation, no DSN generated on the VM, roundtrip probe recovers | Debug or roll back (remove MX record) |
## Failure modes
Covered: cluster/pod outages, pfSense/power/ISP outages ≤ 30 days, WAN IP
changes, short-retry senders. If pfSense is down the drain waits — Postfix
retries until it heals.
Not covered: primary-up-but-5xx misconfigs; outbound; mid-outage mailbox
access; **outages > 30 days lose queued mail silently (no DSN possible)**.
Simultaneous Oracle+homelab outage = status quo ante (sender retries).
Newly introduced, accepted:
- **A pet outside the cluster** — deliberately cattle: rebuilt from TF +
cloud-init, patched by unattended-upgrades, scraped by Prometheus. Never a
backup target.
- **Oracle free-tier caprice [CH — upgraded from v2's framing]**: Oracle has
silently cut Always-Free allowances and terminated over-limit instances
(June 2026, A1). Mitigations: PAYG (required), recurring inbound-25 probe,
`BackupMxDown`, and the fact that outside an active outage the queue is
empty — a surprise reclamation loses nothing, only coverage until rebuilt.
Rollernet Basic ($30/yr) stays the documented fallback if OCI sours.
- **Spam hygiene**: 4xx-only postscreen on the VM (pregreet + conservative
DNSBL-defer) instead of v2's nothing; drained spam is tagged/folded by
rspamd, never bounced.
- Outage mail sits plaintext on Oracle disk ≤ 30 days (single-tenant;
accepted).
## Rollback
Remove the MX + A records; wait for `postqueue -p` empty; `terraform destroy`
on `backup-mx`; delete the pfSense NAT rule (scripted); drop the mailserver
/32 exemptions. Order matters: MX record first.
## Viktor's manual steps (everything else is mine)
1. Create the Oracle Cloud account — **home region `eu-frankfurt-1`** (fixed
forever), card for identity, $0 charged.
2. **Convert the tenancy to Pay-As-You-Go** (required — idle-reclamation
exemption; Always-Free stays $0).
3. Hand me the tenancy OCID + a console user → I mint the API key, store
creds (Vault + Vaultwarden), and build the stack.
4. Approve the (scripted) pfSense NAT rule when I reach that step.

View file

@ -1,89 +0,0 @@
# Drone Logbook (Open DroneLog) — Design
**Date:** 2026-07-04
**Status:** Approved (Viktor, 2026-07-04)
**Owner request:** "I have a DJI Mini 4 Pro. I'm interested in github.com/ViktorBarzin/drone-logbook" → self-host it in the cluster.
## Goal
Self-host [Open DroneLog](https://github.com/arpanghosh8453/open-dronelog) (upstream of the
`ViktorBarzin/drone-logbook` fork) at **https://dronelog.viktorbarzin.me** so Viktor can import
DJI Fly flight logs from his DJI Mini 4 Pro and analyze them privately: telemetry charts, 3D map
replay, per-flight and lifetime stats. All data stays in the cluster (single DuckDB database).
## Decisions (interview, 2026-07-04)
| Question | Decision |
|---|---|
| Deployment form | Self-hosted Docker web app in k8s (not desktop app, not hosted webapp) |
| Exposure | Public `dronelog.viktorbarzin.me`, **Authentik forward-auth** (`auth = "required"`) |
| Log ingestion | **Both** manual web upload *and* a server-side auto-import drop folder from day one |
| Image source | **Upstream** `ghcr.io/arpanghosh8453/open-dronelog:latest` — NOT the fork |
| Fork disposition | Fork is 0 ahead / 372 behind, adds nothing; delete or park it. Only revive (sync + ADR-0002 GHA build) if Viktor starts modifying the code |
## Architecture
New Tier-1 stack `stacks/drone-logbook/`, modeled line-by-line on `stacks/freshrss/`
(the closest existing shape: single upstream-image app, own data volume, Keel-updated):
- **Namespace** `drone-logbook`, tier `4-aux`, label `keel.sh/enrolled=true` → Kyverno injects
Keel poll annotations → auto-upgrades as upstream releases (project is actively maintained).
- **Deployment** (1 replica, `Recreate` — DuckDB is single-writer/embedded):
- image `ghcr.io/arpanghosh8453/open-dronelog:latest` (nginx frontend + Axum REST backend, port 80)
- memory request=limit **512Mi** (DuckDB import/analytics spikes), cpu request 25m, no cpu limit
- standard `KYVERNO_LIFECYCLE_V1` / `KEEL_IGNORE_IMAGE` / `KEEL_LIFECYCLE_V1` lifecycle ignores
- **App data** `/data/drone-logbook` (DuckDB db, cached DJI decryption keys, uploaded originals):
**`proxmox-lvm-encrypted` block PVC** `drone-logbook-data-encrypted`, 2Gi, topolvm autoresize →
10Gi ceiling. Encrypted class because flight logs are GPS traces of home/travel — sensitive data
defaults to `proxmox-lvm-encrypted` per the storage decision rule (`.claude/CLAUDE.md`).
Embedded DBs stay off NFS (same rationale documented in the freshrss stack: NFS only for static files).
- **Backup CronJob** `drone-logbook-backup` (mandatory for every proxmox-lvm app): daily 01:30
file copy of the data volume → NFS `/srv/nfs/drone-logbook-backup` (dated dirs, 30-day retention,
Pushgateway metrics), pod-affinity co-scheduled with the app pod (RWO volume). 01:30 sits outside
the 00:00/08:00/16:00 sync-import windows so the DuckDB file is quiescent; retained upload
originals make even a torn copy recoverable by re-import. `nfs-mirror` (02:00) ships it to sda →
Synology offsite. Vaultwarden pattern.
- **Sync drop folder**: static NFS volume (`modules/kubernetes/nfs_volume`)
`192.168.1.127:/srv/nfs/drone-logbook/sync-logs`, mounted **read-only** at `/sync-logs`;
`SYNC_LOGS_PATH=/sync-logs`, `SYNC_INTERVAL="0 0 */8 * * *"` (every 8 h).
Any producer (Nextcloud sync, scp, a future phone pipeline) drops `.txt` logs there; the app
imports them automatically. `KEEP_UPLOADED_FILES=true` keeps re-importable originals in the PVC.
- **Ingress** via `ingress_factory`: `name = "dronelog"`, `auth = "required"` (Authentik
forward-auth), `dns_type = "proxied"`. External Uptime Kuma HTTPS monitor comes automatically
with the ingress annotation. Homepage tile (group "Media & Entertainment", icon `mdi-quadcopter`).
- **Secrets**: Vault KV `secret/drone-logbook` (`profile_creation_pass`) → ExternalSecret
(`vault-kv` ClusterSecretStore) → k8s secret `drone-logbook-secrets` → env
`PROFILE_CREATION_PASS`. Gates profile create/delete even for other Authentik-logged-in users.
No plan-time secret reads needed (no `data "kubernetes_secret"`).
No `DJI_API_KEY` — bundled default is fine at personal import volume; add later if rate-limited.
## Operational notes
- **DJI egress dependency**: importing a *new* log file requires the pod to reach DJI's servers
once (flight-log decryption key fetch; keys are then cached in the data dir). Remember this when
egress enforcement lands (Security wave 1, beads `code-8ywc`).
- The web UI is desktop-first; mobile is functional but basic.
- NFS host prerequisite: `/srv/nfs/drone-logbook/sync-logs` (root:www-data, 2775 — same shape as
sibling dirs) and `/srv/nfs/drone-logbook-backup` created on 192.168.1.127 and recorded in
`secrets/nfs_directories.txt`. `/srv/nfs` is exported whole-tree, so no `/etc/exports`
(`scripts/pve-nfs-exports`) change.
- Backup story = the daily app-level backup CronJob (above) + the host `daily-backup` LVM-snapshot
leg + original log files retained both in the drop folder and in the data volume
(`KEEP_UPLOADED_FILES=true`).
## Alternatives considered
- **Build from the fork** (`ghcr.io/viktorbarzin/...` via GHA, ADR-0002): rejected for now — fork
has zero custom commits; a build chain adds maintenance for no benefit. Revisit if code changes
are wanted.
- **`auth = "app"` + app profile passwords** (would enable the `opendronelog-sync` native uploader
from anywhere): rejected — a single app password guarding GPS traces of home/travel on the open
internet is weaker than Authentik; the sync drop folder covers automated ingestion instead.
- **Internal-only (.lan + VPN)**: rejected — Authentik-gated public matches the rest of the
homelab and works without VPN while traveling.
- **NFS for the DuckDB data**: rejected — embedded-DB-on-NFS locking risk; freshrss precedent
keeps app DB data on proxmox-lvm.
## Implementation
See `2026-07-04-drone-logbook-plan.md`.

View file

@ -1,542 +0,0 @@
# Drone Logbook (Open DroneLog) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Deploy Open DroneLog (DJI flight-log analyzer) at https://dronelog.viktorbarzin.me — new Tier-1 stack `stacks/drone-logbook/`, upstream image, Authentik-gated, with a DuckDB data PVC and an NFS auto-import drop folder.
**Architecture:** Single Deployment running `ghcr.io/arpanghosh8453/open-dronelog:latest` (nginx + Axum + DuckDB, port 80) in namespace `drone-logbook`; data on a `proxmox-lvm-encrypted` PVC (GPS logs = sensitive data), `/sync-logs` drop folder on static NFS, daily backup CronJob to `/srv/nfs/drone-logbook-backup` (vaultwarden pattern), `ingress_factory` with `auth = "required"`, Keel auto-upgrades via namespace enrollment. Modeled line-by-line on `stacks/freshrss/`. Design: `2026-07-04-drone-logbook-design.md`.
**Tech Stack:** Terraform/Terragrunt (Tier-1 PG state), Vault KV + ESO, ingress_factory, nfs_volume module, Keel/Kyverno.
Terraform is exempt from TDD (execution.md); each task ends with a concrete verification instead.
---
### Task 1: Vault secret
**Files:** none (Vault KV only)
- [ ] **Step 1.1: Create `secret/drone-logbook` with a generated profile-creation password**
```bash
vault kv put secret/drone-logbook profile_creation_pass="$(openssl rand -base64 24)"
```
- [ ] **Step 1.2: Verify**
```bash
vault kv get -field=profile_creation_pass secret/drone-logbook | wc -c
```
Expected: `33` (32 chars + newline). Never echo the value itself.
### Task 2: NFS drop folder on 192.168.1.127
**Files:**
- Modify: `secrets/nfs_directories.txt` (git-crypt'd — **edit from the MAIN checkout only**, never the worktree; sorted list, add `drone-logbook/sync-logs`)
- [ ] **Step 2.1: Create the directories** — world-writable + setgid like `vaultwarden-backup` (the `/srv/nfs` export root-squashes, so pod-root writes land as `nobody`):
```bash
ssh root@192.168.1.127 'mkdir -p /srv/nfs/drone-logbook/sync-logs /srv/nfs/drone-logbook-backup && chown -R root:www-data /srv/nfs/drone-logbook /srv/nfs/drone-logbook-backup && chmod 2777 /srv/nfs/drone-logbook/sync-logs /srv/nfs/drone-logbook-backup && ls -ld /srv/nfs/drone-logbook/sync-logs /srv/nfs/drone-logbook-backup'
```
Expected: `drwxrwsrwx ... root www-data ...` for both.
No `/etc/exports` (`scripts/pve-nfs-exports`) change — `/srv/nfs` is exported whole-tree.
- [ ] **Step 2.2: Record them in the declarative list (MAIN checkout, plaintext there)** — insert `drone-logbook-backup` and `drone-logbook/sync-logs` (after `diun`, before `etcd-backup`) in `~/code/infra/secrets/nfs_directories.txt`, then commit that single file to master:
```bash
git -C ~/code/infra add secrets/nfs_directories.txt
git -C ~/code/infra commit -m "nfs_directories: add drone-logbook/sync-logs
Drop folder for the new drone-logbook stack's auto-import (SYNC_LOGS_PATH).
Directory created on 192.168.1.127 root:www-data 2775."
git -C ~/code/infra push forgejo master
```
(Trivial single-file exception per execution.md; encrypted files cannot be edited from the worktree.)
### Task 3: Stack files (in the `wizard/drone-logbook` worktree)
**Files:**
- Create: `stacks/drone-logbook/main.tf` (content below)
- Create: `stacks/drone-logbook/terragrunt.hcl` (content below)
- Create: `stacks/drone-logbook/secrets` → symlink to `../../secrets`
- (`backend.tf`, `tiers.tf`, `cloudflare_provider.tf`, `providers.tf`, `.terraform.lock.hcl` are terragrunt-generated and **gitignored** — do NOT create or commit them; the tracked copies in old stacks like freshrss predate the ignore rule)
- [ ] **Step 3.1: `terragrunt.hcl`**
```hcl
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}
```
- [ ] **Step 3.2: `main.tf`** — exact content:
```hcl
variable "tls_secret_name" {
type = string
sensitive = true
}
variable "nfs_server" { type = string }
# Open DroneLog (https://github.com/arpanghosh8453/open-dronelog) — self-hosted
# DJI flight-log analyzer for the DJI Mini 4 Pro. Runs the UPSTREAM image (the
# ViktorBarzin/drone-logbook fork has no custom commits); Keel tracks :latest.
# Design: docs/plans/2026-07-04-drone-logbook-design.md
resource "kubernetes_namespace" "drone_logbook" {
metadata {
name = "drone-logbook"
labels = {
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "drone-logbook-secrets"
namespace = "drone-logbook"
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = {
name = "drone-logbook-secrets"
}
dataFrom = [{
extract = {
key = "drone-logbook"
}
}]
}
}
depends_on = [kubernetes_namespace.drone_logbook]
}
module "tls_secret" {
source = "../../modules/kubernetes/setup_tls_secret"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
tls_secret_name = var.tls_secret_name
}
# DuckDB database + cached DJI decryption keys + uploaded originals.
# Embedded DB -> block storage, not NFS (same rationale as freshrss data).
# Encrypted class: flight logs are GPS traces of home/travel (sensitive data
# -> proxmox-lvm-encrypted per the storage decision rule in .claude/CLAUDE.md).
resource "kubernetes_persistent_volume_claim" "data" {
wait_until_bound = false
metadata {
name = "drone-logbook-data-encrypted"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "10Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm-encrypted"
resources {
requests = {
storage = "2Gi"
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and PVCs
# can't shrink; without this every apply tries to revert the size.
ignore_changes = [spec[0].resources[0].requests]
}
}
# Drop folder: any producer (Nextcloud sync, scp, future phone pipeline) lands
# DJI .txt logs here over NFS; the app auto-imports on SYNC_INTERVAL.
module "nfs_sync_logs" {
source = "../../modules/kubernetes/nfs_volume"
name = "drone-logbook-sync-logs"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/drone-logbook/sync-logs"
storage = "5Gi"
}
resource "kubernetes_deployment" "drone_logbook" {
metadata {
name = "drone-logbook"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
labels = {
app = "drone-logbook"
"kubernetes.io/cluster-service" = "true"
tier = local.tiers.aux
}
}
spec {
replicas = 1
strategy {
# DuckDB is single-writer; never overlap two pods on the same volume
type = "Recreate"
}
selector {
match_labels = {
app = "drone-logbook"
}
}
template {
metadata {
labels = {
app = "drone-logbook"
"kubernetes.io/cluster-service" = "true"
}
}
spec {
container {
name = "drone-logbook"
image = "ghcr.io/arpanghosh8453/open-dronelog:latest"
env {
name = "RUST_LOG"
value = "info"
}
env {
# keep re-importable originals under /data/drone-logbook/uploaded
name = "KEEP_UPLOADED_FILES"
value = "true"
}
env {
name = "SYNC_LOGS_PATH"
value = "/sync-logs"
}
env {
# 6-field cron (sec min hour dom mon dow): scan drop folder every 8h
name = "SYNC_INTERVAL"
value = "0 0 */8 * * *"
}
env {
name = "PROFILE_CREATION_PASS"
value_from {
secret_key_ref {
name = "drone-logbook-secrets"
key = "profile_creation_pass"
}
}
}
volume_mount {
name = "data"
mount_path = "/data/drone-logbook"
}
volume_mount {
name = "sync-logs"
mount_path = "/sync-logs"
read_only = true
}
port {
name = "http"
container_port = 80
protocol = "TCP"
}
resources {
requests = {
cpu = "25m"
memory = "512Mi"
}
limits = {
memory = "512Mi"
}
}
}
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
}
}
volume {
name = "sync-logs"
persistent_volume_claim {
claim_name = module.nfs_sync_logs.claim_name
}
}
}
}
}
depends_on = [kubernetes_manifest.external_secret]
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
metadata[0].annotations["kubernetes.io/change-cause"],
metadata[0].annotations["deployment.kubernetes.io/revision"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
}
}
resource "kubernetes_service" "drone_logbook" {
metadata {
name = "drone-logbook"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
labels = {
"app" = "drone-logbook"
}
}
spec {
selector = {
app = "drone-logbook"
}
port {
port = "80"
target_port = "80"
}
}
}
# -----------------------------------------------------------------------------
# Backup — required for every proxmox-lvm(-encrypted) app: daily copy of the
# data volume to NFS /srv/nfs/drone-logbook-backup (picked up by nfs-mirror ->
# sda -> Synology offsite). 01:30 = outside the 00:00/08:00/16:00 sync-import
# windows, so the DuckDB file is quiescent; uploaded originals make even a
# mid-write copy recoverable by re-import. Pod-affinity co-schedules with the
# app pod (RWO volume mounts twice only on the same node). Vaultwarden pattern.
# -----------------------------------------------------------------------------
module "nfs_backup" {
source = "../../modules/kubernetes/nfs_volume"
name = "drone-logbook-backup-host"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/drone-logbook-backup"
}
resource "kubernetes_cron_job_v1" "backup" {
metadata {
name = "drone-logbook-backup"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
}
spec {
concurrency_policy = "Replace"
failed_jobs_history_limit = 5
schedule = "30 1 * * *"
starting_deadline_seconds = 300
successful_jobs_history_limit = 3
job_template {
metadata {}
spec {
backoff_limit = 3
ttl_seconds_after_finished = 10
template {
metadata {}
spec {
affinity {
pod_affinity {
required_during_scheduling_ignored_during_execution {
label_selector {
match_labels = {
app = "drone-logbook"
}
}
topology_key = "kubernetes.io/hostname"
}
}
}
container {
name = "drone-logbook-backup"
image = "docker.io/library/alpine"
command = ["/bin/sh", "-c", <<-EOT
set -euxo pipefail
_t0=$(date +%s)
now=$(date +"%Y_%m_%d_%H_%M")
mkdir -p /backup/$now
cp -a /data/. /backup/$now/
# Rotate — 30 day retention
find /backup -maxdepth 1 -mindepth 1 -type d -mtime +30 -exec rm -rf {} +
_dur=$(($(date +%s) - _t0))
_out_bytes=$(du -sb /backup/$now | awk '{print $1}')
wget -qO- --post-data "backup_duration_seconds $${_dur}
backup_output_bytes $${_out_bytes}
backup_last_success_timestamp $(date +%s)
" "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/drone-logbook-backup" || true
EOT
]
volume_mount {
name = "data"
mount_path = "/data"
read_only = true
}
volume_mount {
name = "backup"
mount_path = "/backup"
}
}
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
}
}
volume {
name = "backup"
persistent_volume_claim {
claim_name = module.nfs_backup.claim_name
}
}
dns_config {
option {
name = "ndots"
value = "2"
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# https://dronelog.viktorbarzin.me
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
auth = "required" # Authentik forward-auth — flight logs are GPS traces of home/travel
dns_type = "proxied"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
name = "dronelog"
service_name = "drone-logbook"
tls_secret_name = var.tls_secret_name
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Drone Logbook"
"gethomepage.dev/description" = "DJI flight log analyzer"
"gethomepage.dev/icon" = "mdi-quadcopter"
"gethomepage.dev/group" = "Media & Entertainment"
"gethomepage.dev/pod-selector" = ""
}
}
```
- [ ] **Step 3.3: Boilerplate**
```bash
ln -s ../../secrets ~/code/infra/.worktrees/drone-logbook/stacks/drone-logbook/secrets
```
- [ ] **Step 3.4: Format check**
```bash
terraform fmt -check -diff $WT/stacks/drone-logbook/ || terraform fmt $WT/stacks/drone-logbook/
```
Expected: no diff (or auto-fixed).
- [ ] **Step 3.5: Commit on the branch (files by name, git-crypt filter flags per execution.md)**
```bash
git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false \
add docs/plans/2026-07-04-drone-logbook-design.md docs/plans/2026-07-04-drone-logbook-plan.md \
stacks/drone-logbook/main.tf stacks/drone-logbook/terragrunt.hcl stacks/drone-logbook/secrets \
.claude/reference/service-catalog.md
git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false \
commit -m "drone-logbook: new stack — self-hosted Open DroneLog at dronelog.viktorbarzin.me
Viktor asked to self-host the DJI flight-log analyzer for his DJI Mini 4 Pro
(fork ViktorBarzin/drone-logbook -> upstream arpanghosh8453/open-dronelog).
Upstream ghcr image with Keel auto-upgrade, DuckDB data on proxmox-lvm PVC,
NFS /sync-logs drop folder auto-imported every 8h, Authentik-gated ingress,
PROFILE_CREATION_PASS from Vault via ESO. Design + plan in docs/plans/."
```
### Task 4: Land and apply
- [ ] **Step 4.1: Presence claim** (CI apply mutates shared infra)
```bash
~/code/scripts/presence claim infra:drone-logbook --purpose "deploy new drone-logbook stack (Open DroneLog) via CI apply"
```
- [ ] **Step 4.2: Merge latest master into the branch, push to master**
```bash
git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false fetch forgejo
git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false merge forgejo/master
git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false push forgejo HEAD:master
```
Non-fast-forward → another agent landed first: fetch, merge, push again. Branch-protection rejection → fall back to PR via Forgejo API (token = password in `~/.git-credentials`).
- [ ] **Step 4.3: Watch the CI apply to completion** — Woodpecker pipeline on the infra repo (`ci.viktorbarzin.me`), then confirm live:
```bash
kubectl get ns drone-logbook && kubectl -n drone-logbook get deploy,pvc,pods,externalsecret,cronjob
kubectl -n drone-logbook rollout status deploy/drone-logbook --timeout=300s
```
Expected: namespace present, ExternalSecret `SecretSynced`, data PVC `Bound` (the NFS PVCs bind on first pod/job use), CronJob `drone-logbook-backup` scheduled `30 1 * * *`, pod `Running 1/1`.
- [ ] **Step 4.4: Cleanup worktree + branch; release presence**
```bash
git -C ~/code/infra worktree remove .worktrees/drone-logbook
git -C ~/code/infra branch -d wizard/drone-logbook
git -C ~/code/infra pull --ff-only # only if main checkout clean/quiescent
~/code/scripts/presence release infra:drone-logbook
```
### Task 5: End-to-end verification
- [ ] **Step 5.1: Ingress + Authentik gate**
```bash
curl -sI https://dronelog.viktorbarzin.me | head -5
```
Expected: `302` redirect into Authentik (NOT `200`, NOT `404`).
- [ ] **Step 5.2: App alive behind the gate** (bypass ingress via port-forward, read-only debug)
```bash
kubectl -n drone-logbook port-forward svc/drone-logbook 18080:80 &
sleep 2 && curl -s -o /dev/null -w '%{http_code}\n' http://127.0.0.1:18080/ && kill %1
```
Expected: `200`.
- [ ] **Step 5.3: Sync folder visible in-pod**
```bash
kubectl -n drone-logbook exec deploy/drone-logbook -- ls -ld /sync-logs /data/drone-logbook
```
Expected: both directories listed; `/sync-logs` read-only mount.
- [ ] **Step 5.4: Monitor + homepage** — Uptime Kuma external monitor for `dronelog.viktorbarzin.me` auto-created (ingress annotation); homepage tile under "Media & Entertainment".
- [ ] **Step 5.5: Functional import** — Viktor uploads a real Mini 4 Pro `.txt` log via the web UI (or drops it in `/srv/nfs/drone-logbook/sync-logs`); confirms flight appears with charts/map. Requires pod egress to DJI once per new log (decryption key). If an upstream sample log is available, the agent may pre-verify import via the REST API through the port-forward.

View file

@ -1,125 +0,0 @@
# immich-frame: LAN-only access, Portals untouched (2026-07-04)
## Goal
Strangers must no longer be able to view `highlights-immich.viktorbarzin.me`
(Viktor's London Portal Plus frame) or `highlights-immich-emo.viktorbarzin.me`
(Emo's Sofia Portal Mini frame) — pages or ImmichFrame API. Both were
`auth = "none"`, Cloudflare-proxied, fully public.
Who keeps access (per Viktor, this session): the two Portals plus **any
household device on the Sofia, London, or Valchedrym home networks**. No
public access, no tailnet requirement. Hard constraint: the Portal app is a
WebView with the URL **baked in at APK build time** (`portal-immich-frame`,
`-PframeUrl`), so the exact URLs must keep loading from where the Portals sit
— zero app rebuilds, zero device touches, zero router changes.
## Design
Two cooperating pieces — the gate and the reachability pointer:
1. **The gate — `home-lans-only` Traefik middleware** (traefik stack, next to
`local-only`): `ipAllowList` of `192.168.1.0/24` (Sofia LAN), `10.0.0.0/8`
(VLANs, K8s pods `10.10.0.0/16`, services `10.96.0.0/12`, WG tunnel
`10.3.2.0/24`), `192.168.8.0/24` (London LAN), `192.168.9.0/24` (London
GUEST net — post-rollout discovery: the Portal Plus actually leases here,
`Portal-75AE8F9C2A8A` = `192.168.9.198`, added same day), `192.168.0.0/24`
(Valchedrym LAN), `fc00::/7`, `fe80::/10`. Attached to both frame
ingresses via `extra_middlewares`. Everyone else gets a Traefik 403 —
including direct-to-WAN-IP requests carrying the right SNI, which DNS
changes alone cannot stop. A **separate** middleware rather than a widened
`local-only`, because widening would silently grant the remote LANs access
to the 9 admin surfaces using it (Prometheus, iDRAC, Loki, …).
2. **The pointer — `dns_type = "internal"`** (new `ingress_factory` tier,
Viktor's idea): a **non-proxied public A record → `10.0.20.203`** (module
var `internal_lb_ip`). Outsiders resolve it but get an unroutable RFC1918
address; every household resolver path delivers a working answer with no
config anywhere: Sofia LAN already gets the internal CNAME from Technitium,
London/Valchedrym resolve the public record via any upstream and
policy-route `10.0.0.0/8` down the WireGuard tunnel. IPv4-only (spokes
route no internal v6 range).
Interlock (the reason both flip together): with a *proxied* record, public
traffic arrives from cloudflared **pod IPs inside 10/8** and would sail
through the allowlist. `internal` removes the Cloudflare path entirely (CF
edge stops serving the hostname), so every request reaches Traefik with its
real source IP (ETP=Local). Verified: no wildcard `*.viktorbarzin.me` record
exists to resurrect public resolution.
`auth` stays `"none"` — there is still no *user* auth by design (kiosk
WebView; forward-auth would 302 the device to a login it can't complete, and
emo's Google-only account can't log in inside a WebView at all); the
convention comment now names the ipAllowList as the gate.
### Resulting flows
| Client | Path | Result |
|---|---|---|
| Emo's Portal Mini (Sofia LAN) | Technitium CNAME → `.203` direct (unchanged) | allowed (`192.168.1.x`) |
| Viktor's Portal Plus (London GUEST net) | public A → `10.0.20.203` → WG tunnel | allowed (`192.168.9.x`) |
| Household browsers (any of the 3 LANs) | same as above | allowed |
| In-cluster checks (`homelab browser`, blackbox) | CoreDNS → Technitium → `.203` | allowed (pod IP in 10/8) |
| Stranger, resolves hostname | gets `10.0.20.203` | unroutable |
| Stranger, hits WAN IP with SNI | pfSense NAT → Traefik (real source IP) | **403** |
| Stranger, via Cloudflare | no proxied record | CF edge won't serve the host |
### Rejected alternatives
- **ImmichFrame `AuthenticationSecret`** (supported upstream: web input field
or `?authsecret=` param + bearer API): real auth from anywhere, but family
browsers would face a secret prompt (fails "household devices just work"),
the secret leaks into URLs/analytics/APK, and robust rollout needs APK
rebuild + USB-adb sideload on both Portals (the Sofia one is high-friction).
- **Authentik forward-auth / `auth = "public"`**: WebView can't complete SSO
(Google blocks WebView logins; session expiry silently bricks an appliance);
the anonymous outpost is an audit trail, not a gate.
- **Remove DNS + London router AdGuardHome rewrites**: works, but adds an
out-of-band, un-IaC'd router dependency the internal-IP record makes
unnecessary. Kept as documented fallback if resolver-side private-IP
filtering ever appears in the London path.
## Pre-verified facts (2026-07-04)
- London Flint 2 DNS chain returns RFC1918 answers unfiltered
(`nslookup 10.0.20.203.nip.io 127.0.0.1` on the router → `10.0.20.203`;
dnsmasq `rebind_protection '0'`, no AdGuardHome rebind filtering).
- Technitium already CNAMEs both hostnames → apex → `10.0.20.203`
(`technitium-ingress-dns-sync` is ingress-driven, not DNS-record-driven, so
the internal answer survives the Cloudflare record swap).
- Pod CIDR `10.10.0.0/16`, service CIDR `10.96.0.0/12` — inside `10.0.0.0/8`.
- No public wildcard record in the zone.
## Blast radius & cleanups
- `external_monitor = false` set explicitly on both ingresses: the
external-monitor-sync default opt-in would otherwise keep the now-doomed
`[External] highlights-immich*` uptime-kuma monitors alive and red. Verify
the sync drops them post-apply.
- rybbit CF worker: `highlights-immich` removed from `SITE_IDS` (`index.js`)
and `wrangler.toml` routes — off Cloudflare the route can never fire.
Requires a `wrangler deploy` to take effect (route removal is hygiene, not
functional).
- Homepage dashboard link keeps working from LANs (hostname unchanged).
- Docs updated in the same change: `.claude/CLAUDE.md` (DNS tier +
external-monitor mechanism), `AGENTS.md`, `docs/architecture/networking.md`
(Internal-IP domains category). The `portal-immich-frame` repo's glossary
("public, login-less URL") updated separately in that repo.
## Failure-mode delta
London frame now depends on the WG tunnel instead of Cloudflare+cloudflared
(the app self-heals with 5s retries; tunnel-flap modes documented in
`docs/architecture/vpn.md`). A Traefik LB renumber must update
`internal_lb_ip` in the module alongside the split-horizon apex record.
Cutover window: cached proxied answers keep working ≤ ~5 min TTL, then the
WebView's own retry picks up the new path.
## Verification & rollback
Verify: public dig → `10.0.20.203` (both hosts); Technitium dig → `.203`;
curl from devvm (10/8) → 200; external vantage (WebFetch/cloud) → unreachable
or 403; middleware attached on both ingresses; Emo's frame renders via
`homelab browser`; London Portal image fetches visible in Traefik access logs
from `192.168.8.x`. Rollback: `git revert` + apply traefik/immich — records
and middleware chain restore (`allow_overwrite = true` re-adopts the records).

View file

@ -129,40 +129,3 @@ heavy user between 1216G even with RAM free; bump to 16/20 if that bites.
storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the
correct pairing. A famous tool that "does OOM" still has to be proven to fire
under *your* configuration.
## Addendum (2026-07-02): the MemoryHigh throttle band livelocks — removed
The soft-cap layer of this design was falsified in production on 2026-07-02
(~15:4216:35 UTC): an agent-spawned `ugrep` (12.35G RSS; `-o` with wide
alternation captures over a multi-GB `.jsonl` transcript) **plateaued inside
t3-serve@wizard's `MemoryHigh=12G..MemoryMax=16G` band**. With
`MemorySwapMax=0` its anonymous pages were unreclaimable, so the kernel parked
every allocating task of the cgroup in `mem_cgroup_handle_over_high`
(`memory.pressure full avg60 ≈ 80%`, `memory.events high=882948`, `oom_kill=0`)
— including the `t3 serve` event loop (~0.5G RSS, pure collateral). The accept
queue backed up (21 pending connections), t3-probe logged `t3serve: [Errno 104]
Connection reset by peer`, t3-dispatch logged `proxy error: context canceled`,
and t3.viktorbarzin.me was dead for its user until the hog was SIGKILLed by
hand (the D-state high-throttle sleep IS killable; the cgroup dropped 14G→1.4G
and the service recovered in seconds with no restart).
The Verification bullet above — a soft-capped balloon "throttled to a crawl,
making no progress and **harming nothing**" — holds only when the hog is alone
in its cgroup. Sharing the cgroup with a latency-sensitive server, the crawl
IS the harm: a hog that stabilises below `MemoryMax` never triggers the local
OOM the design counted on, so the band converts "runaway dies" into "everyone
in the cgroup stalls forever".
**Fix (same day, admin-approved): `MemoryHigh=infinity` on all three work
cgroup definitions** — `scripts/t3-serve@.service`, the `user-.slice.d`
drop-in, and `docker.slice` (`setup-devvm.sh` §10a/§10c). A runaway now runs
unthrottled into `MemoryMax` and is cgroup-OOM-killed immediately
(`OOMPolicy=continue` keeps t3-serve itself alive; in slices the kernel kills
the biggest task). `MemoryMax`, `MemorySwapMax=0`, and earlyoom — the layers
the stress tests actually validated — are unchanged. Applied live via
`daemon-reload` + runtime `set-property` on the running cgroups; no session
restarts.
Lesson: **with `swap=0`, `memory.high` is not a gentler `memory.max` — it is
an unbounded stall injector for everything sharing the cgroup.** Cap-and-kill
beats throttle-and-pray for multi-tenant interactive services.

View file

@ -1,135 +0,0 @@
# Paperless-ngx Mail Ingest (docs@viktorbarzin.me)
Last updated: 2026-07-03 (initial build)
Forward any email with document attachments to **`docs@viktorbarzin.me`** and
paperless-ngx ingests the attachments, owned by the paperless account mapped
from the **sender** (From) address. Built entirely from existing parts: a
docker-mailserver mailbox + Dovecot sieve, and paperless-ngx's native mail
consumer (the same machinery as the `utility:` rules).
## Flow
```
family member forwards email ──> MX ──> docker-mailserver
│ postfix virtual: docs@ has an explicit self-alias (extra/aliases.txt),
│ so the @domain catch-all (→ spam@, swept by TripIt) does NOT apply
Dovecot LMTP delivery to docs@
│ per-user sieve (docs@viktorbarzin.me.dovecot.sieve): sender NOT in
│ allowlist → discard (decision 2026-07-03: unmatched = ignore & delete)
docs@ INBOX ── paperless-ngx mail task (every 10 min, PAPERLESS_EMAIL_TASK_CRON
│ default) applies mail rules in order: filter_from = <sender>
│ → consume attachments (ALL parts incl. inline — see design
│ notes: Apple Mail marks real PDFs inline), owner = mapped user,
│ tag = email-ingest, title = mail subject
consumed mail is MOVED to the "Processed" IMAP folder (audit trail);
INBOX stays empty in steady state
```
## Sender → paperless account map (as built)
| Sender (From) | Paperless user | Rule |
|--------------------------|----------------|-----------------|
| me@viktorbarzin.me | root (id 3) | forward: Viktor (me@) |
| vbarzin@gmail.com | root (id 3) | forward: Viktor (gmail) |
| viktorbarzin@meta.com | root (id 3) | forward: Viktor (meta) |
| ancaelena98@gmail.com | anca (id 4) | forward: Anca |
| emil.barzin@gmail.com | emo (id 7) | forward: Emo |
The map lives in **two places by design** — keep them in sync:
1. **Delivery gate (infra, Terraform):**
`stacks/mailserver/modules/mailserver/extra/docs-at-viktorbarzin.me.dovecot.sieve`
— senders not listed here are discarded at delivery (spam control + the
"ignore and delete unmatched" behaviour; paperless cannot express
"delete without ingesting", so this must happen before the mailbox).
2. **Owner map (paperless DB, via API/UI):** one mail rule per sender on the
`docs@viktorbarzin.me` mail account. DB-state like workflows — NOT
Terraform.
## Add a family member / sender
1. Add the address to the sieve allowlist file above; commit; apply the
`mailserver` stack (normal apply is enough — the sieve CM key is not under
`ignore_changes`; Reloader restarts the pod).
2. Clone an existing `forward:` mail rule in the paperless admin UI
(Mail → Rules) or via API, changing `filter_from` and the rule **owner**
(documents are owned by the rule owner — `assign_owner_from_rule=true`).
Keep: action = Move to `Processed`, attachment type = **process all files
including inline** (`attachment_type=2` — NOT attachments-only, see design
notes), consumption scope = attachments only, tag `email-ingest`, order
after the existing rules.
## Operations
- **Trigger a fetch immediately** (instead of waiting ≤10 min):
`kubectl -n paperless-ngx exec deploy/paperless-ngx -c paperless-ngx -- s6-setuidgid paperless python3 manage.py mail_fetcher`
The `s6-setuidgid paperless` is **required**: `kubectl exec` runs as root, and a
root-run fetcher downloads attachments root-owned into the scratch dir, which
the celery consumer (uid 1000) then can't read — `PermissionError` on
`/tmp/paperless/paperless-mail-*/...`, consume task FAILURE (hit during the
2026-07-03 build E2E). The mail correctly stays in INBOX for retry (the move
action is a chord callback on successful consumption). Recover: `rm -rf
/tmp/paperless/paperless-mail-*` (as root) and let the next scheduled fetch
re-process.
- **Mailbox credentials:** Vault `secret/platform``mailserver_accounts`
JSON, key `docs@viktorbarzin.me` (also used by the paperless mail account).
- **Inspect the mailbox:**
`python3 -c` IMAP to `mailserver.mailserver.svc.cluster.local:993` (in-cluster,
from a pod) or `mail.viktorbarzin.me:993` (externally / devvm).
- **Paperless-side logs:** `kubectl -n paperless-ngx logs deploy/paperless-ngx | grep -i mail`
(also Loki, ns `paperless-ngx`). Rule/account state: `GET /api/mail_rules/`,
`GET /api/mail_accounts/` with the admin token
(k8s secret `paperless-ngx-secrets`, field `api_token`).
- **Account/mailbox provisioning:** adding/rotating anything in
`mailserver_accounts` requires the ConfigMap replace workaround —
`scripts/tg apply mailserver -- -replace=module.mailserver.kubernetes_config_map.mailserver_config`
— because `postfix-accounts.cf` is under `ignore_changes`
(non-deterministic bcrypt; see the module comment).
## Design notes / caveats
- **Why not the catch-all?** Mail to unknown `@viktorbarzin.me` addresses
lands in `spam@`, which the TripIt `ingest-plans` CronJob sweeps every
15 min: it marks everything `\Seen`, LLM-parses mail from linked senders and
replies with ack/failure emails. Forwarded bank statements would get
"couldn't parse a trip" replies. `docs@` being a real mailbox bypasses that
path entirely; TripIt, the `smoke-test@` roundtrip probe, and `dmarc@` are
untouched.
- **Spoofing:** the sender match is on the From header. Rspamd verifies
SPF/DKIM/DMARC on inbound mail, but gmail.com publishes `p=none`, so a
crafted spoof could ingest documents into a family member's account. Accepted
risk (worst case: unwanted documents appear, visible + deletable in
paperless).
- **Not PDF-only:** any attachment type paperless supports is consumed
(PDF, images, Office via the existing tika+gotenberg pipeline).
- **Inline attachments ARE processed (`attachment_type=2`, flipped
2026-07-03):** the rules originally used attachments-only (1) to skip
signature logos, but the very first real forward (Apple Mail, Viktor's
client) attached the invoice PDF with `Content-Disposition: inline`
paperless matched the rule, consumed nothing, and recorded
`PROCESSED_WO_CONSUMPTION` (which, like any ProcessedMail row, blocks that
UID from ever being re-processed — delete the row via `manage.py shell` to
retry). Trade-off: signature/inline images in forwards may be ingested as
junk docs (tagged `email-ingest`, easy to spot). If that gets noisy, add
`filter_attachment_filename_exclude` patterns to the rules using the
actually-observed junk filenames — do NOT flip back to attachments-only.
- **No dedicated alerting** (deliberate, 2026-07-03): mail-task errors surface
in paperless logs; the mailserver inbound path is covered by
`email-roundtrip-monitor`. Revisit if forwards start silently failing.
- **Workflows:** the global `payslip-webhook` + `claude-mcp-readers
auto-permission` workflows fire for mail-ingested docs like any other
consumption source (verified pre-build; payslip receiver does its own
filtering).
## Rollback
1. Disable/delete the 5 `forward:` mail rules + the `docs@` mail account
(paperless admin UI or API).
2. Revert the infra commit (aliases.txt entry, sieve file, CM key + mount).
3. Remove `docs@viktorbarzin.me` from Vault `mailserver_accounts`, then apply
with the `-replace` workaround above. Mail to docs@ then falls back to the
catch-all (spam@) like any unknown address.

View file

@ -109,17 +109,10 @@ rate(node_pressure_memory_stalled_seconds_total{instance="devvm"}[5m])
node_memory_SwapFree_bytes{instance="devvm"}
```
Guardrails in place (2026-06-10, hardened 2026-07-02; `scripts/t3-serve@.service`):
per-unit `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue`, and
`MemoryHigh=infinity` — deliberately NO soft throttle band. With swap=0, a hog
plateauing between high and max never OOMs and the kernel high-throttle stalls
the whole unit: a 12.3G agent `ugrep` livelocked t3-serve@wizard for ~50min on
2026-07-02 (signature: probe `t3serve` leg `Connection reset by peer`, dispatch
`proxy error: context canceled`, server D-state in `mem_cgroup_handle_over_high`,
`ss` backlog on the serve port; fix: SIGKILL the hog — the D-state is killable).
A runaway agent now OOMs alone at 16G inside the cgroup instead of throttling
the WS server with it. Post-mortem addendum:
`docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md`.
Guardrails in place (2026-06-10, `scripts/t3-serve@.service`): per-unit
`MemoryHigh=12G`, `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue`
a runaway agent now OOMs alone inside the cgroup instead of taking the box
(and the WS server) with it.
## 4. Known root causes (2026-06-10 investigation)

View file

@ -1,98 +0,0 @@
# Valia sites — add / update / retire
Off-infra static sites authored by Valia (ADR-0018, CONTEXT.md "Valia site").
Serving: Cloudflare Pages. Freshness: the `valia-sites-sync` CronJob
(`valia-sites` ns) mirrors each Content folder every 10 minutes and deploys
only when the folder's manifest hash changed. Registry: `local.sites` in
`stacks/valia-sites/main.tf` — one entry per site drives everything (Pages
project, custom domain, public CNAME, internal split-horizon CNAME, sync).
Current sites: `bridge` (ОбУ „Отец Паисий“ — "мост"), `stem95su` (95. СУ STEM
board).
## Add a site
1. Valia shares the Drive folder with **vbarzin@gmail.com** (viewer is enough —
the pipeline is strictly read-only towards Drive).
2. Get the folder id from its URL (`drive.google.com/drive/folders/<ID>`).
3. Pick the **English** subdomain name (Viktor's call — CONTEXT.md naming rule).
4. Add one entry to `local.sites` in `stacks/valia-sites/main.tf`:
```hcl
<name> = {
folder_id = "<ID>"
src_path = "" # or "sub/folder" if servable files live deeper
entry_file = "index.html" # or whatever her main HTML file is called
manage_dns = true
}
```
5. Commit + push; CI applies. Within ~10 min the sync deploys content and the
site serves at `https://<name>.viktorbarzin.me` (custom-domain TLS takes
~510 min extra on first attach — CF returns 522 for the hostname until
then). Internal LAN/VLAN/pod resolution appears when the hourly
`technitium-ingress-dns-sync` next runs — trigger it early with:
`kubectl create job --from=cronjob/technitium-ingress-dns-sync valia-dns-now -n technitium`
## Content rules (what Valia's folder must look like)
- The **entry file** must exist — the sync stages a copy as `index.html` at
deploy time, so `/` works; the original filename keeps working too (deep
links survive). If the folder is empty or the entry file is missing, the
sync **skips the site and leaves it as-is** (never wipes a live site).
- Google-native files (Docs/Sheets) are **ignored** (`--drive-skip-gdocs`) —
only real files (`.html`, images, …) deploy. Gemini's HTML exports are fine.
- Per-file limit 25 MB (Cloudflare Pages), 20k files max — far beyond a
1-page site.
## Update a site
Nothing to do: Valia edits the folder, the site follows within ~10 minutes.
Force it early: `kubectl create job --from=cronjob/valia-sites-sync sync-now -n valia-sites`
## Rename / retire a site
Rename = retire + add (Pages projects can't be renamed). Retire:
1. Delete the entry from `local.sites`; commit + push. TF destroys the public
CNAME + custom domain + Pages project; the internal record is removed by
the next `technitium-ingress-dns-sync` run (its deletion pass drops any
internal `*.pages.dev` CNAME that left the `valia-sites-dns` ConfigMap —
scoped so it can never touch non-Pages records).
2. That's all — no manual DNS cleanup (the pre-ADR-0018 add-only gotcha is
fixed by the deletion pass).
## Failure modes / debugging
- **Visibility is failed-Job-only by choice** (ADR-0018): no alerts, no
notifications. Check: `kubectl get jobs -n valia-sites | tail`, logs of the
last `valia-sites-sync-*` pod.
- **Drive auth broken** (`FATAL … Drive list failed`): the shared
`secret/valia-sites.rclone_conf` token died. The GCP OAuth app
(`home-lab-1700868541205`) must stay published to "Production" or refresh
tokens expire weekly (same constraint as the old stem95su conf, which this
one was copied from). Re-mint and `vault kv patch secret/valia-sites
rclone_conf=@…`.
- **Wrangler auth broken**: `secret/valia-sites.cloudflare_pages_token` is a
SCOPED token (Pages Read+Write on the account, id
`355d2c9d11579bdad1e9498dafca30d5`) — re-mint via
`POST /user/tokens` with the Global API Key (`secret/platform`), patch
Vault. Do NOT put the Global API Key in the pod.
- **Site serves stale content**: check the state CM
(`kubectl get cm valia-sites-state -n valia-sites -o yaml`) — deleting a
site's key forces a redeploy on the next run.
- **`GUARD … skipping`** in logs: Valia's folder is empty or renamed the
entry file — the site deliberately kept its last content. Fix the folder or
update `entry_file`.
## History
- stem95su served in-cluster (nginx + NFS + its own rclone CronJob) until
2026-07-03, when it was cut over to this pattern and the old stack retired
(ADR-0018). The blocking 42.9 MB `stem_video.mp4` was compressed to 21.4 MB
(same 1080p, ~2.5 Mbps H.264) and replaced in Valia's folder with Viktor's
explicit one-time OK. `secret/stem95su` is superseded by
`secret/valia-sites`; `/srv/nfs/stem-site` on the PVE host is a harmless
leftover.
- bridge started as a hand-deployed wrangler experiment (2026-07-03, memory
id 7085) and was adopted into the stack the same day.

View file

@ -82,48 +82,33 @@ tail -5 ~/.local/state/vault-token-renew.log # recent results
A healthy log line looks like:
`<ts> OK renewed (dn=token-devvm-wizard ttl=2764800s)` (ttl 2764800s = 768h).
After an OIDC login you'll instead see, at the next nightly run:
`<ts> HEALED: re-minted periodic token from foreign dn=oidc-… (revoked N stale periodic token(s))`
— that's the self-heal working as designed.
## Drift guard & self-heal
## Drift guard & recovery
`~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login`
overwrites it. Two confirmed clobber vectors:
1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer
can't push past the OIDC role's 7-day `token_max_ttl`). The infra docs
prescribe this login before applies, so it recurs — it went unnoticed for
weeks twice (2026-06-18→26, 2026-06-29→07-03) and read as "Vault expires
weekly".
can't push past the OIDC role's 7-day `token_max_ttl`).
2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) →
writes a read-only `kubernetes-woodpecker-default` token (can read Vault but
**cannot** write `secret/*`). Happened 2026-06-05, unnoticed for two days.
**cannot** write `secret/*`). This happened 2026-06-05 and went unnoticed for
two days — reads worked, writes silently 403'd.
Since 2026-07-03 the renewer **self-heals**
(`docs/plans/2026-07-03-vault-token-self-heal-design.md`). On a foreign token
it attempts the re-mint **with the clobbering token's own authority** and lets
Vault's authz decide:
To stop the renewer from silently keeping a foreign token alive, it runs a
**drift guard** first: it refuses to renew unless the token is
`token-devvm-wizard` **and** carries `vault-admin`. On drift it logs loudly and
exits non-zero (the systemd unit goes `failed`) rather than renewing someone
else's token. Symptom in the log:
- **Admin-capable clobber (OIDC login)** → re-mints the periodic token,
sanity-checks it against the drift guard, atomically replaces
`~/.vault-token`, revokes stale `token-devvm-wizard` leftovers
(anti-sprawl), logs
`HEALED: re-minted periodic token from foreign dn=… (revoked N stale periodic token(s))`
and exits 0. The clobbering token is NOT revoked — it may still back a live
login session; it ages out on its own.
- **Weak clobber (read-only k8s token)** → the mint is denied; logs
`DRIFT: … heal denied, foreign token lacks create authority …; investigate what wrote it`
and exits non-zero (unit `failed`). Deliberately loud: this signals a
misbehaving agent flow — exactly the 2026-06-05 case.
`<ts> DRIFT: ~/.vault-token is dn=... policies=... Refusing to renew a foreign token. Re-mint: ...`
**Manual recovery** is only needed for the weak-clobber case (the DRIFT log
line still contains the exact command) — run the
[mint/re-mint](#mint--re-mint-the-token) block.
**Recovery: re-mint** (the DRIFT log line contains the exact command) — run the
[mint/re-mint](#mint--re-mint-the-token) block. The drift guard detects but does
**not** auto-recover (a deliberate scope choice — version-only, no self-heal);
recovery is the manual re-mint above.
## Tests
`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision,
the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber
case), and the self-heal's revoke filter (which stale periodic tokens a heal
may sweep). Run: `bash infra/scripts/test-vault-token-renew.sh`.
`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision
and the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber
case). Run: `bash infra/scripts/test-vault-token-renew.sh`.

View file

@ -127,29 +127,20 @@ variable "anti_ai_scraping" {
variable "dns_type" {
type = string
default = "none"
description = <<-EOT
Cloudflare DNS: 'proxied' (CNAME to tunnel), 'non-proxied' (A/AAAA to
public IP), 'internal' (A to the internal Traefik LB IP resolvable from
any resolver but only ROUTABLE from home LANs / WG sites / VPN; the record
is a reachability pointer, NOT a gate: pair it with an ipAllowList via
extra_middlewares, e.g. traefik-home-lans-only@kubernetescrd, because
direct-to-WAN-IP requests with the right SNI still hit Traefik), or 'none'.
EOT
description = "Cloudflare DNS: 'proxied' (CNAME to tunnel), 'non-proxied' (A/AAAA to public IP), or 'none'"
validation {
condition = contains(["proxied", "non-proxied", "internal", "none"], var.dns_type)
error_message = "dns_type must be 'proxied', 'non-proxied', 'internal', or 'none'."
condition = contains(["proxied", "non-proxied", "none"], var.dns_type)
error_message = "dns_type must be 'proxied', 'non-proxied', or 'none'."
}
}
# Uptime Kuma external monitor: when true, annotate the ingress so the
# external-monitor-sync CronJob creates a `[External] <name>` monitor pointing
# at https://<host>. Null means "follow dns_type" enabled when the ingress
# has a PUBLIC DNS record (proxied or non-proxied; 'internal' records are not
# externally reachable, so no external monitor).
# at https://<host>. Null means "follow dns_type" enabled when proxied.
variable "external_monitor" {
type = bool
default = null
description = "Enable Uptime Kuma external monitor. null = auto (enabled when dns_type is 'proxied' or 'non-proxied')."
description = "Enable Uptime Kuma external monitor. null = auto (enabled when dns_type == 'proxied')."
}
variable "external_monitor_name" {
@ -180,15 +171,6 @@ variable "public_ipv6" {
default = "2001:470:6e:43d::2"
}
# Internal Traefik LB IP used by dns_type = "internal" records. Tracks the
# dedicated MetalLB IP from stacks/traefik (ETP=Local). A future LB renumber
# must update this default alongside the split-horizon apex record see
# docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*.
variable "internal_lb_ip" {
type = string
default = "10.0.20.203"
}
variable "homepage_group" {
type = string
default = null # auto-detect from namespace
@ -219,10 +201,8 @@ locals {
)
# External monitor enabled by default when the ingress has a public DNS
# record (either CF-proxied or direct A/AAAA). 'internal' records resolve
# publicly but are unroutable from outside, so they get no external monitor.
# Explicit bool overrides.
effective_external_monitor = var.external_monitor != null ? var.external_monitor : (var.dns_type == "proxied" || var.dns_type == "non-proxied")
# record (either CF-proxied or direct A/AAAA). Explicit bool overrides.
effective_external_monitor = var.external_monitor != null ? var.external_monitor : (var.dns_type != "none")
# Emit the annotation when effective is true (positive signal), or when the
# caller explicitly set external_monitor=false (opt-out). When the caller
@ -444,19 +424,3 @@ resource "cloudflare_record" "non_proxied_aaaa" {
zone_id = var.cloudflare_zone_id
allow_overwrite = true
}
# 'internal': a publicly-resolvable A record carrying the INTERNAL Traefik LB
# IP. Outsiders resolve it but can't route to it; home-LAN/WG-site/VPN clients
# reach Traefik directly (the WG spokes policy-route 10.0.0.0/8 through the
# tunnel), so kiosk devices with baked-in URLs need no DNS overrides anywhere.
# IPv4-only on purpose: the spokes route no internal IPv6 range.
resource "cloudflare_record" "internal_a" {
count = var.dns_type == "internal" ? 1 : 0
name = local.dns_name
content = var.internal_lb_ip
proxied = false
ttl = 1
type = "A"
zone_id = var.cloudflare_zone_id
allow_overwrite = true
}

View file

@ -21,19 +21,12 @@ WorkingDirectory=/home/%i
ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3
Restart=on-failure
RestartSec=5
# Memory containment (2026-06-10, amended 2026-07-02): agent children live in
# this cgroup; a runaway agent (10.8G anon on a 23G host) swap-thrashed the
# whole devvm — every >20s stall fires the t3 client watchdog (visible
# "disconnects") — then global-OOMed. Cap the cgroup so a runaway OOMs early
# and locally, and forbid swap so stalls can't smear into minutes-long freezes.
# MemoryHigh is DELIBERATELY infinity — do not add a soft band below MemoryMax:
# with swap=0 a hog that plateaus between high and max is unreclaimable but
# never OOMs, and the kernel's high-throttle stalls EVERY task in the cgroup
# (the t3 event loop included) indefinitely. A 12.3G agent ugrep livelocked
# this unit for ~50min on 2026-07-02 exactly this way. Straight-to-OOM at
# MemoryMax is the containment; OOMPolicy=continue below keeps the server up.
# See docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md addendum.
MemoryHigh=infinity
# Memory containment (2026-06-10): agent children live in this cgroup; a
# runaway agent (10.8G anon on a 23G host) swap-thrashed the whole devvm —
# every >20s stall fires the t3 client watchdog (visible "disconnects") —
# then global-OOMed. Cap the cgroup so a runaway OOMs early and locally,
# and forbid swap so stalls can't smear into minutes-long freezes.
MemoryHigh=12G
MemoryMax=16G
MemorySwapMax=0
# Default OOMPolicy=stop kills the WHOLE unit (8.5min outage 2026-06-10

View file

@ -1,11 +1,10 @@
#!/usr/bin/env bash
# Unit tests for the pure functions in vault-token-renew.sh.
# Sources the script (vtr_main is guarded) and exercises (a) the drift-guard
# decision — is ~/.vault-token OUR periodic admin token (renew) or a foreign
# clobber (heal / fail loud)? — whose ABSENCE let the 2026-06-05 woodpecker
# clobber be silently renewed for two days, and (b) the self-heal's revoke
# filter — which stale token-devvm-wizard tokens a heal may sweep.
# Run: bash infra/scripts/test-vault-token-renew.sh
# Unit tests for the pure drift-guard functions in vault-token-renew.sh.
# Sources the script (vtr_main is guarded) and exercises the decision logic that
# decides whether ~/.vault-token is OUR periodic admin token (renew) or a foreign
# token that clobbered the file (refuse, fail loud). This is exactly the logic
# whose ABSENCE let the 2026-06-05 woodpecker-token clobber be silently renewed
# for two days. Run: bash infra/scripts/test-vault-token-renew.sh
set -uo pipefail
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=/dev/null
@ -54,21 +53,5 @@ ok "ours: parse+decide renews" vtr_drift_ok "$(vtr_display_name "$LOOKUP_
no "woodpecker: parse+decide refused" vtr_drift_ok "$(vtr_display_name "$LOOKUP_WP")" "$(vtr_policies_csv "$LOOKUP_WP")"
no "oidc: parse+decide refused" vtr_drift_ok "$(vtr_display_name "$LOOKUP_OIDC")" "$(vtr_policies_csv "$LOOKUP_OIDC")"
# --- vtr_accessor: parse accessor out of lookup JSON ---
LOOKUP_NEW='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-new","policies":["default","sops-admin","vault-admin"],"identity_policies":null}}'
eq "accessor parsed" "acc-new" "$(vtr_accessor "$LOOKUP_NEW")"
eq "accessor absent -> empty" "" "$(vtr_accessor '{"data":{"display_name":"x"}}')"
# --- vtr_is_stale_periodic: the heal's revoke filter — ONLY old token-devvm-wizard
# --- tokens are swept; the just-minted token, foreign tokens, and anything with an
# --- unknown accessor are kept. An empty keep-accessor sweeps NOTHING (fail-safe).
STALE_OURS='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-old","policies":["default","sops-admin","vault-admin"]}}'
ok "older periodic token is stale" vtr_is_stale_periodic "$STALE_OURS" "acc-new"
no "the just-minted token is kept" vtr_is_stale_periodic "$LOOKUP_NEW" "acc-new"
no "foreign oidc token never swept" vtr_is_stale_periodic "$LOOKUP_OIDC" "acc-new"
no "woodpecker token never swept" vtr_is_stale_periodic "$LOOKUP_WP" "acc-new"
no "missing accessor never swept" vtr_is_stale_periodic '{"data":{"display_name":"token-devvm-wizard"}}' "acc-new"
no "empty keep-accessor sweeps nothing" vtr_is_stale_periodic "$STALE_OURS" ""
printf '\n%d passed, %d failed\n' "$pass" "$fail"
(( fail == 0 ))

View file

@ -45,94 +45,6 @@ vtr_drift_ok() {
printf ',%s,' "$pols" | grep -q ",$REQUIRED_POLICY," || return 1
}
# vtr_accessor <lookup-json> -> the token accessor (empty if absent).
vtr_accessor() {
printf '%s' "$1" | jq -r '.data.accessor // ""'
}
# vtr_is_stale_periodic <lookup-json> <keep-accessor> -> 0 if this lookup
# describes one of OUR periodic tokens (display name matches) that is NOT the
# one to keep — i.e. a stale leftover a heal should revoke. 1 otherwise.
# Name-only on purpose (no policy check): anything named token-devvm-wizard
# that isn't the current token is garbage from a previous mint. An empty
# keep-accessor sweeps NOTHING (fail-safe: never revoke when we don't know
# which token is current).
vtr_is_stale_periodic() {
local dn acc
[ -n "${2:-}" ] || return 1
dn=$(vtr_display_name "$1")
acc=$(vtr_accessor "$1")
[ "$dn" = "$EXPECTED_DN" ] || return 1
[ -n "$acc" ] || return 1
[ "$acc" != "$2" ]
}
# vtr_heal <foreign-dn> <log-file> -> 0 if ~/.vault-token was re-minted back to
# our periodic admin token using the foreign token's own authority, 1 if the
# heal was denied or failed (caller exits non-zero; the unit goes failed).
#
# Self-heal added 2026-07-03 (docs/plans/2026-07-03-vault-token-self-heal-design.md):
# an OIDC login — which the infra docs prescribe before applies — clobbers
# ~/.vault-token with a 7-day token, and detect-only drift left that unnoticed
# for weeks (the weekly-expiry loop). We ATTEMPT the re-mint with the
# clobbering token itself and let Vault's authz decide — a read-only clobber
# (the 2026-06-05 woodpecker incident) is denied the mint and stays a loud
# failure, because it signals a misbehaving flow that someone should look at.
vtr_heal() {
local foreign_dn="$1" log="$2"
local errf new_token new_info new_dn new_pols new_acc tmp
errf=$(mktemp)
if ! new_token=$(vault token create -orphan -period=768h \
-policy=vault-admin -policy=sops-admin -display-name=devvm-wizard \
-field=token 2>"$errf") || [ -z "$new_token" ]; then
printf '%s DRIFT: ~/.vault-token is dn=%q — heal denied, foreign token lacks create authority (%s); investigate what wrote it. Manual re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \
"$(date -Is)" "$foreign_dn" "$(tr '\n' ' ' <"$errf")" >>"$log"
rm -f "$errf"
return 1
fi
rm -f "$errf"
# Sanity: the minted token must itself pass the drift guard before it may
# replace ~/.vault-token.
if ! new_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json 2>&1); then
printf '%s FAIL: heal minted a token but its lookup failed: %s\n' \
"$(date -Is)" "$new_info" >>"$log"
return 1
fi
new_dn=$(vtr_display_name "$new_info")
new_pols=$(vtr_policies_csv "$new_info")
if ! vtr_drift_ok "$new_dn" "$new_pols"; then
printf '%s FAIL: heal minted an unexpected token (dn=%q policies=%q) — not writing it\n' \
"$(date -Is)" "$new_dn" "$new_pols" >>"$log"
return 1
fi
# Atomic replace: mktemp files are 0600 from birth; same-filesystem mv.
tmp=$(mktemp "$HOME/.vault-token.XXXXXX")
printf '%s' "$new_token" >"$tmp"
mv "$tmp" "$HOME/.vault-token"
# Anti-sprawl: revoke previous token-devvm-wizard tokens — each heal would
# otherwise strand the prior periodic ADMIN token server-side for up to 32d.
# The clobbering foreign token is deliberately NOT revoked: it may still back
# the user's live login session, and it ages out on its own (7d for OIDC).
local sweep="accessor sweep skipped (list denied)" accessors a a_info revoked=0
new_acc=$(vtr_accessor "$new_info")
if [ -n "$new_acc" ] && accessors=$(VAULT_TOKEN="$new_token" vault list -format=json auth/token/accessors 2>/dev/null); then
while IFS= read -r a; do
[ -n "$a" ] || continue
a_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json -accessor "$a" 2>/dev/null) || continue
if vtr_is_stale_periodic "$a_info" "$new_acc"; then
VAULT_TOKEN="$new_token" vault token revoke -accessor "$a" >/dev/null 2>&1 && revoked=$((revoked + 1))
fi
done < <(printf '%s' "$accessors" | jq -r '.[]')
sweep="revoked $revoked stale periodic token(s)"
fi
printf '%s HEALED: re-minted periodic token from foreign dn=%q (%s)\n' \
"$(date -Is)" "$foreign_dn" "$sweep" >>"$log"
}
vtr_main() {
set -euo pipefail
export PATH="/usr/local/bin:/usr/bin:/bin:${PATH:-}"
@ -149,19 +61,16 @@ vtr_main() {
dn=$(vtr_display_name "$info")
pols=$(vtr_policies_csv "$info")
# Drift guard (2026-06-07) + self-heal (2026-07-03): the renewer must not
# keep a FOREIGN token alive (on 2026-06-05 a stray kubernetes login was
# silently renewed for two days, masking lost write access). But detect-only
# drift proved worse in practice: an OIDC login — which the infra docs
# prescribe before applies — clobbers this file too, and the resulting DRIFT
# failures went unnoticed for weeks while access degraded to a 7-day token
# (the weekly-expiry loop). On drift we now ATTEMPT to heal (see vtr_heal):
# re-mint the periodic token with the clobbering token's own authority.
# Vault's authz keeps the old guarantee — a token that couldn't legitimately
# hold vault-admin is denied the mint, and we still fail loud.
# Drift guard (added 2026-06-07): the renewer must NOT keep a FOREIGN token alive.
# On 2026-06-05 a stray `vault login -method=kubernetes` overwrote ~/.vault-token
# with a read-only woodpecker token, and this script then silently renewed THAT
# for two days — masking the loss of write access. So before renewing, confirm
# the token is our periodic admin token; if it has drifted, fail loudly (systemd
# marks the unit failed) instead of keeping someone else's token alive.
if ! vtr_drift_ok "$dn" "$pols"; then
vtr_heal "$dn" "$log" || exit 1
exit 0
printf '%s DRIFT: ~/.vault-token is dn=%q policies=%q (expected dn=%q with %q). Refusing to renew a foreign token. Re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \
"$(date -Is)" "$dn" "$pols" "$EXPECTED_DN" "$REQUIRED_POLICY" >>"$log"
exit 1
fi
# `vault token renew` with no argument renews the calling token (renew-self).

View file

@ -244,15 +244,9 @@ log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-us
# virtual disk into an IO storm + multi-minute freeze (hard-killed 2026-06-22).
# t3-serve@ was already capped (its [Service] block); the HOLE was the uncapped
# user-<uid>.slice (all ssh/tmux work). Design — per user, on BOTH trees:
# MemoryMax=16G hard + MemorySwapMax=0 (work never touches disk swap → no
# thrash; a runaway is cgroup-OOM-killed locally at the ceiling), plus
# fair-share CPU/IO weights.
# NO MemoryHigh soft band (removed 2026-07-02; was 12G "throttle to a crawl"):
# with swap=0, a hog that PLATEAUS between high and max is unreclaimable but
# never OOMs — the kernel parks every task of the cgroup in
# mem_cgroup_handle_over_high and the whole tree stalls indefinitely. A 12.3G
# agent ugrep livelocked t3-serve@wizard (t3 down ~50min) exactly this way.
# Cap-and-kill, never throttle-and-pray — see the post-mortem addendum.
# MemoryHigh=12G soft (throttles a runaway to a crawl), MemoryMax=16G hard,
# MemorySwapMax=0 (work never touches disk swap → no thrash; it OOMs locally at
# the ceiling instead), plus fair-share CPU/IO weights.
# BACKSTOP = earlyoom, NOT systemd-oomd. We first shipped systemd-oomd but it is
# INERT with swap=0: its pressure-kill only acts on cgroups doing active reclaim
# (pgscan rising), and a no-swap anon workload never reclaims — verified live, a
@ -266,16 +260,12 @@ log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-us
# 10a) per-user caps + fair-share weights on EVERY user-<uid>.slice (ssh/tmux)
install -d -m 0755 /etc/systemd/system/user-.slice.d
cat > /etc/systemd/system/user-.slice.d/50-devvm-resource.conf <<'SLICE_EOF'
# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22;
# MemoryHigh dropped 2026-07-02). Applies to EACH user-<uid>.slice = all of one
# user's ssh/tmux work. Mirrors the t3-serve@.service caps so a user is bounded
# in whichever surface they work in. MemoryHigh stays infinity: with swap=0 a
# hog plateauing in a high..max band livelocks the entire slice (every ssh/tmux
# session of that user) instead of dying — straight-to-OOM at MemoryMax is the
# containment (see post-mortem addendum 2026-07-02).
# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22).
# Applies to EACH user-<uid>.slice = all of one user's ssh/tmux work. Mirrors the
# t3-serve@.service caps so a user is bounded in whichever surface they work in.
[Slice]
MemoryAccounting=yes
MemoryHigh=infinity
MemoryHigh=12G
MemoryMax=16G
MemorySwapMax=0
CPUAccounting=yes
@ -304,14 +294,12 @@ cat > /etc/systemd/system/docker.slice <<'DOCKER_SLICE_EOF'
# All docker containers live here (cgroup-parent in /etc/docker/daemon.json) so
# they share one bounded budget and a runaway container is capped at MemoryMax
# (cgroup-OOM'd locally) instead of escaping into the uncapped system.slice.
# setup-devvm.sh §10, 2026-06-22; MemoryHigh dropped 2026-07-02 — a container
# plateauing in the high..max band would throttle-livelock EVERY container in
# the slice (see post-mortem addendum); MemoryMax OOM is the containment.
# setup-devvm.sh §10, 2026-06-22.
[Unit]
Description=Docker containers slice (capped)
[Slice]
MemoryAccounting=yes
MemoryHigh=infinity
MemoryHigh=6G
MemoryMax=8G
MemorySwapMax=0
CPUAccounting=yes

Binary file not shown.

View file

@ -235,12 +235,6 @@ resource "cloudflare_record" "keyserver" {
zone_id = var.cloudflare_zone_id
}
# bridge.viktorbarzin.me (Cloudflare Pages, "мост" school site) moved to
# stacks/valia-sites (ADR-0018) all Valia-site records live there now.
# State handoff was a manual `tg state rm` (2026-07-03): the CI terraform
# (<1.7) rejects removed{} blocks even at the stack root, so declarative
# forget wasn't available. valia-sites imported the live record by id.
# Enable HTTP/3 (QUIC) for Cloudflare-proxied domains
resource "cloudflare_zone_settings_override" "http3" {
zone_id = var.cloudflare_zone_id

View file

@ -16,7 +16,7 @@ resource "kubernetes_namespace" "dawarich" {
name = "dawarich"
labels = {
"istio-injection" : "disabled"
tier = local.tiers.edge
tier = local.tiers.edge
"keel.sh/enrolled" = "true"
}
}
@ -330,7 +330,7 @@ resource "kubernetes_deployment" "dawarich" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
@ -458,13 +458,6 @@ module "ingress" {
namespace = kubernetes_namespace.dawarich.metadata[0].name
name = "dawarich"
tls_secret_name = var.tls_secret_name
# Rails serves all its fingerprinted assets itself and the map view adds an
# API burst per page load the default 10/50 limiter 429s the asset tail
# from a single client IP (and risks dropping OwnTracks/mobile ingestion
# POSTs on the same host). Dedicated 100/1000 limiter defined in
# stacks/traefik/modules/traefik/middleware.tf.
skip_default_rate_limit = true
extra_middlewares = ["traefik-dawarich-rate-limit@kubernetescrd"]
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Dawarich"

View file

@ -1511,34 +1511,6 @@ resource "null_resource" "pg_instagram_poster_db" {
}
}
# Create tasks database for the tasks PWA (Reminders-style front-end over
# Nextcloud CalDAV; FastAPI + SvelteKit SPA see ~/code/tasks). Stores
# Connected Accounts (Fernet-encrypted Nextcloud app passwords) + sync state.
# Role password is managed by Vault Database Secrets Engine (static role
# `pg-tasks`, 7d rotation). Tables are created by alembic on app startup.
resource "null_resource" "pg_tasks_db" {
depends_on = [null_resource.pg_cluster]
triggers = {
db_name = "tasks"
username = "tasks"
}
provisioner "local-exec" {
command = <<-EOT
PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
bash -c '
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'tasks'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE ROLE tasks WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_database WHERE datname = '"'"'tasks'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE DATABASE tasks OWNER tasks"
psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE tasks TO tasks"
'
EOT
}
}
# Old PostgreSQL deployment kept commented for rollback reference
# resource "kubernetes_deployment" "postgres" {
# metadata {

View file

@ -1,360 +0,0 @@
variable "tls_secret_name" {
type = string
sensitive = true
}
variable "nfs_server" { type = string }
# Open DroneLog (https://github.com/arpanghosh8453/open-dronelog) self-hosted
# DJI flight-log analyzer for the DJI Mini 4 Pro. Runs the UPSTREAM image (the
# ViktorBarzin/drone-logbook fork has no custom commits); Keel tracks :latest.
# Design: docs/plans/2026-07-04-drone-logbook-design.md
resource "kubernetes_namespace" "drone_logbook" {
metadata {
name = "drone-logbook"
labels = {
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "drone-logbook-secrets"
namespace = "drone-logbook"
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = {
name = "drone-logbook-secrets"
}
dataFrom = [{
extract = {
key = "drone-logbook"
}
}]
}
}
depends_on = [kubernetes_namespace.drone_logbook]
}
module "tls_secret" {
source = "../../modules/kubernetes/setup_tls_secret"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
tls_secret_name = var.tls_secret_name
}
# DuckDB database + cached DJI decryption keys + uploaded originals.
# Embedded DB -> block storage, not NFS (same rationale as freshrss data).
# Encrypted class: flight logs are GPS traces of home/travel (sensitive data
# -> proxmox-lvm-encrypted per the storage decision rule in .claude/CLAUDE.md).
resource "kubernetes_persistent_volume_claim" "data" {
wait_until_bound = false
metadata {
name = "drone-logbook-data-encrypted"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "10Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm-encrypted"
resources {
requests = {
storage = "2Gi"
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and PVCs
# can't shrink; without this every apply tries to revert the size.
ignore_changes = [spec[0].resources[0].requests]
}
}
# Drop folder: any producer (Nextcloud sync, scp, future phone pipeline) lands
# DJI .txt logs here over NFS; the app auto-imports on SYNC_INTERVAL.
module "nfs_sync_logs" {
source = "../../modules/kubernetes/nfs_volume"
name = "drone-logbook-sync-logs"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/drone-logbook/sync-logs"
storage = "5Gi"
}
resource "kubernetes_deployment" "drone_logbook" {
metadata {
name = "drone-logbook"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
labels = {
app = "drone-logbook"
"kubernetes.io/cluster-service" = "true"
tier = local.tiers.aux
}
}
spec {
replicas = 1
strategy {
# DuckDB is single-writer; never overlap two pods on the same volume
type = "Recreate"
}
selector {
match_labels = {
app = "drone-logbook"
}
}
template {
metadata {
labels = {
app = "drone-logbook"
"kubernetes.io/cluster-service" = "true"
}
}
spec {
container {
name = "drone-logbook"
image = "ghcr.io/arpanghosh8453/open-dronelog:latest"
env {
name = "RUST_LOG"
value = "info"
}
env {
# keep re-importable originals under /data/drone-logbook/uploaded
name = "KEEP_UPLOADED_FILES"
value = "true"
}
env {
name = "SYNC_LOGS_PATH"
value = "/sync-logs"
}
env {
# 6-field cron (sec min hour dom mon dow): scan drop folder every 8h
name = "SYNC_INTERVAL"
value = "0 0 */8 * * *"
}
env {
name = "PROFILE_CREATION_PASS"
value_from {
secret_key_ref {
name = "drone-logbook-secrets"
key = "profile_creation_pass"
}
}
}
volume_mount {
name = "data"
mount_path = "/data/drone-logbook"
}
volume_mount {
name = "sync-logs"
mount_path = "/sync-logs"
read_only = true
}
port {
name = "http"
container_port = 80
protocol = "TCP"
}
resources {
requests = {
cpu = "25m"
memory = "512Mi"
}
limits = {
memory = "512Mi"
}
}
}
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
}
}
volume {
name = "sync-logs"
persistent_volume_claim {
claim_name = module.nfs_sync_logs.claim_name
}
}
}
}
}
depends_on = [kubernetes_manifest.external_secret]
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["kubernetes.io/change-cause"],
metadata[0].annotations["deployment.kubernetes.io/revision"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
}
}
resource "kubernetes_service" "drone_logbook" {
metadata {
name = "drone-logbook"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
labels = {
"app" = "drone-logbook"
}
}
spec {
selector = {
app = "drone-logbook"
}
port {
port = "80"
target_port = "80"
}
}
}
# -----------------------------------------------------------------------------
# Backup required for every proxmox-lvm(-encrypted) app: daily copy of the
# data volume to NFS /srv/nfs/drone-logbook-backup (picked up by nfs-mirror ->
# sda -> Synology offsite). 01:30 = outside the 00:00/08:00/16:00 sync-import
# windows, so the DuckDB file is quiescent; uploaded originals make even a
# mid-write copy recoverable by re-import. Pod-affinity co-schedules with the
# app pod (RWO volume mounts twice only on the same node). Vaultwarden pattern.
# -----------------------------------------------------------------------------
module "nfs_backup" {
source = "../../modules/kubernetes/nfs_volume"
name = "drone-logbook-backup-host"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/drone-logbook-backup"
}
resource "kubernetes_cron_job_v1" "backup" {
metadata {
name = "drone-logbook-backup"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
}
spec {
concurrency_policy = "Replace"
failed_jobs_history_limit = 5
schedule = "30 1 * * *"
starting_deadline_seconds = 300
successful_jobs_history_limit = 3
job_template {
metadata {}
spec {
backoff_limit = 3
ttl_seconds_after_finished = 10
template {
metadata {}
spec {
affinity {
pod_affinity {
required_during_scheduling_ignored_during_execution {
label_selector {
match_labels = {
app = "drone-logbook"
}
}
topology_key = "kubernetes.io/hostname"
}
}
}
container {
name = "drone-logbook-backup"
image = "docker.io/library/alpine"
command = ["/bin/sh", "-c", <<-EOT
set -euxo pipefail
_t0=$(date +%s)
now=$(date +"%Y_%m_%d_%H_%M")
mkdir -p /backup/$now
cp -a /data/. /backup/$now/
# Rotate 30 day retention
find /backup -maxdepth 1 -mindepth 1 -type d -mtime +30 -exec rm -rf {} +
_dur=$(($(date +%s) - _t0))
_out_bytes=$(du -sb /backup/$now | awk '{print $1}')
wget -qO- --post-data "backup_duration_seconds $${_dur}
backup_output_bytes $${_out_bytes}
backup_last_success_timestamp $(date +%s)
" "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/drone-logbook-backup" || true
EOT
]
volume_mount {
name = "data"
mount_path = "/data"
read_only = true
}
volume_mount {
name = "backup"
mount_path = "/backup"
}
}
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
}
}
volume {
name = "backup"
persistent_volume_claim {
claim_name = module.nfs_backup.claim_name
}
}
dns_config {
option {
name = "ndots"
value = "2"
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# https://dronelog.viktorbarzin.me
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
auth = "required" # Authentik forward-auth flight logs are GPS traces of home/travel
dns_type = "proxied"
namespace = kubernetes_namespace.drone_logbook.metadata[0].name
name = "dronelog"
service_name = "drone-logbook"
tls_secret_name = var.tls_secret_name
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Drone Logbook"
"gethomepage.dev/description" = "DJI flight log analyzer"
"gethomepage.dev/icon" = "mdi-quadcopter"
"gethomepage.dev/group" = "Media & Entertainment"
"gethomepage.dev/pod-selector" = ""
}
}

View file

@ -1 +0,0 @@
../../secrets

View file

@ -1,8 +0,0 @@
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}

View file

@ -10,7 +10,7 @@ resource "kubernetes_namespace" "excalidraw" {
name = "excalidraw"
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
@ -45,15 +45,6 @@ resource "kubernetes_deployment" "excalidraw" {
app = "excalidraw"
tier = local.tiers.aux
}
# Keel rolls new ghcr:latest digests (k8s-portal pattern). Values here are
# recreate-correct seeds only the keys are in ignore_changes below, so
# the live annotations win on an existing deployment.
annotations = {
"keel.sh/policy" = "force"
"keel.sh/trigger" = "poll"
"keel.sh/match-tag" = "true"
"keel.sh/pollSchedule" = "@every 5m"
}
}
spec {
replicas = 1
@ -76,19 +67,9 @@ resource "kubernetes_deployment" "excalidraw" {
}
}
spec {
# GHCR pull secret: the ghcr-credentials Secret in this namespace is
# cloned in by the kyverno stack's sync-ghcr-credentials ClusterPolicy
# (allowlisted private-ghcr namespaces only ADR-0002). Source of
# truth: stacks/kyverno/modules/kyverno/ghcr-credentials.tf.
image_pull_secrets {
name = "ghcr-credentials"
}
container {
# ADR-0002: GHA-built (.github/workflows/build-excalidraw.yml),
# PRIVATE ghcr; Keel rolls new :latest digests. DockerHub
# viktorbarzin/excalidraw-library:v4 is the frozen rollback image.
image = "ghcr.io/viktorbarzin/excalidraw-library:latest"
image_pull_policy = "Always"
image = "viktorbarzin/excalidraw-library:v4"
image_pull_policy = "IfNotPresent"
name = "excalidraw"
port {
container_port = 8080
@ -126,7 +107,7 @@ resource "kubernetes_deployment" "excalidraw" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],

View file

@ -4,28 +4,18 @@ A self-hosted Excalidraw library with per-user drawing storage and management.
## Features
- Dashboard to manage all your drawings (create, open, rename, delete)
- Dashboard to manage all your drawings
- Per-user storage (via Authentik SSO headers)
- Rename drawings from the dashboard or by clicking the drawing name in the editor
- Native Excalidraw export via the editor's hamburger menu: "Save to..."
(.excalidraw file) and "Export image..." (PNG / SVG / clipboard)
- Autosave (2s debounce) + manual save (Ctrl+S or menu "Save now")
- Create, edit, and delete drawings
- Persistent storage via NFS
## Docker Image
```
ghcr.io/viktorbarzin/excalidraw-library:latest
viktorbarzin/excalidraw-library:v4
```
Built by GitHub Actions (`.github/workflows/build-excalidraw.yml` in the infra
repo, ADR-0002) on every master push touching `stacks/excalidraw/project/**`;
tags `:latest` + `:<git-sha>`. The package is PRIVATE — cluster pulls use the
Kyverno-synced `ghcr-credentials` secret. Keel polls `:latest` and rolls the
deployment on digest change.
The legacy manually-built DockerHub image `viktorbarzin/excalidraw-library:v4`
is frozen as the rollback target; nothing pushes to it anymore.
Available on Docker Hub: https://hub.docker.com/r/viktorbarzin/excalidraw-library
## Configuration
@ -49,13 +39,54 @@ Mount a persistent volume to the `DATA_DIR` path. Drawings are stored as `.excal
└── my-diagram.excalidraw
```
The filename (without extension) is both the drawing ID and its display name;
renaming a drawing renames the file (`os.Rename`, mtime preserved).
## Deployment
Deployed by the `stacks/excalidraw` Terraform stack (namespace `excalidraw`,
service `draw`, ingress `draw.viktorbarzin.me` with `auth = "required"`).
### Docker
```bash
docker run -d \
--name excalidraw-rooms \
-p 8080:8080 \
-v /path/to/storage:/data \
viktorbarzin/excalidraw-library:v4
```
### Kubernetes
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: excalidraw
spec:
replicas: 1
selector:
matchLabels:
app: excalidraw
template:
metadata:
labels:
app: excalidraw
spec:
containers:
- name: excalidraw
image: viktorbarzin/excalidraw-library:v4
ports:
- containerPort: 8080
env:
- name: DATA_DIR
value: /data
- name: PORT
value: "8080"
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
nfs:
server: 192.168.1.127
path: /srv/nfs/excalidraw
```
### With Authentik SSO
@ -65,7 +96,23 @@ The application reads user identity from Authentik headers:
- `X-Authentik-Email` - Displayed in UI
- `X-Authentik-Name` - Displayed in UI
Requests without `X-Authentik-Username` fall back to the `anonymous` user.
Configure your ingress to pass these headers:
```yaml
annotations:
nginx.ingress.kubernetes.io/auth-response-headers: "X-authentik-username,X-authentik-email,X-authentik-name"
```
## Building
```bash
# Build the Docker image
docker build -t excalidraw-library .
# Or build locally
go build -o excalidraw-library .
./excalidraw-library
```
## API Endpoints
@ -75,25 +122,10 @@ Requests without `X-Authentik-Username` fall back to the `anonymous` user.
| GET | `/api/drawings` | List all drawings for current user |
| GET | `/api/drawings/:id` | Get drawing data |
| PUT | `/api/drawings/:id` | Save drawing |
| PATCH | `/api/drawings/:id` | Rename drawing — body `{"name": "<new-name>"}`; returns `{"status":"renamed","id":"<new-id>"}`; 409 if the target name exists |
| DELETE | `/api/drawings/:id` | Delete drawing |
| GET | `/api/user` | Get current user info |
| GET | `/draw/:id` | Open drawing in editor |
Rename names are sanitized server-side to `[a-zA-Z0-9-_]` (other characters
become `-`; a trailing `.excalidraw` is stripped). Existing IDs are accepted
as-is for backward compatibility with API clients.
## Development
```bash
# Run tests
go test ./...
# Run locally
DATA_DIR=/tmp/excalidraw-data go run .
```
## License
MIT

View file

@ -9,7 +9,6 @@ import (
"net/http"
"os"
"path/filepath"
"regexp"
"sort"
"strings"
"time"
@ -64,21 +63,6 @@ func getUsername(r *http.Request) string {
return username
}
var invalidNameChars = regexp.MustCompile(`[^a-zA-Z0-9-_]`)
// sanitizeName normalizes a user-supplied drawing name into a safe file ID
// (same charset the dashboard applies on create). Returns "" if nothing
// meaningful remains.
func sanitizeName(name string) string {
name = strings.TrimSpace(name)
name = strings.TrimSuffix(name, ".excalidraw")
name = invalidNameChars.ReplaceAllString(name, "-")
if strings.Trim(name, "-") == "" {
return ""
}
return name
}
// getUserDataDir returns the data directory for a specific user and ensures it exists
func getUserDataDir(username string) string {
userDir := filepath.Join(dataDir, username)
@ -184,41 +168,6 @@ func handleDrawing(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(map[string]string{"status": "saved", "id": id})
case http.MethodPatch:
var req struct {
Name string `json:"name"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
http.Error(w, "Invalid JSON body", http.StatusBadRequest)
return
}
newID := sanitizeName(req.Name)
if newID == "" {
http.Error(w, "Invalid name", http.StatusBadRequest)
return
}
if _, err := os.Stat(filePath); err != nil {
if os.IsNotExist(err) {
http.Error(w, "Drawing not found", http.StatusNotFound)
} else {
http.Error(w, err.Error(), http.StatusInternalServerError)
}
return
}
if newID != id {
newPath := filepath.Join(userDataDir, newID+".excalidraw")
if _, err := os.Stat(newPath); err == nil {
http.Error(w, "A drawing with that name already exists", http.StatusConflict)
return
}
if err := os.Rename(filePath, newPath); err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(map[string]string{"status": "renamed", "id": newID})
case http.MethodDelete:
if err := os.Remove(filePath); err != nil {
if os.IsNotExist(err) {
@ -315,8 +264,6 @@ const dashboardHTML = `<!DOCTYPE html>
.btn:hover { background: #5b4cdb; }
.btn-danger { background: #e74c3c; }
.btn-danger:hover { background: #c0392b; }
.btn-secondary { background: #3d3d5c; }
.btn-secondary:hover { background: #4a4a70; }
.btn-small { padding: 0.4rem 0.8rem; font-size: 0.85rem; }
.drawings { display: grid; gap: 1rem; }
.drawing {
@ -395,11 +342,11 @@ const dashboardHTML = `<!DOCTYPE html>
<div id="modal" class="modal">
<div class="modal-content">
<h2 id="modal-title">New Drawing</h2>
<h2>New Drawing</h2>
<input type="text" id="drawingName" placeholder="Drawing name..." autofocus>
<div class="modal-actions">
<button class="btn" style="background:#444" onclick="hideModal()">Cancel</button>
<button class="btn" id="modal-confirm" onclick="confirmModal()">Create</button>
<button class="btn" onclick="createDrawing()">Create</button>
</div>
</div>
</div>
@ -422,63 +369,31 @@ const dashboardHTML = `<!DOCTYPE html>
}
}
function drawingRow(d) {
var row = document.createElement('div');
row.className = 'drawing';
var info = document.createElement('div');
info.className = 'drawing-info';
var nameLink = document.createElement('a');
nameLink.className = 'drawing-name';
nameLink.href = '/draw/' + encodeURIComponent(d.id);
nameLink.textContent = d.name;
var meta = document.createElement('div');
meta.className = 'drawing-meta';
meta.textContent = 'Modified: ' + new Date(d.modified).toLocaleDateString() + ' ' +
new Date(d.modified).toLocaleTimeString() + ' - ' + formatSize(d.size);
info.appendChild(nameLink);
info.appendChild(meta);
var actions = document.createElement('div');
actions.className = 'drawing-actions';
var open = document.createElement('a');
open.className = 'btn btn-small';
open.href = '/draw/' + encodeURIComponent(d.id);
open.textContent = 'Open';
var rename = document.createElement('button');
rename.className = 'btn btn-small btn-secondary';
rename.textContent = 'Rename';
rename.onclick = function() { showRenameModal(d.id); };
var del = document.createElement('button');
del.className = 'btn btn-small btn-danger';
del.textContent = 'Delete';
del.onclick = function() { deleteDrawing(d.id); };
actions.appendChild(open);
actions.appendChild(rename);
actions.appendChild(del);
row.appendChild(info);
row.appendChild(actions);
return row;
}
async function loadDrawings() {
const resp = await fetch('/api/drawings');
const drawings = await resp.json();
const container = document.getElementById('drawings');
container.replaceChildren();
if (!drawings || drawings.length === 0) {
var empty = document.createElement('div');
empty.className = 'empty';
empty.textContent = 'No drawings yet. Create your first one!';
container.appendChild(empty);
container.innerHTML = '<div class="empty">No drawings yet. Create your first one!</div>';
return;
}
drawings.forEach(function(d) {
container.appendChild(drawingRow(d));
});
container.innerHTML = drawings.map(function(d) {
return '<div class="drawing">' +
'<div class="drawing-info">' +
'<a href="/draw/' + d.id + '" class="drawing-name">' + d.name + '</a>' +
'<div class="drawing-meta">' +
'Modified: ' + new Date(d.modified).toLocaleDateString() + ' ' + new Date(d.modified).toLocaleTimeString() +
' - ' + formatSize(d.size) +
'</div>' +
'</div>' +
'<div class="drawing-actions">' +
'<a href="/draw/' + d.id + '" class="btn btn-small">Open</a>' +
'<button class="btn btn-small btn-danger" onclick="deleteDrawing(\'' + d.id + '\')">Delete</button>' +
'</div>' +
'</div>';
}).join('');
}
function formatSize(bytes) {
@ -487,64 +402,18 @@ const dashboardHTML = `<!DOCTYPE html>
return (bytes / (1024 * 1024)).toFixed(1) + ' MB';
}
var modalAction = null; // invoked with the input value on confirm
function showModal(title, confirmLabel, initialValue, action) {
document.getElementById('modal-title').textContent = title;
document.getElementById('modal-confirm').textContent = confirmLabel;
var input = document.getElementById('drawingName');
input.value = initialValue || '';
modalAction = action;
document.getElementById('modal').classList.add('active');
input.focus();
input.select();
}
function showNewModal() {
showModal('New Drawing', 'Create', '', createDrawing);
}
function showRenameModal(id) {
showModal('Rename Drawing', 'Rename', id, function(value) {
renameDrawing(id, value);
});
document.getElementById('modal').classList.add('active');
document.getElementById('drawingName').focus();
}
function hideModal() {
document.getElementById('modal').classList.remove('active');
document.getElementById('drawingName').value = '';
modalAction = null;
}
function confirmModal() {
if (modalAction) modalAction(document.getElementById('drawingName').value);
}
async function renameDrawing(id, newName) {
newName = (newName || '').trim();
if (!newName || newName === id) {
hideModal();
return;
}
var resp = await fetch('/api/drawings/' + encodeURIComponent(id), {
method: 'PATCH',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ name: newName })
});
if (resp.status === 409) {
alert('A drawing with that name already exists.');
return; // keep the modal open so the user can pick another name
}
if (!resp.ok) {
alert('Rename failed: ' + await resp.text());
return;
}
hideModal();
loadDrawings();
}
async function createDrawing(name) {
name = (name || '').trim();
async function createDrawing() {
var name = document.getElementById('drawingName').value.trim();
if (!name) {
name = 'drawing-' + Date.now();
}
@ -577,7 +446,7 @@ const dashboardHTML = `<!DOCTYPE html>
}
document.getElementById('drawingName').addEventListener('keypress', function(e) {
if (e.key === 'Enter') confirmModal();
if (e.key === 'Enter') createDrawing();
});
document.getElementById('modal').addEventListener('click', function(e) {

View file

@ -1,249 +0,0 @@
package main
import (
"encoding/json"
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"strings"
"testing"
)
const testDrawing = `{"type":"excalidraw","version":2,"source":"excalidraw-library","elements":[{"id":"e1"}],"appState":{"viewBackgroundColor":"#ffffff"}}`
func setupDataDir(t *testing.T) {
t.Helper()
dataDir = t.TempDir()
}
// doDrawing sends a request to handleDrawing for the given user and returns the recorder.
func doDrawing(t *testing.T, method, id, body, user string) *httptest.ResponseRecorder {
t.Helper()
var reader *strings.Reader
if body == "" {
reader = strings.NewReader("")
} else {
reader = strings.NewReader(body)
}
req := httptest.NewRequest(method, "/api/drawings/"+id, reader)
if user != "" {
req.Header.Set("X-Authentik-Username", user)
}
w := httptest.NewRecorder()
handleDrawing(w, req)
return w
}
func listDrawings(t *testing.T, user string) []Drawing {
t.Helper()
req := httptest.NewRequest(http.MethodGet, "/api/drawings", nil)
if user != "" {
req.Header.Set("X-Authentik-Username", user)
}
w := httptest.NewRecorder()
handleListDrawings(w, req)
if w.Code != http.StatusOK {
t.Fatalf("list: expected 200, got %d", w.Code)
}
var drawings []Drawing
if err := json.Unmarshal(w.Body.Bytes(), &drawings); err != nil {
t.Fatalf("list: bad JSON: %v", err)
}
return drawings
}
func TestPutGetRoundtrip(t *testing.T) {
setupDataDir(t)
if w := doDrawing(t, http.MethodPut, "foo", testDrawing, "alice"); w.Code != http.StatusOK {
t.Fatalf("PUT: expected 200, got %d: %s", w.Code, w.Body.String())
}
w := doDrawing(t, http.MethodGet, "foo", "", "alice")
if w.Code != http.StatusOK {
t.Fatalf("GET: expected 200, got %d", w.Code)
}
if w.Body.String() != testDrawing {
t.Errorf("GET: content mismatch: %s", w.Body.String())
}
}
func TestGetMissing(t *testing.T) {
setupDataDir(t)
if w := doDrawing(t, http.MethodGet, "nope", "", "alice"); w.Code != http.StatusNotFound {
t.Fatalf("expected 404, got %d", w.Code)
}
}
func TestListDrawings(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "one", testDrawing, "alice")
doDrawing(t, http.MethodPut, "two", testDrawing, "alice")
drawings := listDrawings(t, "alice")
if len(drawings) != 2 {
t.Fatalf("expected 2 drawings, got %d", len(drawings))
}
ids := map[string]bool{drawings[0].ID: true, drawings[1].ID: true}
if !ids["one"] || !ids["two"] {
t.Errorf("unexpected ids: %v", ids)
}
for _, d := range drawings {
if d.Name != d.ID {
t.Errorf("name should equal id: %+v", d)
}
}
}
func TestDelete(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "foo", testDrawing, "alice")
if w := doDrawing(t, http.MethodDelete, "foo", "", "alice"); w.Code != http.StatusOK {
t.Fatalf("DELETE: expected 200, got %d", w.Code)
}
if w := doDrawing(t, http.MethodGet, "foo", "", "alice"); w.Code != http.StatusNotFound {
t.Fatalf("GET after delete: expected 404, got %d", w.Code)
}
if w := doDrawing(t, http.MethodDelete, "foo", "", "alice"); w.Code != http.StatusNotFound {
t.Fatalf("second DELETE: expected 404, got %d", w.Code)
}
}
func TestPerUserIsolation(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "secret", testDrawing, "alice")
if w := doDrawing(t, http.MethodGet, "secret", "", "bob"); w.Code != http.StatusNotFound {
t.Fatalf("bob should not see alice's drawing, got %d", w.Code)
}
if drawings := listDrawings(t, "bob"); len(drawings) != 0 {
t.Fatalf("bob's list should be empty, got %d", len(drawings))
}
}
// --- rename (PATCH) ---
func renameReq(t *testing.T, id, newName, user string) *httptest.ResponseRecorder {
t.Helper()
return doDrawing(t, http.MethodPatch, id, `{"name":`+strconv(newName)+`}`, user)
}
// strconv JSON-quotes a string without importing encoding/json for a one-liner.
func strconv(s string) string {
b, _ := json.Marshal(s)
return string(b)
}
func TestRenameSuccess(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "foo", testDrawing, "alice")
w := renameReq(t, "foo", "bar", "alice")
if w.Code != http.StatusOK {
t.Fatalf("PATCH: expected 200, got %d: %s", w.Code, w.Body.String())
}
var resp map[string]string
if err := json.Unmarshal(w.Body.Bytes(), &resp); err != nil {
t.Fatalf("PATCH: bad JSON: %v", err)
}
if resp["id"] != "bar" || resp["status"] != "renamed" {
t.Errorf("unexpected response: %v", resp)
}
if w := doDrawing(t, http.MethodGet, "bar", "", "alice"); w.Code != http.StatusOK || w.Body.String() != testDrawing {
t.Errorf("GET new id: code=%d content=%q", w.Code, w.Body.String())
}
if w := doDrawing(t, http.MethodGet, "foo", "", "alice"); w.Code != http.StatusNotFound {
t.Errorf("GET old id: expected 404, got %d", w.Code)
}
}
func TestRenameConflict(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "a", testDrawing, "alice")
doDrawing(t, http.MethodPut, "b", testDrawing, "alice")
if w := renameReq(t, "a", "b", "alice"); w.Code != http.StatusConflict {
t.Fatalf("expected 409, got %d", w.Code)
}
// both drawings intact
for _, id := range []string{"a", "b"} {
if w := doDrawing(t, http.MethodGet, id, "", "alice"); w.Code != http.StatusOK {
t.Errorf("drawing %q should be intact, got %d", id, w.Code)
}
}
}
func TestRenameMissing(t *testing.T) {
setupDataDir(t)
if w := renameReq(t, "nope", "new", "alice"); w.Code != http.StatusNotFound {
t.Fatalf("expected 404, got %d", w.Code)
}
}
func TestRenameSameName(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "foo", testDrawing, "alice")
w := renameReq(t, "foo", "foo", "alice")
if w.Code != http.StatusOK {
t.Fatalf("same-name rename: expected 200, got %d: %s", w.Code, w.Body.String())
}
if w := doDrawing(t, http.MethodGet, "foo", "", "alice"); w.Code != http.StatusOK {
t.Errorf("drawing should be intact, got %d", w.Code)
}
}
func TestRenameInvalidNames(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "foo", testDrawing, "alice")
for _, name := range []string{"", " ", "../..", "---"} {
if w := renameReq(t, "foo", name, "alice"); w.Code != http.StatusBadRequest {
t.Errorf("rename to %q: expected 400, got %d", name, w.Code)
}
}
// malformed body
if w := doDrawing(t, http.MethodPatch, "foo", `{not json`, "alice"); w.Code != http.StatusBadRequest {
t.Errorf("malformed body: expected 400, got %d", w.Code)
}
}
func TestRenameSanitization(t *testing.T) {
setupDataDir(t)
cases := []struct{ in, want string }{
{"My Drawing!", "My-Drawing-"},
{"net diag.excalidraw", "net-diag"}, // .excalidraw suffix stripped, not mangled
{"a/b\\c", "a-b-c"},
}
for _, c := range cases {
doDrawing(t, http.MethodPut, "src", testDrawing, "alice")
w := renameReq(t, "src", c.in, "alice")
if w.Code != http.StatusOK {
t.Errorf("rename to %q: expected 200, got %d: %s", c.in, w.Code, w.Body.String())
continue
}
var resp map[string]string
json.Unmarshal(w.Body.Bytes(), &resp)
if resp["id"] != c.want {
t.Errorf("rename to %q: expected id %q, got %q", c.in, c.want, resp["id"])
}
// file must be inside the user dir under the sanitized name
if _, err := os.Stat(filepath.Join(dataDir, "alice", c.want+".excalidraw")); err != nil {
t.Errorf("rename to %q: expected file %q on disk: %v", c.in, c.want, err)
}
doDrawing(t, http.MethodDelete, resp["id"], "", "alice")
}
}
func TestRenameTraversalStaysInUserDir(t *testing.T) {
setupDataDir(t)
doDrawing(t, http.MethodPut, "foo", testDrawing, "alice")
w := renameReq(t, "foo", "../../../etc/passwd", "alice")
if w.Code == http.StatusOK {
var resp map[string]string
json.Unmarshal(w.Body.Bytes(), &resp)
if strings.Contains(resp["id"], "/") || strings.Contains(resp["id"], "..") {
t.Fatalf("traversal characters survived: %q", resp["id"])
}
if _, err := os.Stat(filepath.Join(dataDir, "alice", resp["id"]+".excalidraw")); err != nil {
t.Fatalf("renamed file escaped user dir: %v", err)
}
}
// nothing outside the data dir
if _, err := os.Stat(filepath.Join(dataDir, "..", "etc")); err == nil {
t.Fatal("file escaped the data dir")
}
}

View file

@ -8,41 +8,41 @@
* { margin: 0; padding: 0; }
html, body { width: 100%; height: 100%; overflow: hidden; }
#root { width: 100%; height: 100%; }
.top-right-ui {
.toolbar {
position: fixed;
top: 10px;
left: 10px;
z-index: 1000;
display: flex;
align-items: center;
gap: 8px;
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
}
.top-right-ui a, .top-right-ui button {
display: inline-flex;
align-items: center;
gap: 6px;
background: rgba(255,255,255,0.95);
padding: 8px 12px;
border: 1px solid transparent;
border-radius: 8px;
box-shadow: 0 2px 8px rgba(0,0,0,0.15);
}
.toolbar button, .toolbar a {
padding: 6px 14px;
border: none;
border-radius: 6px;
cursor: pointer;
font-size: 13px;
font-size: 14px;
background: #6c5ce7;
color: white;
text-decoration: none;
box-shadow: 0 1px 4px rgba(0,0,0,0.12);
max-width: 40vw;
white-space: nowrap;
overflow: hidden;
text-overflow: ellipsis;
display: inline-block;
}
.top-right-ui.theme-light a, .top-right-ui.theme-light button {
background: #ffffff;
color: #1b1b1f;
.toolbar button:hover, .toolbar a:hover { background: #5b4cdb; }
.toolbar .secondary { background: #ddd; color: #333; }
.toolbar .secondary:hover { background: #ccc; }
.toolbar .title {
font-weight: 600;
padding: 6px 0;
color: #333;
}
.top-right-ui.theme-dark a, .top-right-ui.theme-dark button {
background: #232329;
color: #e9ecef;
}
.top-right-ui button:hover, .top-right-ui a:hover { border-color: #a29bfe; }
.status {
position: fixed;
bottom: 10px;
right: 60px;
right: 10px;
padding: 6px 12px;
background: rgba(0,0,0,0.7);
color: white;
@ -51,7 +51,6 @@
z-index: 1000;
opacity: 0;
transition: opacity 0.3s;
pointer-events: none;
}
.status.show { opacity: 1; }
.loading {
@ -68,6 +67,11 @@
</style>
</head>
<body>
<div class="toolbar">
<a href="/" class="secondary">Back to Library</a>
<span class="title" id="title">Loading...</span>
<button onclick="saveDrawing()">Save</button>
</div>
<div id="root">
<div class="loading">
<div>Loading Excalidraw...</div>
@ -77,33 +81,16 @@
<div id="status" class="status">Saved</div>
<script>
// Replaces #root with an error panel (safe DOM methods, no innerHTML).
function showFatal(title, detail) {
var root = document.getElementById('root');
root.replaceChildren();
var panel = document.createElement('div');
panel.className = 'loading error';
var titleEl = document.createElement('div');
titleEl.textContent = title;
panel.appendChild(titleEl);
if (detail) {
var detailEl = document.createElement('div');
detailEl.style.fontSize = '0.9rem';
detailEl.textContent = detail;
panel.appendChild(detailEl);
}
root.appendChild(panel);
}
// Get drawing ID from URL path: /draw/{id}
var pathParts = window.location.pathname.split('/');
var drawingId = pathParts[pathParts.length - 1] || pathParts[pathParts.length - 2];
if (!drawingId) {
showFatal('No drawing ID specified');
document.getElementById('root').innerHTML = '<div class="loading error">No drawing ID specified</div>';
throw new Error('No drawing ID');
}
document.getElementById('title').textContent = drawingId;
document.title = drawingId + ' - Excalidraw';
var excalidrawAPI = null;
@ -172,46 +159,6 @@
autoSaveTimeout = setTimeout(saveDrawing, 2000);
}
// Renames the current drawing via the API. Returns the new ID, or null
// if the rename was cancelled or failed.
async function renameCurrentDrawing() {
var newName = window.prompt('Rename drawing', drawingId);
if (newName === null) return null;
newName = newName.trim();
if (!newName || newName === drawingId) return null;
// A pending autosave would resurrect the old file after the rename.
clearTimeout(autoSaveTimeout);
var resp;
try {
resp = await fetch('/api/drawings/' + drawingId, {
method: 'PATCH',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ name: newName })
});
} catch (e) {
showStatus('Rename failed!');
return null;
}
if (resp.status === 409) {
window.alert('A drawing with that name already exists.');
return null;
}
if (!resp.ok) {
window.alert('Rename failed: ' + (await resp.text()));
return null;
}
var result = await resp.json();
drawingId = result.id;
document.title = drawingId + ' - Excalidraw';
window.history.replaceState(null, '', '/draw/' + encodeURIComponent(drawingId));
showStatus('Renamed');
// Flush any unsaved changes to the new file.
saveDrawing();
return drawingId;
}
// Load scripts dynamically
function loadScript(src) {
return new Promise(function(resolve, reject) {
@ -250,76 +197,33 @@
updateLoadStatus('Rendering Excalidraw...');
var e = React.createElement;
var MainMenu = ExcalidrawLib.MainMenu;
// Native default menu items, existence-guarded so a library
// update that drops one degrades gracefully.
function defaultItem(name) {
var C = MainMenu && MainMenu.DefaultItems && MainMenu.DefaultItems[name];
return C ? e(C, { key: name }) : null;
}
// Create Excalidraw component
function App() {
var nameState = React.useState(drawingId);
var name = nameState[0], setName = nameState[1];
function onRename() {
renameCurrentDrawing().then(function(newId) {
if (newId) setName(newId);
});
}
// The menu is where the native export features live:
// Export = "Save to..." (.excalidraw), SaveAsImage =
// "Export image..." (PNG / SVG / clipboard).
var menu = MainMenu ? e(MainMenu, { key: 'menu' },
e(MainMenu.Item, { key: 'back', onSelect: function() { window.location.href = '/'; } }, 'Back to Library'),
e(MainMenu.Item, { key: 'save', onSelect: saveDrawing }, 'Save now'),
e(MainMenu.Item, { key: 'rename', onSelect: onRename }, 'Rename drawing…'),
MainMenu.Separator ? e(MainMenu.Separator, { key: 'sep1' }) : null,
defaultItem('LoadScene'),
defaultItem('Export'),
defaultItem('SaveAsImage'),
MainMenu.Separator ? e(MainMenu.Separator, { key: 'sep2' }) : null,
defaultItem('ClearCanvas'),
defaultItem('ToggleTheme'),
defaultItem('ChangeCanvasBackground'),
defaultItem('Help')
) : null;
return e(ExcalidrawLib.Excalidraw, {
return React.createElement(ExcalidrawLib.Excalidraw, {
initialData: initialData ? {
elements: initialData.elements || [],
appState: initialData.appState || {}
} : undefined,
UIOptions: { canvasActions: { toggleTheme: true } },
excalidrawAPI: function(api) {
excalidrawAPI = api;
console.log('Excalidraw API ready');
},
onChange: onChange,
renderTopRightUI: function(isMobile, appState) {
return e('div', { className: 'top-right-ui theme-' + (appState.theme || 'light') },
e('a', { key: 'home', href: '/', title: 'Back to Library' }, '← Library'),
e('button', {
key: 'name',
title: 'Click to rename',
onClick: onRename
}, name + ' ✎')
);
}
}, menu);
onChange: onChange
});
}
var root = ReactDOM.createRoot(document.getElementById('root'));
root.render(e(App));
root.render(React.createElement(App));
console.log('Excalidraw rendered successfully');
} catch (err) {
console.error('Init error:', err);
showFatal('Failed to load Excalidraw', err.message);
} catch (e) {
console.error('Init error:', e);
document.getElementById('root').innerHTML =
'<div class="loading error">' +
'<div>Failed to load Excalidraw</div>' +
'<div style="font-size:0.9rem">' + e.message + '</div>' +
'</div>';
}
}

View file

@ -1,49 +0,0 @@
# emo's Claude Excalidraw upload RBAC.
#
# emo's agent uploads drawings with `kubectl -n excalidraw port-forward svc/draw`
# + `PUT /api/drawings/<name>` carrying the X-Authentik-Username header (the
# documented recipe in emo's ~/.claude/CLAUDE.md the app sits behind Authentik
# forward-auth, so direct curl gets redirected). His hands-off credential is the
# chrome-service/emo-browser ServiceAccount kubeconfig (stacks/chrome-service/rbac.tf);
# its cluster-wide grant (oidc-power-user-readonly) is read-only, so pods/portforward
# must be granted per namespace. This is the excalidraw-namespace grant
# (Viktor's call, 2026-07-02; same pattern as the chrome-service one).
#
# TRADE-OFF (accepted): port-forward into this namespace bypasses the Authentik
# ingress and the drawings API trusts the X-Authentik-Username header, so the SA
# can read/write ANY user's drawings, not only emo's. The namespace runs nothing
# but the drawings app, and the same class of trade-off was already accepted for
# the shared browser (CDP reach into Viktor's sessions).
resource "kubernetes_role" "portforward" {
metadata {
name = "excalidraw-portforward"
namespace = kubernetes_namespace.excalidraw.metadata[0].name
}
rule {
api_groups = [""]
resources = ["pods/portforward"]
verbs = ["create"]
}
}
resource "kubernetes_role_binding" "emo_browser_portforward" {
metadata {
name = "emo-browser-portforward"
namespace = kubernetes_namespace.excalidraw.metadata[0].name
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "Role"
name = kubernetes_role.portforward.metadata[0].name
}
subject {
kind = "ServiceAccount"
# Defined in stacks/chrome-service/rbac.tf referenced by name across
# stacks, same as that file references the oidc-power-user-readonly
# ClusterRole. get/list on pods+services (needed to resolve svc/draw) comes
# from the SA's cluster-read binding there.
name = "emo-browser"
namespace = "chrome-service"
}
}

View file

@ -166,33 +166,6 @@ resource "kubernetes_deployment" "f1-stream" {
name = "DISCORD_CHANNELS"
value = var.discord_f1_channel_ids
}
# Replays feature (app repo ADR-0002). optional=true so the pod still
# starts before the Reddit app credentials exist; the app treats missing
# creds as "replays off" (logs "Replays pipeline disabled"). The
# ExternalSecret above uses dataFrom.extract on the Vault "f1-stream"
# key, so adding reddit_client_id / reddit_client_secret there auto-syncs
# them into this Secret no ExternalSecret change needed, just a pod
# restart to pick them up.
env {
name = "REDDIT_CLIENT_ID"
value_from {
secret_key_ref {
name = "f1-stream-secrets"
key = "reddit_client_id"
optional = true
}
}
}
env {
name = "REDDIT_CLIENT_SECRET"
value_from {
secret_key_ref {
name = "f1-stream-secrets"
key = "reddit_client_secret"
optional = true
}
}
}
# Verifier connects to in-cluster headed Chromium pool see
# stacks/chrome-service/. Falls back to in-process headless if unset.
# 2026-06-04: migrated WS (:3000 / path-token) CDP (:9222 /

View file

@ -117,9 +117,8 @@ resource "kubernetes_deployment" "frigate" {
limits = {
memory = "10Gi"
"nvidia.com/gpu" = "1"
# GPU VRAM budget (ADR-0016): detector + ffmpeg decode (~1.9 GiB),
# +~250 MiB NVDEC headroom for the vermont-garage camera (ADR-0017).
"viktorbarzin.me/gpumem" = "2300"
# GPU VRAM budget (ADR-0016): detector + ffmpeg decode (~1.9 GiB).
"viktorbarzin.me/gpumem" = "2000"
}
}
env {

View file

@ -73,9 +73,7 @@ resource "kubernetes_deployment" "immich-frame-emo" {
}
spec {
container {
# immich_v3: upstream compat tag for Immich v3 see frame.tf for the
# full story; repin to a versioned tag once upstream releases v3 support.
image = "ghcr.io/immichframe/immichframe:immich_v3"
image = "ghcr.io/immichframe/immichframe:v1.0.32.0"
name = "immich-frame-emo"
resources {
requests = {
@ -144,21 +142,14 @@ resource "kubernetes_service" "immich-frame-emo" {
module "ingress_emo" {
source = "../../modules/kubernetes/ingress_factory"
# Photo-frame kiosk display on Emo's Portal Mini (Sofia LAN) WebView
# pulling images via an Immich API key; no user login possible on the
# device. Same LAN-only gating as frame.tf: home-lans-only ipAllowList +
# dns_type "internal" (Emo's Portal already resolves this host internally
# via Technitium; the public internal-IP record covers any resolver).
# LAN-only design: docs/plans/2026-07-04-immich-frame-lan-only-design.md.
# auth = "none": kiosk WebView, no user auth by design; gated by the home-lans-only ipAllowList instead.
auth = "none"
dns_type = "internal"
extra_middlewares = ["traefik-home-lans-only@kubernetescrd"]
# Not externally reachable explicit opt-out so external-monitor-sync
# drops the old [External] monitor instead of default-opting it back in.
external_monitor = false
namespace = "immich"
name = "highlights-immich-emo"
tls_secret_name = var.tls_secret_name
service_name = "immich-frame-emo"
# Photo-frame kiosk display on Emo's Portal headless browser pulling images
# via an Immich API key (no user login). Forward-auth would 302 the device to
# Authentik with no way to complete login.
# auth = "none": photo-frame kiosk; headless browser with API key; no user login.
auth = "none"
dns_type = "proxied"
namespace = "immich"
name = "highlights-immich-emo"
tls_secret_name = var.tls_secret_name
service_name = "immich-frame-emo"
}

View file

@ -69,11 +69,7 @@ resource "kubernetes_deployment" "immich-frame" {
}
spec {
container {
# immich_v3 is the upstream compat tag for Immich v3 servers every
# versioned release ( v1.0.33.0) crashes deserializing v3 API
# responses (immichFrame/immichFrame#653). Pin back to a vX.Y.Z.W tag
# once a stable release ships v3 support (upstream PR #654).
image = "ghcr.io/immichframe/immichframe:immich_v3"
image = "ghcr.io/immichframe/immichframe:v1.0.32.0"
name = "immich-frame"
resources {
requests = {
@ -142,23 +138,14 @@ resource "kubernetes_service" "immich-frame" {
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
# Photo-frame kiosk display (Viktor's London Portal Plus WebView) pulls
# images via an Immich API key; no user login possible on the device, so
# forward-auth would 302 it to Authentik with no way to complete login.
# The GATE is network-level: the home-lans-only ipAllowList (Sofia/London/
# Valchedrym LANs + 10/8) 403s everyone else, and dns_type "internal"
# publishes the Traefik LB IP publicly so the Portal's baked-in URL resolves
# from any resolver yet routes only via the home LANs / WG tunnel.
# LAN-only design: docs/plans/2026-07-04-immich-frame-lan-only-design.md.
# auth = "none": kiosk WebView, no user auth by design; gated by the home-lans-only ipAllowList instead.
auth = "none"
dns_type = "internal"
extra_middlewares = ["traefik-home-lans-only@kubernetescrd"]
# Not externally reachable explicit opt-out so external-monitor-sync
# drops the old [External] monitor instead of default-opting it back in.
external_monitor = false
namespace = "immich"
name = "highlights-immich"
tls_secret_name = var.tls_secret_name
service_name = "immich-frame"
# Photo-frame kiosk display runs in headless browser mode on a TV/frame
# device and pulls images via an Immich API key (no user login). Forward-auth
# would 302 the device to Authentik with no way to complete login.
# auth = "none": Photo-frame kiosk display headless browser with API key; no user login; forward-auth breaks device automation.
auth = "none"
dns_type = "proxied"
namespace = "immich"
name = "highlights-immich"
tls_secret_name = var.tls_secret_name
service_name = "immich-frame"
}

View file

@ -15,7 +15,7 @@ locals {
variable "immich_version" {
type = string
# Change me to upgrade
default = "v3.0.0"
default = "v2.7.5"
}
variable "proxmox_host" { type = string }
variable "redis_host" { type = string }
@ -492,7 +492,7 @@ resource "kubernetes_deployment" "immich-postgres" {
}
spec {
container {
image = "ghcr.io/immich-app/postgres:15-vectorchord0.4.3-pgvectors0.2.0"
image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0"
name = "immich-postgresql"
port {
container_port = 5432
@ -882,7 +882,7 @@ resource "kubernetes_cron_job_v1" "clip-index-prewarm" {
restart_policy = "Never"
container {
name = "prewarm"
image = "ghcr.io/immich-app/postgres:15-vectorchord0.4.3-pgvectors0.2.0"
image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0"
# command overrides the postgres entrypoint runs psql directly.
command = [
"psql", "-v", "ON_ERROR_STOP=1", "-c",
@ -964,7 +964,7 @@ resource "kubernetes_cron_job_v1" "immich-search-probe" {
}
init_container {
name = "measure"
image = "ghcr.io/immich-app/postgres:15-vectorchord0.4.3-pgvectors0.2.0"
image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0"
command = ["/bin/bash", "-c", <<-EOT
set -uo pipefail
OUT=/shared/metrics.prom

View file

@ -43,11 +43,6 @@ locals {
# ghcr.io/passionprojectsanca/book-plotter (built by GHA in Anca's repo,
# under her own org's ghcr). The deployment references the cloned secret.
"plotting-book",
# excalidraw: infra-owned image migrated from manual DockerHub pushes to
# PRIVATE ghcr.io/viktorbarzin/excalidraw-library (ADR-0002, built by
# .github/workflows/build-excalidraw.yml). The deployment references the
# cloned secret.
"excalidraw",
]
}

View file

@ -19,12 +19,3 @@ plans@viktorbarzin.me spam@viktorbarzin.me
# to trips@, or every verification/recovery send is rejected (550 sender). Also
# routes any inbound trips@ to spam@.
trips@viktorbarzin.me spam@viktorbarzin.me
# docs@ -> docs@: explicit self-alias for the paperless-ngx ingest MAILBOX
# (a real account in secret/platform.mailserver_accounts). Without this the
# @domain catch-all above (Vault-side aliases) rewrites docs@ to spam@ and the
# mail lands in the TripIt-swept catch-all mailbox instead. Same pattern as
# me@ -> me@. Delivery-time sender allowlist: docs-at-viktorbarzin.me
# .dovecot.sieve (mounted as docs@viktorbarzin.me.dovecot.sieve).
# Runbook: docs/runbooks/paperless-mail-ingest.md
docs@viktorbarzin.me docs@viktorbarzin.me

View file

@ -1,17 +0,0 @@
# Sender allowlist for the paperless-ngx ingest mailbox docs@viktorbarzin.me.
# Family members forward document emails here; paperless-ngx polls the INBOX
# over IMAP and maps each sender to a paperless account (1 mail rule per
# sender). Decision (Viktor, 2026-07-03): mail from any OTHER sender is
# ignored and deleted — discarded here at LMTP delivery, before paperless
# ever sees it. This also keeps spam to the guessable address out entirely.
#
# Keep this list in sync with the paperless mail rules (the sender -> owner
# map). Add-a-sender procedure: docs/runbooks/paperless-mail-ingest.md
if not address :is "from" ["me@viktorbarzin.me",
"vbarzin@gmail.com",
"viktorbarzin@meta.com",
"ancaelena98@gmail.com",
"emil.barzin@gmail.com"] {
discard;
stop;
}

View file

@ -14,15 +14,10 @@ variable "nfs_server" { type = string }
locals {
_account_set = keys(var.mailserver_accounts)
_virtual_lines = split("\n", format("%s%s", var.postfix_account_aliases, file("${path.module}/extra/aliases.txt")))
# NOTE: the length guard must live in a ternary, not a leading `&&` operand.
# Terraform only short-circuits && / || from v1.6 on the older terraform
# pinned in the infra-ci image, `split(" ", line)[1]` was still evaluated
# for blank/comment lines and failed the whole plan with "Invalid index"
# (first hit by CI pipeline #469, 2026-07-03). A conditional expression is
# lazy on every terraform version.
postfix_virtual = join("\n", [
for line in local._virtual_lines : line
if length(split(" ", line)) != 2 ? true : !(
if !(
length(split(" ", line)) == 2 &&
contains(local._account_set, split(" ", line)[0]) &&
contains(local._account_set, split(" ", line)[1]) &&
split(" ", line)[0] != split(" ", line)[1]
@ -115,12 +110,6 @@ resource "kubernetes_config_map" "mailserver_config" {
"postfix-main.cf" = var.postfix_cf
"postfix-virtual.cf" = local.postfix_virtual
# Per-user Dovecot sieve for the paperless-ngx ingest mailbox: DMS installs
# any /tmp/docker-mailserver/<login>.dovecot.sieve at startup. ConfigMap
# keys can't contain '@', so the key is sanitized ("-at-") and the
# volume_mount below restores the real filename.
"docs-at-viktorbarzin.me.dovecot.sieve" = file("${path.module}/extra/docs-at-viktorbarzin.me.dovecot.sieve")
KeyTable = "mail._domainkey.viktorbarzin.me viktorbarzin.me:mail:/etc/opendkim/keys/viktorbarzin.me-mail.key\n"
SigningTable = "*@viktorbarzin.me mail._domainkey.viktorbarzin.me\n"
TrustedHosts = "127.0.0.1\nlocalhost\n"
@ -415,12 +404,6 @@ resource "kubernetes_deployment" "mailserver" {
sub_path = "postfix-virtual.cf"
read_only = true
}
volume_mount {
name = "config"
mount_path = "/tmp/docker-mailserver/docs@viktorbarzin.me.dovecot.sieve"
sub_path = "docs-at-viktorbarzin.me.dovecot.sieve"
read_only = true
}
volume_mount {
name = "config"
mount_path = "/tmp/docker-mailserver/fetchmail.cf"

View file

@ -60,10 +60,6 @@ locals {
# t3 dispatch probe surface (auth="none" path carve-out on /probe): WS echo
# + healthz for the t3-probe drop-attribution client (stacks/t3code).
"t3-probe-ws" = "https://t3.viktorbarzin.me/probe/healthz"
# tasks PWA icons + manifest (auth="none" path carve-out, stacks/tasks
# module.ingress_icons): macOS/iOS/Android icon fetchers carry no session
# cookies, so an Authentik 302 here breaks Add-to-Dock icons.
"tasks-icons" = "https://tasks.viktorbarzin.me/apple-touch-icon.png"
# NOTE: openclaw task-webhook (auth="none") is intentionally NOT probed it
# has no public DNS record (NXDOMAIN, external_monitor=false), so there is no
# externally GET-able URL to probe. Its carve-out is internal-only.

View file

@ -18,6 +18,7 @@ const SITE_IDS = {
"stacks.viktorbarzin.me": "b38fda4285df",
"f1.viktorbarzin.me": "7e69786f66d5",
"frigate.viktorbarzin.me": "0d4044069ff5",
"highlights-immich.viktorbarzin.me": "602167601c6b",
"immich.viktorbarzin.me": "35eedb7a3d2b",
"mail.viktorbarzin.me": "082f164faa7d",
"navidrome.viktorbarzin.me": "8a3844ff75ba",

View file

@ -28,6 +28,7 @@ routes = [
{ pattern = "stacks.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
{ pattern = "f1.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
{ pattern = "frigate.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
{ pattern = "highlights-immich.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
{ pattern = "immich.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
{ pattern = "mail.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
{ pattern = "navidrome.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },

View file

@ -0,0 +1,122 @@
# Automatic Google Drive -> site sync (added 2026-06-09; supersedes the
# earlier on-demand-only model now that content is actively maintained).
#
# A CronJob mirrors the READ-ONLY Drive folder "claude" (servable content in
# subfolder "stem claude/files/") onto the NFS content volume every 10 min via
# rclone. rclone is delta-aware: an unchanged run lists ~33 files' metadata and
# transfers nothing, so the schedule is cheap (not a 24MB re-download). nginx
# keeps serving the same volume read-only; updates appear within ~5s (actimeo).
#
# Drive is treated strictly READ-ONLY: scope=drive.readonly and rclone only ever
# reads the remote (sync gdrive: -> /data), never writes back.
#
# TOKEN LONGEVITY: the GCP OAuth app (project home-lab-1700868541205) MUST be
# published to "Production" or its refresh token expires ~weekly and this job
# fails. After publishing, re-mint the token and refresh
# `secret/stem95su.rclone_conf`. A failed run surfaces as a failed Job.
resource "kubernetes_manifest" "rclone_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "stem95su-rclone"
namespace = kubernetes_namespace.stem95su.metadata[0].name
}
spec = {
refreshInterval = "1h"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = { name = "stem95su-rclone" }
data = [{
secretKey = "rclone.conf"
remoteRef = {
key = "stem95su"
property = "rclone_conf"
}
}]
}
}
depends_on = [kubernetes_namespace.stem95su]
}
resource "kubernetes_cron_job_v1" "gdrive_sync" {
metadata {
name = "stem95su-gdrive-sync"
namespace = kubernetes_namespace.stem95su.metadata[0].name
labels = { run = "stem95su", component = "gdrive-sync" }
}
spec {
schedule = "*/10 * * * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 2
failed_jobs_history_limit = 3
job_template {
metadata {}
spec {
backoff_limit = 1
ttl_seconds_after_finished = 86400
template {
metadata { labels = { run = "stem95su", component = "gdrive-sync" } }
spec {
restart_policy = "OnFailure"
container {
name = "rclone"
image = "docker.io/rclone/rclone:1.74.3"
# Mirror Drive folder -> /data. Guard: hard-fail on auth/list error
# (so an expired token is visible); skip quietly if the source is
# empty / missing the dashboard (never wipe the live site);
# --max-delete caps catastrophic deletes from a partial listing.
command = ["/bin/sh", "-c", <<-EOT
set -eu
cp /config/rclone.conf /tmp/rc.conf
SRC="gdrive:stem claude/files"
LIST=$(rclone --config /tmp/rc.conf lsf "$SRC" --files-only) || { echo "FATAL: Drive list failed (auth/network)"; exit 1; }
N=$(printf '%s\n' "$LIST" | grep -c . || true)
if [ "$N" -lt 1 ] || ! printf '%s\n' "$LIST" | grep -qx "stem_board.html"; then
echo "GUARD: source N=$N / stem_board.html missing -- skipping, site untouched"; exit 0
fi
echo "source OK ($N files) -- mirroring to /data"
rclone --config /tmp/rc.conf sync "$SRC" /data --exclude ".DS_Store" --fast-list --transfers 4 --max-delete 25 -v
EOT
]
resources {
requests = { cpu = "10m", memory = "64Mi" }
limits = { memory = "192Mi" }
}
volume_mount {
name = "rclone-config"
mount_path = "/config"
read_only = true
}
volume_mount {
name = "content"
mount_path = "/data"
}
}
volume {
name = "rclone-config"
secret { secret_name = "stem95su-rclone" }
}
volume {
name = "content"
persistent_volume_claim {
claim_name = module.nfs_content.claim_name
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
depends_on = [kubernetes_manifest.rclone_external_secret]
}

View file

@ -1,9 +1,173 @@
# stem95su moved OFF-INFRA to Cloudflare Pages (ADR-0018 cutover, 2026-07-03)
# registry entry `stem95su` in stacks/valia-sites; runbook
# docs/runbooks/valia-sites.md. This stack intentionally declares NOTHING:
# the apply that landed this file destroyed the old in-cluster serving
# (nginx + NFS content PVC + ingress + per-site gdrive-sync CronJob +
# namespace). Directory kept only so the destroy could run through CI
# safe to delete the dir + its PG state schema in a later cleanup.
# Harmless leftovers (manual cleanup if ever wanted): /srv/nfs/stem-site on
# the PVE host, and Vault secret/stem95su (superseded by secret/valia-sites).
# STEM educational platform for 95. СУ Проф. Иван Шишманов" (Sofia).
# Public, open static site at stem95su.viktorbarzin.me. Self-contained HTML
# pages + media authored externally (Gemini exports), served by a stock nginx
# straight off the PVE host NFS NOT baked into an image, so content can be
# updated out-of-band (Nextcloud "PVE NFS Pool" or rsync to /srv/nfs/stem-site)
# without a rebuild. Auto-backed-up offsite by the existing nfs-mirror job.
resource "kubernetes_namespace" "stem95su" {
metadata {
name = "stem95su"
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
source = "../../modules/kubernetes/setup_tls_secret"
namespace = kubernetes_namespace.stem95su.metadata[0].name
tls_secret_name = var.tls_secret_name
}
# Content lives on the PVE host NFS. NOTE: the nfs_volume module creates only
# the K8s PV+PVC the export subdir (/srv/nfs/stem-site) must already exist on
# 192.168.1.127 or the pod fails to mount (mount.nfs exit 32). It is created
# during deploy and re-created on demand if ever lost.
module "nfs_content" {
source = "../../modules/kubernetes/nfs_volume"
name = "stem95su-content"
namespace = kubernetes_namespace.stem95su.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/stem-site"
storage = "1Gi"
access_modes = ["ReadWriteMany"]
}
# Minimal nginx server block: serve the static dir, with the dashboard
# (stem_board.html) as the directory index so "/" loads the platform home.
# All other pages/assets are reached by their exact filenames (the dashboard
# links to them by name those must not be renamed).
resource "kubernetes_config_map" "nginx_conf" {
metadata {
name = "stem95su-nginx-conf"
namespace = kubernetes_namespace.stem95su.metadata[0].name
}
data = {
"default.conf" = <<-EOT
server {
listen 80;
server_name _;
root /usr/share/nginx/html;
index stem_board.html index.html;
}
EOT
}
}
resource "kubernetes_deployment" "stem95su" {
metadata {
name = "stem95su"
namespace = kubernetes_namespace.stem95su.metadata[0].name
labels = {
run = "stem95su"
tier = local.tiers.aux
}
}
spec {
replicas = 1
selector {
match_labels = {
run = "stem95su"
}
}
template {
metadata {
labels = {
run = "stem95su"
}
}
spec {
container {
image = "nginx:1.28-alpine"
name = "nginx"
resources {
limits = {
memory = "64Mi"
}
requests = {
cpu = "10m"
memory = "64Mi"
}
}
port {
container_port = 80
}
volume_mount {
name = "content"
mount_path = "/usr/share/nginx/html"
read_only = true
}
volume_mount {
name = "nginx-conf"
mount_path = "/etc/nginx/conf.d"
read_only = true
}
readiness_probe {
http_get {
path = "/"
port = 80
}
initial_delay_seconds = 3
period_seconds = 10
}
}
volume {
name = "content"
persistent_volume_claim {
claim_name = module.nfs_content.claim_name
}
}
volume {
name = "nginx-conf"
config_map {
name = kubernetes_config_map.nginx_conf.metadata[0].name
}
}
}
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
]
}
}
resource "kubernetes_service" "stem95su" {
metadata {
name = "stem95su"
namespace = kubernetes_namespace.stem95su.metadata[0].name
labels = {
run = "stem95su"
}
}
spec {
selector = {
run = "stem95su"
}
port {
name = "http"
port = "80"
target_port = "80"
}
}
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
# auth = "none": public static educational site for 95. СУ, open to the internet by design CrowdSec + ai-bot-block gate bots; no login.
auth = "none"
namespace = kubernetes_namespace.stem95su.metadata[0].name
name = "stem95su"
service_name = kubernetes_service.stem95su.metadata[0].name
port = "80"
host = "stem95su"
dns_type = "proxied"
tls_secret_name = var.tls_secret_name
}

View file

@ -0,0 +1,9 @@
variable "tls_secret_name" {
type = string
sensitive = true
}
variable "nfs_server" {
type = string
default = "192.168.1.127"
}

View file

@ -1,53 +0,0 @@
# One-shot adoption of the live tasks-stack resources that exist in-cluster but
# were never persisted to Terraform state: pipeline 477 (2026-07-03, the stack's
# first apply) died mid-`[tasks] apply` after creating the resources, before
# the pg backend write so `tasks.states` stayed empty and every later apply
# would create-fail with `namespaces "tasks" already exists` (same class as the
# monitoring alert-digest adoption in stacks/monitoring/imports.tf). Importing
# reconciles them into state so `terraform apply` UPDATES instead of failing to
# create. These blocks are idempotent (a no-op once the resources are in state)
# and may be removed after the next green apply. Defs: main.tf.
# (module.ingress_icons is deliberately NOT here it does not exist live yet;
# the same apply creates it.)
import {
to = kubernetes_namespace.tasks
id = "tasks"
}
import {
to = kubernetes_manifest.external_secret
id = "apiVersion=external-secrets.io/v1,kind=ExternalSecret,namespace=tasks,name=tasks-secrets"
}
import {
to = kubernetes_manifest.db_external_secret
id = "apiVersion=external-secrets.io/v1,kind=ExternalSecret,namespace=tasks,name=tasks-db-creds"
}
import {
to = kubernetes_deployment.tasks
id = "tasks/tasks"
}
import {
to = kubernetes_service.tasks
id = "tasks/tasks"
}
import {
to = kubernetes_network_policy_v1.tasks_ingress
id = "tasks/tasks-ingress"
}
import {
to = module.ingress.kubernetes_ingress_v1.proxied-ingress
id = "tasks/tasks"
}
# Cloudflare record ID looked up via the API (zone fd2c5dd4 / record for
# tasks.viktorbarzin.me, CNAME the cfargotunnel target, proxied).
import {
to = module.ingress.cloudflare_record.proxied[0]
id = "fd2c5dd4efe8fe38958944e74d0ced6d/a8e6901a074c5255d09700d93eaaf705"
}

View file

@ -1,378 +0,0 @@
variable "image_tag" {
type = string
default = "latest"
description = "tasks image tag. Running tag is set by the Woodpecker deploy (kubectl set image)."
}
variable "postgresql_host" { type = string }
variable "tls_secret_name" {
type = string
sensitive = true
}
locals {
namespace = "tasks"
# ADR-0002: built on GHA from the public GitHub mirror, pushed to ghcr
# (public package anonymous pulls). Running tag is managed by the
# Woodpecker deploy (kubectl set image); the image ref below is
# ignore_changes'd (KEEL_IGNORE_IMAGE), so this base only matters on
# (re)create.
image = "ghcr.io/viktorbarzin/tasks:${var.image_tag}"
labels = {
app = "tasks"
}
}
resource "kubernetes_namespace" "tasks" {
metadata {
name = local.namespace
labels = {
tier = local.tiers.aux
"istio-injection" = "disabled"
# Opt into Keel auto-update (inject-keel-annotations ClusterPolicy).
"keel.sh/enrolled" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label.
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# App secrets seed these in Vault before applying:
# secret/tasks
# fernet_key Fernet key encrypting the per-user Nextcloud app passwords
# stored in the Connected Accounts table (tasks ADR-0002).
#
# DB: CNPG database `tasks` (created in dbaas, null_resource.pg_tasks_db);
# role password managed via the Vault database engine see
# static-creds/pg-tasks. Alembic runs migrations on app startup (no init
# container needed).
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "tasks-secrets"
namespace = local.namespace
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = {
name = "tasks-secrets"
template = {
metadata = {
annotations = {
"reloader.stakater.com/match" = "true"
}
}
}
}
data = [
{ secretKey = "TASKS_FERNET_KEY", remoteRef = { key = "tasks", property = "fernet_key" } },
]
}
}
depends_on = [kubernetes_namespace.tasks]
}
# DB credentials from Vault database engine (7-day rotation).
# Builds the asyncpg DSN consumed by the FastAPI app as TASKS_DB_DSN.
# Pre-req in dbaas: CNPG cluster has DB `tasks`, role `tasks`, and Vault
# role `static-creds/pg-tasks`.
resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "tasks-db-creds"
namespace = local.namespace
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-database"
kind = "ClusterSecretStore"
}
target = {
name = "tasks-db-creds"
template = {
metadata = {
annotations = {
"reloader.stakater.com/match" = "true"
}
}
data = {
TASKS_DB_DSN = "postgresql+asyncpg://tasks:{{ .password }}@${var.postgresql_host}:5432/tasks"
DB_PASSWORD = "{{ .password }}"
}
}
}
data = [{
secretKey = "password"
remoteRef = {
key = "static-creds/pg-tasks"
property = "password"
}
}]
}
}
depends_on = [kubernetes_namespace.tasks]
}
resource "kubernetes_deployment" "tasks" {
metadata {
name = "tasks"
namespace = kubernetes_namespace.tasks.metadata[0].name
labels = merge(local.labels, {
tier = local.tiers.aux
})
annotations = {
# Reloader restarts the pod when tasks-secrets / tasks-db-creds change
# (both carry reloader.stakater.com/match=true) required because the
# DB password rotates every 7 days and is read only at startup.
"reloader.stakater.com/search" = "true"
}
}
spec {
# Single leader: the CalDAV sync engine wants one writer per user's
# sync-token cursor; the SPA is served by the same process.
replicas = 1
strategy {
type = "Recreate"
}
selector {
match_labels = local.labels
}
template {
metadata {
labels = local.labels
annotations = {
# Prometheus scrapes the service-endpoints (annotations live on the
# Service below); the pod annotations here let the kubernetes-pods
# SD job also discover /metrics directly.
"prometheus.io/scrape" = "true"
"prometheus.io/path" = "/metrics"
"prometheus.io/port" = "8000"
}
}
spec {
image_pull_secrets {
name = "registry-credentials"
}
container {
name = "tasks"
image = local.image
port {
container_port = 8000
}
# TASKS_FERNET_KEY via tasks-secrets; TASKS_DB_DSN via tasks-db-creds.
env_from {
secret_ref { name = "tasks-secrets" }
}
env_from {
secret_ref { name = "tasks-db-creds" }
}
# Wall-clock zone for all-day due dates (DUE;VALUE=DATE) and the
# Today/Scheduled smart views.
env {
name = "TASKS_LOCAL_TZ"
value = "Europe/Sofia"
}
# SECURITY INVARIANT DEV_USER must NEVER be set here. It is the
# dev-only identity fallback: when present the backend treats every
# request as that user, bypassing the Authentik forward-auth
# identity (X-authentik-username) entirely. Production identity
# comes ONLY from the header Traefik/Authentik injects.
readiness_probe {
http_get {
path = "/healthz"
port = 8000
}
initial_delay_seconds = 5
period_seconds = 10
}
liveness_probe {
http_get {
path = "/healthz"
port = 8000
}
initial_delay_seconds = 30
period_seconds = 30
}
resources {
requests = { cpu = "100m", memory = "384Mi" }
limits = { memory = "384Mi" }
}
}
}
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Woodpecker deploy sets the running tag
metadata[0].annotations["kubernetes.io/change-cause"],
metadata[0].annotations["deployment.kubernetes.io/revision"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
}
depends_on = [
kubernetes_manifest.external_secret,
kubernetes_manifest.db_external_secret,
]
}
resource "kubernetes_service" "tasks" {
metadata {
name = "tasks"
namespace = kubernetes_namespace.tasks.metadata[0].name
labels = local.labels
annotations = {
# Prometheus kubernetes-service-endpoints SD scrapes /metrics here.
"prometheus.io/scrape" = "true"
"prometheus.io/path" = "/metrics"
"prometheus.io/port" = "8000"
}
}
spec {
type = "ClusterIP"
selector = local.labels
port {
name = "http"
port = 8000
target_port = 8000
}
}
}
# Kyverno ClusterPolicy `sync-tls-secret` auto-clones the wildcard TLS
# secret into every namespace, so we don't need a setup_tls_secret module.
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
# auth = "required": Authentik forward-auth gates EVERY request the app
# has no login of its own and blindly trusts the X-authentik-username
# header the outpost injects, so Authentik is the only thing standing
# between strangers and everyone's tasks. Do NOT relax this tier (tasks
# design decision #3; pairs with the NetworkPolicy below, SEC-1).
auth = "required"
dns_type = "proxied"
namespace = kubernetes_namespace.tasks.metadata[0].name
name = "tasks"
port = 8000
tls_secret_name = var.tls_secret_name
}
# Carve-out for the PWA icon assets + web manifest. macOS Safari's
# "Add to Dock" (and every other OS icon fetcher: iOS Add-to-Home-Screen,
# Android install prompt) fetches these in a cookie-less context behind
# forward-auth it got the Authentik 302 and fell back to a letter monogram.
# Traefik prioritises these longer path prefixes over the main "/" router,
# so ONLY these five static files bypass Authentik; the SPA shell and /api
# stay gated by the main ingress above (and the app itself 401s /api
# without the identity header). Guarded against regression by the
# tasks-icons entry in the Authentik walling-off probe
# (stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf).
module "ingress_icons" {
source = "../../modules/kubernetes/ingress_factory"
# auth = "none": public static icons + manifest, no user data; required for
# OS icon fetchers (Safari Add-to-Dock etc.) that carry no session and
# cannot complete the Authentik redirect dance.
auth = "none"
namespace = kubernetes_namespace.tasks.metadata[0].name
name = "tasks-icons"
service_name = kubernetes_service.tasks.metadata[0].name
port = 8000
ingress_path = [
"/apple-touch-icon.png",
"/favicon.png",
"/pwa-192x192.png",
"/pwa-512x512.png",
"/manifest.webmanifest",
]
full_host = "tasks.viktorbarzin.me" # MUST match the main ingress host; otherwise the factory derives tasks-icons.viktorbarzin.me and the carve-out never matches.
dns_type = "none" # host record already owned by the main tasks ingress
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false # Five static icons + a manifest; nothing for scrapers to mine.
homepage_enabled = false # path carve-out, not its own dashboard tile
}
# --- NetworkPolicy: scoped pod ingress (security-review finding SEC-1). ---
# The app trusts X-authentik-username unconditionally, so its ENTIRE auth
# model depends on requests only ever arriving through Traefik (where the
# Authentik forward-auth middleware sets that header). Any pod that could
# reach the pod IP directly could spoof the header and read/write anyone's
# tasks hence ingress is restricted to:
# - TCP/8000 from the traefik namespace (user traffic, post-forward-auth);
# - TCP/8000 from the monitoring namespace (Prometheus /metrics scrape).
# The cluster has no default-deny, so this NP only takes effect inside the
# tasks ns pods elsewhere remain unaffected. (Same shape as
# chrome-service's chrome-service-ws-ingress.)
resource "kubernetes_network_policy_v1" "tasks_ingress" {
metadata {
name = "tasks-ingress"
namespace = kubernetes_namespace.tasks.metadata[0].name
}
spec {
pod_selector {
match_labels = local.labels
}
policy_types = ["Ingress"]
ingress {
from {
namespace_selector {
match_labels = {
"kubernetes.io/metadata.name" = "traefik"
}
}
}
ports {
port = "8000"
protocol = "TCP"
}
}
ingress {
from {
namespace_selector {
match_labels = {
"kubernetes.io/metadata.name" = "monitoring"
}
}
}
ports {
port = "8000"
protocol = "TCP"
}
}
}
}

View file

@ -1,23 +0,0 @@
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}
dependency "vault" {
config_path = "../vault"
skip_outputs = true
}
dependency "external-secrets" {
config_path = "../external-secrets"
skip_outputs = true
}
inputs = {
# Override per-deploy in CI / commit.
image_tag = "latest"
}

View file

@ -873,14 +873,6 @@ resource "kubernetes_cluster_role" "ingress_dns_sync" {
resources = ["services"]
verbs = ["get", "list"]
}
# Read the Valia-sites internal-DNS feed (written by stacks/valia-sites,
# ADR-0018) so the sync can reconcile off-infra Pages CNAMEs declaratively.
rule {
api_groups = [""]
resources = ["configmaps"]
resource_names = ["valia-sites-dns"]
verbs = ["get"]
}
}
resource "kubernetes_cluster_role_binding" "ingress_dns_sync" {
@ -1010,42 +1002,6 @@ resource "kubernetes_cron_job_v1" "technitium_ingress_dns_sync" {
echo "mail-auth: MX present"
fi
# Valia sites (ADR-0018) off-infra Cloudflare Pages sites.
# The internal zone is authoritative (superset rule above), so
# these public-only names must exist here or every internal
# client NXDOMAINs on them. Reconciled DECLARATIVELY from the
# ConfigMap valia-sites-dns (written by stacks/valia-sites):
# ensure/update every entry, and DELETE stale records that
# left the map (site retired/renamed). Deletion is scoped to
# CNAMEs targeting *.pages.dev nothing else is ever touched.
# Targets resolve upstream to CF edge IPs; no hairpin involved.
VALIA=$$(kubectl get configmap valia-sites-dns -n technitium -o go-template='{{range $$k, $$v := .data}}{{$$k}} {{$$v}}{{"\n"}}{{end}}' 2>/dev/null || true)
if [ -n "$$VALIA" ]; then
printf '%s\n' "$$VALIA" | while read -r VNAME VTARGET; do
[ -z "$$VNAME" ] && continue
CUR=$$(curl -sf "$$TECH_API/api/zones/records/get?token=$$TOKEN&zone=$$ZONE&domain=$$VNAME.$$ZONE" | grep -o '"cname":"[^"]*"' | head -1 | cut -d'"' -f4)
if [ "$$CUR" = "$$VTARGET" ]; then
echo "valia: $$VNAME.$$ZONE ok"
continue
fi
if [ -n "$$CUR" ]; then
curl -sf -G "$$TECH_API/api/zones/records/delete" --data-urlencode "token=$$TOKEN" --data-urlencode "zone=$$ZONE" --data-urlencode "domain=$$VNAME.$$ZONE" --data-urlencode "type=CNAME" --data-urlencode "cname=$$CUR" > /dev/null || true
fi
R=$$(curl -sf -G "$$TECH_API/api/zones/records/add" --data-urlencode "token=$$TOKEN" --data-urlencode "zone=$$ZONE" --data-urlencode "domain=$$VNAME.$$ZONE" --data-urlencode "type=CNAME" --data-urlencode "cname=$$VTARGET" --data-urlencode "ttl=3600") || true
echo "$$R" | grep -q '"status":"ok"' && echo "valia: set $$VNAME.$$ZONE -> $$VTARGET" || echo "valia: FAILED $$VNAME.$$ZONE -- $$R"
done
# Deletion pass: zone CNAMEs targeting *.pages.dev that are
# no longer in the map. ZONE_DUMP predates this run's adds,
# but just-set names are in $VALIA so they're never deleted.
printf '%s' "$$ZONE_DUMP" | tr ',' '\n' | awk -F'"' '/"name":/{n=$$4} /"cname":/{print n" "$$4}' | grep '\.pages\.dev *$$' | while read -r RNAME RTARGET; do
SHORT=$${RNAME%%.$$ZONE}
printf '%s\n' "$$VALIA" | grep -q "^$$SHORT " && continue
curl -sf -G "$$TECH_API/api/zones/records/delete" --data-urlencode "token=$$TOKEN" --data-urlencode "zone=$$ZONE" --data-urlencode "domain=$$RNAME" --data-urlencode "type=CNAME" --data-urlencode "cname=$$RTARGET" > /dev/null && echo "valia: removed stale $$RNAME -> $$RTARGET"
done
else
echo "valia: CM valia-sites-dns absent/unreadable -- skipping Pages CNAMEs this run"
fi
# Pin the .lan ingress anchor A record to the LIVE Traefik LB IP.
# *.viktorbarzin.lan ingress hosts CNAME to ingress.viktorbarzin.lan,
# so a Traefik LB IP move that misses the .lan zone silently breaks

View file

@ -119,41 +119,6 @@ resource "kubernetes_manifest" "middleware_local_only" {
depends_on = [helm_release.traefik]
}
# IP allowlist for household access across ALL home sites: Sofia LAN + the
# WireGuard spoke LANs (London, Valchedrym) + 10/8 (VLANs, K8s pods/services,
# WG tunnel IPs). Deliberately a SEPARATE middleware from `local-only`
# widening local-only would grant the remote LANs access to the admin surfaces
# that use it (Prometheus, iDRAC, Loki, ). Use for family-facing services
# (e.g. the immich-frame kiosks) that every household device may open but the
# public internet must not. Pair with ingress_factory `dns_type = "internal"`:
# a Cloudflare-proxied record would deliver public traffic from cloudflared
# POD IPs (inside 10/8) and silently bypass this allowlist.
resource "kubernetes_manifest" "middleware_home_lans_only" {
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "home-lans-only"
namespace = kubernetes_namespace.traefik.metadata[0].name
}
spec = {
ipAllowList = {
sourceRange = [
"192.168.1.0/24", # Sofia LAN (hub site)
"10.0.0.0/8", # VLANs, K8s pod/svc CIDRs, WG tunnel subnet
"192.168.8.0/24", # London LAN (via WG tunnel)
"192.168.9.0/24", # London GUEST net the Portal Plus actually leases here (Portal-75AE8F9C2A8A = 192.168.9.198)
"192.168.0.0/24", # Valchedrym LAN (via WG tunnel)
"fc00::/7",
"fe80::/10",
]
}
}
}
depends_on = [helm_release.traefik]
}
# HTTPS redirect middleware
resource "kubernetes_manifest" "middleware_redirect_https" {
manifest = {
@ -403,33 +368,6 @@ resource "kubernetes_manifest" "middleware_authentik_rate_limit" {
depends_on = [helm_release.traefik]
}
# Dawarich-specific rate limit. The Rails app serves all its fingerprinted
# assets itself (JS/CSS chunks, SVG store badges, favicons, webmanifest) and
# the map view adds a points/API burst on load a single page load from one
# client IP blows past the default 10/50 limiter and 429s the asset tail
# (seventh instance of the burst pattern, after ha-sofia, ActualBudget, noVNC,
# tripit, health and authentik). Background location ingestion (OwnTracks
# bridge + mobile api_key POSTs) rides the same host, so 429s here also risk
# dropped pings. Burst absorbs a couple of full page loads back-to-back.
resource "kubernetes_manifest" "middleware_dawarich_rate_limit" {
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "dawarich-rate-limit"
namespace = kubernetes_namespace.traefik.metadata[0].name
}
spec = {
rateLimit = {
average = 100
burst = 1000
}
}
}
depends_on = [helm_release.traefik]
}
# Compress responses to clients at the entrypoint level (outermost).
# Applied at websecure entrypoint so all responses get compressed.
# Uses includedContentTypes (whitelist) instead of excludedContentTypes:

View file

@ -175,12 +175,6 @@ locals {
STORY_SOURCE_MODE = "web"
SCRIPT_WRITER_MODE = "chat"
PLACE_RESOLVER_MODE = "wikipedia"
# Saved Place preview photos (tripit ADR-0035/0040): the Wikipedia lead-image
# fetcher behind manual-add-time photos and the backfill sweep. Same fake-
# default gap as the resolver above never set, so prod silently ran the
# fake and hand-added places (and any backfill) would store placeholder
# PNGs instead of real photos.
PLACE_PHOTO_PROVIDER = "wikipedia"
}
}

View file

@ -1,368 +0,0 @@
# Valia sites (ADR-0018): small static sites authored by Valia in Google Drive,
# served OFF-INFRA on Cloudflare Pages, mirrored by the in-cluster CronJob below
# every 10 minutes. Registering a new site = one entry in local.sites (plus
# Valia sharing the folder with vbarzin@gmail.com). Full runbook:
# docs/runbooks/valia-sites.md
#
# Per site this stack fans out:
# - cloudflare_pages_project + custom domain <name>.viktorbarzin.me
# - public proxied CNAME <name> -> <project>.pages.dev (manage_dns gate)
# - internal split-horizon CNAME via ConfigMap valia-sites-dns consumed by
# the technitium-ingress-dns-sync script (declarative: add/update/REMOVE)
# - a slot in the shared sync CronJob (rclone mirror -> wrangler deploy)
locals {
cloudflare_account_id = "02e035473cfc4834fb10c5d35470d8b4" # vbarzin@gmail.com's account (not a secret)
# THE site registry. Keys are the public subdomain (English, Viktor picks
# CONTEXT.md "Valia site"). folder_id = the Drive folder Valia shared (the
# Content folder); src_path = subfolder holding servable files ("" = root);
# entry_file = what / must serve (staged as index.html at deploy time).
# manage_dns = false parks a site's public CNAME + internal record while the
# name is still owned elsewhere (used for the stem95su ingress cutover).
sites = {
bridge = {
folder_id = "1YWwAtSTsJD9HOzckGRIFXigWqCgYSGEa" # "мост" ОбУ Отец Паисий
src_path = ""
entry_file = "index.html"
manage_dns = true
}
stem95su = {
folder_id = "1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_" # "claude" 95. СУ STEM board
src_path = "stem claude/files"
entry_file = "stem_board.html"
manage_dns = true
}
}
dns_managed_sites = { for k, v in local.sites : k => v if v.manage_dns }
}
# ---------------------------------------------------------------------------
# Cloudflare Pages: project + custom domain per site
# ---------------------------------------------------------------------------
resource "cloudflare_pages_project" "site" {
for_each = local.sites
account_id = local.cloudflare_account_id
name = each.key
production_branch = "main"
}
# bridge was created by hand (wrangler) on 2026-07-03 adopt, don't recreate.
import {
to = cloudflare_pages_project.site["bridge"]
id = "02e035473cfc4834fb10c5d35470d8b4/bridge"
}
resource "cloudflare_pages_domain" "site" {
for_each = local.sites
account_id = local.cloudflare_account_id
project_name = cloudflare_pages_project.site[each.key].name
domain = "${each.key}.viktorbarzin.me"
}
import {
to = cloudflare_pages_domain.site["bridge"]
id = "02e035473cfc4834fb10c5d35470d8b4/bridge/bridge.viktorbarzin.me"
}
# Public proxied CNAME. Gated on manage_dns: a site whose name is still served
# by an in-cluster ingress keeps its ingress_factory record until cutover
# (two records can't share one name).
resource "cloudflare_record" "site" {
for_each = local.dns_managed_sites
zone_id = var.cloudflare_zone_id
name = each.key
content = cloudflare_pages_project.site[each.key].subdomain
type = "CNAME"
proxied = true
ttl = 1
}
# bridge's record predates this stack (created 2026-07-03 in stacks/cloudflared,
# handed off via removed{} there) adopt by id.
import {
to = cloudflare_record.site["bridge"]
id = "fd2c5dd4efe8fe38958944e74d0ced6d/ff4fb6f4900744d4b22de50d3fdd219b"
}
# ---------------------------------------------------------------------------
# Internal split-horizon DNS feed (docs/architecture/dns.md "superset rule"):
# the technitium-ingress-dns-sync script reads this CM and reconciles internal
# CNAMEs for every entry including deleting stale *.pages.dev records when
# an entry disappears (site retired/renamed).
# ---------------------------------------------------------------------------
resource "kubernetes_config_map" "valia_sites_dns" {
metadata {
name = "valia-sites-dns"
namespace = "technitium"
labels = { "app.kubernetes.io/managed-by" = "valia-sites" }
}
data = { for k, v in local.dns_managed_sites : k => cloudflare_pages_project.site[k].subdomain }
}
# ---------------------------------------------------------------------------
# The shared sync CronJob
# ---------------------------------------------------------------------------
resource "kubernetes_namespace" "valia_sites" {
metadata {
name = "valia-sites"
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# Secrets: shared drive.readonly rclone conf + the SCOPED CF Pages token
# (Pages Read/Write only the Global API Key never enters a pod).
resource "kubernetes_manifest" "sync_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "valia-sites-sync"
namespace = kubernetes_namespace.valia_sites.metadata[0].name
}
spec = {
refreshInterval = "1h"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = { name = "valia-sites-sync" }
data = [
{
secretKey = "rclone.conf"
remoteRef = { key = "valia-sites", property = "rclone_conf" }
},
{
secretKey = "CLOUDFLARE_API_TOKEN"
remoteRef = { key = "valia-sites", property = "cloudflare_pages_token" }
},
{
secretKey = "CLOUDFLARE_ACCOUNT_ID"
remoteRef = { key = "valia-sites", property = "account_id" }
},
]
}
}
depends_on = [kubernetes_namespace.valia_sites]
}
# Site registry rendered for the job (folder ids aren't secrets).
resource "kubernetes_config_map" "sync_config" {
metadata {
name = "valia-sites-config"
namespace = kubernetes_namespace.valia_sites.metadata[0].name
}
data = {
"sites.json" = jsonencode(local.sites)
}
}
# Last-deployed manifest hash per site written by the job (merge-patch), so
# TF must never fight it over data.
resource "kubernetes_config_map" "sync_state" {
metadata {
name = "valia-sites-state"
namespace = kubernetes_namespace.valia_sites.metadata[0].name
}
data = {}
lifecycle {
ignore_changes = [data]
}
}
resource "kubernetes_service_account" "sync" {
metadata {
name = "valia-sites-sync"
namespace = kubernetes_namespace.valia_sites.metadata[0].name
}
}
resource "kubernetes_role" "sync_state" {
metadata {
name = "valia-sites-sync-state"
namespace = kubernetes_namespace.valia_sites.metadata[0].name
}
rule {
api_groups = [""]
resources = ["configmaps"]
resource_names = ["valia-sites-state"]
verbs = ["get", "patch"]
}
}
resource "kubernetes_role_binding" "sync_state" {
metadata {
name = "valia-sites-sync-state"
namespace = kubernetes_namespace.valia_sites.metadata[0].name
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "Role"
name = kubernetes_role.sync_state.metadata[0].name
}
subject {
kind = "ServiceAccount"
name = kubernetes_service_account.sync.metadata[0].name
namespace = kubernetes_namespace.valia_sites.metadata[0].name
}
}
resource "kubernetes_cron_job_v1" "sync" {
metadata {
name = "valia-sites-sync"
namespace = kubernetes_namespace.valia_sites.metadata[0].name
labels = { app = "valia-sites", component = "sync" }
}
spec {
schedule = "*/10 * * * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 2
failed_jobs_history_limit = 3
job_template {
metadata {}
spec {
backoff_limit = 1
ttl_seconds_after_finished = 86400
template {
metadata { labels = { app = "valia-sites", component = "sync" } }
spec {
restart_policy = "OnFailure"
service_account_name = kubernetes_service_account.sync.metadata[0].name
container {
name = "sync"
image = "ghcr.io/viktorbarzin/valia-sites-sync:latest"
# Guards mirror stem95su's proven set: hard-fail on Drive
# list/auth errors (visible as a failed Job the chosen
# visibility, ADR-0018), skip quietly when a folder is empty or
# missing its entry file (never wipe a live site), capped
# deletes. Deploy ONLY on remote-manifest change: CF Pages caps
# monthly deployments on the free tier, so 144 no-op
# deploys/day is not an option.
command = ["/bin/sh", "-c", <<-EOT
set -u
cp /config/rclone.conf /tmp/rc.conf
APISERVER="https://kubernetes.default.svc"
SA=/var/run/secrets/kubernetes.io/serviceaccount
KTOKEN=$$(cat $$SA/token); NS=$$(cat $$SA/namespace)
STATE_URL="$$APISERVER/api/v1/namespaces/$$NS/configmaps/valia-sites-state"
FAILED=0
for SITE in $$(jq -r 'keys[]' /sites/sites.json); do
FOLDER=$$(jq -r --arg s "$$SITE" '.[$$s].folder_id' /sites/sites.json)
SRC_PATH=$$(jq -r --arg s "$$SITE" '.[$$s].src_path' /sites/sites.json)
ENTRY=$$(jq -r --arg s "$$SITE" '.[$$s].entry_file' /sites/sites.json)
RC="rclone --config /tmp/rc.conf --drive-root-folder-id=$$FOLDER --drive-skip-gdocs"
# 1. Remote manifest (path+size+hash) metadata only, no download.
MANIFEST=$$($$RC lsf "gdrive:$$SRC_PATH" -R --files-only --format phs 2>/tmp/lsf.err) || {
echo "FATAL [$$SITE]: Drive list failed (auth/network):"; cat /tmp/lsf.err; FAILED=1; continue; }
N=$$(printf '%s\n' "$$MANIFEST" | grep -c . || true)
if [ "$$N" -lt 1 ] || ! printf '%s\n' "$$MANIFEST" | cut -d';' -f1 | grep -qx "$$ENTRY"; then
echo "GUARD [$$SITE]: N=$$N / $$ENTRY missing -- skipping, site untouched"; continue
fi
# Cloudflare Pages hard-caps files at 25 MB deploying
# without an oversize file would silently break the pages
# that reference it, so skip the whole site instead (last
# deployed content keeps serving) and say so loudly.
OVERSIZE=$$(printf '%s\n' "$$MANIFEST" | awk -F';' '$$3 > 26214400 {print $$1" ("$$3" B)"}')
if [ -n "$$OVERSIZE" ]; then
echo "GUARD [$$SITE]: file(s) exceed the 25MB Pages limit -- skipping, site untouched:"; echo "$$OVERSIZE"; continue
fi
HASH=$$(printf '%s' "$$MANIFEST" | sha256sum | cut -d' ' -f1)
LAST=$$(curl -sf --cacert $$SA/ca.crt -H "Authorization: Bearer $$KTOKEN" "$$STATE_URL" | jq -r --arg s "$$SITE" '.data[$$s] // ""')
if [ "$$HASH" = "$$LAST" ]; then echo "OK [$$SITE]: unchanged"; continue; fi
# 2. Content changed pull and deploy.
$$RC sync "gdrive:$$SRC_PATH" "/work/$$SITE" --exclude ".DS_Store" --fast-list --transfers 4 --max-delete 25 -v || {
echo "FATAL [$$SITE]: rclone sync failed"; FAILED=1; continue; }
if [ "$$ENTRY" != "index.html" ]; then
cp "/work/$$SITE/$$ENTRY" "/work/$$SITE/index.html"
fi
wrangler pages deploy "/work/$$SITE" --project-name="$$SITE" --branch=main --commit-dirty=true || {
echo "FATAL [$$SITE]: wrangler deploy failed"; FAILED=1; continue; }
curl -sf --cacert $$SA/ca.crt -H "Authorization: Bearer $$KTOKEN" \
-X PATCH -H "Content-Type: application/merge-patch+json" \
-d "{\"data\":{\"$$SITE\":\"$$HASH\"}}" "$$STATE_URL" > /dev/null || {
echo "WARN [$$SITE]: state patch failed (will redeploy next run)"; FAILED=1; }
echo "DEPLOYED [$$SITE]: $$HASH"
done
exit $$FAILED
EOT
]
env {
name = "CLOUDFLARE_API_TOKEN"
value_from {
secret_key_ref {
name = "valia-sites-sync"
key = "CLOUDFLARE_API_TOKEN"
}
}
}
env {
name = "CLOUDFLARE_ACCOUNT_ID"
value_from {
secret_key_ref {
name = "valia-sites-sync"
key = "CLOUDFLARE_ACCOUNT_ID"
}
}
}
resources {
requests = { cpu = "25m", memory = "128Mi" }
limits = { memory = "512Mi" }
}
volume_mount {
name = "rclone-config"
mount_path = "/config"
read_only = true
}
volume_mount {
name = "sites-config"
mount_path = "/sites"
read_only = true
}
volume_mount {
name = "work"
mount_path = "/work"
}
}
volume {
name = "rclone-config"
secret {
secret_name = "valia-sites-sync"
items {
key = "rclone.conf"
path = "rclone.conf"
}
}
}
volume {
name = "sites-config"
config_map { name = kubernetes_config_map.sync_config.metadata[0].name }
}
volume {
name = "work"
empty_dir {}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
depends_on = [kubernetes_manifest.sync_external_secret]
}

View file

@ -1,15 +0,0 @@
# valia-sites-sync: everything the 10-min Content-folder mirror needs, baked in
# (no runtime installs — CronJob pods must not apk/npm on every start).
# rclone pinned to match the proven stem95su version; wrangler pinned to major 4.
FROM node:22-alpine
RUN apk add --no-cache curl unzip ca-certificates jq \
&& curl -fsSL https://downloads.rclone.org/v1.74.3/rclone-v1.74.3-linux-amd64.zip -o /tmp/rclone.zip \
&& unzip -j /tmp/rclone.zip '*/rclone' -d /usr/local/bin \
&& chmod +x /usr/local/bin/rclone \
&& rm /tmp/rclone.zip \
&& npm install -g wrangler@4 \
&& npm cache clean --force
# wrangler writes config/cache under $HOME; the CronJob runs as non-root node (uid 1000)
ENV HOME=/tmp

View file

@ -1,8 +0,0 @@
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}

View file

@ -1,3 +0,0 @@
variable "cloudflare_zone_id" {
type = string
}

View file

@ -675,7 +675,6 @@ resource "vault_database_secret_backend_connection" "postgresql" {
"pg-nextcloud-todos",
"pg-technitium",
"pg-goldmane-edges",
"pg-tasks",
]
postgresql {
@ -904,17 +903,6 @@ resource "vault_database_secret_backend_static_role" "pg_goldmane_edges" {
rotation_period = 604800
}
# tasks PWA (Reminders-style front-end over Nextcloud CalDAV) 7-day rotation
# for the `tasks` CNPG role. Consumed by stacks/tasks via a vault-database
# ExternalSecret -> TASKS_DB_DSN (remoteRef static-creds/pg-tasks).
resource "vault_database_secret_backend_static_role" "pg_tasks" {
backend = vault_mount.database.path
db_name = vault_database_secret_backend_connection.postgresql.name
name = "pg-tasks"
username = "tasks"
rotation_period = 604800
}
# =============================================================================
# Kubernetes Secrets Engine Dynamic K8s Credentials
# =============================================================================

File diff suppressed because one or more lines are too long