diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index facc4c2a..301e31c5 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -36,7 +36,7 @@ Violations cause state drift, which causes future applies to break or silently r - `auth = "none"` — no Authentik, no own-auth claim. Use for Anubis-fronted content (Anubis is the gate), native-client APIs (Git, `/v2/`, WebDAV/CalDAV, CardDAV), webhook receivers, OAuth callbacks, and Authentik outposts themselves. - **Anti-exposure rule** (the reason `"app"` exists): only pick `"app"` or `"none"` AFTER you've verified the app has its own user auth (`"app"`) OR the endpoint is intentionally public (`"none"`). Default is `"required"` so accidental omission fails closed. **Convention**: when using `"app"` or `"none"`, add a comment line above the `auth = "..."` line stating what gates the app or why it's public. **Enforced by `scripts/tg`**: every `tg plan/apply/destroy/refresh` runs `scripts/check-ingress-auth-comments.py` against the current stack and aborts if any `auth = "app|none"` line lacks the preceding `# auth = "": ...` comment. Stack-scoped — untouched stacks aren't blocked until they're next edited. - **Anti-AI**: on by default when `auth = "none"` or `auth = "app"` (no Authentik to discourage bots); redundant on `"required"` and `"public"`. - - **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend). + - **DNS**: `dns_type = "proxied"` (Cloudflare CDN), `"non-proxied"` (direct A/AAAA to the public IP), or `"internal"` (public A record carrying the INTERNAL Traefik LB IP `10.0.20.203` — resolvable everywhere, routable only from home LANs/WG sites/VPN; the record is reachability, NOT a gate — pair with `extra_middlewares = ["traefik-home-lans-only@kubernetescrd"]`, since direct-to-WAN-IP SNI requests still reach Traefik, and NEVER combine that allowlist with `"proxied"` — cloudflared pod source IPs sit inside 10/8 and would bypass it. First users: the immich-frame kiosks, `docs/plans/2026-07-04-immich-frame-lan-only-design.md`). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend). - **Anubis PoW challenge** (`modules/kubernetes/anubis_instance/`): per-site reverse proxy that issues a 30-day JWT cookie after a tiny PoW solve. Use for **public, content-bearing sites without app-level auth** (blog, docs, wikis, static landing pages). Pattern: declare `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://..svc.cluster.local" }`, then in `ingress_factory` set `service_name = module.anubis.service_name`, `port = module.anubis.service_port`, `anti_ai_scraping = false`. Shared ed25519 key in Vault `secret/viktor` -> `anubis_ed25519_key`; cookie scoped to `viktorbarzin.me` so one solve covers all Anubis-fronted subdomains. **DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints** — clients without JS can't solve PoW. **Replicas default to 1** because Anubis stores in-flight challenges in process memory; a challenge issued by pod A and solved against pod B errors with `store: key not found` (HTTP 500). Bumping replicas requires wiring a shared Redis store (TODO). For path-level carve-outs (e.g. wrongmove has `/` behind Anubis but `/api` direct, blog has `/net-diag.sh` direct), declare a second `ingress_factory` with `ingress_path = ["/"]` pointing at the bare backend service. Active on: blog (except `/net-diag.sh`), www, kms, travel, f1, cc, json, pb (privatebin), home (homepage), wrongmove (UI only). See `.claude/reference/patterns.md` "Anti-AI Scraping" for full layering. - **Docker images**: Always build for `linux/amd64`. SHA-tag rule is being phased out — see `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`. New model: CI pushes `:latest` (optionally also `:<8-char-sha>` for traceability), Keel polls and triggers rollouts. Cache-staleness concern from the old rule is resolved at the nginx layer (URL-split — manifests pass through, blobs cached). Until Phase 1 of the migration completes (per the plan), follow the SHA-tag rule for new services to match existing pattern. - **Image registry**: **Owned images now live on `ghcr.io/viktorbarzin/`** (ADR-0002, built by GHA — see the CI/CD Architecture section). The **Forgejo container registry is FROZEN + emptied** (break-glass only — `docs/runbooks/forgejo-registry-breakglass.md`); nothing pushes to it. The rest of this bullet documents the **still-live forgejo-pull DNS/mirror machinery** (it remains in place for the break-glass path + because `registry-credentials` is still Kyverno-synced; the hairpin lessons apply to any internal-registry pull). Historical usage was `image: forgejo.viktorbarzin.me/viktor/:` + `imagePullSecrets: [{name: registry-credentials}]`. **Kubelet pulls** are kept off the hairpin **at the resolver, with zero node-side DNS config**: pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium (added 2026-06-10, `docs/runbooks/pfsense-unbound.md`), whose split-horizon zone CNAMEs every ingress host (auto-synced hourly by `technitium-ingress-dns-sync`) to the zone apex whose A record tracks the **live** Traefik LB IP (canary: `viktorbarzin-apex-probe`, alerts ViktorBarzinApexDrift). Nodes are stock — link DNS `10.0.20.1 94.140.14.14` via `qm set --nameserver`, no `/etc/hosts` pins, no resolved drop-ins (two same-day interim approaches on 2026-06-10 were removed the same day). The containerd `hosts.toml` mirror (`[host."https://10.0.20.203"]`, `skip_verify = true`) still exists but is **vestigial** — it can NOT keep pulls internal on its own: Traefik routes by Host/SNI and 404s the mirror's bare-IP requests, and the registry's Bearer auth realm is the absolute `https://forgejo.viktorbarzin.me/v2/token` URL fetched outside the mirror — without internal DNS every fresh pull degrades to public DNS → hairpin → intermittent `dial tcp 176.12.22.76:443: i/o timeout` ImagePullBackOff (tuya-bridge 7.5h outage 2026-06-10, tripit 2026-06-09; see `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md`). **In-cluster pods are ordinary internal clients too** (since 2026-06-10 evening) — CoreDNS's dedicated `viktorbarzin.me:53` block (Corefile in `stacks/technitium/modules/technitium/main.tf`) forwards to the Technitium ClusterIP `10.96.0.53`, so pods get the same split-horizon answers as everyone else; forgejo stays pinned to Traefik's **ClusterIP** in that block (TF-interpolated from the live Service) so CI pushes survive a Technitium outage. This relies on a k8s-1.34 behavior verified 2026-06-10: **pods CAN reach the ETP=Local Traefik LB IP** (kube-proxy short-circuits in-cluster traffic to LB IPs via the cluster path) — re-verify after major k8s upgrades; canary = the uptime-kuma `[External]` fleet going red. (The block briefly forwarded to `8.8.8.8/1.1.1.1` earlier that day, which kept pods on the WAN IP and the broken TP-Link NAT loopback — 27 non-proxied `[External]` monitors dark; beads code-yh33.) **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left the mirror pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull; a future LB renumber is now handled by DNS (apex record + drift probe) — only the vestigial hosts.toml literal would go stale. Mirror source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes; also cleans up the legacy 2026-06-10 node-DNS customization). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest` + any buildkit `*cache*` tag — **REVERTED to DRY_RUN 2026-06-10 after its first live run orphaned OCI index children** (multi-arch/attestation children are separate *untagged* sha256 versions that sort outside the newest-10 window while their parent index is kept; broke `kms-website:latest`+`:dfc83fb`, caught by the integrity probe, healed by re-tagging latest→a794d1a + deleting the corrupt version; see `docs/post-mortems/2026-06-10-forgejo-retention-orphaned-indexes.md`). Do NOT re-enable deletes until the keep-set resolves kept indexes' child digests (or skips untagged versions, or moves to Forgejo's native container-aware cleanup rules). The registry PVC remains at its 50Gi autoresize ceiling on the HDD (we did NOT move it to SSD, see beads code-oflt), so a container-aware retention is still needed. Integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07. @@ -137,7 +137,7 @@ audiobook-search) now also land on ghcr. chrome-service-novnc, android-emulator. - **PRIVATE ghcr:** f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, - infra-ci, k8s-portal. Pulled via the Kyverno-synced `ghcr-credentials` allowlist + infra-ci, k8s-portal, excalidraw-library. Pulled via the Kyverno-synced `ghcr-credentials` allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`; NOT cluster-wide; cred = Vault `secret/viktor/ghcr_pull_token`, a dedicated classic PAT scoped to `read:packages` (UI-minted 2026-06-15; no longer the admin `github_pat` @@ -153,7 +153,9 @@ github↔forgejo divergence was deliberately NOT reconciled): `build-cli.yml` → DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli`; `build-infra-ci.yml` → `ghcr.io/viktorbarzin/infra-ci`; `build-k8s-portal.yml` → PRIVATE `ghcr.io/viktorbarzin/k8s-portal` (Keel-deployed; the LAST in-cluster -Woodpecker build, migrated 2026-06-13 — completes "no local builds"). **infra-ci** +Woodpecker build, migrated 2026-06-13 — completes "no local builds"); `build-excalidraw.yml` → +PRIVATE `ghcr.io/viktorbarzin/excalidraw-library` (Keel-deployed; replaced +manual DockerHub pushes 2026-07-02 — DockerHub `:v4` frozen as rollback). **infra-ci** is the image the `.woodpecker/default.yml` apply step + `drift-detection.yml` run in (proven by pipelines 165/166). chatterbox-tts is already built by tripit's GHA → ghcr. The Woodpecker `build-ci-image.yml` + `build-cli.yml` pipelines were REMOVED; @@ -216,7 +218,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`). | Service | Key Operational Knowledge | |---------|--------------------------| | Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe | -| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | +| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which had no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). **Since 2026-07-02 the T4 has a scheduler-level VRAM budget + watchdog (ADR-0016)**: each GPU tenant declares `viktorbarzin.me/gpumem` MiB (immich-ml 3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; node advertises 14000) and the `gpu-vram-watchdog` (nvidia ns) recycles the biggest over-budget tenant when free VRAM < 1536 MiB — currently **DRY_RUN=true** (observe-only; flip `watchdog_dry_run` in `stacks/nvidia/modules/nvidia/gpu_memory_budget.tf` to arm). KNOWN MISCALIBRATION (2026-07-02): llama-swap's real qwen3-8b@16k resident is ~7 GB (the 4.35 GiB figure was weights-only cudaMalloc), so retune budgets (ctx 16k→8k + llama-swap 6144 + immich-ml 2500, or rebalance) BEFORE arming, else the watchdog would recycle llama-swap first. TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | | CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` | | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU | | Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control; `/`+`/static` use a dedicated `authentik-rate-limit` (100/1000) so the cold-load chunk burst isn't 429'd into a blank screen. **Reliability (2026-06-28): the chart key is `deploymentStrategy`, NOT `strategy`** — the old `strategy:` key was inert, so live ran the chart default 25%/25% and dropped a server pod out of rotation on every roll; now `maxSurge:1/maxUnavailable:0`. Readiness `failureThreshold:8` (~80s, was 30s): the DB-coupled `/-/health/ready/` returns 503 on a PG/pgbouncer blip, and with too-tight tolerance all 3 server pods left the Service at once → Traefik 502/504 (the episodic blank-screen + 30s-hang). gunicorn `max_requests=10000`/jitter=1000 decorrelates worker recycles from DB blips. Redis is GONE since 2026.2 (sessions+cache+channels on PostgreSQL, no external-cache option) — a short PG transient is now survived, but a TOTAL CNPG outage still takes authentik down. **Custom overlay image (2026-06-28):** server+worker run `ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3` (built by `.github/workflows/build-authentik.yml` from `stacks/authentik/Dockerfile` + `patch-compat-sfe.py`) with TWO guarded patches: **#1 SLOW-1a** — narrows the identification-stage `select_subclasses()` query (~1.4s→~14ms; bare upstream call LEFT-JOINs every source subtype); **#2 old-browser blank login** — `patch-compat-sfe.py` (a) extends `compat_needs_sfe()` to serve authentik's built-in no-JS **SFE** login to old Safari/WebKit AND **any iOS browser** (Chrome/CriOS, Firefox/FxiOS — all share the system WebKit) on iOS≤16.3, and (b) **injects static social-login `` links into the SFE shell** (`flow-sfe.html`) since the SFE can't render Identification-stage sources — required for password-less accounts (e.g. emo = Google-only). The modern flow SPA is ES2022 (needs Safari 16.4+) and renders BLANK on older WebKit; every iOS browser shares that WebKit, so it's not browser-choice (emo's iPadOS-15.8 iPad hit this). SFE = the *real* authentik login (password + MFA + reputation, no auth downgrade) — chosen over a Traefik basic-auth fallback which would have put a spoofable-UA single password in front of `vbarzin→wizard` passwordless-root. Social link = plain redirect to `/source/oauth/login//` (works on any browser); slugs (google/github/facebook) are static — re-verify on source changes. **Keel un-enrolled** for the ns → image pinned in `global.image` (repo+tag), **upgraded manually**: bump the Dockerfile `FROM` + the values tag (+ re-verify both patches) together, GHA rebuilds, then apply. | @@ -230,7 +232,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`). - **Cascade inhibitions** (`inhibit_rules`): `NodeDown` AND `NodeConditionBad`/`NodeDiskPressure` suppress downstream pod-churn alerts (PodCrashLooping/PodImagePullBackOff/PodsStuckContainerCreating/ScrapeTargetDown/*ReplicasMismatch); `T3ProbeLegDown` suppresses `T3ProbeDropBurst` for the same `leg`; plus existing NFS/Traefik/Authentik/Power/Tuya/iDRAC cascades. No `equal` on the node rules (pod alerts carry no `node` label → cluster-wide, like NodeDown). - **ScrapeTargetDown scrapes only Ready endpoints** (relabel `keep __meta_kubernetes_endpoint_ready=true` on both `kubernetes-service-endpoints` jobs) — completed CronJob pods lingering as NotReady EndpointSlice addresses no longer fire phantom "down" alerts (tts/tripit/beads, id=4895). Replaces the old "exclude completed CronJob pods" guidance; a Ready pod with a broken metrics endpoint still fires. - Alertmanager is now scraped (`extraScrapeConfigs` job `alertmanager`) → `alertmanager_notifications_total`/`_alerts`/`_notifications_failed_total` available; it had no `prometheus.io/scrape` annotation so notification volume was previously unmeasurable. -- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions. +- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever the ingress has a public DNS record (`dns_type` `"proxied"`/`"non-proxied"`; `"internal"` and `"none"` get none — set `external_monitor = false` explicitly on internal-only ingresses so the sync's default opt-in doesn't re-add a doomed monitor; see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions. - **External monitoring**: `[External] ` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param). - Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction), AuthentikRootRouter5xxHigh (all-3-server-pods-NotReady cascade → 502/503/504 on the authentik `/` router). **The Traefik scrape keeps `traefik_router_requests_total`** (per-router `code` label) — the drop-regex in the `traefik` scrape job drops only the high-cardinality `*_duration_seconds_bucket` histogram, NOT the request counter, so per-router 429/5xx is queryable + alertable. - **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay). diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index 447620d9..7c84dd3b 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -81,7 +81,7 @@ | ytdlp | YouTube downloader | ytdlp | | wealthfolio | Finance tracking | wealthfolio | | audiobookshelf | Audiobook server (may be merged into ebooks stack) | audiobookshelf | -| paperless-ngx | Document management | paperless-ngx | +| paperless-ngx | Document management. Mail ingest: forward document emails to `docs@viktorbarzin.me` — sender maps 1:1 to a paperless account (runbook `paperless-mail-ingest.md`) | paperless-ngx | | jsoncrack | JSON visualizer | jsoncrack | | servarr | Media automation (Sonarr/Radarr/etc) | servarr | | aiostreams | Stremio stream aggregator (Real-Debrid + Torrentio/Comet/StremThru Torz/Knaben; **MediaFusion removed 2026-06-07** — broken upstream `500`). `auth=app` (own UUID+password); stream-probe tests **both series+movie paths** with per-source breakdown (`aiostreams_streams_{comet,torrentio,stremthru_torz,knaben}`) + `aiostreams_error_streams` + `aiostreams_movie_stream_count`, success gated on Comet (workhorse) being alive; weekly NFS config + Stremio-account-collection backups to `/srv/nfs/aiostreams-backup/`. PG-backed user config (Comet timeout bumped 5s→10s 2026-06-07). | servarr/aiostreams | @@ -99,6 +99,7 @@ | tor-proxy | Tor proxy | tor-proxy | | forgejo | Git forge. Open native self-signup (Turnstile captcha + email confirm) + Authentik & GitHub OAuth sign-in; see `docs/runbooks/forgejo-open-signups.md` | forgejo | | freshrss | RSS reader | freshrss | +| drone-logbook | DJI flight-log analyzer (Open DroneLog, upstream image) — dronelog.viktorbarzin.me | drone-logbook | | navidrome | Music streaming | navidrome | | networking-toolbox | Network tools | networking-toolbox | | stirling-pdf | PDF tools | stirling-pdf | @@ -120,7 +121,9 @@ | status-page | Status page | status-page | | plotting-book | Book plotting/world-building app | plotting-book | | tripit | Self-hosted TripIt-clone travel-itinerary PWA (FastAPI + SvelteKit SPA, same-origin). CNPG (`tripit` db, Vault static role `pg-tripit`) + RWX NFS trip-doc vault (`/srv/nfs/tripit-documents`) + RWO `proxmox-lvm-encrypted` personal-document vault `tripit-personal-documents` (passports/IDs — AES-256-GCM app-layer envelope, master key `DOCUMENT_ENCRYPTION_KEY` in `secret/tripit`). `auth=required` (Authentik forward-auth, reads `X-authentik-email`); second `auth=none` ingress on `/api/calendar` for HMAC-token-gated `.ics` feed. Email-ingest CronJob `tripit-ingest-plans` (`*/15`) is the SOLE inbound path — forward a booking to plans@viktorbarzin.me (catch-all → spam@), polled read-only and routed ONLY to a registered user / verified linked address (no default-owner fallback; strangers ignored), parsed by local LLM (`qwen3vl-4b`), and the sender is emailed the outcome (Added to trip / Couldn't import). Plus `tripit-poll-flights`, `tripit-run-reminders`, `tripit-transport-nudge`, `tripit-weather-brief`. (The old Gmail-scrape `tripit-ingest-mail` CronJob was removed 2026-06-05.) App secrets in Vault `secret/tripit`. | tripit | -| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me. Public **open** static site (`auth=none` — CrowdSec + ai-bot-block, no login). Stock `nginx:1.28-alpine` serving content **straight off PVE host NFS** `/srv/nfs/stem-site` (RWX `nfs_volume`, mounted read-only) — **NOT** image-baked, so the externally-authored (Gemini-exported) HTML/media updates with no rebuild; auto-backed-up offsite by `nfs-mirror`. **Content source = Google Drive folder "claude"** (id `1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_`, shared Valentina→vbarzin@gmail.com). **Deploy = scheduled mirror** (since 2026-06-09, reversed the earlier on-demand-only call once content went active): CronJob `stem95su-gdrive-sync` (`*/10`, `stacks/stem95su/gdrive-sync.tf`) mounts the content PVC RW and `rclone sync`s the Drive folder onto it (`docker.io/rclone/rclone:1.74.3`, `scope=drive.readonly` — Drive is READ-ONLY; empty-source guard + `--max-delete 25` so a partial listing can't wipe the site). rclone creds (OAuth refresh-token) in Vault `secret/stem95su` (`rclone_conf`) → ESO secret `stem95su-rclone`. **Requires the GCP OAuth app (project home-lab-1700868541205) published to "Production"** or the refresh token expires ~weekly (re-mint + `vault kv put secret/stem95su rclone_conf=…` after publishing); a dead token surfaces as a failed Job. Manual on-demand sync still possible (throwaway rclone container from devvm; recipe in claude-memory). Nextcloud "PVE NFS Pool"/rsync is a manual fallback. Dashboard `stem_board.html` served at `/` via a small nginx ConfigMap (`index`). No DB, no in-cluster secrets. Reference impl for the NFS-backed static-site pattern (see patterns.md). | stem95su | +| tasks | Reminders-style tasks PWA over Nextcloud CalDAV (FastAPI + SvelteKit SPA same-origin, single container; code `~/code/tasks`, design `tasks/docs/2026-07-03-tasks-pwa-design.md`). Nextcloud stays the source of truth (VTODOs); the app is the front-end Apple Reminders stopped being. CNPG (`tasks` db, Vault static role `pg-tasks`) stores Connected Accounts — per-user Nextcloud app passwords Fernet-encrypted with `fernet_key` from `secret/tasks`. `auth=required` (Authentik forward-auth; identity = `X-authentik-username`, NO app-level login — `DEV_USER` must never be set in prod) at tasks.viktorbarzin.me (proxied). Exception: the five PWA icon/manifest files (`/apple-touch-icon.png`, `/favicon.png`, `/pwa-192x192.png`, `/pwa-512x512.png`, `/manifest.webmanifest`) are a path-scoped `auth=none` carve-out (`module.ingress_icons`) so cookie-less OS icon fetchers (macOS Safari Add-to-Dock, mobile home-screen installs) get the real icon instead of the Authentik 302; guarded by the `tasks-icons` walloff-probe target. NetworkPolicy `tasks-ingress` (SEC-1) restricts pod ingress to traefik + monitoring namespaces so the trusted header can't be spoofed pod-to-pod. GHA → public ghcr `tasks` → Woodpecker deploy (ADR-0002). | tasks | +| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me — **a Valia site on Cloudflare Pages since 2026-07-03** (ADR-0018): registry entry in `stacks/valia-sites`, synced from Drive folder "claude" every 10 min, deploy-on-change. The old in-cluster stack (nginx off PVE NFS + per-site rclone CronJob) is RETIRED — stacks/stem95su is a tombstone; `secret/stem95su` superseded by `secret/valia-sites`; `stem_video.mp4` was compressed 42.9→21.4MB (25MB Pages cap) with Viktor's OK. See docs/runbooks/valia-sites.md. | — | +| valia-sites | **Valia-site registry + sync** (ADR-0018): all sites authored by Valia serve OFF-INFRA on Cloudflare Pages (`bridge` + `stem95su` live). One map entry in `stacks/valia-sites/main.tf` per site fans out Pages project + custom domain + public CNAME + internal split-horizon CNAME (ConfigMap `valia-sites-dns` → technitium sync, declarative incl. removal). CronJob `valia-sites-sync` (`*/10`, image ghcr `valia-sites-sync`) mirrors each Drive Content folder (rclone `drive.readonly`, stem95su-style guards + 25MB Pages-cap guard) and wrangler-deploys ONLY on manifest change (free-tier deploy cap). Secrets `secret/valia-sites` (shared rclone conf + SCOPED CF Pages token — Global API Key never in pods). Failed-Job-only visibility by choice. Runbook: docs/runbooks/valia-sites.md. | valia-sites | | trek | **TRIAL (2026-06-05)** — self-hosted group-trip planner (upstream [TREK](https://github.com/mauriceboe/TREK), `mauriceboe/trek:3.0.22`, AGPL-3.0). Solo evaluation behind Authentik forward-auth (`auth=required`) before deciding build-vs-adopt; covers collaborative trip planning + accommodation records + activities + per-person budget splitting on free OpenStreetMap (no paid maps key). SQLite + uploads on `proxmox-lvm-encrypted` (`trek-data-encrypted` 2Gi, `trek-uploads-encrypted` 5Gi). For the trial only: `ENCRYPTION_KEY` is TREK-auto-generated onto the data PVC and the bootstrap admin (`admin@trek.local`) is printed to pod logs — NO Vault/ESO wiring (graduation TODO: move key to `secret/trek` + ESO, add an app-level SQLite backup CronJob since host file-backup can't read the LUKS PVC, wire TREK↔Authentik OIDC). Pinned image, TF-managed (no CI/Keel). Availability-poll companion (Rallly) deferred. Teardown: `tg destroy` in `stacks/trek`. | trek | ## Cloudflare Domains @@ -130,7 +133,7 @@ blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send, audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden, changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser, -travel, netbox, phpipam, tripit, t3, stem95su +travel, netbox, phpipam, tripit, t3, stem95su, tasks ``` ### Non-Proxied (Direct DNS) diff --git a/.github/workflows/build-excalidraw.yml b/.github/workflows/build-excalidraw.yml new file mode 100644 index 00000000..7f58131f --- /dev/null +++ b/.github/workflows/build-excalidraw.yml @@ -0,0 +1,42 @@ +name: Build excalidraw-library + +# ADR-0002 / no-local-builds: excalidraw-library (infra-owned Go app behind +# draw.viktorbarzin.me) builds off-infra on GHA → private ghcr; Keel polls +# ghcr:latest and rolls the deployment. Replaces the manual DockerHub pushes +# (viktorbarzin/excalidraw-library:v4 stays frozen as the rollback image). +on: + push: + branches: [master] + paths: + - 'stacks/excalidraw/project/**' + workflow_dispatch: {} + +permissions: + contents: read + packages: write + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-go@v5 + with: + go-version: '1.21' + - run: go test ./... + working-directory: stacks/excalidraw/project + - uses: docker/setup-buildx-action@v3 + - uses: docker/login-action@v3 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + - uses: docker/build-push-action@v6 + with: + context: stacks/excalidraw/project + platforms: linux/amd64 + provenance: false + push: true + tags: | + ghcr.io/viktorbarzin/excalidraw-library:latest + ghcr.io/viktorbarzin/excalidraw-library:${{ github.sha }} diff --git a/.github/workflows/build-valia-sites-sync.yml b/.github/workflows/build-valia-sites-sync.yml new file mode 100644 index 00000000..090b7f5c --- /dev/null +++ b/.github/workflows/build-valia-sites-sync.yml @@ -0,0 +1,39 @@ +name: Build valia-sites-sync + +# ADR-0002 + ADR-0018: infra-owned image built off-infra on GHA → ghcr (public). +# Rclone + wrangler runner for the Valia-sites Content-folder mirror CronJob. +# Rebuilds are rare (tool pins only change deliberately) → dispatch + path. +# Security note: no untrusted event inputs are interpolated anywhere (only +# github.actor / github.sha / GITHUB_TOKEN — same shape as the other +# build-*.yml workflows in this repo). +on: + push: + branches: [master] + paths: + - 'stacks/valia-sites/sync-image/**' + workflow_dispatch: {} + +permissions: + contents: read + packages: write + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: docker/setup-buildx-action@v3 + - uses: docker/login-action@v3 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + - uses: docker/build-push-action@v6 + with: + context: stacks/valia-sites/sync-image + platforms: linux/amd64 + provenance: false + push: true + tags: | + ghcr.io/viktorbarzin/valia-sites-sync:latest + ghcr.io/viktorbarzin/valia-sites-sync:${{ github.sha }} diff --git a/AGENTS.md b/AGENTS.md index 4e3ea2de..43f06b8e 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -95,7 +95,7 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro ## Key Paths - `stacks//main.tf` — service definition - `stacks/platform/modules//` — core infra modules -- `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI, and auto Cloudflare DNS (`dns_type = "proxied"` or `"non-proxied"`) +- `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI, and auto Cloudflare DNS (`dns_type = "proxied"`, `"non-proxied"`, or `"internal"` — a public A record carrying the internal Traefik LB IP for household-only services; pair with the `home-lans-only` ipAllowList middleware, never with `"proxied"`) - `modules/kubernetes/nfs_volume/` — NFS volume module (CSI-backed, soft mount) - `config.tfvars` — non-secret configuration (plaintext) - `secrets.sops.json` — all secrets (SOPS-encrypted JSON) diff --git a/CONTEXT.md b/CONTEXT.md index fa5113d5..76b101d0 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -118,6 +118,14 @@ _Avoid_: "external", "outside". `viktorbarzin.lan`, served by Technitium DNS. Resolves only inside the homelab network. _Avoid_: bare "lan", "private", "intranet". +**Segment**: +One isolated L2/L3 network with pfSense as its gateway — realised as a Proxmox-bridge-level tag feeding one dedicated untagged pfSense interface (dManagementsVms 10.0.10.0/24 = vmbr1 tag 10, dKubernetes 10.0.20.0/24 = vmbr1 tag 20, dCCTV 10.0.30.0/24 = vmbr0 tag 30). pfSense itself never terminates 802.1Q. +_Avoid_: "VLAN" as the primary name (the tags 10/20/30 are transport detail; the Segment is the concept). + +**CCTV segment**: +The untrusted camera **Segment** (`dCCTV`) — devices in it may be pulled from (RTSP/ISAPI) but may initiate nothing except NTP to their gateway. Deliberately outside every trusted source-IP allowlist (ADR-0017). +_Avoid_: "camera VLAN", "CCTV LAN". + **Ingress auth**: The `auth = "..."` parameter on `ingress_factory` — a discrete *mode*, not a ranked tier — one of `required` (Authentik forward-auth gates every request), `app` (the backend owns its login), `public` (anonymous Authentik binding for audit only), or `none` (Anubis-fronted content, or native-client API). Default `required` (fail-closed). _Avoid_: "auth tier" / "auth mode" — refer to it by the canonical key, `auth` (e.g. `auth = "required"`). "tier" is reserved for State tier and Namespace tier. @@ -229,6 +237,20 @@ _Avoid_: expecting Diun to deploy; conflating with **Keel**. **Anubis**: A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content-bearing sites without app-level auth (blog, wiki, landing pages). Never in front of Git, WebDAV, CalDAV, or API endpoints (clients can't solve PoW). +### Externally-authored sites + +**Valia site**: +A small public static site authored by Valia (Viktor's mother, external to the infra) and hosted for her under `.viktorbarzin.me`. Its source of truth is a **Content folder** she owns; the live site is a mirror of that folder, fresh within ~10 minutes. Hosted **off-infra** (Cloudflare Pages) by decision: a homelab outage freezes content but never takes her sites down. Viktor picks the English subdomain name per site at registration (her folder names stay Bulgarian). Current instances: `stem95su`, `bridge`. +_Avoid_: "school site" (the family may grow beyond school projects); treating the deployed copy as editable — edits land only in the **Content folder**. + +**Content folder**: +The Google Drive folder (or subfolder) Valia shares with `vbarzin@gmail.com` holding one **Valia site**'s files. Strictly read-only from the infra side — nothing ever writes back to her Drive. Empty or half-uploaded folder states must never wipe a live site. +_Avoid_: syncing a folder root when the servable content lives in a subfolder (stem95su serves `stem claude/files/`, not the folder root). + +**Entry file**: +The HTML file a **Valia site** serves at `/`. Defaults to `index.html`; per-site override when she names it differently (stem95su: `stem_board.html`). The override is a registration-time setting, not a constraint on her authoring. +_Avoid_: asking Valia to rename her files to fit hosting conventions. + ## Relationships - A **Service** is defined by exactly one **Stack** — **flat** or wrapping a **Stack-local module** — which sources zero or more shared **Factory modules** and resolves to one or more K8s workloads. @@ -240,6 +262,7 @@ A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content - A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither. - An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**. - Tier-1 **State tier** state and ~12 app databases share one **CNPG** `pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store. +- A **Valia site** mirrors exactly one **Content folder** and serves exactly one **Entry file** at `/`; the folder is hers, the subdomain name is Viktor's, the hosting is off-infra. ## Example dialogue diff --git a/cli/VERSION b/cli/VERSION index fd2726c9..87a1cf59 100644 --- a/cli/VERSION +++ b/cli/VERSION @@ -1 +1 @@ -v0.11.0 +v0.12.0 diff --git a/cli/cmd_memory.go b/cli/cmd_memory.go index 7ae11ea0..129d07b2 100644 --- a/cli/cmd_memory.go +++ b/cli/cmd_memory.go @@ -30,11 +30,21 @@ func memoryCommands() []Command { } } -// printMemories renders a {memories:[…]} response as compact lines, or raw JSON. +// printMemories renders a {memories:[…]} response as one line per memory, or raw JSON. func printMemories(raw []byte, jsonOut bool) error { + fmt.Print(renderMemories(raw, jsonOut)) + return nil +} + +// renderMemories formats each memory as a single line with its FULL content +// (newlines flattened to spaces). Content is deliberately never truncated: the +// old 240-rune preview cut memories mid-sentence, misled agents into believing +// no full-content read-back existed, and made blind `update --content` from +// the preview silently destroy the stored tail. Full passthrough also can't +// produce invalid UTF-8 (the old mid-rune cut crashed the recall hook). +func renderMemories(raw []byte, jsonOut bool) string { if jsonOut { - fmt.Println(string(raw)) - return nil + return string(raw) + "\n" } var r struct { Memories []struct { @@ -46,36 +56,20 @@ func printMemories(raw []byte, jsonOut bool) error { } `json:"memories"` } if err := json.Unmarshal(raw, &r); err != nil { - fmt.Println(string(raw)) - return nil + return string(raw) + "\n" } if len(r.Memories) == 0 { - fmt.Println("(no memories)") - return nil + return "(no memories)\n" } + var b strings.Builder for _, m := range r.Memories { - c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240) - fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c) + c := strings.ReplaceAll(m.Content, "\n", " ") + fmt.Fprintf(&b, "#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c) if m.Tags != "" { - fmt.Printf(" tags: %s\n", m.Tags) + fmt.Fprintf(&b, " tags: %s\n", m.Tags) } } - return nil -} - -// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it -// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240] -// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte -// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict -// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit -// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit -// hook error" for Cyrillic-language users. -func truncatePreview(s string, maxRunes int) string { - r := []rune(s) - if len(r) <= maxRunes { - return s - } - return string(r[:maxRunes]) + "…" + return b.String() } func memoryRecall(args []string) error { diff --git a/cli/memory_test.go b/cli/memory_test.go index 1c673c7b..ee21ad12 100644 --- a/cli/memory_test.go +++ b/cli/memory_test.go @@ -8,25 +8,53 @@ import ( "unicode/utf8" ) -func TestTruncatePreviewKeepsValidUTF8(t *testing.T) { - // Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits - // invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must - // cut on a rune boundary and always stay valid UTF-8. - long := strings.Repeat("я", 300) // 300 runes / 600 bytes - got := truncatePreview(long, 240) +func TestRenderMemoriesFullContent(t *testing.T) { + // The pretty view must NOT truncate content: the old 240-rune preview cut + // memories mid-sentence, misled agents into thinking no full-content + // read-back existed, and made blind `update --content` from the preview + // destroy the stored tail. Full passthrough also removes the mid-rune-cut + // invalid-UTF-8 class by construction — nothing is ever sliced. + long := strings.Repeat("я", 300) + strings.Repeat("a", 300) + raw, _ := json.Marshal(map[string]interface{}{"memories": []map[string]interface{}{ + {"id": 7, "content": long, "category": "facts", "tags": "t1,t2", "importance": 0.7}, + }}) + got := renderMemories(raw, false) + if !strings.Contains(got, long) { + t.Fatalf("content was truncated: %q", got) + } + if strings.Contains(got, "…") { + t.Fatalf("ellipsis in output — truncation still active: %q", got) + } if !utf8.ValidString(got) { - t.Fatalf("truncatePreview produced invalid UTF-8: %q", got) + t.Fatalf("invalid UTF-8 in output: %q", got) } - if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' { - t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r)) + if !strings.Contains(got, "#7 [facts] (0.70) ") || !strings.Contains(got, "tags: t1,t2") { + t.Fatalf("line format broken: %q", got) } - // Short multibyte strings pass through untouched (no ellipsis). - if got := truncatePreview("кратко", 240); got != "кратко" { - t.Fatalf("short string altered: %q", got) +} + +func TestRenderMemoriesFlattensNewlinesToOneLine(t *testing.T) { + // Consumers (the recall hook, terminal skims) rely on one memory per line; + // multi-line content is flattened, never split across lines. + raw, _ := json.Marshal(map[string]interface{}{"memories": []map[string]interface{}{ + {"id": 1, "content": "line one\nline two\nline three", "category": "facts", "importance": 0.5}, + }}) + got := renderMemories(raw, false) + if !strings.Contains(got, "line one line two line three") { + t.Fatalf("newlines not flattened: %q", got) } - // ASCII boundary still works. - if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" { - t.Fatalf("ascii truncation wrong: %q", got) +} + +func TestRenderMemoriesEdgeCases(t *testing.T) { + if got := renderMemories([]byte(`{"memories":[]}`), false); got != "(no memories)\n" { + t.Fatalf("empty list: %q", got) + } + // --json and unparseable responses pass through raw. + if got := renderMemories([]byte(`{"x":1}`), true); got != "{\"x\":1}\n" { + t.Fatalf("json passthrough: %q", got) + } + if got := renderMemories([]byte(`not json`), false); got != "not json\n" { + t.Fatalf("unparseable passthrough: %q", got) } } diff --git a/config.tfvars b/config.tfvars index 790a48ae..9ce566ed 100644 Binary files a/config.tfvars and b/config.tfvars differ diff --git a/docs/adr/0017-cctv-physical-cabling.svg b/docs/adr/0017-cctv-physical-cabling.svg new file mode 100644 index 00000000..6088f9e3 --- /dev/null +++ b/docs/adr/0017-cctv-physical-cabling.svg @@ -0,0 +1,126 @@ + + + + + + + + + + + ADR-0017 — physical cabling (single-switch, rev 3) + wires only — no VLANs, no traffic · solid = in place · dashed = camera-day · ~ = radio + + + + APARTMENT + + ☁ ISP (internet) + + + + AX6000 router + 192.168.1.1 · WAN←ISP · 8×LAN + + + Synology NAS · .13 + on an AX6000 LAN port + + + 📶 wifi clients (phones, laptops) + + + + + in-wall run → garage + + + + GARAGE — RACK + + + + TL-SG105PE · 5-port gigabit PoE switch + mgmt 192.168.1.6 · replaces the old TL-SG105E (→ shelf, cold spare) + + + P1 + ← apartment + + P2 + ← 4G router + + P3 + ← UPS mgmt + + P4 ⚡PoE + ← camera + + P5 + ← R730 eno1 + + every cable below re-plugs old-switch → PE on camera day (≈3 min) + + + + 4G router · 192.168.1.7 + ~cellular uplink (out-of-band) + + + 📡 cellular + + + + UPS (Huawei) + network mgmt card + + + + + Dell R730 · PVE host · 192.168.1.127 + + + eno1 · LAN1 + ← switch P5 · 1GbE + + eno2 · LAN2 + dark · fallback leg + + eno3 / eno4 + free, uncabled + + iDRAC · .4 + shared-LOM/eno1 + + no other network cables — everything else on this host is VIRTUAL: + pfSense · ha-sofia (HA) · devvm · k8s-master + node1-6 · registry VM … + (power: host + switch fed from the UPS — power wiring not drawn) + + + LAN1 cable + + + + GARAGE ENTRANCE + + vermont-garage camera + HiLook IPC-T241H-C · 10.0.30.70 + powered over the data cable (PoE) + outdoor · armored conduit + + + single cat6 in conduit · data + PoE power (camera day) + + + + + copper, in place + + camera-day cable / dark port + + radio (wifi / cellular) + total wired links at the rack: 5 (all on the one switch) · ADR-0017 rev 3 + + diff --git a/docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md b/docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md new file mode 100644 index 00000000..d9de098d --- /dev/null +++ b/docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md @@ -0,0 +1,99 @@ +# CCTV segment: dedicated pfSense interface, VLAN-30 trunk on the LAN1 cable + +Status: accepted (2026-07-02, rev 3 — single-switch) + +![Network topology — dCCTV segment, flows, and camera-day steps](./0017-cctv-segment-topology.svg) + +![Physical cabling — wires only, no VLANs](./0017-cctv-physical-cabling.svg) + +The first owned camera at the Sofia/Vermont site (`vermont-garage`, HiLook +IPC-T241H-C at the garage entrance) needs to be network-isolated: its cable is +physically exposed outside the apartment, so anything plugged into that cable +must land in a segment that can reach nothing. The original design doc +(NAS: `Emo shared/Claude shared/garage-camera/`) called for an "802.1Q trunk +to pfSense" — but nothing in this network terminates dot1q on pfSense; the +site idiom is one vlan-aware Proxmox bridge → one tagged VM NIC → one clean +untagged pfSense interface per segment. + +**Decision (rev 3):** ONE switch — the new TL-SG105PE **replaces** the old +garage TL-SG105E (Viktor prefers not running two switches; retired unit +becomes a cold spare, its 192.168.1.6 mgmt IP passes to the PE). Five ports, +all used: apartment uplink, 4G router 192.168.1.7, UPS mgmt (all untagged +VLAN 1), the camera (untagged VLAN 30, PoE), and the **trunk to R730 `eno1` +carrying home LAN untagged + CCTV tagged 30** over the existing LAN1 cable. +pfSense `net3` (vtnet3) sits on `vmbr0` with `tag=30` — exactly the site +idiom used for dManagementsVms/dKubernetes (bridge-level tag → clean untagged +vNIC; pfSense still terminates no dot1q itself). The earlier dedicated +`eno2`/`vmbr2` leg is kept **dormant as a fallback** (rev 2 wired it; moving +net3 back to vmbr2 restores pure physical isolation in one `qm set`). +This narrows the earlier 802.1Q objection rather than contradicting it: the +rejection assumed *unmanaged* switches, where any LAN device could inject +tagged frames; with the managed PE as the only device on eno1, VLAN-30 +membership is {camera port, trunk port} only, so tag-30 ingress from every +other port — and from the exposed camera cable — is dropped or contained. +Cameras are untrusted: default-deny on dCCTV with a single +NTP-to-gateway exception; Frigate (k8s) pulls RTSP in; ha-sofia (192.168.1.8) +may reach ISAPI/RTSP directly; home-LAN clients route in via an AX6000 static +route (10.0.30.0/24 via 192.168.1.2). 10.0.30.0/24 is deliberately NOT in the +10.0.20.0/22 trusted source-IP allowlist. + +## Traffic on the trunk — how one cable carries two networks + +The LAN1 cable is shared, but the two networks on it diverge at `vmbr0` +(the vlan-aware bridge on the PVE host), and only ONE of them ever touches +pfSense: + +- **Untagged (VLAN 1, home LAN)** is plain L2 bridging: vmbr0 switches it + between the trunk, the host's own IP (192.168.1.127) and pfSense `net0` — + where pfSense sits as an ordinary LAN *client* (WAN 192.168.1.2). The home + LAN's gateway is and remains the AX6000; home-LAN traffic never transits + pfSense. Consequently a pfSense (or R730 VM-level) outage does not affect + the home LAN, and the apartment ↔ 4G-router ↔ UPS paths don't even leave + the switch (P1/P2/P3 bridge internally), so out-of-band recovery via the + 4G router survives the whole rack being down. +- **Tagged 30 (CCTV)** has exactly one possible landing: vmbr0 delivers + VID 30 only to pfSense `net3` (dCCTV, 10.0.30.1), which is the camera + segment's gateway, firewall and sole exit. "Camera → AX6000 → internet" + is impossible by construction, not merely by firewall rule. +- pfSense forwards *upstream* only its own segments (10.0.10/20/30), NATed + out of its WAN toward the AX6000. Load-wise the trunk gained only the + camera's ~8 Mbps — it already carried all rack-bound home-LAN traffic. + +![VLAN tagging — where traffic can flow](./0017-cctv-vlan-tagging.svg) + +*(editable source: [`0017-cctv-vlan-tagging.excalidraw`](./0017-cctv-vlan-tagging.excalidraw) — open it in excalidraw to tweak)* + +## Considered options + +- **802.1Q over the LAN path behind an UNMANAGED switch** (the original plan + read this way) — rejected: any LAN device could inject tagged frames into + vmbr0 (`bridge-vids 2-4094`) and tag-passing through a dumb switch is + undefined. Rev 3 adopts the tagged path ONLY because the managed PE now + polices VLAN-30 membership at the single entry point to eno1; no bridge + reconfiguration was needed (vmbr0 was already vlan-aware). +- **Dedicated physical leg (eno2 → vmbr2 → net3), one switch per role** + (rev 1/2 as-built) — superseded by rev 3: it forced either a second switch + (6 connections vs 5 ports once the PE also replaced the old switch) or new + hardware. Strongest isolation of all options; kept dormant as the fallback. +- **AX6000 as the camera gateway** — rejected earlier in the design (consumer + router, no inter-VLAN firewall). + +## Consequences + +- The switch is now single-point and load-bearing for everything in the rack + (apartment uplink, pfSense backup-WAN via 4G, UPS mgmt, CCTV) AND its VLAN + table + mgmt password are part of the isolation boundary — the Easy Smart + mgmt UI answers on every port, so the password is the gate between a + compromised camera and the switch config. All 5 ports are consumed: the + next camera forces an 8-port PoE upgrade (the wiring plan already fits it). +- `eno2`/`vmbr2` stay cabled-ready but dormant (fallback to rev 2's physical + leg); eno3/eno4 remain free. +- The old TL-SG105E is retired to cold spare; the PE inherits 192.168.1.6 + (Kea reservation by MAC). +- Revision history (all 2026-07-02): rev 1 assumed one shared PE with a + port-VLAN split (conflated the two devices); rev 2 split into two switches + after inspecting 192.168.1.6 (old non-PoE SG105E, 4/5 ports used); rev 3 + consolidated back to one switch — the PE replacing the SG105E — per + Viktor's preference, moving CCTV onto a managed tagged trunk. +- Frigate's ADR-0016 VRAM budget was bumped 2000 → 2300 MiB for the extra + NVDEC stream. diff --git a/docs/adr/0017-cctv-segment-topology.svg b/docs/adr/0017-cctv-segment-topology.svg new file mode 100644 index 00000000..007b7e16 --- /dev/null +++ b/docs/adr/0017-cctv-segment-topology.svg @@ -0,0 +1,178 @@ + + + + + + + + + + + + + + + + + ADR-0017 — CCTV segment behind pfSense, VLAN-30 trunk on the LAN1 cable + Sofia/Vermont · rev 3 (single switch) 2026-07-02 · dashed = camera-day · the ONLY 802.1Q is the trunk between the switch and eno1 + + + + + + + + DENY · camera → LAN / other segments / internet (default deny on dCCTV) + + + + GARAGE ENTRANCE + + vermont-garage + HiLook IPC-T241H-C · pure IR + 10.0.30.70 (Kea reservation) + DNS: garage-cam.viktorbarzin.lan + PoE from switch · cloud/P2P off + + + cat6 in conduit · PoE → P4 + + + + RACK — GARAGE · ONE SWITCH + + + TL-SG105PE replaces the SG105E · mgmt 192.168.1.6 (Kea) · all 5 ports used + + + P1 · V1 + apartment + uplink + + P2 · V1 + 4G router + 192.168.1.7 + + P3 · V1 + UPS mgmt + + P4 · V30 + camera + PoE ON + + P5 · trunk + V1 untagged + + V30 tagged + + 802.1Q: VLAN 1 untagged {P1,P2,P3,P5} · VLAN 30 {P4 untagged/PVID 30, P5 tagged} + tag-30 ingress on P1/P2/P3 is dropped (not members) — the trunk is the only tagged path + old TL-SG105E → retired, cold spare · backup-WAN (4G) + UPS keep their ports + + + + + LAN1 cable + + + + DELL R730 — PVE HOST 192.168.1.127 (IN THE RACK) + + + + eno1 → vmbr0 + untag V1 + tag 30 + + + eno2 → vmbr2 + dormant fallback leg + + + vmbr1 + internal · tags 10/20 + + + + + pfSense (VM 101) + gateway + firewall for every segment + + + net0 · WAN 192.168.1.2 · vmbr0 untagged + + net1 · dManagementsVms 10.0.10.1 + + net2 · dKubernetes 10.0.20.1 + + net3 · dCCTV 10.0.30.1/24 · vmbr0 tag 30 + + + + + + + + + k8s VMs · 10.0.20.0/24 + vmbr1 tag 20 · pod egress SNATs + to node IPs + + Frigate · k8s-node1 (T4) + detect sub / record main + gpumem budget 2300 MiB + + go2rtc LB 10.0.20.204 + restream → HA live view (MSE/HLS) + + + + HOME LAN 192.168.1.0/24 + + AX6000 · .1 + + route 10.0.30.0/24 → .2 + + ha-sofia · .8 + Frigate card + hikvision_next + + apartment clients + laptops, phones + + CAMERA DAY: static route + 10.0.30.0/24 via 192.168.1.2 + + + apartment uplink · switch P1 · trunk · eno1 + + + + ALLOW · Frigate → camera RTSP :554 (routed k8s → dCCTV; opt1 allow-all) + + + ALLOW · ha-sofia → camera :80 ISAPI + :554 + enters pfSense WAN · reply-to off · needs the AX6000 route + + + ALLOW · camera → 10.0.30.1:123 (NTP) + + + + + home LAN / VLAN 1 + + CCTV / VLAN 30 / dCCTV 10.0.30.0/24 + + dKubernetes + + dManagementsVms + + allowed flow + + denied + + camera-day step + ADR-0017 · rev 3 + + diff --git a/docs/adr/0017-cctv-vlan-tagging.excalidraw b/docs/adr/0017-cctv-vlan-tagging.excalidraw new file mode 100644 index 00000000..26eb9abd --- /dev/null +++ b/docs/adr/0017-cctv-vlan-tagging.excalidraw @@ -0,0 +1,1771 @@ +{ + "type": "excalidraw", + "version": 2, + "source": "https://excalidraw.viktorbarzin.me", + "elements": [ + { + "id": "el001", + "type": "text", + "x": 40, + "y": 20, + "width": 621.6, + "height": 35.0, + "angle": 0, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1778837932, + "version": 1, + "versionNonce": 1303193991, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "VLAN tagging \u2014 where traffic can flow", + "fontSize": 28, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "VLAN tagging \u2014 where traffic can flow", + "lineHeight": 1.25, + "baseline": 28 + }, + { + "id": "el002", + "type": "text", + "x": 40, + "y": 62, + "width": 758.4, + "height": 20.0, + "angle": 0, + "strokeColor": "#868e96", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1570340888, + "version": 1, + "versionNonce": 1243931547, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "the 802.1Q tag exists only between switch P5 and vmbr0 \u2014 endpoints never see it", + "fontSize": 16, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "the 802.1Q tag exists only between switch P5 and vmbr0 \u2014 endpoints never see it", + "lineHeight": 1.25, + "baseline": 16 + }, + { + "id": "el003", + "type": "rectangle", + "x": 700, + "y": 110, + "width": 210, + "height": 560, + "angle": 0, + "strokeColor": "#868e96", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "dashed", + "roughness": 1, + "opacity": 100, + "seed": 750280512, + "version": 1, + "versionNonce": 1195188524, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "roundness": { + "type": 3 + } + }, + { + "id": "el004", + "type": "text", + "x": 742, + "y": 122, + "width": 97.2, + "height": 22.5, + "angle": 0, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 473142373, + "version": 1, + "versionNonce": 115692583, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "ONE CABLE", + "fontSize": 18, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "ONE CABLE", + "lineHeight": 1.25, + "baseline": 18 + }, + { + "id": "el005", + "type": "text", + "x": 716, + "y": 148, + "width": 171.6, + "height": 16.25, + "angle": 0, + "strokeColor": "#868e96", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1069030696, + "version": 1, + "versionNonce": 1650002323, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "the LAN1 run \u00b7 P5\u2194eno1", + "fontSize": 13, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "the LAN1 run \u00b7 P5\u2194eno1", + "lineHeight": 1.25, + "baseline": 13 + }, + { + "id": "el006", + "type": "text", + "x": 40, + "y": 120, + "width": 276.0, + "height": 25.0, + "angle": 0, + "strokeColor": "#7048e8", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1802024079, + "version": 1, + "versionNonce": 1083980019, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "VLAN 30 \u00b7 CCTV (camera)", + "fontSize": 20, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "VLAN 30 \u00b7 CCTV (camera)", + "lineHeight": 1.25, + "baseline": 20 + }, + { + "id": "el007", + "type": "rectangle", + "x": 40, + "y": 160, + "width": 170, + "height": 100, + "angle": 0, + "strokeColor": "#7048e8", + "backgroundColor": "#d0bfff", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1363373344, + "version": 1, + "versionNonce": 1724819963, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "roundness": { + "type": 3 + } + }, + { + "id": "el008", + "type": "text", + "x": 56, + "y": 172, + "width": 126.0, + "height": 56.25, + "angle": 0, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 590735843, + "version": 1, + "versionNonce": 267116025, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "camera\n10.0.30.70\nsends untagged", + "fontSize": 15, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "camera\n10.0.30.70\nsends untagged", + "lineHeight": 1.25, + "baseline": 15 + }, + { + "id": "el009", + "type": "arrow", + "x": 210, + "y": 210, + "width": 50, + "height": 0, + "angle": 0, + "strokeColor": "#7048e8", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 600787264, + "version": 1, + "versionNonce": 844240212, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "points": [ + [ + 0, + 0 + ], + [ + 50, + 0 + ] + ], + "lastCommittedPoint": null, + "startBinding": null, + "endBinding": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "roundness": { + "type": 2 + } + }, + { + "id": "el010", + "type": "rectangle", + "x": 260, + "y": 160, + "width": 190, + "height": 100, + "angle": 0, + "strokeColor": "#7048e8", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 648177040, + "version": 1, + "versionNonce": 901986117, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "roundness": { + "type": 3 + } + }, + { + "id": "el011", + "type": "text", + "x": 274, + "y": 170, + "width": 153.0, + "height": 37.5, + "angle": 0, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1421789145, + "version": 1, + "versionNonce": 530430174, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "P4 ingress\nPVID 30 \u2192 VLAN 30", + "fontSize": 15, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "P4 ingress\nPVID 30 \u2192 VLAN 30", + "lineHeight": 1.25, + "baseline": 15 + }, + { + "id": "el012", + "type": "text", + "x": 274, + "y": 226, + "width": 126.0, + "height": 17.5, + "angle": 0, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 297119438, + "version": 1, + "versionNonce": 1328001885, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "\u2717 not in VLAN 1", + "fontSize": 14, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "\u2717 not in VLAN 1", + "lineHeight": 1.25, + "baseline": 14 + }, + { + "id": "el013", + "type": "arrow", + "x": 450, + "y": 210, + "width": 50, + "height": 0, + "angle": 0, + "strokeColor": "#7048e8", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1759537933, + "version": 1, + "versionNonce": 351602578, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "points": [ + [ + 0, + 0 + ], + [ + 50, + 0 + ] + ], + "lastCommittedPoint": null, + "startBinding": null, + "endBinding": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "roundness": { + "type": 2 + } + }, + { + "id": "el014", + "type": "rectangle", + "x": 500, + "y": 160, + "width": 170, + "height": 100, + "angle": 0, + "strokeColor": "#7048e8", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 2036237420, + "version": 1, + "versionNonce": 608198039, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "roundness": { + "type": 3 + } + }, + { + "id": "el015", + "type": "text", + "x": 514, + "y": 172, + "width": 99.0, + "height": 37.5, + "angle": 0, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1755241687, + "version": 1, + "versionNonce": 1444750360, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "P5 egress\nadds 802.1Q", + "fontSize": 15, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "P5 egress\nadds 802.1Q", + "lineHeight": 1.25, + "baseline": 15 + }, + { + "id": "el016", + "type": "text", + "x": 514, + "y": 226, + "width": 81.6, + "height": 21.25, + "angle": 0, + "strokeColor": "#7048e8", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 76597799, + "version": 1, + "versionNonce": 1858784829, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "+ tag 30", + "fontSize": 17, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "+ tag 30", + "lineHeight": 1.25, + "baseline": 17 + }, + { + "id": "el017", + "type": "arrow", + "x": 670, + "y": 200, + "width": 270, + "height": 0, + "angle": 0, + "strokeColor": "#7048e8", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 3, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1598556093, + "version": 1, + "versionNonce": 221916615, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "points": [ + [ + 0, + 0 + ], + [ + 270, + 0 + ] + ], + "lastCommittedPoint": null, + "startBinding": null, + "endBinding": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "roundness": { + "type": 2 + } + }, + { + "id": "el018", + "type": "arrow", + "x": 670, + "y": 222, + "width": 270, + "height": 0, + "angle": 0, + "strokeColor": "#7048e8", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 3, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1523174671, + "version": 1, + "versionNonce": 216018217, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "points": [ + [ + 0, + 0 + ], + [ + 270, + 0 + ] + ], + "lastCommittedPoint": null, + "startBinding": null, + "endBinding": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "roundness": { + "type": 2 + } + }, + { + "id": "el019", + "type": "text", + "x": 724, + "y": 172, + "width": 126.0, + "height": 18.75, + "angle": 0, + "strokeColor": "#7048e8", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 2049719155, + "version": 1, + "versionNonce": 1609878353, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "carries tag 30", + "fontSize": 15, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "carries tag 30", + "lineHeight": 1.25, + "baseline": 15 + }, + { + "id": "el020", + "type": "rectangle", + "x": 940, + "y": 160, + "width": 180, + "height": 100, + "angle": 0, + "strokeColor": "#7048e8", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 22152744, + "version": 1, + "versionNonce": 1741428563, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "roundness": { + "type": 3 + } + }, + { + "id": "el021", + "type": "text", + "x": 954, + "y": 170, + "width": 144.0, + "height": 37.5, + "angle": 0, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1026267703, + "version": 1, + "versionNonce": 502895922, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "vmbr0 vlan-aware\nVID 30 \u2192 net3", + "fontSize": 15, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "vmbr0 vlan-aware\nVID 30 \u2192 net3", + "lineHeight": 1.25, + "baseline": 15 + }, + { + "id": "el022", + "type": "text", + "x": 954, + "y": 226, + "width": 151.2, + "height": 17.5, + "angle": 0, + "strokeColor": "#7048e8", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 918449769, + "version": 1, + "versionNonce": 1067599022, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "ONLY, nowhere else", + "fontSize": 14, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "ONLY, nowhere else", + "lineHeight": 1.25, + "baseline": 14 + }, + { + "id": "el023", + "type": "arrow", + "x": 1120, + "y": 210, + "width": 50, + "height": 0, + "angle": 0, + "strokeColor": "#7048e8", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1544933330, + "version": 1, + "versionNonce": 249589260, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "points": [ + [ + 0, + 0 + ], + [ + 50, + 0 + ] + ], + "lastCommittedPoint": null, + "startBinding": null, + "endBinding": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "roundness": { + "type": 2 + } + }, + { + "id": "el024", + "type": "rectangle", + "x": 1170, + "y": 130, + "width": 300, + "height": 190, + "angle": 0, + "strokeColor": "#7048e8", + "backgroundColor": "#d0bfff", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1147616804, + "version": 1, + "versionNonce": 275900123, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "roundness": { + "type": 3 + } + }, + { + "id": "el025", + "type": "text", + "x": 1186, + "y": 142, + "width": 198.0, + "height": 56.25, + "angle": 0, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1183197673, + "version": 1, + "versionNonce": 827844211, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "pfSense net3 \u00b7 dCCTV\n10.0.30.1/24\ntag stripped by bridge", + "fontSize": 15, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "pfSense net3 \u00b7 dCCTV\n10.0.30.1/24\ntag stripped by bridge", + "lineHeight": 1.25, + "baseline": 15 + }, + { + "id": "el026", + "type": "text", + "x": 1186, + "y": 212, + "width": 268.8, + "height": 35.0, + "angle": 0, + "strokeColor": "#2f9e44", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 556137867, + "version": 1, + "versionNonce": 1074481459, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "\u2713 in: Frigate :554 \u00b7 HA :80+:554\n\u2713 out: NTP :123 only", + "fontSize": 14, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "\u2713 in: Frigate :554 \u00b7 HA :80+:554\n\u2713 out: NTP :123 only", + "lineHeight": 1.25, + "baseline": 14 + }, + { + "id": "el027", + "type": "text", + "x": 1186, + "y": 268, + "width": 193.2, + "height": 17.5, + "angle": 0, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1321167842, + "version": 1, + "versionNonce": 1493882225, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "\u2717 everything else: DENY", + "fontSize": 14, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "\u2717 everything else: DENY", + "lineHeight": 1.25, + "baseline": 14 + }, + { + "id": "el028", + "type": "text", + "x": 40, + "y": 380, + "width": 480.0, + "height": 25.0, + "angle": 0, + "strokeColor": "#1971c2", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1369574852, + "version": 1, + "versionNonce": 733267986, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "VLAN 1 \u00b7 home LAN (the rest of the rack)", + "fontSize": 20, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "VLAN 1 \u00b7 home LAN (the rest of the rack)", + "lineHeight": 1.25, + "baseline": 20 + }, + { + "id": "el029", + "type": "rectangle", + "x": 40, + "y": 420, + "width": 170, + "height": 120, + "angle": 0, + "strokeColor": "#1971c2", + "backgroundColor": "#a5d8ff", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1426243518, + "version": 1, + "versionNonce": 404213796, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "roundness": { + "type": 3 + } + }, + { + "id": "el030", + "type": "text", + "x": 54, + "y": 432, + "width": 142.79999999999998, + "height": 70.0, + "angle": 0, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1170712377, + "version": 1, + "versionNonce": 1439293404, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "apartment uplink\n4G router \u00b7 .7\nUPS \u00b7 switch mgmt\nall untagged", + "fontSize": 14, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "apartment uplink\n4G router \u00b7 .7\nUPS \u00b7 switch mgmt\nall untagged", + "lineHeight": 1.25, + "baseline": 14 + }, + { + "id": "el031", + "type": "arrow", + "x": 210, + "y": 480, + "width": 50, + "height": 0, + "angle": 0, + "strokeColor": "#1971c2", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 41933292, + "version": 1, + "versionNonce": 217435681, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "points": [ + [ + 0, + 0 + ], + [ + 50, + 0 + ] + ], + "lastCommittedPoint": null, + "startBinding": null, + "endBinding": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "roundness": { + "type": 2 + } + }, + { + "id": "el032", + "type": "rectangle", + "x": 260, + "y": 420, + "width": 190, + "height": 120, + "angle": 0, + "strokeColor": "#1971c2", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1494665817, + "version": 1, + "versionNonce": 82528369, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "roundness": { + "type": 3 + } + }, + { + "id": "el033", + "type": "text", + "x": 274, + "y": 430, + "width": 135.0, + "height": 37.5, + "angle": 0, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 2006432221, + "version": 1, + "versionNonce": 1170391402, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "P1 / P2 / P3\nPVID 1 \u2192 VLAN 1", + "fontSize": 15, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "P1 / P2 / P3\nPVID 1 \u2192 VLAN 1", + "lineHeight": 1.25, + "baseline": 15 + }, + { + "id": "el034", + "type": "text", + "x": 274, + "y": 488, + "width": 142.79999999999998, + "height": 35.0, + "angle": 0, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 2035003054, + "version": 1, + "versionNonce": 231739024, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "\u2717 tag-30 arriving\nhere is DROPPED", + "fontSize": 14, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "\u2717 tag-30 arriving\nhere is DROPPED", + "lineHeight": 1.25, + "baseline": 14 + }, + { + "id": "el035", + "type": "arrow", + "x": 450, + "y": 480, + "width": 50, + "height": 0, + "angle": 0, + "strokeColor": "#1971c2", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 851649342, + "version": 1, + "versionNonce": 1330529717, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "points": [ + [ + 0, + 0 + ], + [ + 50, + 0 + ] + ], + "lastCommittedPoint": null, + "startBinding": null, + "endBinding": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "roundness": { + "type": 2 + } + }, + { + "id": "el036", + "type": "rectangle", + "x": 500, + "y": 420, + "width": 170, + "height": 120, + "angle": 0, + "strokeColor": "#1971c2", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1108429504, + "version": 1, + "versionNonce": 322250604, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "roundness": { + "type": 3 + } + }, + { + "id": "el037", + "type": "text", + "x": 514, + "y": 434, + "width": 117.0, + "height": 37.5, + "angle": 0, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 2082654793, + "version": 1, + "versionNonce": 88739979, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "P5 egress\nnative VLAN 1", + "fontSize": 15, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "P5 egress\nnative VLAN 1", + "lineHeight": 1.25, + "baseline": 15 + }, + { + "id": "el038", + "type": "text", + "x": 514, + "y": 496, + "width": 108.0, + "height": 18.75, + "angle": 0, + "strokeColor": "#1971c2", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 594390025, + "version": 1, + "versionNonce": 1730926570, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "no tag added", + "fontSize": 15, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "no tag added", + "lineHeight": 1.25, + "baseline": 15 + }, + { + "id": "el039", + "type": "arrow", + "x": 670, + "y": 480, + "width": 270, + "height": 0, + "angle": 0, + "strokeColor": "#1971c2", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 3, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 2082581262, + "version": 1, + "versionNonce": 1681796809, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "points": [ + [ + 0, + 0 + ], + [ + 270, + 0 + ] + ], + "lastCommittedPoint": null, + "startBinding": null, + "endBinding": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "roundness": { + "type": 2 + } + }, + { + "id": "el040", + "type": "text", + "x": 716, + "y": 452, + "width": 189.0, + "height": 18.75, + "angle": 0, + "strokeColor": "#1971c2", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 787209477, + "version": 1, + "versionNonce": 840302416, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "plain untagged frames", + "fontSize": 15, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "plain untagged frames", + "lineHeight": 1.25, + "baseline": 15 + }, + { + "id": "el041", + "type": "rectangle", + "x": 940, + "y": 420, + "width": 180, + "height": 120, + "angle": 0, + "strokeColor": "#1971c2", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1079834069, + "version": 1, + "versionNonce": 647687454, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "roundness": { + "type": 3 + } + }, + { + "id": "el042", + "type": "text", + "x": 954, + "y": 432, + "width": 168.0, + "height": 70.0, + "angle": 0, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 474197814, + "version": 1, + "versionNonce": 912206893, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "vmbr0 untagged\n= plain L2 switching\nhost .127 + pfSense\nWAN \u2014 no routing", + "fontSize": 14, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "vmbr0 untagged\n= plain L2 switching\nhost .127 + pfSense\nWAN \u2014 no routing", + "lineHeight": 1.25, + "baseline": 14 + }, + { + "id": "el043", + "type": "arrow", + "x": 1120, + "y": 480, + "width": 50, + "height": 0, + "angle": 0, + "strokeColor": "#1971c2", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 215726947, + "version": 1, + "versionNonce": 1310489154, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "points": [ + [ + 0, + 0 + ], + [ + 50, + 0 + ] + ], + "lastCommittedPoint": null, + "startBinding": null, + "endBinding": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "roundness": { + "type": 2 + } + }, + { + "id": "el044", + "type": "rectangle", + "x": 1170, + "y": 410, + "width": 300, + "height": 160, + "angle": 0, + "strokeColor": "#1971c2", + "backgroundColor": "#a5d8ff", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1355096973, + "version": 1, + "versionNonce": 1357902601, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "roundness": { + "type": 3 + } + }, + { + "id": "el045", + "type": "text", + "x": 1186, + "y": 422, + "width": 218.4, + "height": 52.5, + "angle": 0, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 212355785, + "version": 1, + "versionNonce": 693422793, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "pfSense net0 \u00b7 WAN .2\njust a LAN client \u2014\nhome LAN never transits it", + "fontSize": 14, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "pfSense net0 \u00b7 WAN .2\njust a LAN client \u2014\nhome LAN never transits it", + "lineHeight": 1.25, + "baseline": 14 + }, + { + "id": "el046", + "type": "text", + "x": 1186, + "y": 494, + "width": 201.6, + "height": 35.0, + "angle": 0, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1799580904, + "version": 1, + "versionNonce": 398539541, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "gateway = AX6000\npfSense NATs only 10.0.x", + "fontSize": 14, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "gateway = AX6000\npfSense NATs only 10.0.x", + "lineHeight": 1.25, + "baseline": 14 + }, + { + "id": "el047", + "type": "rectangle", + "x": 40, + "y": 600, + "width": 630, + "height": 90, + "angle": 0, + "strokeColor": "#868e96", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "dashed", + "roughness": 1, + "opacity": 100, + "seed": 1339321764, + "version": 1, + "versionNonce": 1076065263, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "roundness": { + "type": 3 + } + }, + { + "id": "el048", + "type": "text", + "x": 56, + "y": 612, + "width": 554.4, + "height": 35.0, + "angle": 0, + "strokeColor": "#868e96", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1733803932, + "version": 1, + "versionNonce": 2062677415, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "not on this cable: vmbr1 tag 10 \u2192 dMgmt \u00b7 tag 20 \u2192 dK8s (Frigate)\ndormant fallback: eno2 \u2192 vmbr2 (revert = one qm set)", + "fontSize": 14, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "not on this cable: vmbr1 tag 10 \u2192 dMgmt \u00b7 tag 20 \u2192 dK8s (Frigate)\ndormant fallback: eno2 \u2192 vmbr2 (revert = one qm set)", + "lineHeight": 1.25, + "baseline": 14 + }, + { + "id": "el049", + "type": "text", + "x": 940, + "y": 620, + "width": 396.0, + "height": 37.5, + "angle": 0, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 322195856, + "version": 1, + "versionNonce": 365731358, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "L2 drops (membership) happen in the switch \u2014\nL3 allow/deny happens in pfSense", + "fontSize": 15, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "L2 drops (membership) happen in the switch \u2014\nL3 allow/deny happens in pfSense", + "lineHeight": 1.25, + "baseline": 15 + }, + { + "id": "el050", + "type": "text", + "x": 940, + "y": 676, + "width": 109.2, + "height": 16.25, + "angle": 0, + "strokeColor": "#868e96", + "backgroundColor": "transparent", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "seed": 1038112083, + "version": 1, + "versionNonce": 966092898, + "isDeleted": false, + "groupIds": [], + "frameId": null, + "boundElements": null, + "updated": 1, + "link": null, + "locked": false, + "text": "ADR-0017 rev 3", + "fontSize": 13, + "fontFamily": 1, + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "ADR-0017 rev 3", + "lineHeight": 1.25, + "baseline": 13 + } + ], + "appState": { + "gridSize": null, + "viewBackgroundColor": "#ffffff" + }, + "files": {} +} \ No newline at end of file diff --git a/docs/adr/0017-cctv-vlan-tagging.svg b/docs/adr/0017-cctv-vlan-tagging.svg new file mode 100644 index 00000000..868aa746 --- /dev/null +++ b/docs/adr/0017-cctv-vlan-tagging.svg @@ -0,0 +1 @@ +VLAN tagging — where traffic can flowthe 802.1Q tag exists only between switch P5 and vmbr0 — endpoints never see itONE CABLEthe LAN1 run - P5 to eno1VLAN 30 - CCTV (camera)camera10.0.30.70sends untaggedP4 ingressPVID 30 -> VLAN 30x not in VLAN 1P5 egressadds 802.1Q:+ tag 30carries tag 30vmbr0 vlan-awareVID 30 -> net3ONLY, nowhere elsepfSense net3 - dCCTV 10.0.30.1/24tag stripped by the bridgeok in: Frigate :554 - HA :80 + :554ok out: NTP :123 onlyx everything else: DENYVLAN 1 - home LAN (the rest of the rack)apartment uplink4G router - .7UPS - switch mgmtall untaggedP1 / P2 / P3PVID 1 -> VLAN 1x tag-30 arrivinghere is DROPPEDP5 egressnative VLAN 1:no tag addedplain untagged framesvmbr0 untagged =plain L2 switching:host .127 + pfSenseWAN - no routingpfSense net0 - WAN 192.168.1.2just a LAN client - home LANnever transits pfSensegateway = AX6000 - pfSense NATs only 10.0.xnot on this cable: vmbr1 tag 10 -> dMgmt - tag 20 -> dK8s (Frigate)dormant fallback: eno2 -> vmbr2 (revert = one qm set)L2 drops (membership) happen in the switch,L3 allow/deny happens in pfSenseADR-0017 rev 3 - editable source: 0017-cctv-vlan-tagging.excalidraw \ No newline at end of file diff --git a/docs/adr/0018-valia-sites-off-infra-pages-in-cluster-sync.md b/docs/adr/0018-valia-sites-off-infra-pages-in-cluster-sync.md new file mode 100644 index 00000000..5344382a --- /dev/null +++ b/docs/adr/0018-valia-sites-off-infra-pages-in-cluster-sync.md @@ -0,0 +1,47 @@ +# Valia sites are served off-infra (Cloudflare Pages), synced in-cluster + +Valia (Viktor's mother) authors small one-page static sites in Google Drive folders she +shares, and keeps asking for them to be hosted — two exist already (`stem95su`, `bridge`) +and more are expected. We decided all **Valia sites** are served **off-infra on Cloudflare +Pages** under `.viktorbarzin.me`, kept fresh by **one shared in-cluster +CronJob** (`stacks/valia-sites/`) that mirrors each **Content folder** every 10 minutes +(rclone, drive.readonly) and re-deploys only on change (wrangler direct upload). The +existing in-cluster `stem95su` serving stack (nginx + NFS + ingress + per-site sync) +migrates onto this and is retired. + +Why off-infra serving: these are her sites, shown to teachers/parents — they must survive +homelab outages (cf. the 2026-06-27 egress incident that took every proxied in-cluster +site down). With Pages, a homelab outage degrades to "content frozen until we're back", +never "site down". Serving costs no cluster resources and no per-site nginx/PVC/ingress/ +Anubis. Why the syncer stays in-cluster anyway: secrets stay in Vault (no per-site GHA +secret sprawl), and the stem95su guard patterns (hard-fail on Drive auth errors, never +wipe a live site on an empty/partial folder, capped deletes) carry over wholesale. The +deliberate asymmetry — off-infra serving, on-infra syncing — is the point, not an +accident. + +## Considered options + +- **In-cluster everywhere** (generalise stem95su into a factory module): one roof, no + Cloudflare Pages dependency — but her sites share the homelab's fate and each site + spends cluster resources to serve static files a free CDN serves better. +- **Pages for new sites only**: less work now, two patterns and two runbooks forever. +- **GHA-scheduled sync** (fully off-infra pipeline): no cluster dependency at all, but + Drive + Cloudflare credentials would live as GitHub secrets per repo, outside Vault. + +## Consequences + +- Registration is one entry in the `sites` map (name, Content folder, optional Entry + file); CI applies Pages project, custom domain, public CNAME, and internal-DNS config + together. Names are English, picked by Viktor (most → bridge set the precedent). +- The internal split-horizon zone learns Valia sites from a ConfigMap the + `technitium-ingress-dns-sync` script consumes — declaratively, including **removal** + (the previous static-CNAME approach was add-only; a retired site left a stale record). +- Deploy-on-change is mandatory, not an optimisation: Pages caps monthly deployments on + the free tier, and a 10-minute cadence would burn ~4,300/month if unchanged runs + deployed. +- Failure visibility is **failed-Job-only** by explicit choice (no stale-sync alert, no + per-site uptime monitors, no notifications to Valia) — Viktor fields "it didn't + update" reports, consistent with the alert-noise-reduction posture. Revisit if a + silent stall actually bites. +- If the homelab is down, content updates pause; the sites keep serving last-deployed + content. Accepted degradation. diff --git a/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md b/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md new file mode 100644 index 00000000..708d8624 --- /dev/null +++ b/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md @@ -0,0 +1,97 @@ +# Inbound mail gets a self-hosted store-and-forward backup MX on Oracle Always-Free + +`viktorbarzin.me` has run a single direct MX to the home IP since the 2026-04-12 +inbound overhaul, with sender-MTA retry (1–5 days, sender-dependent) as the only +outage protection — a documented "No Backup MX" decision made after ForwardEmail's +forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email +Routing proved pass-through-only. Viktor now wants inbound mail to survive +homelab outages **without loss** (2026-07-04): delayed delivery is fine, +mid-outage reading is not required, and the budget is **$0** — a hard +constraint that eliminated every managed option (see below). + +We run a minimal **Postfix store-and-forward relay on an Oracle Cloud +Always-Free `VM.Standard.E2.1.Micro`** (`mx2.viktorbarzin.me`, **reserved** +public IP, MX preference 20; primary untouched at 1). It accepts everything +for the domain (catch-all — every RCPT is valid; reputation may only ever +4xx-defer, via postscreen pregreet + conservative DNSBL-defer on the VM — +never 5xx: a backup MX that hard-rejects manufactures the loss it exists to +prevent), queues up to **30 days** (bounce lifetime 1 day — the VM can never +deliver a DSN, its only egress is the drain), and drains to the primary over +**port 2526** — one scripted pfSense WAN NAT rule onto the existing HAProxy +frontend — because Oracle blocks egress TCP 25 tenancy-wide. Management is +tailnet-only (headscale preauth key, `tag:backup-mx`; OCI console as +mid-outage break-glass since headscale itself lives in the cluster); TLS via +certbot HTTP-01 (port 80 permanently open — LE validation is +multi-perspective and unscopeable); the VM is a cattle-rebuild from a new +`stacks/backup-mx/` Terraform stack (OCI provider + cloud-init, which must +also punch 25/80 through the OCI Ubuntu image's OS-level iptables REJECT). +On the primary, the drain stream (one /32) is enabled at the layers that +actually bite — `check_client_access` permits past +`reject_unknown_client_hostname` and spoof-protection, an anvil rate-limit +exception, and rspamd `external_relay` (score against the *original* sender +IP) with the reject action capped to tag/fold so drained spam can never force +the VM to emit backscatter. Go-live is gated on empirical checks: inbound-25 +reachability (recurring probe — Oracle publishes no commitment), drain +end-to-end, and a live failover test that includes a high-spam-score and a +>10 MB message. Two independent adversarial reviews (2026-07-04) shaped this +final form. Design: +[`plans/2026-07-04-backup-mx-design.md`](../plans/2026-07-04-backup-mx-design.md). + +## Considered options + +- **Roller Network free Secondary MX** — v1 of this decision, killed at the + validation gates the same day: free tier caps at 200 relayed messages or + 10 MB per rolling 7 days, and overage suspends the domain for 48 h + answering **SMTP 5xx** (permanent bounces) — since spammers target backup + MXes even while the primary is up, background spam alone can hold it + suspended, making it *worse than no backup MX*. Free accounts are also + being discontinued. (Their TLS checked out; their paid Basic at $30/yr is + the documented fallback if the OCI route sours.) +- **Dynu Email Backup ($9.99/yr)** — queue lifetime undocumented (FAQ hints + 12–24 h, barely beating sender retry); filtering black-box; not free. +- **Cloudflare Email Routing / mailflare** — no store-and-forward / terminal + inbox on Cloudflare; rejected earlier (2026-04-12; 2026-07-04 memory #7148). +- **Other free tiers** (challenged and re-verified 2026-07-04): GCP e2-micro + blocks egress 25 too and its free regions are US-only; AWS's 2025+ "free" + plan is a 6-month credit; Azure has no always-free VM and blocks 25; + Hetzner has no free tier; Fly.io ended free allowances; Vultr/Linode are + trial credits; DNSExit/KisoLabs/DuoCircle backup-MX are paid or dead. OCI + is the only standing free option. +- **Harden-only** (5xx-misconfig guards + paging) — does not address + multi-day outages or short-retry senders; deferred as a complementary + track. + +## Consequences + +- **A pet outside the cluster** — deliberately cattle: rebuilt entirely from + Terraform + cloud-init, patched by unattended-upgrades, scraped by the + cluster's Prometheus (exporters on the reserved public IP, allowlisted to + the homelab WAN /32 — there is **no cluster→tailnet route**, so tailnet + scraping was rejected as fictional; blackbox TCP:25 + MX-set drift alerts + besides). Never a backup target itself. +- **Oracle free-tier caprice is the top risk**: Oracle silently halved the A1 + free allowance in June 2026 and terminated over-limit instances, and + publishes no commitment that inbound 25 stays open. Mitigations: + **Pay-As-You-Go conversion is a required prerequisite** (exempts idle + reclamation, stays $0), a recurring inbound-25 probe, `BackupMxDown`, and + the queue being empty outside outages (a surprise reclamation loses + coverage, never mail). Home region is fixed at signup — Frankfurt, chosen + once. +- The drain stream bypasses `reject_unknown_client_hostname`, anvil limits, + and rspamd's reject tier for one /32; DKIM verification, SPF/DMARC (against + the original IP via `external_relay`), and content scoring stay on — spam + arriving via the backup is tagged and folded to Junk, never bounced. The VM + is deliberately NOT in the primary's `mynetworks` (a compromised VM must + not relay through us). +- **Outages > 30 days lose queued mail silently** — no DSN can ever leave the + VM. Stated and accepted (6× better than the status quo). +- Outage mail sits in plaintext on Oracle disk ≤ 30 days — single-tenant but + off-premises; accepted (same class as Brevo holding outbound today). +- Cloudflare zone lands at 197/200 records; the MTA-STS follow-up (policy + host found dangling during design — inert today; must list `mx2` when + fixed) needs 1–2 more → schedule the next record purge proactively. +- `architecture/mailserver.md` §"No Backup MX" superseded at implementation; + new runbook `docs/runbooks/backup-mx.md` (incl. OCI console break-glass); + `vpn.md`'s stale headscale claims fixed in passing; the roundtrip probe's + failure semantics change (a "failing" probe may now mean "delayed via mx2, + drains shortly" — noted in alert description). diff --git a/docs/architecture/chrome-service.md b/docs/architecture/chrome-service.md index 118c0895..f77518b4 100644 --- a/docs/architecture/chrome-service.md +++ b/docs/architecture/chrome-service.md @@ -329,6 +329,12 @@ Two independent grants make up "browser access" for a user: the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate a token by deleting its `-browser-token` Secret). +Because the SA is the user's DEFAULT kubectl credential, other per-namespace +port-forward grants hang off the same identity: `stacks/excalidraw/rbac.tf` +grants `emo-browser` `pods/portforward` in `excalidraw` (2026-07-02) so emo's +agent can upload drawings via the port-forward + `X-Authentik-Username` recipe +in his `~/.claude/CLAUDE.md`. Revoking the SA revokes those too. + ## Limits + risks - **Anti-bot vs stealth arms race** — when an upstream beats us (DRM diff --git a/docs/architecture/ci-cd.md b/docs/architecture/ci-cd.md index b8cfcdd5..17d0859f 100644 --- a/docs/architecture/ci-cd.md +++ b/docs/architecture/ci-cd.md @@ -94,7 +94,7 @@ can't reach Forgejo's public hairpin. | Visibility | Packages | Pull mechanism | |------------|----------|----------------| | **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous | -| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson | +| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci, k8s-portal, excalidraw-library | `ghcr-credentials` dockerconfigjson | Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit @@ -188,6 +188,8 @@ reconciled — the workflows were added to the GitHub lineage via PR): | android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` | | infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` | | infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` | +| k8s-portal | `build-k8s-portal.yml` | private `ghcr.io/viktorbarzin/k8s-portal` (Keel rolls `:latest` digests) | +| excalidraw-library | `build-excalidraw.yml` | private `ghcr.io/viktorbarzin/excalidraw-library` (Keel rolls `:latest` digests; DockerHub `:v4` frozen as rollback) | **`infra-ci`** is the image the `.woodpecker/default.yml` apply step and `drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is diff --git a/docs/architecture/dns.md b/docs/architecture/dns.md index 6150d226..e3fe6ee5 100644 --- a/docs/architecture/dns.md +++ b/docs/architecture/dns.md @@ -277,7 +277,7 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h). -**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too. +**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too. **Off-infra Valia sites** (Cloudflare Pages, ADR-0018) are the other class of public-only names with no Traefik ingress — without internal records they NXDOMAIN for every internal client while working fine externally. Since 2026-07-03 they are reconciled **declaratively**: `stacks/valia-sites` writes the ConfigMap `valia-sites-dns` (technitium ns, `.pages.dev`), and the sync script ensures/updates a CNAME per entry and **deletes** stale internal CNAMEs targeting `*.pages.dev` that left the map (retire/rename cleans itself up; deletion is suffix-scoped so nothing else can be touched). ## NodeLocal DNSCache @@ -368,6 +368,7 @@ The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`) | TXT (MTA-STS) | 1 | `v=STSv1; id=20260412` | TLS enforcement | | TXT (TLSRPT) | 1 | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS reporting | | A (keyserver) | 1 | `130.162.165.220` (Oracle VPS) | PGP keyserver | +| CNAME (CF Pages) | 2 | `.pages.dev` (Cloudflare Pages) | bridge, stem95su — Valia sites (ADR-0018), managed by `stacks/valia-sites` | ### Proxied vs Non-Proxied @@ -513,6 +514,7 @@ For external `.viktorbarzin.me` records: 1. Add `dns_type = "proxied"` (or `"non-proxied"`) to the `ingress_factory` module call in the service stack 2. Run `scripts/tg apply` on the service stack — DNS record is auto-created 3. For non-standard records (MX, TXT), add a `cloudflare_record` resource in `stacks/cloudflared/modules/cloudflared/cloudflare.tf` +4. For a Valia site (off-infra Cloudflare Pages), add the entry to `local.sites` in `stacks/valia-sites/main.tf` — public CNAME + internal record both follow (`docs/runbooks/valia-sites.md`) ## Incident History diff --git a/docs/architecture/mailserver.md b/docs/architecture/mailserver.md index 0edeffb4..a7925849 100644 --- a/docs/architecture/mailserver.md +++ b/docs/architecture/mailserver.md @@ -161,6 +161,17 @@ https://mail.viktorbarzin.me → Traefik → Roundcubemail DB: MySQL (mysql.dbaas.svc.cluster.local) ``` +### Paperless ingest mailbox (docs@) + +`docs@viktorbarzin.me` is a dedicated real mailbox (explicit self-alias in +`extra/aliases.txt` so the `@domain → spam@` catch-all doesn't shadow it) that +paperless-ngx polls over IMAP; family members forward document emails to it +and the sender maps 1:1 to a paperless account. A per-user Dovecot sieve +(`docs-at-viktorbarzin.me.dovecot.sieve` in the `mailserver.config` ConfigMap, +mounted as `/tmp/docker-mailserver/docs@viktorbarzin.me.dovecot.sieve`) +discards mail from non-allowlisted senders at delivery. Full flow, sender map, +and add-a-sender procedure: [`runbooks/paperless-mail-ingest.md`](../runbooks/paperless-mail-ingest.md). + ## DNS Records All managed in Terraform at `stacks/cloudflared/modules/cloudflared/cloudflare.tf`. @@ -300,6 +311,21 @@ Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from External ## Troubleshooting +### All mail tempfailing with `451 4.3.0 queue file write error` (postsrsd spin) + +Seen 2026-07-03 right after a pod restart. Signature in `/var/log/mail/mail.log`: +`postfix/cleanup: warning: tcp:localhost:10001 lookup error` + +`sender_canonical_maps map lookup problem ... message not accepted, try again later`. +Cause: **postsrsd** (SRS daemon, `sender_canonical_maps = tcp:localhost:10001`) +came up spinning at 100% CPU without binding 10001/10002 — supervisor shows it +`RUNNING` but `ss -ltn | grep 1000` is empty and its log is empty. Postfix then +tempfails every message (inbound AND submission); senders retry so nothing is +lost, and the roundtrip probe alerts within the hour. +Fix: `supervisorctl restart postsrsd` inside the container; if the fresh +process spins again (it did once), `kubectl -n mailserver delete pod` for a +full re-init — that healed it. Root cause not pinned down (one-off bad init; +postsrsd 1.10). + ### Inbound mail not arriving 1. **DNS/MX**: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me` 2. **WAN reachability**: `nc -zw5 mail.viktorbarzin.me 25` from outside diff --git a/docs/architecture/networking.md b/docs/architecture/networking.md index b93df195..1e17d95d 100644 --- a/docs/architecture/networking.md +++ b/docs/architecture/networking.md @@ -1,10 +1,10 @@ # Networking Architecture -Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS TSIG-signed) +Last updated: 2026-07-02 (dCCTV segment added — dedicated pfSense leg for the garage camera, ADR-0017) ## Overview -The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs. +The homelab network is built on three isolated segments behind pfSense (management VLAN 10, Kubernetes VLAN 20, and the physically-legged dCCTV camera segment — see ADR-0017) with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs. ## Architecture Diagram @@ -24,9 +24,14 @@ graph TB CSdrop[CrowdSec drop
nftables / CF edge
out-of-band, pre-Traefik] - subgraph "Proxmox Host (eno1)" + subgraph "Proxmox Host (eno1, eno2)" vmbr0[vmbr0 Bridge
192.168.1.127/24] vmbr1[vmbr1 Internal
VLAN-aware] + vmbr2[vmbr2 Bridge
eno2 → TL-SG105PE] + + subgraph "dCCTV - 10.0.30.0/24
ADR-0017" + Camera[vermont-garage
10.0.30.70] + end subgraph "VLAN 10 - Management
10.0.10.0/24" Proxmox[Proxmox Host
10.0.10.1] @@ -71,6 +76,9 @@ graph TB vmbr1 -.VLAN 20.- Tech vmbr1 -.VLAN 20.- Master vmbr1 -.VLAN 20.- Node1 + vmbr2 -.physical link.- eno2 + vmbr2 -.untagged.- Camera + vmbr2 -.pfSense net3 = dCCTV 10.0.30.1.- pfSense ``` ## Components @@ -81,6 +89,7 @@ graph TB | phpIPAM | v1.7.0 | phpipam.viktorbarzin.me | IP address management, device inventory, DNS sync | | vmbr0 | Linux bridge | 192.168.1.127/24 | Physical bridge on eno1, uplink to LAN | | vmbr1 | Linux bridge (VLAN-aware) | Internal | VLAN trunk for VM isolation | +| vmbr2 | Linux bridge | Physical (eno2) | DORMANT fallback leg for dCCTV (ADR-0017 rev 3) — live dCCTV rides vmbr0 tag 30 over the LAN1 trunk | | Technitium DNS | Container | 10.0.20.201 (LB) / 10.96.0.53 (ClusterIP) | Internal DNS (viktorbarzin.lan) + full recursive resolver | | Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me | | Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding | @@ -90,6 +99,22 @@ graph TB | MetalLB | v0.15.3 Helm chart | K8s | LoadBalancer IPs (10.0.20.200-10.0.20.220), all services on 10.0.20.200 | | Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 | +## CCTV Segment (dCCTV) — as-built 2026-07-02 + +Isolated camera segment for owned cameras at the Sofia site (first: `vermont-garage`, HiLook IPC-T241H-C at the garage entrance). Decision + rejected alternatives: `docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md`. + +**Physical path (rev 3, single switch)**: camera → TL-SG105PE PoE port (untagged VLAN 30) → trunk port (home LAN untagged + CCTV **tagged 30**) → the existing LAN1 cable → R730 `eno1` → `vmbr0` (vlan-aware) → pfSense `net3`/vtnet3 = `vmbr0 tag=30` = interface **dCCTV `10.0.30.1/24`**. The TL-SG105PE **replaces** the old garage TL-SG105E (retired to cold spare) and carries everything: apartment uplink, 4G router `192.168.1.7`, UPS mgmt (VLAN 1), camera (VLAN 30), trunk — all 5 ports used. VLAN-30 membership is {camera port, trunk port} only, so tagged injection from other ports is dropped. `eno2`/`vmbr2` remain dormant as the fallback physical leg (rev 2). + +**Addressing**: Kea DHCP pool `10.0.30.100-199`; devices get MAC reservations (camera `10.0.30.70`; the PE switch mgmt inherits the retired switch's `192.168.1.6` on the home LAN). Kea DDNS auto-registers names in Technitium; `phpipam-pfsense-import` picks up leases hourly. + +**Firewall** (all on pfSense): +- dCCTV in: pass `udp OPT4-net → 10.0.30.1:123` (NTP) — everything else hits the interface's default deny. Cameras cannot reach LAN, other segments, or the internet. +- WAN in (home LAN side): pass `192.168.1.8` (ha-sofia) → `10.0.30.70:80` (ISAPI/hikvision_next) and `:554` (RTSP), reply-to disabled on both. +- dKubernetes is allow-all, so cluster Frigate/go2rtc pulls RTSP with no extra rule (pod egress SNATs to node IPs). +- Home-LAN clients need the **AX6000 static route** `10.0.30.0/24 via 192.168.1.2` (camera-day step) to reach the camera UI. + +**Consumers**: cluster Frigate (`/srv/nfs/frigate/config/config.yml` — NOT Terraform) pulls `rtsp://10.0.30.70:554` main+sub as `vermont-garage`; HA integrates via Frigate plus direct hikvision_next for tamper events. + ## IPAM & DNS Auto-Registration Devices are automatically discovered, named, and registered in DNS without manual intervention. @@ -207,6 +232,8 @@ VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the up - blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send, audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden, changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser, travel, netbox - **Non-proxied domains** (grey cloud, direct IP resolution): - mail, wg, headscale, immich, calibre, vaultwarden, and other services requiring direct connections +- **Internal-IP domains** (grey cloud, A → `10.0.20.203` Traefik LB, `ingress_factory` `dns_type = "internal"`): + - highlights-immich, highlights-immich-emo — publicly *resolvable* but only *routable* from home LANs / WG sites / VPN (spokes policy-route `10.0.0.0/8` down the tunnel, so kiosk devices with baked-in URLs need no per-site DNS overrides). The record is reachability, not a gate — enforcement is the `home-lans-only` Traefik ipAllowList (Sofia/London/Valchedrym LANs + 10/8) on the ingress. See `docs/plans/2026-07-04-immich-frame-lan-only-design.md`. - CNAME records for proxied domains point to Cloudflared tunnel FQDNs ### Ingress Flow @@ -261,7 +288,7 @@ Traefik chain: 1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`). 2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend. -3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), and authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients). +3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients), tripit (`tripit-rate-limit`, 100/1000, photo-tab thumbnail bursts), health (`health-rate-limit`, 100/1000, SPA shell + API burst per page), and dawarich (`dawarich-rate-limit`, 100/1000 — the Rails app self-serves all fingerprinted assets and the map adds an API burst per load; the default burst 429'd the asset tail and risked dropping OwnTracks/mobile location POSTs on the same host). 4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors). Additional middleware: @@ -552,7 +579,7 @@ chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Che **Diagnosis**: Check Traefik middleware config for the affected IngressRoute. -**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik--rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, authentik 100/1000 (login SPA `/static` chunk burst → blank screen). +**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik--rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, and tripit/health/authentik/dawarich each 100/1000 (SPA or asset-heavy page loads bursting past the default from one client IP). ### Large Downloads or Uploads Truncate / Fail Partway diff --git a/docs/plans/2026-07-03-vault-token-self-heal-design.md b/docs/plans/2026-07-03-vault-token-self-heal-design.md new file mode 100644 index 00000000..a88aff46 --- /dev/null +++ b/docs/plans/2026-07-03-vault-token-self-heal-design.md @@ -0,0 +1,103 @@ +# Vault Token Renewer Self-Heal Design + +**Date**: 2026-07-03 +**Status**: Approved (brainstorm complete; implementation pending) +**Owner**: wizard@devvm +**Supersedes**: the "version-only, no self-heal" scope choice recorded in +`docs/runbooks/vault-token-renew-devvm.md` (2026-06-07) + +## Problem + +`wizard@devvm` holds a maintenance-free periodic Vault token +(`token-devvm-wizard`, `period=768h`, renewed daily by the +`vault-token-renew` user timer) precisely so no weekly re-login is needed. +But `~/.vault-token` is the Vault CLI's default token sink, so any +`vault login -method=oidc` — which the infra docs themselves instruct before +applies — overwrites it with a 7-day OIDC token. The renewer's drift guard +(deliberately detect-only) then refuses to renew the foreign token and fails +the unit daily, into a log nobody watches. + +Observed consequence: a self-perpetuating weekly-expiry loop. The OIDC token +expires after 7 days → Vault 403s → the natural response is another +`vault login -method=oidc` → clobbers again. Drift persisted unnoticed +2026-06-18 → 06-26 and 2026-06-29 → 07-03 (memory #7121); Viktor experienced +it as "the token expires maybe once a week". + +**Goal**: `vault login -method=oidc` becomes harmless on devvm. The renewer +converts any admin-capable clobber back into the permanent periodic token, +unattended. (Chosen over "never log in" doc-fixes and over instant path-unit +healing — see Alternatives.) + +## Decisions + +| # | Decision | Notes | +|---|----------|-------| +| 1 | Heal in the existing renewer's drift branch, at its nightly run | ~20-line diff to an already-tested script; no new units. A few-hours window holding the 7-day OIDC token is harmless (heal window 24h ≪ 7d TTL) | +| 2 | Heal = *attempt* re-mint using the foreign token itself; let Vault's 403 decide | No policy-list guessing — identity-vs-token-policies burned us before (memory #4211). OIDC tokens carry `vault-admin` via `identity_policies`, so the create succeeds | +| 3 | Weak foreign token (create denied) → keep today's loud DRIFT failure | A read-only clobber (e.g. the 2026-06-05 `kubernetes-woodpecker-default` incident) signals a misbehaving agent flow; auto-papering over it would hide the offender. Log gains a "heal denied — investigate what wrote it" suffix | +| 4 | Do NOT revoke the clobbering OIDC token | It may still back the user's live login session; it ages out in 7 days on its own | +| 5 | After a successful heal, revoke stale `token-devvm-wizard` accessors | Anti-sprawl: each heal would otherwise strand the previous periodic **admin** token server-side for up to 32 days. Walk `auth/token/accessors`, revoke every `display_name=token-devvm-wizard` except the just-minted one. Runs only on heal (rare), never on the happy path | +| 6 | Minted-token sanity check before writing the file | Look up the new token; require `display_name=token-devvm-wizard`. Write via temp file + `mv` + `chmod 600` so a failed mint can never truncate `~/.vault-token` | +| 7 | Keep timer cadence (daily) and all happy-path behavior unchanged | | +| 8 | No notification plumbing in this change | devvm alerting is tracked separately (beads `code-aslh`). Heal events are logged; heal-denied/FAIL still fail the unit | + +## Behavior matrix + +| Token found in `~/.vault-token` | Before | After | +|---|---|---| +| Our periodic token | renew-self, log `OK` | unchanged | +| Foreign, admin-capable (OIDC login) | log `DRIFT`, exit 1 | re-mint periodic token with it, sanity-check, atomic write, revoke stale periodic accessors, log `HEALED: re-minted from foreign dn= (revoked N stale)`, exit 0 | +| Foreign, weak (read-only k8s clobber) | log `DRIFT`, exit 1 | log `DRIFT … heal denied — foreign token lacks create authority; investigate what wrote it`, exit 1 | +| Vault unreachable / lookup fails | log `FAIL`, exit 1 | unchanged | + +Re-mint command (identical to the manual recovery the DRIFT log already +prescribes): + +``` +vault token create -orphan -period=768h \ + -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard +``` + +## Testing + +- **Unit** (`scripts/test-vault-token-renew.sh`, existing source-the-functions + harness): new pure functions for (a) the stale-accessor revoke filter + (match on `display_name`, exclude the current accessor) and (b) the + minted-token sanity predicate; regression cases for the existing drift + predicate stay green. +- **Live, post-deploy** (on devvm): + 1. Mint a fake 1h admin token (`-display-name=fake-oidc`, + `-policy=vault-admin -policy=sops-admin`), write to `~/.vault-token`, + start the service → expect `HEALED`, file holds `token-devvm-wizard`. + 2. Mint a fake 10m no-privilege token (`-policy=default`), write it, start + the service → expect `DRIFT … heal denied`, unit `failed`; restore real + token. + 3. Revoke both fakes; one-off sweep of stale periodic accessors left by the + June 26 / July 3 manual re-mints. + +## Docs & rollout + +- Same commit rewrites the runbook's "Drift guard & recovery" section: + self-heal is the recovery for admin-capable clobbers; manual re-mint remains + only for weak clobbers (or a dead token with no admin-capable replacement in + the file). +- `vault login -method=oidc` instructions across the docs stay as-is — the + login is now harmless by design. +- Deploy per the runbook's manual model: `install -m 0755` to + `~/.local/bin/vault-token-renew`. Units unchanged — no daemon-reload. +- After landing: update memories #4204/#4211 (gotcha now self-healing). + +## Alternatives considered + +- **Instant heal** (systemd path unit + protected source-copy of the token): + strictly more capable (seconds-latency, heals weak clobbers too, zero + re-minting), but 2 new units + a second secret file + inotify re-trigger + edge cases — machinery disproportionate to the residual risk. Revisit only + if the few-hour heal window ever bites. +- **Vault CLI `token_helper` interception**: right interception point in + theory, but a helper bug breaks every `vault` CLI call, Terraform reads + `~/.vault-token` natively anyway, and it adds latency inside login. Rejected. +- **Docs-only ("never log in")**: rejected by user — the login should keep + working, not become forbidden knowledge. +- **Raise the OIDC role's 7-day `token_max_ttl`**: shared role, affects every + OIDC user; rejected previously for the same reason (memory #4205). diff --git a/docs/plans/2026-07-03-vault-token-self-heal-plan.md b/docs/plans/2026-07-03-vault-token-self-heal-plan.md new file mode 100644 index 00000000..1bfd7978 --- /dev/null +++ b/docs/plans/2026-07-03-vault-token-self-heal-plan.md @@ -0,0 +1,443 @@ +# Vault Token Renewer Self-Heal Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Make `vault login -method=oidc` harmless on devvm — the nightly renewer re-mints the permanent periodic token from any admin-capable clobber of `~/.vault-token`, unattended. + +**Architecture:** Extend the drift branch of `scripts/vault-token-renew.sh` (deployed to `~/.local/bin/vault-token-renew`, driven by an existing systemd user timer). On drift, *attempt* the re-mint with the clobbering token itself and let Vault's 403 be the authority; sanity-check the minted token, replace the file atomically, then revoke stale `token-devvm-wizard` leftovers. Weak clobbers keep today's loud failure. Design: `docs/plans/2026-07-03-vault-token-self-heal-design.md`. + +**Tech Stack:** bash + jq + vault CLI; existing test harness `scripts/test-vault-token-renew.sh` (sources the script, `vtr_main` is guarded). + +**Working copy:** everything below runs in the worktree +`~/code/infra/.worktrees/vault-token-self-heal` on branch `wizard/vault-token-self-heal`. +Per repo policy, EVERY git command in this git-crypt repo worktree carries: +`-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false` +(abbreviated as `$GCFLAGS` below; define once per shell: +`GCFLAGS="-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false"` +and use it unquoted: `git $GCFLAGS …`). + +--- + +### Task 1: Unit tests for the two new pure functions (RED) + +**Files:** +- Modify: `scripts/test-vault-token-renew.sh` (append before the final `printf`/exit lines) + +- [ ] **Step 1: Append the failing tests** + +Insert this block immediately after the existing "parse + decide end-to-end" section (after the line `no "oidc: parse+decide refused" …`, before the final `printf '\n%d passed…'`): + +```bash +# --- vtr_accessor: parse accessor out of lookup JSON --- +LOOKUP_NEW='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-new","policies":["default","sops-admin","vault-admin"],"identity_policies":null}}' +eq "accessor parsed" "acc-new" "$(vtr_accessor "$LOOKUP_NEW")" +eq "accessor absent -> empty" "" "$(vtr_accessor '{"data":{"display_name":"x"}}')" + +# --- vtr_is_stale_periodic: the heal's revoke filter — ONLY old token-devvm-wizard +# --- tokens are swept; the just-minted token, foreign tokens, and anything with an +# --- unknown accessor are kept. An empty keep-accessor sweeps NOTHING (fail-safe). +STALE_OURS='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-old","policies":["default","sops-admin","vault-admin"]}}' +ok "older periodic token is stale" vtr_is_stale_periodic "$STALE_OURS" "acc-new" +no "the just-minted token is kept" vtr_is_stale_periodic "$LOOKUP_NEW" "acc-new" +no "foreign oidc token never swept" vtr_is_stale_periodic "$LOOKUP_OIDC" "acc-new" +no "woodpecker token never swept" vtr_is_stale_periodic "$LOOKUP_WP" "acc-new" +no "missing accessor never swept" vtr_is_stale_periodic '{"data":{"display_name":"token-devvm-wizard"}}' "acc-new" +no "empty keep-accessor sweeps nothing" vtr_is_stale_periodic "$STALE_OURS" "" +``` + +(`LOOKUP_OIDC` / `LOOKUP_WP` and the `ok`/`no`/`eq` helpers already exist in the file.) + +- [ ] **Step 2: Run tests, verify they fail** + +Run: `bash scripts/test-vault-token-renew.sh` +Expected: FAILs / `command not found` for `vtr_accessor` and `vtr_is_stale_periodic`; the 17 pre-existing tests stay green. + +### Task 2: Implement the pure functions (GREEN) + +**Files:** +- Modify: `scripts/vault-token-renew.sh` (insert after `vtr_drift_ok()`, before `vtr_main()`) + +- [ ] **Step 1: Add the two functions** + +```bash +# vtr_accessor -> the token accessor (empty if absent). +vtr_accessor() { + printf '%s' "$1" | jq -r '.data.accessor // ""' +} + +# vtr_is_stale_periodic -> 0 if this lookup +# describes one of OUR periodic tokens (display name matches) that is NOT the +# one to keep — i.e. a stale leftover a heal should revoke. 1 otherwise. +# Name-only on purpose (no policy check): anything named token-devvm-wizard +# that isn't the current token is garbage from a previous mint. An empty +# keep-accessor sweeps NOTHING (fail-safe: never revoke when we don't know +# which token is current). +vtr_is_stale_periodic() { + local dn acc + [ -n "${2:-}" ] || return 1 + dn=$(vtr_display_name "$1") + acc=$(vtr_accessor "$1") + [ "$dn" = "$EXPECTED_DN" ] || return 1 + [ -n "$acc" ] || return 1 + [ "$acc" != "$2" ] +} +``` + +- [ ] **Step 2: Run tests, verify all pass** + +Run: `bash scripts/test-vault-token-renew.sh` +Expected: `25 passed, 0 failed`, exit 0. + +- [ ] **Step 3: Commit** + +```bash +cd ~/code/infra/.worktrees/vault-token-self-heal +git $GCFLAGS add scripts/vault-token-renew.sh scripts/test-vault-token-renew.sh +git $GCFLAGS commit -m "vault-token-renew: pure helpers for the self-heal revoke filter + +vtr_accessor parses the accessor from lookup JSON; vtr_is_stale_periodic +decides which old token-devvm-wizard tokens a heal may revoke (never the +just-minted one, never foreign tokens, nothing when the keeper is unknown). +TDD red-green for the heal branch that lands next." +``` + +### Task 3: The heal branch (`vtr_heal` + `vtr_main` wiring) + +**Files:** +- Modify: `scripts/vault-token-renew.sh` + +- [ ] **Step 1: Add `vtr_heal` after `vtr_is_stale_periodic()`, before `vtr_main()`** + +```bash +# vtr_heal -> 0 if ~/.vault-token was re-minted back to +# our periodic admin token using the foreign token's own authority, 1 if the +# heal was denied or failed (caller exits non-zero; the unit goes failed). +# +# Self-heal added 2026-07-03 (docs/plans/2026-07-03-vault-token-self-heal-design.md): +# an OIDC login — which the infra docs prescribe before applies — clobbers +# ~/.vault-token with a 7-day token, and detect-only drift left that unnoticed +# for weeks (the weekly-expiry loop). We ATTEMPT the re-mint with the +# clobbering token itself and let Vault's authz decide — a read-only clobber +# (the 2026-06-05 woodpecker incident) is denied the mint and stays a loud +# failure, because it signals a misbehaving flow that someone should look at. +vtr_heal() { + local foreign_dn="$1" log="$2" + local errf new_token new_info new_dn new_pols new_acc tmp + errf=$(mktemp) + if ! new_token=$(vault token create -orphan -period=768h \ + -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard \ + -field=token 2>"$errf") || [ -z "$new_token" ]; then + printf '%s DRIFT: ~/.vault-token is dn=%q — heal denied, foreign token lacks create authority (%s); investigate what wrote it. Manual re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \ + "$(date -Is)" "$foreign_dn" "$(tr '\n' ' ' <"$errf")" >>"$log" + rm -f "$errf" + return 1 + fi + rm -f "$errf" + + # Sanity: the minted token must itself pass the drift guard before it may + # replace ~/.vault-token. + if ! new_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json 2>&1); then + printf '%s FAIL: heal minted a token but its lookup failed: %s\n' \ + "$(date -Is)" "$new_info" >>"$log" + return 1 + fi + new_dn=$(vtr_display_name "$new_info") + new_pols=$(vtr_policies_csv "$new_info") + if ! vtr_drift_ok "$new_dn" "$new_pols"; then + printf '%s FAIL: heal minted an unexpected token (dn=%q policies=%q) — not writing it\n' \ + "$(date -Is)" "$new_dn" "$new_pols" >>"$log" + return 1 + fi + + # Atomic replace: mktemp files are 0600 from birth; same-filesystem mv. + tmp=$(mktemp "$HOME/.vault-token.XXXXXX") + printf '%s' "$new_token" >"$tmp" + mv "$tmp" "$HOME/.vault-token" + + # Anti-sprawl: revoke previous token-devvm-wizard tokens — each heal would + # otherwise strand the prior periodic ADMIN token server-side for up to 32d. + # The clobbering foreign token is deliberately NOT revoked: it may still back + # the user's live login session, and it ages out on its own (7d for OIDC). + local sweep="accessor sweep skipped (list denied)" accessors a a_info revoked=0 + new_acc=$(vtr_accessor "$new_info") + if [ -n "$new_acc" ] && accessors=$(VAULT_TOKEN="$new_token" vault list -format=json auth/token/accessors 2>/dev/null); then + while IFS= read -r a; do + [ -n "$a" ] || continue + a_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json -accessor "$a" 2>/dev/null) || continue + if vtr_is_stale_periodic "$a_info" "$new_acc"; then + VAULT_TOKEN="$new_token" vault token revoke -accessor "$a" >/dev/null 2>&1 && revoked=$((revoked + 1)) + fi + done < <(printf '%s' "$accessors" | jq -r '.[]') + sweep="revoked $revoked stale periodic token(s)" + fi + + printf '%s HEALED: re-minted periodic token from foreign dn=%q (%s)\n' \ + "$(date -Is)" "$foreign_dn" "$sweep" >>"$log" +} +``` + +- [ ] **Step 2: Rewire the drift branch in `vtr_main`** + +Replace this exact block (comment + if): + +```bash + # Drift guard (added 2026-06-07): the renewer must NOT keep a FOREIGN token alive. + # On 2026-06-05 a stray `vault login -method=kubernetes` overwrote ~/.vault-token + # with a read-only woodpecker token, and this script then silently renewed THAT + # for two days — masking the loss of write access. So before renewing, confirm + # the token is our periodic admin token; if it has drifted, fail loudly (systemd + # marks the unit failed) instead of keeping someone else's token alive. + if ! vtr_drift_ok "$dn" "$pols"; then + printf '%s DRIFT: ~/.vault-token is dn=%q policies=%q (expected dn=%q with %q). Refusing to renew a foreign token. Re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \ + "$(date -Is)" "$dn" "$pols" "$EXPECTED_DN" "$REQUIRED_POLICY" >>"$log" + exit 1 + fi +``` + +with: + +```bash + # Drift guard (2026-06-07) + self-heal (2026-07-03): the renewer must not + # keep a FOREIGN token alive (on 2026-06-05 a stray kubernetes login was + # silently renewed for two days, masking lost write access). But detect-only + # drift proved worse in practice: an OIDC login — which the infra docs + # prescribe before applies — clobbers this file too, and the resulting DRIFT + # failures went unnoticed for weeks while access degraded to a 7-day token + # (the weekly-expiry loop). On drift we now ATTEMPT to heal (see vtr_heal): + # re-mint the periodic token with the clobbering token's own authority. + # Vault's authz keeps the old guarantee — a token that couldn't legitimately + # hold vault-admin is denied the mint, and we still fail loud. + if ! vtr_drift_ok "$dn" "$pols"; then + vtr_heal "$dn" "$log" || exit 1 + exit 0 + fi +``` + +- [ ] **Step 3: Syntax + lint + regression check** + +Run: `bash -n scripts/vault-token-renew.sh && bash scripts/test-vault-token-renew.sh; command -v shellcheck >/dev/null && shellcheck scripts/vault-token-renew.sh` +Expected: syntax OK, `25 passed, 0 failed`; shellcheck (if installed) reports nothing new. + +- [ ] **Step 4: Commit** + +```bash +git $GCFLAGS add scripts/vault-token-renew.sh +git $GCFLAGS commit -m "vault-token-renew: self-heal the periodic token on admin-capable clobber + +Viktor asked for 'vault login -method=oidc' to work seamlessly: the OIDC +login the docs prescribe kept clobbering ~/.vault-token with a 7-day token, +and detect-only DRIFT failures went unnoticed for weeks (weekly-expiry +loop, twice in June). On drift the renewer now re-mints the periodic token +with the clobbering token's own authority (Vault's 403 is the judge — no +policy guessing), sanity-checks it, replaces the file atomically, and +revokes stale token-devvm-wizard leftovers. Weak/read-only clobbers still +fail loudly on purpose. Design: docs/plans/2026-07-03-vault-token-self-heal-design.md" +``` + +### Task 4: Docs — runbook + test-file header + +**Files:** +- Modify: `docs/runbooks/vault-token-renew-devvm.md` (the `## Drift guard & recovery` section + the healthy-log-line note + `## Tests`) +- Modify: `scripts/test-vault-token-renew.sh` (header comment only) + +- [ ] **Step 1: Replace the runbook's `## Drift guard & recovery` section with:** + +```markdown +## Drift guard & self-heal + +`~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login` +overwrites it. Two confirmed clobber vectors: + +1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer + can't push past the OIDC role's 7-day `token_max_ttl`). The infra docs + prescribe this login before applies, so it recurs — it went unnoticed for + weeks twice (2026-06-18→26, 2026-06-29→07-03) and read as "Vault expires + weekly". +2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) → + writes a read-only `kubernetes-woodpecker-default` token (can read Vault but + **cannot** write `secret/*`). Happened 2026-06-05, unnoticed for two days. + +Since 2026-07-03 the renewer **self-heals** +(`docs/plans/2026-07-03-vault-token-self-heal-design.md`). On a foreign token +it attempts the re-mint **with the clobbering token's own authority** and lets +Vault's authz decide: + +- **Admin-capable clobber (OIDC login)** → re-mints the periodic token, + sanity-checks it against the drift guard, atomically replaces + `~/.vault-token`, revokes stale `token-devvm-wizard` leftovers + (anti-sprawl), logs + `HEALED: re-minted periodic token from foreign dn=… (revoked N stale periodic token(s))` + and exits 0. The clobbering token is NOT revoked — it may still back a live + login session; it ages out on its own. +- **Weak clobber (read-only k8s token)** → the mint is denied; logs + `DRIFT: … heal denied, foreign token lacks create authority …; investigate what wrote it` + and exits non-zero (unit `failed`). Deliberately loud: this signals a + misbehaving agent flow — exactly the 2026-06-05 case. + +**Manual recovery** is only needed for the weak-clobber case (the DRIFT log +line still contains the exact command) — run the +[mint/re-mint](#mint--re-mint-the-token) block. +``` + +- [ ] **Step 2: In the runbook's `## Health check` section**, after the "A healthy log line looks like…" sentence, add: + +```markdown +After an OIDC login you'll instead see, at the next nightly run: +` HEALED: re-minted periodic token from foreign dn="oidc-…" (revoked N stale periodic token(s))` — that's the self-heal working as designed. +``` + +- [ ] **Step 3: In the runbook's `## Tests` section**, replace the first sentence with: + +```markdown +`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision, +the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber +case), and the self-heal's revoke filter (which stale periodic tokens a heal +may sweep). +``` + +- [ ] **Step 4: Update the test file's header comment** (lines 2–7) to: + +```bash +# Unit tests for the pure functions in vault-token-renew.sh. +# Sources the script (vtr_main is guarded) and exercises (a) the drift-guard +# decision — is ~/.vault-token OUR periodic admin token (renew) or a foreign +# clobber (heal / fail loud)? — whose ABSENCE let the 2026-06-05 woodpecker +# clobber be silently renewed for two days, and (b) the self-heal's revoke +# filter — which stale token-devvm-wizard tokens a heal may sweep. +# Run: bash infra/scripts/test-vault-token-renew.sh +``` + +- [ ] **Step 5: Run tests once more, then commit** + +Run: `bash scripts/test-vault-token-renew.sh` +Expected: `25 passed, 0 failed`. + +```bash +git $GCFLAGS add docs/runbooks/vault-token-renew-devvm.md scripts/test-vault-token-renew.sh +git $GCFLAGS commit -m "vault-token-renew runbook: document the self-heal behavior + +Drift guard section rewritten: admin-capable clobbers now self-heal at the +nightly run (HEALED log line); weak clobbers keep the loud DRIFT failure; +manual re-mint is only the weak-clobber recovery now." +``` + +### Task 5: Deploy + live verification (on devvm, as wizard) + +**Files:** none (host deploy + live checks) + +- [ ] **Step 1: Install from the worktree** + +```bash +install -m 0755 ~/code/infra/.worktrees/vault-token-self-heal/scripts/vault-token-renew.sh ~/.local/bin/vault-token-renew +``` + +(Units unchanged — no `daemon-reload` needed.) + +- [ ] **Step 2: Live case 1 — admin-capable clobber heals** + +```bash +export VAULT_ADDR=https://vault.viktorbarzin.me +export XDG_RUNTIME_DIR=/run/user/$(id -u) +FAKE_ADMIN=$(vault token create -ttl=1h -policy=vault-admin -policy=sops-admin -display-name=fake-oidc -field=token) +printf '%s' "$FAKE_ADMIN" > ~/.vault-token +systemctl --user start vault-token-renew.service; echo "exit=$?" +tail -1 ~/.local/state/vault-token-renew.log +vault token lookup | grep -E 'display_name|period' +``` + +Expected: `exit=0`; log line `HEALED: re-minted periodic token from foreign dn="token-fake-oidc" (revoked N stale periodic token(s))` with N ≥ 1 (the pre-clobber periodic token is itself swept as stale — by design — along with any strays from the June 26 / July 3 manual re-mints); lookup shows `display_name token-devvm-wizard`, `period 768h`. Note: `FAKE_ADMIN` is a child of the swept old token, so the cascade revokes it too — no cleanup needed. + +- [ ] **Step 3: Verify exactly ONE periodic token remains server-side** + +```bash +for a in $(vault list -format=json auth/token/accessors | jq -r '.[]'); do + vault token lookup -format=json -accessor "$a" 2>/dev/null \ + | jq -r 'select(.data.display_name=="token-devvm-wizard") | .data.accessor' +done +``` + +Expected: exactly one line, matching `vault token lookup -format=json | jq -r .data.accessor`. + +- [ ] **Step 4: Live case 2 — weak clobber stays a loud failure** + +```bash +GOOD=$(cat ~/.vault-token) +FAKE_WEAK=$(vault token create -ttl=10m -policy=default -display-name=fake-weak -field=token) +printf '%s' "$FAKE_WEAK" > ~/.vault-token +systemctl --user start vault-token-renew.service; echo "exit=$?" +systemctl --user is-failed vault-token-renew.service +tail -1 ~/.local/state/vault-token-renew.log +printf '%s' "$GOOD" > ~/.vault-token && chmod 600 ~/.vault-token +vault token revoke "$FAKE_WEAK" >/dev/null +``` + +Expected: `exit=1` (start reports the oneshot failure), `is-failed` prints `failed`, log line `DRIFT: ~/.vault-token is dn="token-fake-weak" — heal denied, foreign token lacks create authority (… permission denied …); investigate what wrote it. Manual re-mint: …`. + +- [ ] **Step 5: Happy path still green** + +```bash +systemctl --user start vault-token-renew.service; echo "exit=$?" +tail -1 ~/.local/state/vault-token-renew.log +``` + +Expected: `exit=0`, log `OK renewed (dn=token-devvm-wizard ttl=2764800s)`. + +### Task 6: Land on master + cleanup + +- [ ] **Step 1: Merge latest master into the branch, re-verify, push** + +```bash +cd ~/code/infra/.worktrees/vault-token-self-heal +git $GCFLAGS fetch forgejo +git $GCFLAGS merge forgejo/master +bash scripts/test-vault-token-renew.sh +git $GCFLAGS push forgejo HEAD:master +``` + +Expected: clean merge (or already up to date), `25 passed, 0 failed`, push accepted. Non-fast-forward → fetch, merge, push again. + +- [ ] **Step 2: Watch CI to completion** + +The push fires the infra Woodpecker `default.yml` (terragrunt apply for changed stacks). This change touches only `scripts/` + `docs/` → expect a fast success / no-op apply. Check (Forgejo-forge infra repo = Woodpecker repo id 82): + +```bash +export VAULT_ADDR=https://vault.viktorbarzin.me +vault kv get -format=json secret/ci/global | jq -r '.data.data | keys[]' # find the woodpecker admin token key +WP_TOKEN=$(vault kv get -field= secret/ci/global) +curl -s -H "Authorization: Bearer $WP_TOKEN" 'https://ci.viktorbarzin.me/api/repos/82/pipelines?perPage=1' | jq '.[0] | {number, status, commit: .commit[0:8]}' +``` + +Expected: the pipeline for the pushed commit reaches `status: "success"` (poll until terminal). If it fails, fix before proceeding. + +- [ ] **Step 3: Remove worktree + branch, reconcile main checkout** + +```bash +git -C ~/code/infra $GCFLAGS worktree remove .worktrees/vault-token-self-heal +git -C ~/code/infra $GCFLAGS branch -d wizard/vault-token-self-heal +git -C ~/code/infra status --porcelain # expect clean before pulling +git -C ~/code/infra $GCFLAGS pull --ff-only forgejo master +``` + +Expected: worktree gone, branch deleted (already merged), main checkout fast-forwards to the landed commit. + +### Task 7: Memory + wrap-up + +- [ ] **Step 1: Update the stale memories** (they say the drift guard is detect-only / recovery is manual): + +```bash +homelab memory recall "vault periodic token renewer drift" # confirm ids 4204, 4211, 7121 still say detect-only +homelab memory update 4211 "" +homelab memory update 7121 "" +``` + +(Fetch each memory's current text first and preserve it — amend, don't replace wholesale.) + +- [ ] **Step 2: End-of-task extraction** — dispatch the standard M.3 memory-mining subagent per `~/.claude/rules/execution.md`, then give the final summary. + +--- + +## Plan self-review (done at write time) + +- **Spec coverage**: heal-on-admin-clobber (T3), loud-fail-on-weak (T3 + live T5.4), no-revoke-foreign (T3 comment + design decision 4), anti-sprawl sweep + fail-safe filter (T2/T3, live T5.3), minted-token sanity + atomic write (T3), unit tests (T1/T2), runbook (T4), deploy + live sim (T5), memory updates (T7). ✓ +- **Placeholders**: `` in T6.2 is a deliberate discovery step (key name verified live from Vault, not invented). No other TBDs. ✓ +- **Name consistency**: `vtr_accessor`, `vtr_is_stale_periodic`, `vtr_heal`, `EXPECTED_DN` match across tasks; test count 17→25 consistent (8 new cases). ✓ diff --git a/docs/plans/2026-07-04-backup-mx-design.md b/docs/plans/2026-07-04-backup-mx-design.md new file mode 100644 index 00000000..fe54af61 --- /dev/null +++ b/docs/plans/2026-07-04-backup-mx-design.md @@ -0,0 +1,335 @@ +# Backup MX — self-hosted store-and-forward relay on Oracle Always-Free — design + +Date: 2026-07-04 (v3 — post-challenge; v2 Oracle pivot same day) · Status: design, +pre-implementation · ADR: [0019](../adr/0019-backup-mx-self-hosted-oracle-relay.md) + +v3 incorporates two independent adversarial-challenge reviews (same day). Their +material corrections are marked **[CH]** throughout — the largest: the v2 drain +path would never have drained (primary-side smtpd rejects), monitoring-over- +tailnet was fiction (no cluster→tailnet route exists), and the VM's bounce +model was wrong (it can never deliver a DSN). + +## Goal + +Inbound mail for `viktorbarzin.me` must survive homelab outages without loss. +Requirement level (Viktor, 2026-07-04): **never lose mail; delayed delivery is +acceptable; budget is $0** (hard constraint — reaffirmed after the Rollernet +gates failed). A store-and-forward backup MX queues mail while the homelab is +down and re-delivers when it returns. + +Out of scope, explicitly: + +- Reading new mail *during* an outage. +- Outbound mail during outages. +- The "primary up but hard-bouncing 5xx" misconfig class — a backup MX is + never consulted when the primary answers. Separate hardening/alerting track. + +Known residual limit (state it plainly): an outage **longer than 30 days** +loses the queued mail *silently* — the VM cannot emit a bounce to anyone +(egress 25 blocked), so no sender ever learns. Accepted; 30 days is already +6× the sender-retry status quo. + +## v1 → v2: why Rollernet was dropped (gate evidence, 2026-07-04) + +v1 selected Roller Network's free Secondary MX. The validation gates killed it +before any DNS change: + +- **G2 FAILED**: the [free-accounts policy](https://rollernet.us/policy/free-accounts.html) + caps free mail service at **200 relayed messages or 10 MB per rolling 7 + days**; overage → domain suspended **48 h answering SMTP 5xx** (permanent + bounces), repeatable. Spammers deliberately target backup MXes even while + the primary is up, so background spam alone can hold the domain suspended — + worse than no backup MX. +- **G1 SHAKY**: same policy page says free accounts are being discontinued. +- **G3 PASSED** (for posterity): `mail{,2}.rollernet.us` present valid LE + certs over STARTTLS. +- Signup is Cloudflare-Turnstile-gated — moot given G1/G2. + +Viktor's decision: stay free → self-host on Oracle Always-Free. **[CH]** The +external challenger re-searched the free landscape (DNSExit, KisoLabs, +DuoCircle, AWS/Azure/GCP/Hetzner/Fly/Vultr/Linode free tiers) and confirmed: +no credible free managed backup-MX or free VM with a usable port-25 story +exists in 2026 other than OCI. GCP's free e2-micro also blocks egress 25 and +is US-regions-only (wrong continent). + +## Decision + +A minimal **Postfix store-and-forward relay** (`mx2.viktorbarzin.me`) on an +Oracle Cloud **Always-Free** compute instance, published as a lower-preference +MX. It accepts mail for `viktorbarzin.me` when the primary is unreachable, +queues up to 30 days, and drains to the primary when it returns. No mailboxes, +no third-party terms — the queue-lifetime and reject-behavior knobs are ours. + +## Architecture + +``` + ┌── pri 1 mail.viktorbarzin.me ──► pfSense HAProxy ──► mailserver pod +sender MTA ──► MX lookup ┤ ▲ + └── pri 20 mx2.viktorbarzin.me │ drain: smtp to + (Oracle VM, Postfix relay, │ mail.viktorbarzin.me:2526 + queue ≤ 30 days) ───────────────────┘ (pfSense WAN NAT rdr + 2526 → 10.0.20.1:25, + existing HAProxy frontend) +``` + +- **Normal operation**: senders use pri 1; the VM idles (spammers targeting + the backup + transient-blip retries get relayed onward immediately). +- **Outage**: senders fall back to pri 20 → VM accepts + queues → Postfix + retries the primary on its native schedule → queue drains after recovery + through the standard external ingress path (PROXY v2 → :2525 → rspamd → + Dovecot). +- **Custom drain port**: Oracle blocks **egress TCP 25** tenancy-wide + (post-2021; exemptions unreliable) — the VM cannot reach + `mail.viktorbarzin.me:25`. One pfSense WAN NAT rule `TCP 2526 → + 10.0.20.1:25` reuses the existing HAProxy frontend unchanged. **[CH] + Verified against the runbook**: the frontend binds `*:25` on pfSense (not + strictly 10.0.20.1), rdr dst-port rewrite is the existing production + pattern (WAN:25 already rewrites to 10.0.20.1:25), and port 2526 collides + with nothing (the HAProxy test frontend uses :2525). Inbound TCP 25 **to** + the VM is unaffected by Oracle's egress-only block per practitioner + evidence (iRedMail/mailcow on OCI: receive works, send doesn't) — **to be + proven at gate O2 before any DNS change** (Oracle publishes no positive + commitment). + +## Oracle account & instance + +- **Account**: Viktor creates it (human signup; card for identity, $0 + charged). **Home region is fixed at signup and Always-Free compute exists + only there — choose `eu-frankfurt-1` deliberately; there is no + try-another-region fallback without a new account. [CH]** +- **[CH] PAYG conversion is a REQUIRED prerequisite, not a recommendation**: + Oracle stops idle Always-Free instances (95th-pct CPU < 20% over 7 days — an + idle Postfix box qualifies) and demonstrably changes free-tier terms without + notice, enforcing by termination (June 2026: A1 allowance silently halved, + over-limit instances shut down). PAYG keeps Always-Free resources free and + exempts them from idle reclamation. +- **Shape**: `VM.Standard.E2.1.Micro` (x86, 1/8 OCPU burst, 1 GB RAM; 2 + always-free instances allowed; ample for queue-only Postfix — and untouched + by the 2026 A1 cuts). ARM A1 fallback is **unreliable** (halved quota, + chronic Frankfurt capacity) — treat E2.1.Micro availability as the gate. +- **[CH] Reserved public IP is mandatory** (`oci_core_public_ip`, reserved): + an ephemeral IP rotates on stop/start and would silently break all four + IP-keyed controls at once (pfSense NAT source-restriction, the primary's + smtpd/rspamd exemptions, the Oracle security list, Prometheus scrape + allowlist) — discovered only at the next outage's drain. +- **OS**: Ubuntu 24.04. **[CH] OCI Ubuntu images ship an OS-level iptables + ruleset (`/etc/iptables/rules.v4`) that ACCEPTs 22 and REJECTs everything + else, independent of security lists** — cloud-init must insert ACCEPT rules + for 25/80 (+ scrape ports) ahead of the REJECT and persist them, or gate O2 + fails on day 1 with a correct security list. +- **Credentials**: OCI API key for Terraform → Vault `secret/viktor` + (`oci_*`); web login → Vaultwarden item `Oracle Cloud (backup MX)`. + +## Networking & security posture + +- **Ingress on the VM**: TCP 25 world-open (the service). **[CH] TCP 80 + world-open permanently** — Let's Encrypt validation is multi-perspective + with no published source IPs, so it cannot be source-scoped, and a + "open-only-during-renewal" toggle is unspecified automation whose realistic + failure mode is an expired cert at day ~90. Nothing listens on 80 outside + certbot's seconds-long renewal windows; connection-refused surface is + negligible. TCP 9100/9154 (exporters) restricted to the homelab WAN /32 + (176.12.22.76) in both the Oracle security list and the VM firewall. +- **No public SSH**: management rides the headscale tailnet — cloud-init + enrolls via a **preauth key for a dedicated non-OIDC headscale user** with + node tag `tag:backup-mx` (headscale 0.28.0 file-mode ACL, content in Vault + `secret/headscale` → `headscale_acl`); SSH bound to the tailnet interface. + ACL grant: `group:admin → tag:backup-mx:22` (cluster pods are NOT tailnet + members — see monitoring). **[CH] Outage caveat**: headscale's control + plane + DERP live in the cluster, so mid-outage tailnet reachability is + cached-netmap best-effort — the runbook documents the **OCI instance + console connection as break-glass** management. (Also fix `vpn.md`'s stale + "0.23.x / OIDC-only" claims while in there.) +- **VM compromise blast radius**: plaintext of outage-queued mail + a relay + surface contained by `relay_domains = viktorbarzin.me` only, no submission + ports, no SASL, no local delivery. The VM is deliberately NOT added to the + primary's `mynetworks` (that would let a compromised VM relay arbitrary + mail *through* the primary) — per-stage exemptions instead, below. + +## Postfix configuration (relay-only, accept-and-queue with 4xx-only hygiene) + +- `relay_domains = viktorbarzin.me`; `mydestination =` (empty). +- **[CH]** `smtpd_relay_restrictions = permit_mynetworks, + reject_unauth_destination` — explicit 5xx for foreign-domain RCPTs (the + default tail is `defer_unauth_destination`, whose 4xx invites every relay + probe to retry forever). +- **[CH]** `relay_recipient_maps` explicitly set to the wildcard form + (`@viktorbarzin.me OK`) — documents accept-all-recipients as a decision + (the domain is catch-all; every RCPT is valid by definition). +- `transport_maps`: `viktorbarzin.me smtp:[mail.viktorbarzin.me]:2526`. +- `maximal_queue_lifetime = 30d`. **[CH]** `bounce_queue_lifetime = 1d` and + `delay_warning_time = 0` — this host can never deliver a DSN to anyone + (egress 25 blocked; its only egress is 2526 to the primary), so undeliverable + bounces must be discarded quickly or they rot in the queue for a month and + permanently poison the queue-depth alert. +- **[CH]** `message_size_limit = 209715200` — exactly the primary's 200 MB + (`POSTFIX_MESSAGE_SIZE_LIMIT`, mailserver main.tf:88). The stock 10 MB + default would 552-reject large legitimate mail during outages — the exact + loss mode this project exists to prevent. Equal, never higher (higher + recreates drain-time rejects). +- **[CH] postscreen on the VM in 4xx-only posture**: pregreet test ON + (fire-and-forget bots don't retry; real MTAs do — the whole design already + rests on sender retry, so 4xx filtering is loss-free by construction), + optionally `postscreen_dnsbl_action = defer` with a conservative threshold. + v2's blanket "no DNSBL" conflated 5xx reputation rejects (rightly banned) + with 4xx tempfail (harmless); without any hygiene the backup is a 24/7 + spam backdoor since spammers deliberately deliver to the highest-numbered + MX. Zero 5xx from reputation, ever. +- `inet_protocols = ipv4` **[CH]** — the primary publishes an AAAA (HE + tunnel) but the IPv6 HAProxy bridge has no :2526 listener; skip the wasted + v6 attempt per delivery. +- `smtpd_tls_cert_file` = LE cert for `mx2.viktorbarzin.me` (opportunistic + STARTTLS inbound; `smtp_tls_security_level = may` on the drain leg). +- Queue disk: the ~45 GB free boot volume dwarfs any realistic 30-day + accumulation for a personal domain. + +## TLS + +certbot standalone HTTP-01 for `mx2.viktorbarzin.me` (no Cloudflare API token +on an internet-facing VM). Port 80 permanently open (see above); certbot renew +timer. The MTA-STS follow-up (separate task; policy host currently dangling — +below) must list `mx2.viktorbarzin.me` when implemented. + +## Primary-side drain enablement **[CH — this section replaces v2's "SPF/DMARC exemption + postscreen permit", which exempted the wrong layers]** + +The v2 exemptions targeted postscreen DNSBL (which is **off** on the primary — +`ENABLE_DNSBL` unset) and rspamd SPF/DMARC scoring — but missed the three +mechanisms that would actually break the drain. All are keyed on the VM's +reserved /32 (the PROXY-v2-recovered client IP): + +1. **`reject_unknown_client_hostname` bypass** — the primary sets + `POSTFIX_REJECT_UNKNOWN_CLIENT_HOSTNAME=1` (main.tf:89); an Oracle IP + without full FCrDNS (PTR needs an Oracle SR; limited on free accounts) + would be **450-deferred on every drain attempt → the queue never drains → + mass-bounces at day 30**. Fix: `check_client_access` permit for the VM /32 + early in `smtpd_client_restrictions`, and a matching permit at the sender + stage (SPOOF_PROTECTION=1 rejects unauthenticated own-domain envelope + senders — drained self-addressed/bounced mail would 5xx). Attempt the + Oracle PTR anyway (belt and braces). +2. **Anvil rate-limit exception** — `smtpd_client_message_rate_limit = 30`/min + keys on the VM's IP at drain; a >3,600-message backlog would throttle for + hours and false-fire the queue alert. Add the VM /32 to + `smtpd_client_event_limit_exceptions`. +3. **rspamd: evaluate the original sender, never 5xx the drain stream** — via + the existing override.d ConfigMap pattern (same mount as + `dkim_signing.conf`): (a) configure rspamd's **`external_relay`** module + (ip_map = VM /32) so SPF/DMARC/IP reputation evaluate against the + *original* client IP parsed from the VM's Received header — this keeps + DMARC protection for the entire drain stream instead of v2's blanket + disable; (b) cap rspamd's **action at the VM /32 to tag/fold — never + milter-reject**: the primary's default reject tier (DMS default, active + since only dkim_signing is overridden today) would 5xx high-score spam at + DATA, forcing the VM to generate DSNs to forged senders = classic + backup-MX backscatter → mx2's IP blacklisted. Drained spam lands tagged in + the catch-all's Junk instead. Validate the external_relay ↔ settings-rule + interplay at gate O5 with a high-spam-score message. +4. postscreen permit for the /32 (harmless; pregreet never trips a real + Postfix client and DNSBL is off — kept for future-proofing only). + +## Our-side changes (Terraform unless noted) + +1. **New stack `stacks/backup-mx/`** (Tier 1): OCI provider (creds from + Vault), VCN + subnet + security list + **reserved public IP** + + `VM.Standard.E2.1.Micro` + cloud-init (`templatefile`): **OS iptables + ACCEPTs for 25/80/9100/9154 ahead of the OCI image's REJECT rule + (persisted)**, postfix + config above, certbot, tailscale→headscale + enrollment (preauth key from Vault), node_exporter, postfix_exporter, + unattended-upgrades. +2. **DNS** — `stacks/cloudflared/modules/cloudflared/cloudflare.tf`: A + `mx2.viktorbarzin.me` → reserved IP (non-proxied), MX pref 20 → `mx2`. + **[CH] Live zone count verified: 195/200 → 197/200 after this change; only + 3 slots remain and the MTA-STS follow-up needs 1–2 → plan the next + record-purge now, not at collision time.** +3. **pfSense (live network device — approved as part of this plan)**: WAN NAT + rdr `TCP 2526 → 10.0.20.1:25` + firewall rule, source-restricted to the + reserved IP. **[CH] Scripted** (extend the existing + `scripts/pfsense-*-haproxy*.php` bootstrap-script family), not + hand-clicked — keeps the git-rebuildable parity the rest of the pfSense + mail config has. Config.xml rides the nightly backup. +4. **Mailserver stack**: the four-layer drain enablement above (client+sender + `check_client_access` permits, anvil exception, rspamd external_relay + + action cap, postscreen permit) — all keyed to one /32, via the existing + `postfix_cf` / `user-patches.sh` / rspamd-override hook points (verified + present: main.tf:129-144, 222-281, 467-474). +5. **Monitoring [CH — replaces v2's tailnet scraping, which had no transport: + no cluster→tailnet route exists and no existing target is scraped that + way]**: Prometheus scrapes `node_exporter`/`postfix_exporter` on the VM's + **public reserved IP**, allowed only from the homelab WAN /32 (Oracle SL + + VM firewall); blackbox TCP:25 from the cluster (`BackupMxDown`, warning); + MX-set drift assertion (both MX records present). Alerts: + `BackupMxQueueStuck` = **non-bounce** queue depth > 0 for 2 h while the + primary is healthy (gate on the existing `MailServerDown`/roundtrip + series, machine-readable — not prose); bounce residue is excluded by the + 1-day bounce lifetime. Note: during a full homelab outage Prometheus + itself is down — queue growth is unobservable live under ANY transport; + what we actually watch is the post-recovery drain. A WAN-IP change stales + the Oracle allowlist → visible as ScrapeTargetDown (self-signaling). + **Probe semantics note**: once mx2 exists, the Brevo roundtrip probe's + mail fails over to mx2 on transient primary blips and arrives minutes late + via the drain — `EmailRoundtripFailing` may then mean "delayed via mx2", + not "lost"; note in the alert description and runbook. +6. **Docs (same commit as implementation)**: rewrite `mailserver.md` §"No + Backup MX", new runbook `docs/runbooks/backup-mx.md` (`postqueue -p`, + forced drain `postqueue -f`, cert renewal, **OCI console break-glass**, VM + rebuild from stack, Oracle account facts incl. PAYG + home-region lock), + `vpn.md` headscale-version/OIDC staleness fix, monitoring rows. + +### MTA-STS finding (unchanged; no action in this change) + +`_mta-sts` TXT is published but `mta-sts.viktorbarzin.me` has no record and +nothing serves the policy — MTA-STS is inert today. When fixed, the policy +MUST include `mx: mx2.viktorbarzin.me` (and budget its DNS records against the +3 remaining zone slots). + +## Validation gates (in order; any failure → stop and report) + +| # | Gate | Method | Failure handling | +|---|------|--------|------------------| +| O1 | Oracle account (home region `eu-frankfurt-1`, **fixed forever at signup**), **PAYG conversion done**, E2.1.Micro capacity | Viktor signs up + converts; TF apply | A1-in-home-region is a best-effort fallback only (halved quota, contended); else decision returns to Viktor | +| O2 | Inbound TCP 25 reachable from the internet (after the OS-iptables fix) | `nc -zv 25` from outside + recurring Uptime-Kuma TCP monitor (keeps proving it — Oracle publishes no commitment) | Stop; decision returns to Viktor | +| O3 | Drain works: VM → `mail.viktorbarzin.me:2526` delivers end-to-end | Test message injected on the VM | Debug pfSense NAT / HAProxy path | +| O4 | LE cert issued | certbot standalone | STARTTLS is opportunistic — non-blocking for go-live; fix before MTA-STS | +| O5 | Live failover test — **hardened [CH]** | presence-claim → scale mailserver to 0 (~30 min) → send from Gmail + Brevo **plus a high-spam-score message and a >10 MB message** → confirm queued (`postqueue -p`) → scale up → verify full drain within the anvil-exception expectations, spam folded to Junk (not bounced), headers show original-IP SPF/DMARC evaluation, no DSN generated on the VM, roundtrip probe recovers | Debug or roll back (remove MX record) | + +## Failure modes + +Covered: cluster/pod outages, pfSense/power/ISP outages ≤ 30 days, WAN IP +changes, short-retry senders. If pfSense is down the drain waits — Postfix +retries until it heals. + +Not covered: primary-up-but-5xx misconfigs; outbound; mid-outage mailbox +access; **outages > 30 days lose queued mail silently (no DSN possible)**. +Simultaneous Oracle+homelab outage = status quo ante (sender retries). + +Newly introduced, accepted: + +- **A pet outside the cluster** — deliberately cattle: rebuilt from TF + + cloud-init, patched by unattended-upgrades, scraped by Prometheus. Never a + backup target. +- **Oracle free-tier caprice [CH — upgraded from v2's framing]**: Oracle has + silently cut Always-Free allowances and terminated over-limit instances + (June 2026, A1). Mitigations: PAYG (required), recurring inbound-25 probe, + `BackupMxDown`, and the fact that outside an active outage the queue is + empty — a surprise reclamation loses nothing, only coverage until rebuilt. + Rollernet Basic ($30/yr) stays the documented fallback if OCI sours. +- **Spam hygiene**: 4xx-only postscreen on the VM (pregreet + conservative + DNSBL-defer) instead of v2's nothing; drained spam is tagged/folded by + rspamd, never bounced. +- Outage mail sits plaintext on Oracle disk ≤ 30 days (single-tenant; + accepted). + +## Rollback + +Remove the MX + A records; wait for `postqueue -p` empty; `terraform destroy` +on `backup-mx`; delete the pfSense NAT rule (scripted); drop the mailserver +/32 exemptions. Order matters: MX record first. + +## Viktor's manual steps (everything else is mine) + +1. Create the Oracle Cloud account — **home region `eu-frankfurt-1`** (fixed + forever), card for identity, $0 charged. +2. **Convert the tenancy to Pay-As-You-Go** (required — idle-reclamation + exemption; Always-Free stays $0). +3. Hand me the tenancy OCID + a console user → I mint the API key, store + creds (Vault + Vaultwarden), and build the stack. +4. Approve the (scripted) pfSense NAT rule when I reach that step. diff --git a/docs/plans/2026-07-04-drone-logbook-design.md b/docs/plans/2026-07-04-drone-logbook-design.md new file mode 100644 index 00000000..78e3b469 --- /dev/null +++ b/docs/plans/2026-07-04-drone-logbook-design.md @@ -0,0 +1,89 @@ +# Drone Logbook (Open DroneLog) — Design + +**Date:** 2026-07-04 +**Status:** Approved (Viktor, 2026-07-04) +**Owner request:** "I have a DJI Mini 4 Pro. I'm interested in github.com/ViktorBarzin/drone-logbook" → self-host it in the cluster. + +## Goal + +Self-host [Open DroneLog](https://github.com/arpanghosh8453/open-dronelog) (upstream of the +`ViktorBarzin/drone-logbook` fork) at **https://dronelog.viktorbarzin.me** so Viktor can import +DJI Fly flight logs from his DJI Mini 4 Pro and analyze them privately: telemetry charts, 3D map +replay, per-flight and lifetime stats. All data stays in the cluster (single DuckDB database). + +## Decisions (interview, 2026-07-04) + +| Question | Decision | +|---|---| +| Deployment form | Self-hosted Docker web app in k8s (not desktop app, not hosted webapp) | +| Exposure | Public `dronelog.viktorbarzin.me`, **Authentik forward-auth** (`auth = "required"`) | +| Log ingestion | **Both** manual web upload *and* a server-side auto-import drop folder from day one | +| Image source | **Upstream** `ghcr.io/arpanghosh8453/open-dronelog:latest` — NOT the fork | +| Fork disposition | Fork is 0 ahead / 372 behind, adds nothing; delete or park it. Only revive (sync + ADR-0002 GHA build) if Viktor starts modifying the code | + +## Architecture + +New Tier-1 stack `stacks/drone-logbook/`, modeled line-by-line on `stacks/freshrss/` +(the closest existing shape: single upstream-image app, own data volume, Keel-updated): + +- **Namespace** `drone-logbook`, tier `4-aux`, label `keel.sh/enrolled=true` → Kyverno injects + Keel poll annotations → auto-upgrades as upstream releases (project is actively maintained). +- **Deployment** (1 replica, `Recreate` — DuckDB is single-writer/embedded): + - image `ghcr.io/arpanghosh8453/open-dronelog:latest` (nginx frontend + Axum REST backend, port 80) + - memory request=limit **512Mi** (DuckDB import/analytics spikes), cpu request 25m, no cpu limit + - standard `KYVERNO_LIFECYCLE_V1` / `KEEL_IGNORE_IMAGE` / `KEEL_LIFECYCLE_V1` lifecycle ignores +- **App data** `/data/drone-logbook` (DuckDB db, cached DJI decryption keys, uploaded originals): + **`proxmox-lvm-encrypted` block PVC** `drone-logbook-data-encrypted`, 2Gi, topolvm autoresize → + 10Gi ceiling. Encrypted class because flight logs are GPS traces of home/travel — sensitive data + defaults to `proxmox-lvm-encrypted` per the storage decision rule (`.claude/CLAUDE.md`). + Embedded DBs stay off NFS (same rationale documented in the freshrss stack: NFS only for static files). +- **Backup CronJob** `drone-logbook-backup` (mandatory for every proxmox-lvm app): daily 01:30 + file copy of the data volume → NFS `/srv/nfs/drone-logbook-backup` (dated dirs, 30-day retention, + Pushgateway metrics), pod-affinity co-scheduled with the app pod (RWO volume). 01:30 sits outside + the 00:00/08:00/16:00 sync-import windows so the DuckDB file is quiescent; retained upload + originals make even a torn copy recoverable by re-import. `nfs-mirror` (02:00) ships it to sda → + Synology offsite. Vaultwarden pattern. +- **Sync drop folder**: static NFS volume (`modules/kubernetes/nfs_volume`) + `192.168.1.127:/srv/nfs/drone-logbook/sync-logs`, mounted **read-only** at `/sync-logs`; + `SYNC_LOGS_PATH=/sync-logs`, `SYNC_INTERVAL="0 0 */8 * * *"` (every 8 h). + Any producer (Nextcloud sync, scp, a future phone pipeline) drops `.txt` logs there; the app + imports them automatically. `KEEP_UPLOADED_FILES=true` keeps re-importable originals in the PVC. +- **Ingress** via `ingress_factory`: `name = "dronelog"`, `auth = "required"` (Authentik + forward-auth), `dns_type = "proxied"`. External Uptime Kuma HTTPS monitor comes automatically + with the ingress annotation. Homepage tile (group "Media & Entertainment", icon `mdi-quadcopter`). +- **Secrets**: Vault KV `secret/drone-logbook` (`profile_creation_pass`) → ExternalSecret + (`vault-kv` ClusterSecretStore) → k8s secret `drone-logbook-secrets` → env + `PROFILE_CREATION_PASS`. Gates profile create/delete even for other Authentik-logged-in users. + No plan-time secret reads needed (no `data "kubernetes_secret"`). + No `DJI_API_KEY` — bundled default is fine at personal import volume; add later if rate-limited. + +## Operational notes + +- **DJI egress dependency**: importing a *new* log file requires the pod to reach DJI's servers + once (flight-log decryption key fetch; keys are then cached in the data dir). Remember this when + egress enforcement lands (Security wave 1, beads `code-8ywc`). +- The web UI is desktop-first; mobile is functional but basic. +- NFS host prerequisite: `/srv/nfs/drone-logbook/sync-logs` (root:www-data, 2775 — same shape as + sibling dirs) and `/srv/nfs/drone-logbook-backup` created on 192.168.1.127 and recorded in + `secrets/nfs_directories.txt`. `/srv/nfs` is exported whole-tree, so no `/etc/exports` + (`scripts/pve-nfs-exports`) change. +- Backup story = the daily app-level backup CronJob (above) + the host `daily-backup` LVM-snapshot + leg + original log files retained both in the drop folder and in the data volume + (`KEEP_UPLOADED_FILES=true`). + +## Alternatives considered + +- **Build from the fork** (`ghcr.io/viktorbarzin/...` via GHA, ADR-0002): rejected for now — fork + has zero custom commits; a build chain adds maintenance for no benefit. Revisit if code changes + are wanted. +- **`auth = "app"` + app profile passwords** (would enable the `opendronelog-sync` native uploader + from anywhere): rejected — a single app password guarding GPS traces of home/travel on the open + internet is weaker than Authentik; the sync drop folder covers automated ingestion instead. +- **Internal-only (.lan + VPN)**: rejected — Authentik-gated public matches the rest of the + homelab and works without VPN while traveling. +- **NFS for the DuckDB data**: rejected — embedded-DB-on-NFS locking risk; freshrss precedent + keeps app DB data on proxmox-lvm. + +## Implementation + +See `2026-07-04-drone-logbook-plan.md`. diff --git a/docs/plans/2026-07-04-drone-logbook-plan.md b/docs/plans/2026-07-04-drone-logbook-plan.md new file mode 100644 index 00000000..588c7ab1 --- /dev/null +++ b/docs/plans/2026-07-04-drone-logbook-plan.md @@ -0,0 +1,542 @@ +# Drone Logbook (Open DroneLog) Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Deploy Open DroneLog (DJI flight-log analyzer) at https://dronelog.viktorbarzin.me — new Tier-1 stack `stacks/drone-logbook/`, upstream image, Authentik-gated, with a DuckDB data PVC and an NFS auto-import drop folder. + +**Architecture:** Single Deployment running `ghcr.io/arpanghosh8453/open-dronelog:latest` (nginx + Axum + DuckDB, port 80) in namespace `drone-logbook`; data on a `proxmox-lvm-encrypted` PVC (GPS logs = sensitive data), `/sync-logs` drop folder on static NFS, daily backup CronJob to `/srv/nfs/drone-logbook-backup` (vaultwarden pattern), `ingress_factory` with `auth = "required"`, Keel auto-upgrades via namespace enrollment. Modeled line-by-line on `stacks/freshrss/`. Design: `2026-07-04-drone-logbook-design.md`. + +**Tech Stack:** Terraform/Terragrunt (Tier-1 PG state), Vault KV + ESO, ingress_factory, nfs_volume module, Keel/Kyverno. + +Terraform is exempt from TDD (execution.md); each task ends with a concrete verification instead. + +--- + +### Task 1: Vault secret + +**Files:** none (Vault KV only) + +- [ ] **Step 1.1: Create `secret/drone-logbook` with a generated profile-creation password** + +```bash +vault kv put secret/drone-logbook profile_creation_pass="$(openssl rand -base64 24)" +``` + +- [ ] **Step 1.2: Verify** + +```bash +vault kv get -field=profile_creation_pass secret/drone-logbook | wc -c +``` + +Expected: `33` (32 chars + newline). Never echo the value itself. + +### Task 2: NFS drop folder on 192.168.1.127 + +**Files:** +- Modify: `secrets/nfs_directories.txt` (git-crypt'd — **edit from the MAIN checkout only**, never the worktree; sorted list, add `drone-logbook/sync-logs`) + +- [ ] **Step 2.1: Create the directories** — world-writable + setgid like `vaultwarden-backup` (the `/srv/nfs` export root-squashes, so pod-root writes land as `nobody`): + +```bash +ssh root@192.168.1.127 'mkdir -p /srv/nfs/drone-logbook/sync-logs /srv/nfs/drone-logbook-backup && chown -R root:www-data /srv/nfs/drone-logbook /srv/nfs/drone-logbook-backup && chmod 2777 /srv/nfs/drone-logbook/sync-logs /srv/nfs/drone-logbook-backup && ls -ld /srv/nfs/drone-logbook/sync-logs /srv/nfs/drone-logbook-backup' +``` + +Expected: `drwxrwsrwx ... root www-data ...` for both. +No `/etc/exports` (`scripts/pve-nfs-exports`) change — `/srv/nfs` is exported whole-tree. + +- [ ] **Step 2.2: Record them in the declarative list (MAIN checkout, plaintext there)** — insert `drone-logbook-backup` and `drone-logbook/sync-logs` (after `diun`, before `etcd-backup`) in `~/code/infra/secrets/nfs_directories.txt`, then commit that single file to master: + +```bash +git -C ~/code/infra add secrets/nfs_directories.txt +git -C ~/code/infra commit -m "nfs_directories: add drone-logbook/sync-logs + +Drop folder for the new drone-logbook stack's auto-import (SYNC_LOGS_PATH). +Directory created on 192.168.1.127 root:www-data 2775." +git -C ~/code/infra push forgejo master +``` + +(Trivial single-file exception per execution.md; encrypted files cannot be edited from the worktree.) + +### Task 3: Stack files (in the `wizard/drone-logbook` worktree) + +**Files:** +- Create: `stacks/drone-logbook/main.tf` (content below) +- Create: `stacks/drone-logbook/terragrunt.hcl` (content below) +- Create: `stacks/drone-logbook/secrets` → symlink to `../../secrets` +- (`backend.tf`, `tiers.tf`, `cloudflare_provider.tf`, `providers.tf`, `.terraform.lock.hcl` are terragrunt-generated and **gitignored** — do NOT create or commit them; the tracked copies in old stacks like freshrss predate the ignore rule) + +- [ ] **Step 3.1: `terragrunt.hcl`** + +```hcl +include "root" { + path = find_in_parent_folders() +} + +dependency "platform" { + config_path = "../platform" + skip_outputs = true +} +``` + +- [ ] **Step 3.2: `main.tf`** — exact content: + +```hcl +variable "tls_secret_name" { + type = string + sensitive = true +} +variable "nfs_server" { type = string } + +# Open DroneLog (https://github.com/arpanghosh8453/open-dronelog) — self-hosted +# DJI flight-log analyzer for the DJI Mini 4 Pro. Runs the UPSTREAM image (the +# ViktorBarzin/drone-logbook fork has no custom commits); Keel tracks :latest. +# Design: docs/plans/2026-07-04-drone-logbook-design.md +resource "kubernetes_namespace" "drone_logbook" { + metadata { + name = "drone-logbook" + labels = { + tier = local.tiers.aux + "keel.sh/enrolled" = "true" + } + } + lifecycle { + # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace + ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] + } +} + +resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } + manifest = { + apiVersion = "external-secrets.io/v1" + kind = "ExternalSecret" + metadata = { + name = "drone-logbook-secrets" + namespace = "drone-logbook" + } + spec = { + refreshInterval = "15m" + secretStoreRef = { + name = "vault-kv" + kind = "ClusterSecretStore" + } + target = { + name = "drone-logbook-secrets" + } + dataFrom = [{ + extract = { + key = "drone-logbook" + } + }] + } + } + depends_on = [kubernetes_namespace.drone_logbook] +} + +module "tls_secret" { + source = "../../modules/kubernetes/setup_tls_secret" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + tls_secret_name = var.tls_secret_name +} + +# DuckDB database + cached DJI decryption keys + uploaded originals. +# Embedded DB -> block storage, not NFS (same rationale as freshrss data). +# Encrypted class: flight logs are GPS traces of home/travel (sensitive data +# -> proxmox-lvm-encrypted per the storage decision rule in .claude/CLAUDE.md). +resource "kubernetes_persistent_volume_claim" "data" { + wait_until_bound = false + metadata { + name = "drone-logbook-data-encrypted" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + annotations = { + "resize.topolvm.io/threshold" = "10%" + "resize.topolvm.io/increase" = "100%" + "resize.topolvm.io/storage_limit" = "10Gi" + } + } + spec { + access_modes = ["ReadWriteOnce"] + storage_class_name = "proxmox-lvm-encrypted" + resources { + requests = { + storage = "2Gi" + } + } + } + lifecycle { + # The autoresizer expands requests.storage up to storage_limit and PVCs + # can't shrink; without this every apply tries to revert the size. + ignore_changes = [spec[0].resources[0].requests] + } +} + +# Drop folder: any producer (Nextcloud sync, scp, future phone pipeline) lands +# DJI .txt logs here over NFS; the app auto-imports on SYNC_INTERVAL. +module "nfs_sync_logs" { + source = "../../modules/kubernetes/nfs_volume" + name = "drone-logbook-sync-logs" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + nfs_server = var.nfs_server + nfs_path = "/srv/nfs/drone-logbook/sync-logs" + storage = "5Gi" +} + +resource "kubernetes_deployment" "drone_logbook" { + metadata { + name = "drone-logbook" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + labels = { + app = "drone-logbook" + "kubernetes.io/cluster-service" = "true" + tier = local.tiers.aux + } + } + spec { + replicas = 1 + strategy { + # DuckDB is single-writer; never overlap two pods on the same volume + type = "Recreate" + } + selector { + match_labels = { + app = "drone-logbook" + } + } + template { + metadata { + labels = { + app = "drone-logbook" + "kubernetes.io/cluster-service" = "true" + } + } + spec { + container { + name = "drone-logbook" + image = "ghcr.io/arpanghosh8453/open-dronelog:latest" + env { + name = "RUST_LOG" + value = "info" + } + env { + # keep re-importable originals under /data/drone-logbook/uploaded + name = "KEEP_UPLOADED_FILES" + value = "true" + } + env { + name = "SYNC_LOGS_PATH" + value = "/sync-logs" + } + env { + # 6-field cron (sec min hour dom mon dow): scan drop folder every 8h + name = "SYNC_INTERVAL" + value = "0 0 */8 * * *" + } + env { + name = "PROFILE_CREATION_PASS" + value_from { + secret_key_ref { + name = "drone-logbook-secrets" + key = "profile_creation_pass" + } + } + } + volume_mount { + name = "data" + mount_path = "/data/drone-logbook" + } + volume_mount { + name = "sync-logs" + mount_path = "/sync-logs" + read_only = true + } + port { + name = "http" + container_port = 80 + protocol = "TCP" + } + resources { + requests = { + cpu = "25m" + memory = "512Mi" + } + limits = { + memory = "512Mi" + } + } + } + volume { + name = "data" + persistent_volume_claim { + claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name + } + } + volume { + name = "sync-logs" + persistent_volume_claim { + claim_name = module.nfs_sync_logs.claim_name + } + } + } + } + } + depends_on = [kubernetes_manifest.external_secret] + lifecycle { + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 + metadata[0].annotations["keel.sh/policy"], + metadata[0].annotations["keel.sh/trigger"], + metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 + metadata[0].annotations["keel.sh/match-tag"], + spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates + metadata[0].annotations["kubernetes.io/change-cause"], + metadata[0].annotations["deployment.kubernetes.io/revision"], + spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 + ] + } +} + +resource "kubernetes_service" "drone_logbook" { + metadata { + name = "drone-logbook" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + labels = { + "app" = "drone-logbook" + } + } + + spec { + selector = { + app = "drone-logbook" + } + port { + port = "80" + target_port = "80" + } + } +} + +# ----------------------------------------------------------------------------- +# Backup — required for every proxmox-lvm(-encrypted) app: daily copy of the +# data volume to NFS /srv/nfs/drone-logbook-backup (picked up by nfs-mirror -> +# sda -> Synology offsite). 01:30 = outside the 00:00/08:00/16:00 sync-import +# windows, so the DuckDB file is quiescent; uploaded originals make even a +# mid-write copy recoverable by re-import. Pod-affinity co-schedules with the +# app pod (RWO volume mounts twice only on the same node). Vaultwarden pattern. +# ----------------------------------------------------------------------------- + +module "nfs_backup" { + source = "../../modules/kubernetes/nfs_volume" + name = "drone-logbook-backup-host" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + nfs_server = var.nfs_server + nfs_path = "/srv/nfs/drone-logbook-backup" +} + +resource "kubernetes_cron_job_v1" "backup" { + metadata { + name = "drone-logbook-backup" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + } + spec { + concurrency_policy = "Replace" + failed_jobs_history_limit = 5 + schedule = "30 1 * * *" + starting_deadline_seconds = 300 + successful_jobs_history_limit = 3 + job_template { + metadata {} + spec { + backoff_limit = 3 + ttl_seconds_after_finished = 10 + template { + metadata {} + spec { + affinity { + pod_affinity { + required_during_scheduling_ignored_during_execution { + label_selector { + match_labels = { + app = "drone-logbook" + } + } + topology_key = "kubernetes.io/hostname" + } + } + } + container { + name = "drone-logbook-backup" + image = "docker.io/library/alpine" + command = ["/bin/sh", "-c", <<-EOT + set -euxo pipefail + _t0=$(date +%s) + now=$(date +"%Y_%m_%d_%H_%M") + mkdir -p /backup/$now + cp -a /data/. /backup/$now/ + # Rotate — 30 day retention + find /backup -maxdepth 1 -mindepth 1 -type d -mtime +30 -exec rm -rf {} + + _dur=$(($(date +%s) - _t0)) + _out_bytes=$(du -sb /backup/$now | awk '{print $1}') + wget -qO- --post-data "backup_duration_seconds $${_dur} + backup_output_bytes $${_out_bytes} + backup_last_success_timestamp $(date +%s) + " "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/drone-logbook-backup" || true + EOT + ] + volume_mount { + name = "data" + mount_path = "/data" + read_only = true + } + volume_mount { + name = "backup" + mount_path = "/backup" + } + } + volume { + name = "data" + persistent_volume_claim { + claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name + } + } + volume { + name = "backup" + persistent_volume_claim { + claim_name = module.nfs_backup.claim_name + } + } + dns_config { + option { + name = "ndots" + value = "2" + } + } + } + } + } + } + } + lifecycle { + # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] + } +} + +# https://dronelog.viktorbarzin.me +module "ingress" { + source = "../../modules/kubernetes/ingress_factory" + auth = "required" # Authentik forward-auth — flight logs are GPS traces of home/travel + dns_type = "proxied" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + name = "dronelog" + service_name = "drone-logbook" + tls_secret_name = var.tls_secret_name + extra_annotations = { + "gethomepage.dev/enabled" = "true" + "gethomepage.dev/name" = "Drone Logbook" + "gethomepage.dev/description" = "DJI flight log analyzer" + "gethomepage.dev/icon" = "mdi-quadcopter" + "gethomepage.dev/group" = "Media & Entertainment" + "gethomepage.dev/pod-selector" = "" + } +} +``` + +- [ ] **Step 3.3: Boilerplate** + +```bash +ln -s ../../secrets ~/code/infra/.worktrees/drone-logbook/stacks/drone-logbook/secrets +``` + +- [ ] **Step 3.4: Format check** + +```bash +terraform fmt -check -diff $WT/stacks/drone-logbook/ || terraform fmt $WT/stacks/drone-logbook/ +``` + +Expected: no diff (or auto-fixed). + +- [ ] **Step 3.5: Commit on the branch (files by name, git-crypt filter flags per execution.md)** + +```bash +git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false \ + add docs/plans/2026-07-04-drone-logbook-design.md docs/plans/2026-07-04-drone-logbook-plan.md \ + stacks/drone-logbook/main.tf stacks/drone-logbook/terragrunt.hcl stacks/drone-logbook/secrets \ + .claude/reference/service-catalog.md +git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false \ + commit -m "drone-logbook: new stack — self-hosted Open DroneLog at dronelog.viktorbarzin.me + +Viktor asked to self-host the DJI flight-log analyzer for his DJI Mini 4 Pro +(fork ViktorBarzin/drone-logbook -> upstream arpanghosh8453/open-dronelog). +Upstream ghcr image with Keel auto-upgrade, DuckDB data on proxmox-lvm PVC, +NFS /sync-logs drop folder auto-imported every 8h, Authentik-gated ingress, +PROFILE_CREATION_PASS from Vault via ESO. Design + plan in docs/plans/." +``` + +### Task 4: Land and apply + +- [ ] **Step 4.1: Presence claim** (CI apply mutates shared infra) + +```bash +~/code/scripts/presence claim infra:drone-logbook --purpose "deploy new drone-logbook stack (Open DroneLog) via CI apply" +``` + +- [ ] **Step 4.2: Merge latest master into the branch, push to master** + +```bash +git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false fetch forgejo +git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false merge forgejo/master +git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false push forgejo HEAD:master +``` + +Non-fast-forward → another agent landed first: fetch, merge, push again. Branch-protection rejection → fall back to PR via Forgejo API (token = password in `~/.git-credentials`). + +- [ ] **Step 4.3: Watch the CI apply to completion** — Woodpecker pipeline on the infra repo (`ci.viktorbarzin.me`), then confirm live: + +```bash +kubectl get ns drone-logbook && kubectl -n drone-logbook get deploy,pvc,pods,externalsecret,cronjob +kubectl -n drone-logbook rollout status deploy/drone-logbook --timeout=300s +``` + +Expected: namespace present, ExternalSecret `SecretSynced`, data PVC `Bound` (the NFS PVCs bind on first pod/job use), CronJob `drone-logbook-backup` scheduled `30 1 * * *`, pod `Running 1/1`. + +- [ ] **Step 4.4: Cleanup worktree + branch; release presence** + +```bash +git -C ~/code/infra worktree remove .worktrees/drone-logbook +git -C ~/code/infra branch -d wizard/drone-logbook +git -C ~/code/infra pull --ff-only # only if main checkout clean/quiescent +~/code/scripts/presence release infra:drone-logbook +``` + +### Task 5: End-to-end verification + +- [ ] **Step 5.1: Ingress + Authentik gate** + +```bash +curl -sI https://dronelog.viktorbarzin.me | head -5 +``` + +Expected: `302` redirect into Authentik (NOT `200`, NOT `404`). + +- [ ] **Step 5.2: App alive behind the gate** (bypass ingress via port-forward, read-only debug) + +```bash +kubectl -n drone-logbook port-forward svc/drone-logbook 18080:80 & +sleep 2 && curl -s -o /dev/null -w '%{http_code}\n' http://127.0.0.1:18080/ && kill %1 +``` + +Expected: `200`. + +- [ ] **Step 5.3: Sync folder visible in-pod** + +```bash +kubectl -n drone-logbook exec deploy/drone-logbook -- ls -ld /sync-logs /data/drone-logbook +``` + +Expected: both directories listed; `/sync-logs` read-only mount. + +- [ ] **Step 5.4: Monitor + homepage** — Uptime Kuma external monitor for `dronelog.viktorbarzin.me` auto-created (ingress annotation); homepage tile under "Media & Entertainment". + +- [ ] **Step 5.5: Functional import** — Viktor uploads a real Mini 4 Pro `.txt` log via the web UI (or drops it in `/srv/nfs/drone-logbook/sync-logs`); confirms flight appears with charts/map. Requires pod egress to DJI once per new log (decryption key). If an upstream sample log is available, the agent may pre-verify import via the REST API through the port-forward. diff --git a/docs/plans/2026-07-04-immich-frame-lan-only-design.md b/docs/plans/2026-07-04-immich-frame-lan-only-design.md new file mode 100644 index 00000000..199316cf --- /dev/null +++ b/docs/plans/2026-07-04-immich-frame-lan-only-design.md @@ -0,0 +1,125 @@ +# immich-frame: LAN-only access, Portals untouched (2026-07-04) + +## Goal + +Strangers must no longer be able to view `highlights-immich.viktorbarzin.me` +(Viktor's London Portal Plus frame) or `highlights-immich-emo.viktorbarzin.me` +(Emo's Sofia Portal Mini frame) — pages or ImmichFrame API. Both were +`auth = "none"`, Cloudflare-proxied, fully public. + +Who keeps access (per Viktor, this session): the two Portals plus **any +household device on the Sofia, London, or Valchedrym home networks**. No +public access, no tailnet requirement. Hard constraint: the Portal app is a +WebView with the URL **baked in at APK build time** (`portal-immich-frame`, +`-PframeUrl`), so the exact URLs must keep loading from where the Portals sit +— zero app rebuilds, zero device touches, zero router changes. + +## Design + +Two cooperating pieces — the gate and the reachability pointer: + +1. **The gate — `home-lans-only` Traefik middleware** (traefik stack, next to + `local-only`): `ipAllowList` of `192.168.1.0/24` (Sofia LAN), `10.0.0.0/8` + (VLANs, K8s pods `10.10.0.0/16`, services `10.96.0.0/12`, WG tunnel + `10.3.2.0/24`), `192.168.8.0/24` (London LAN), `192.168.9.0/24` (London + GUEST net — post-rollout discovery: the Portal Plus actually leases here, + `Portal-75AE8F9C2A8A` = `192.168.9.198`, added same day), `192.168.0.0/24` + (Valchedrym LAN), `fc00::/7`, `fe80::/10`. Attached to both frame + ingresses via `extra_middlewares`. Everyone else gets a Traefik 403 — + including direct-to-WAN-IP requests carrying the right SNI, which DNS + changes alone cannot stop. A **separate** middleware rather than a widened + `local-only`, because widening would silently grant the remote LANs access + to the 9 admin surfaces using it (Prometheus, iDRAC, Loki, …). + +2. **The pointer — `dns_type = "internal"`** (new `ingress_factory` tier, + Viktor's idea): a **non-proxied public A record → `10.0.20.203`** (module + var `internal_lb_ip`). Outsiders resolve it but get an unroutable RFC1918 + address; every household resolver path delivers a working answer with no + config anywhere: Sofia LAN already gets the internal CNAME from Technitium, + London/Valchedrym resolve the public record via any upstream and + policy-route `10.0.0.0/8` down the WireGuard tunnel. IPv4-only (spokes + route no internal v6 range). + +Interlock (the reason both flip together): with a *proxied* record, public +traffic arrives from cloudflared **pod IPs inside 10/8** and would sail +through the allowlist. `internal` removes the Cloudflare path entirely (CF +edge stops serving the hostname), so every request reaches Traefik with its +real source IP (ETP=Local). Verified: no wildcard `*.viktorbarzin.me` record +exists to resurrect public resolution. + +`auth` stays `"none"` — there is still no *user* auth by design (kiosk +WebView; forward-auth would 302 the device to a login it can't complete, and +emo's Google-only account can't log in inside a WebView at all); the +convention comment now names the ipAllowList as the gate. + +### Resulting flows + +| Client | Path | Result | +|---|---|---| +| Emo's Portal Mini (Sofia LAN) | Technitium CNAME → `.203` direct (unchanged) | allowed (`192.168.1.x`) | +| Viktor's Portal Plus (London GUEST net) | public A → `10.0.20.203` → WG tunnel | allowed (`192.168.9.x`) | +| Household browsers (any of the 3 LANs) | same as above | allowed | +| In-cluster checks (`homelab browser`, blackbox) | CoreDNS → Technitium → `.203` | allowed (pod IP in 10/8) | +| Stranger, resolves hostname | gets `10.0.20.203` | unroutable | +| Stranger, hits WAN IP with SNI | pfSense NAT → Traefik (real source IP) | **403** | +| Stranger, via Cloudflare | no proxied record | CF edge won't serve the host | + +### Rejected alternatives + +- **ImmichFrame `AuthenticationSecret`** (supported upstream: web input field + or `?authsecret=` param + bearer API): real auth from anywhere, but family + browsers would face a secret prompt (fails "household devices just work"), + the secret leaks into URLs/analytics/APK, and robust rollout needs APK + rebuild + USB-adb sideload on both Portals (the Sofia one is high-friction). +- **Authentik forward-auth / `auth = "public"`**: WebView can't complete SSO + (Google blocks WebView logins; session expiry silently bricks an appliance); + the anonymous outpost is an audit trail, not a gate. +- **Remove DNS + London router AdGuardHome rewrites**: works, but adds an + out-of-band, un-IaC'd router dependency the internal-IP record makes + unnecessary. Kept as documented fallback if resolver-side private-IP + filtering ever appears in the London path. + +## Pre-verified facts (2026-07-04) + +- London Flint 2 DNS chain returns RFC1918 answers unfiltered + (`nslookup 10.0.20.203.nip.io 127.0.0.1` on the router → `10.0.20.203`; + dnsmasq `rebind_protection '0'`, no AdGuardHome rebind filtering). +- Technitium already CNAMEs both hostnames → apex → `10.0.20.203` + (`technitium-ingress-dns-sync` is ingress-driven, not DNS-record-driven, so + the internal answer survives the Cloudflare record swap). +- Pod CIDR `10.10.0.0/16`, service CIDR `10.96.0.0/12` — inside `10.0.0.0/8`. +- No public wildcard record in the zone. + +## Blast radius & cleanups + +- `external_monitor = false` set explicitly on both ingresses: the + external-monitor-sync default opt-in would otherwise keep the now-doomed + `[External] highlights-immich*` uptime-kuma monitors alive and red. Verify + the sync drops them post-apply. +- rybbit CF worker: `highlights-immich` removed from `SITE_IDS` (`index.js`) + and `wrangler.toml` routes — off Cloudflare the route can never fire. + Requires a `wrangler deploy` to take effect (route removal is hygiene, not + functional). +- Homepage dashboard link keeps working from LANs (hostname unchanged). +- Docs updated in the same change: `.claude/CLAUDE.md` (DNS tier + + external-monitor mechanism), `AGENTS.md`, `docs/architecture/networking.md` + (Internal-IP domains category). The `portal-immich-frame` repo's glossary + ("public, login-less URL") updated separately in that repo. + +## Failure-mode delta + +London frame now depends on the WG tunnel instead of Cloudflare+cloudflared +(the app self-heals with 5s retries; tunnel-flap modes documented in +`docs/architecture/vpn.md`). A Traefik LB renumber must update +`internal_lb_ip` in the module alongside the split-horizon apex record. +Cutover window: cached proxied answers keep working ≤ ~5 min TTL, then the +WebView's own retry picks up the new path. + +## Verification & rollback + +Verify: public dig → `10.0.20.203` (both hosts); Technitium dig → `.203`; +curl from devvm (10/8) → 200; external vantage (WebFetch/cloud) → unreachable +or 403; middleware attached on both ingresses; Emo's frame renders via +`homelab browser`; London Portal image fetches visible in Traefik access logs +from `192.168.8.x`. Rollback: `git revert` + apply traefik/immich — records +and middleware chain restore (`allow_overwrite = true` re-adopts the records). diff --git a/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md index 664869fa..27a4484a 100644 --- a/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md +++ b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md @@ -129,3 +129,40 @@ heavy user between 12–16G even with RAM free; bump to 16/20 if that bites. storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the correct pairing. A famous tool that "does OOM" still has to be proven to fire under *your* configuration. + +## Addendum (2026-07-02): the MemoryHigh throttle band livelocks — removed + +The soft-cap layer of this design was falsified in production on 2026-07-02 +(~15:42–16:35 UTC): an agent-spawned `ugrep` (12.35G RSS; `-o` with wide +alternation captures over a multi-GB `.jsonl` transcript) **plateaued inside +t3-serve@wizard's `MemoryHigh=12G..MemoryMax=16G` band**. With +`MemorySwapMax=0` its anonymous pages were unreclaimable, so the kernel parked +every allocating task of the cgroup in `mem_cgroup_handle_over_high` +(`memory.pressure full avg60 ≈ 80%`, `memory.events high=882948`, `oom_kill=0`) +— including the `t3 serve` event loop (~0.5G RSS, pure collateral). The accept +queue backed up (21 pending connections), t3-probe logged `t3serve: [Errno 104] +Connection reset by peer`, t3-dispatch logged `proxy error: context canceled`, +and t3.viktorbarzin.me was dead for its user until the hog was SIGKILLed by +hand (the D-state high-throttle sleep IS killable; the cgroup dropped 14G→1.4G +and the service recovered in seconds with no restart). + +The Verification bullet above — a soft-capped balloon "throttled to a crawl, +making no progress and **harming nothing**" — holds only when the hog is alone +in its cgroup. Sharing the cgroup with a latency-sensitive server, the crawl +IS the harm: a hog that stabilises below `MemoryMax` never triggers the local +OOM the design counted on, so the band converts "runaway dies" into "everyone +in the cgroup stalls forever". + +**Fix (same day, admin-approved): `MemoryHigh=infinity` on all three work +cgroup definitions** — `scripts/t3-serve@.service`, the `user-.slice.d` +drop-in, and `docker.slice` (`setup-devvm.sh` §10a/§10c). A runaway now runs +unthrottled into `MemoryMax` and is cgroup-OOM-killed immediately +(`OOMPolicy=continue` keeps t3-serve itself alive; in slices the kernel kills +the biggest task). `MemoryMax`, `MemorySwapMax=0`, and earlyoom — the layers +the stress tests actually validated — are unchanged. Applied live via +`daemon-reload` + runtime `set-property` on the running cgroups; no session +restarts. + +Lesson: **with `swap=0`, `memory.high` is not a gentler `memory.max` — it is +an unbounded stall injector for everything sharing the cgroup.** Cap-and-kill +beats throttle-and-pray for multi-tenant interactive services. diff --git a/docs/runbooks/paperless-mail-ingest.md b/docs/runbooks/paperless-mail-ingest.md new file mode 100644 index 00000000..50c404be --- /dev/null +++ b/docs/runbooks/paperless-mail-ingest.md @@ -0,0 +1,135 @@ +# Paperless-ngx Mail Ingest (docs@viktorbarzin.me) + +Last updated: 2026-07-03 (initial build) + +Forward any email with document attachments to **`docs@viktorbarzin.me`** and +paperless-ngx ingests the attachments, owned by the paperless account mapped +from the **sender** (From) address. Built entirely from existing parts: a +docker-mailserver mailbox + Dovecot sieve, and paperless-ngx's native mail +consumer (the same machinery as the `utility:` rules). + +## Flow + +``` +family member forwards email ──> MX ──> docker-mailserver + │ postfix virtual: docs@ has an explicit self-alias (extra/aliases.txt), + │ so the @domain catch-all (→ spam@, swept by TripIt) does NOT apply + ▼ +Dovecot LMTP delivery to docs@ + │ per-user sieve (docs@viktorbarzin.me.dovecot.sieve): sender NOT in + │ allowlist → discard (decision 2026-07-03: unmatched = ignore & delete) + ▼ +docs@ INBOX ── paperless-ngx mail task (every 10 min, PAPERLESS_EMAIL_TASK_CRON + │ default) applies mail rules in order: filter_from = + │ → consume attachments (ALL parts incl. inline — see design + │ notes: Apple Mail marks real PDFs inline), owner = mapped user, + │ tag = email-ingest, title = mail subject + ▼ +consumed mail is MOVED to the "Processed" IMAP folder (audit trail); +INBOX stays empty in steady state +``` + +## Sender → paperless account map (as built) + +| Sender (From) | Paperless user | Rule | +|--------------------------|----------------|-----------------| +| me@viktorbarzin.me | root (id 3) | forward: Viktor (me@) | +| vbarzin@gmail.com | root (id 3) | forward: Viktor (gmail) | +| viktorbarzin@meta.com | root (id 3) | forward: Viktor (meta) | +| ancaelena98@gmail.com | anca (id 4) | forward: Anca | +| emil.barzin@gmail.com | emo (id 7) | forward: Emo | + +The map lives in **two places by design** — keep them in sync: + +1. **Delivery gate (infra, Terraform):** + `stacks/mailserver/modules/mailserver/extra/docs-at-viktorbarzin.me.dovecot.sieve` + — senders not listed here are discarded at delivery (spam control + the + "ignore and delete unmatched" behaviour; paperless cannot express + "delete without ingesting", so this must happen before the mailbox). +2. **Owner map (paperless DB, via API/UI):** one mail rule per sender on the + `docs@viktorbarzin.me` mail account. DB-state like workflows — NOT + Terraform. + +## Add a family member / sender + +1. Add the address to the sieve allowlist file above; commit; apply the + `mailserver` stack (normal apply is enough — the sieve CM key is not under + `ignore_changes`; Reloader restarts the pod). +2. Clone an existing `forward:` mail rule in the paperless admin UI + (Mail → Rules) or via API, changing `filter_from` and the rule **owner** + (documents are owned by the rule owner — `assign_owner_from_rule=true`). + Keep: action = Move to `Processed`, attachment type = **process all files + including inline** (`attachment_type=2` — NOT attachments-only, see design + notes), consumption scope = attachments only, tag `email-ingest`, order + after the existing rules. + +## Operations + +- **Trigger a fetch immediately** (instead of waiting ≤10 min): + `kubectl -n paperless-ngx exec deploy/paperless-ngx -c paperless-ngx -- s6-setuidgid paperless python3 manage.py mail_fetcher` + The `s6-setuidgid paperless` is **required**: `kubectl exec` runs as root, and a + root-run fetcher downloads attachments root-owned into the scratch dir, which + the celery consumer (uid 1000) then can't read — `PermissionError` on + `/tmp/paperless/paperless-mail-*/...`, consume task FAILURE (hit during the + 2026-07-03 build E2E). The mail correctly stays in INBOX for retry (the move + action is a chord callback on successful consumption). Recover: `rm -rf + /tmp/paperless/paperless-mail-*` (as root) and let the next scheduled fetch + re-process. +- **Mailbox credentials:** Vault `secret/platform` → `mailserver_accounts` + JSON, key `docs@viktorbarzin.me` (also used by the paperless mail account). +- **Inspect the mailbox:** + `python3 -c` IMAP to `mailserver.mailserver.svc.cluster.local:993` (in-cluster, + from a pod) or `mail.viktorbarzin.me:993` (externally / devvm). +- **Paperless-side logs:** `kubectl -n paperless-ngx logs deploy/paperless-ngx | grep -i mail` + (also Loki, ns `paperless-ngx`). Rule/account state: `GET /api/mail_rules/`, + `GET /api/mail_accounts/` with the admin token + (k8s secret `paperless-ngx-secrets`, field `api_token`). +- **Account/mailbox provisioning:** adding/rotating anything in + `mailserver_accounts` requires the ConfigMap replace workaround — + `scripts/tg apply mailserver -- -replace=module.mailserver.kubernetes_config_map.mailserver_config` + — because `postfix-accounts.cf` is under `ignore_changes` + (non-deterministic bcrypt; see the module comment). + +## Design notes / caveats + +- **Why not the catch-all?** Mail to unknown `@viktorbarzin.me` addresses + lands in `spam@`, which the TripIt `ingest-plans` CronJob sweeps every + 15 min: it marks everything `\Seen`, LLM-parses mail from linked senders and + replies with ack/failure emails. Forwarded bank statements would get + "couldn't parse a trip" replies. `docs@` being a real mailbox bypasses that + path entirely; TripIt, the `smoke-test@` roundtrip probe, and `dmarc@` are + untouched. +- **Spoofing:** the sender match is on the From header. Rspamd verifies + SPF/DKIM/DMARC on inbound mail, but gmail.com publishes `p=none`, so a + crafted spoof could ingest documents into a family member's account. Accepted + risk (worst case: unwanted documents appear, visible + deletable in + paperless). +- **Not PDF-only:** any attachment type paperless supports is consumed + (PDF, images, Office via the existing tika+gotenberg pipeline). +- **Inline attachments ARE processed (`attachment_type=2`, flipped + 2026-07-03):** the rules originally used attachments-only (1) to skip + signature logos, but the very first real forward (Apple Mail, Viktor's + client) attached the invoice PDF with `Content-Disposition: inline` — + paperless matched the rule, consumed nothing, and recorded + `PROCESSED_WO_CONSUMPTION` (which, like any ProcessedMail row, blocks that + UID from ever being re-processed — delete the row via `manage.py shell` to + retry). Trade-off: signature/inline images in forwards may be ingested as + junk docs (tagged `email-ingest`, easy to spot). If that gets noisy, add + `filter_attachment_filename_exclude` patterns to the rules using the + actually-observed junk filenames — do NOT flip back to attachments-only. +- **No dedicated alerting** (deliberate, 2026-07-03): mail-task errors surface + in paperless logs; the mailserver inbound path is covered by + `email-roundtrip-monitor`. Revisit if forwards start silently failing. +- **Workflows:** the global `payslip-webhook` + `claude-mcp-readers + auto-permission` workflows fire for mail-ingested docs like any other + consumption source (verified pre-build; payslip receiver does its own + filtering). + +## Rollback + +1. Disable/delete the 5 `forward:` mail rules + the `docs@` mail account + (paperless admin UI or API). +2. Revert the infra commit (aliases.txt entry, sieve file, CM key + mount). +3. Remove `docs@viktorbarzin.me` from Vault `mailserver_accounts`, then apply + with the `-replace` workaround above. Mail to docs@ then falls back to the + catch-all (spam@) like any unknown address. diff --git a/docs/runbooks/t3-drop-attribution.md b/docs/runbooks/t3-drop-attribution.md index df4cef09..e05f163b 100644 --- a/docs/runbooks/t3-drop-attribution.md +++ b/docs/runbooks/t3-drop-attribution.md @@ -109,10 +109,17 @@ rate(node_pressure_memory_stalled_seconds_total{instance="devvm"}[5m]) node_memory_SwapFree_bytes{instance="devvm"} ``` -Guardrails in place (2026-06-10, `scripts/t3-serve@.service`): per-unit -`MemoryHigh=12G`, `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue` — -a runaway agent now OOMs alone inside the cgroup instead of taking the box -(and the WS server) with it. +Guardrails in place (2026-06-10, hardened 2026-07-02; `scripts/t3-serve@.service`): +per-unit `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue`, and +`MemoryHigh=infinity` — deliberately NO soft throttle band. With swap=0, a hog +plateauing between high and max never OOMs and the kernel high-throttle stalls +the whole unit: a 12.3G agent `ugrep` livelocked t3-serve@wizard for ~50min on +2026-07-02 (signature: probe `t3serve` leg `Connection reset by peer`, dispatch +`proxy error: context canceled`, server D-state in `mem_cgroup_handle_over_high`, +`ss` backlog on the serve port; fix: SIGKILL the hog — the D-state is killable). +A runaway agent now OOMs alone at 16G inside the cgroup instead of throttling +the WS server with it. Post-mortem addendum: +`docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md`. ## 4. Known root causes (2026-06-10 investigation) diff --git a/docs/runbooks/valia-sites.md b/docs/runbooks/valia-sites.md new file mode 100644 index 00000000..ee10a866 --- /dev/null +++ b/docs/runbooks/valia-sites.md @@ -0,0 +1,98 @@ +# Valia sites — add / update / retire + +Off-infra static sites authored by Valia (ADR-0018, CONTEXT.md "Valia site"). +Serving: Cloudflare Pages. Freshness: the `valia-sites-sync` CronJob +(`valia-sites` ns) mirrors each Content folder every 10 minutes and deploys +only when the folder's manifest hash changed. Registry: `local.sites` in +`stacks/valia-sites/main.tf` — one entry per site drives everything (Pages +project, custom domain, public CNAME, internal split-horizon CNAME, sync). + +Current sites: `bridge` (ОбУ „Отец Паисий“ — "мост"), `stem95su` (95. СУ STEM +board). + +## Add a site + +1. Valia shares the Drive folder with **vbarzin@gmail.com** (viewer is enough — + the pipeline is strictly read-only towards Drive). +2. Get the folder id from its URL (`drive.google.com/drive/folders/`). +3. Pick the **English** subdomain name (Viktor's call — CONTEXT.md naming rule). +4. Add one entry to `local.sites` in `stacks/valia-sites/main.tf`: + + ```hcl + = { + folder_id = "" + src_path = "" # or "sub/folder" if servable files live deeper + entry_file = "index.html" # or whatever her main HTML file is called + manage_dns = true + } + ``` + +5. Commit + push; CI applies. Within ~10 min the sync deploys content and the + site serves at `https://.viktorbarzin.me` (custom-domain TLS takes + ~5–10 min extra on first attach — CF returns 522 for the hostname until + then). Internal LAN/VLAN/pod resolution appears when the hourly + `technitium-ingress-dns-sync` next runs — trigger it early with: + `kubectl create job --from=cronjob/technitium-ingress-dns-sync valia-dns-now -n technitium` + +## Content rules (what Valia's folder must look like) + +- The **entry file** must exist — the sync stages a copy as `index.html` at + deploy time, so `/` works; the original filename keeps working too (deep + links survive). If the folder is empty or the entry file is missing, the + sync **skips the site and leaves it as-is** (never wipes a live site). +- Google-native files (Docs/Sheets) are **ignored** (`--drive-skip-gdocs`) — + only real files (`.html`, images, …) deploy. Gemini's HTML exports are fine. +- Per-file limit 25 MB (Cloudflare Pages), 20k files max — far beyond a + 1-page site. + +## Update a site + +Nothing to do: Valia edits the folder, the site follows within ~10 minutes. +Force it early: `kubectl create job --from=cronjob/valia-sites-sync sync-now -n valia-sites` + +## Rename / retire a site + +Rename = retire + add (Pages projects can't be renamed). Retire: + +1. Delete the entry from `local.sites`; commit + push. TF destroys the public + CNAME + custom domain + Pages project; the internal record is removed by + the next `technitium-ingress-dns-sync` run (its deletion pass drops any + internal `*.pages.dev` CNAME that left the `valia-sites-dns` ConfigMap — + scoped so it can never touch non-Pages records). +2. That's all — no manual DNS cleanup (the pre-ADR-0018 add-only gotcha is + fixed by the deletion pass). + +## Failure modes / debugging + +- **Visibility is failed-Job-only by choice** (ADR-0018): no alerts, no + notifications. Check: `kubectl get jobs -n valia-sites | tail`, logs of the + last `valia-sites-sync-*` pod. +- **Drive auth broken** (`FATAL … Drive list failed`): the shared + `secret/valia-sites.rclone_conf` token died. The GCP OAuth app + (`home-lab-1700868541205`) must stay published to "Production" or refresh + tokens expire weekly (same constraint as the old stem95su conf, which this + one was copied from). Re-mint and `vault kv patch secret/valia-sites + rclone_conf=@…`. +- **Wrangler auth broken**: `secret/valia-sites.cloudflare_pages_token` is a + SCOPED token (Pages Read+Write on the account, id + `355d2c9d11579bdad1e9498dafca30d5`) — re-mint via + `POST /user/tokens` with the Global API Key (`secret/platform`), patch + Vault. Do NOT put the Global API Key in the pod. +- **Site serves stale content**: check the state CM + (`kubectl get cm valia-sites-state -n valia-sites -o yaml`) — deleting a + site's key forces a redeploy on the next run. +- **`GUARD … skipping`** in logs: Valia's folder is empty or renamed the + entry file — the site deliberately kept its last content. Fix the folder or + update `entry_file`. + +## History + +- stem95su served in-cluster (nginx + NFS + its own rclone CronJob) until + 2026-07-03, when it was cut over to this pattern and the old stack retired + (ADR-0018). The blocking 42.9 MB `stem_video.mp4` was compressed to 21.4 MB + (same 1080p, ~2.5 Mbps H.264) and replaced in Valia's folder with Viktor's + explicit one-time OK. `secret/stem95su` is superseded by + `secret/valia-sites`; `/srv/nfs/stem-site` on the PVE host is a harmless + leftover. +- bridge started as a hand-deployed wrangler experiment (2026-07-03, memory + id 7085) and was adopted into the stack the same day. diff --git a/docs/runbooks/vault-token-renew-devvm.md b/docs/runbooks/vault-token-renew-devvm.md index 2dc4d35b..2ccddb8e 100644 --- a/docs/runbooks/vault-token-renew-devvm.md +++ b/docs/runbooks/vault-token-renew-devvm.md @@ -82,33 +82,48 @@ tail -5 ~/.local/state/vault-token-renew.log # recent results A healthy log line looks like: ` OK renewed (dn=token-devvm-wizard ttl=2764800s)` (ttl 2764800s = 768h). -## Drift guard & recovery +After an OIDC login you'll instead see, at the next nightly run: +` HEALED: re-minted periodic token from foreign dn=oidc-… (revoked N stale periodic token(s))` +— that's the self-heal working as designed. + +## Drift guard & self-heal `~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login` overwrites it. Two confirmed clobber vectors: 1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer - can't push past the OIDC role's 7-day `token_max_ttl`). + can't push past the OIDC role's 7-day `token_max_ttl`). The infra docs + prescribe this login before applies, so it recurs — it went unnoticed for + weeks twice (2026-06-18→26, 2026-06-29→07-03) and read as "Vault expires + weekly". 2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) → writes a read-only `kubernetes-woodpecker-default` token (can read Vault but - **cannot** write `secret/*`). This happened 2026-06-05 and went unnoticed for - two days — reads worked, writes silently 403'd. + **cannot** write `secret/*`). Happened 2026-06-05, unnoticed for two days. -To stop the renewer from silently keeping a foreign token alive, it runs a -**drift guard** first: it refuses to renew unless the token is -`token-devvm-wizard` **and** carries `vault-admin`. On drift it logs loudly and -exits non-zero (the systemd unit goes `failed`) rather than renewing someone -else's token. Symptom in the log: +Since 2026-07-03 the renewer **self-heals** +(`docs/plans/2026-07-03-vault-token-self-heal-design.md`). On a foreign token +it attempts the re-mint **with the clobbering token's own authority** and lets +Vault's authz decide: -` DRIFT: ~/.vault-token is dn=... policies=... Refusing to renew a foreign token. Re-mint: ...` +- **Admin-capable clobber (OIDC login)** → re-mints the periodic token, + sanity-checks it against the drift guard, atomically replaces + `~/.vault-token`, revokes stale `token-devvm-wizard` leftovers + (anti-sprawl), logs + `HEALED: re-minted periodic token from foreign dn=… (revoked N stale periodic token(s))` + and exits 0. The clobbering token is NOT revoked — it may still back a live + login session; it ages out on its own. +- **Weak clobber (read-only k8s token)** → the mint is denied; logs + `DRIFT: … heal denied, foreign token lacks create authority …; investigate what wrote it` + and exits non-zero (unit `failed`). Deliberately loud: this signals a + misbehaving agent flow — exactly the 2026-06-05 case. -**Recovery: re-mint** (the DRIFT log line contains the exact command) — run the -[mint/re-mint](#mint--re-mint-the-token) block. The drift guard detects but does -**not** auto-recover (a deliberate scope choice — version-only, no self-heal); -recovery is the manual re-mint above. +**Manual recovery** is only needed for the weak-clobber case (the DRIFT log +line still contains the exact command) — run the +[mint/re-mint](#mint--re-mint-the-token) block. ## Tests -`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision -and the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber -case). Run: `bash infra/scripts/test-vault-token-renew.sh`. +`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision, +the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber +case), and the self-heal's revoke filter (which stale periodic tokens a heal +may sweep). Run: `bash infra/scripts/test-vault-token-renew.sh`. diff --git a/modules/kubernetes/ingress_factory/main.tf b/modules/kubernetes/ingress_factory/main.tf index fc9bc9f5..ddcc7105 100644 --- a/modules/kubernetes/ingress_factory/main.tf +++ b/modules/kubernetes/ingress_factory/main.tf @@ -127,20 +127,29 @@ variable "anti_ai_scraping" { variable "dns_type" { type = string default = "none" - description = "Cloudflare DNS: 'proxied' (CNAME to tunnel), 'non-proxied' (A/AAAA to public IP), or 'none'" + description = <<-EOT + Cloudflare DNS: 'proxied' (CNAME to tunnel), 'non-proxied' (A/AAAA to + public IP), 'internal' (A to the internal Traefik LB IP — resolvable from + any resolver but only ROUTABLE from home LANs / WG sites / VPN; the record + is a reachability pointer, NOT a gate: pair it with an ipAllowList via + extra_middlewares, e.g. traefik-home-lans-only@kubernetescrd, because + direct-to-WAN-IP requests with the right SNI still hit Traefik), or 'none'. + EOT validation { - condition = contains(["proxied", "non-proxied", "none"], var.dns_type) - error_message = "dns_type must be 'proxied', 'non-proxied', or 'none'." + condition = contains(["proxied", "non-proxied", "internal", "none"], var.dns_type) + error_message = "dns_type must be 'proxied', 'non-proxied', 'internal', or 'none'." } } # Uptime Kuma external monitor: when true, annotate the ingress so the # external-monitor-sync CronJob creates a `[External] ` monitor pointing -# at https://. Null means "follow dns_type" — enabled when proxied. +# at https://. Null means "follow dns_type" — enabled when the ingress +# has a PUBLIC DNS record (proxied or non-proxied; 'internal' records are not +# externally reachable, so no external monitor). variable "external_monitor" { type = bool default = null - description = "Enable Uptime Kuma external monitor. null = auto (enabled when dns_type == 'proxied')." + description = "Enable Uptime Kuma external monitor. null = auto (enabled when dns_type is 'proxied' or 'non-proxied')." } variable "external_monitor_name" { @@ -171,6 +180,15 @@ variable "public_ipv6" { default = "2001:470:6e:43d::2" } +# Internal Traefik LB IP used by dns_type = "internal" records. Tracks the +# dedicated MetalLB IP from stacks/traefik (ETP=Local). A future LB renumber +# must update this default alongside the split-horizon apex record — see +# docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*. +variable "internal_lb_ip" { + type = string + default = "10.0.20.203" +} + variable "homepage_group" { type = string default = null # auto-detect from namespace @@ -201,8 +219,10 @@ locals { ) # External monitor enabled by default when the ingress has a public DNS - # record (either CF-proxied or direct A/AAAA). Explicit bool overrides. - effective_external_monitor = var.external_monitor != null ? var.external_monitor : (var.dns_type != "none") + # record (either CF-proxied or direct A/AAAA). 'internal' records resolve + # publicly but are unroutable from outside, so they get no external monitor. + # Explicit bool overrides. + effective_external_monitor = var.external_monitor != null ? var.external_monitor : (var.dns_type == "proxied" || var.dns_type == "non-proxied") # Emit the annotation when effective is true (positive signal), or when the # caller explicitly set external_monitor=false (opt-out). When the caller @@ -424,3 +444,19 @@ resource "cloudflare_record" "non_proxied_aaaa" { zone_id = var.cloudflare_zone_id allow_overwrite = true } + +# 'internal': a publicly-resolvable A record carrying the INTERNAL Traefik LB +# IP. Outsiders resolve it but can't route to it; home-LAN/WG-site/VPN clients +# reach Traefik directly (the WG spokes policy-route 10.0.0.0/8 through the +# tunnel), so kiosk devices with baked-in URLs need no DNS overrides anywhere. +# IPv4-only on purpose: the spokes route no internal IPv6 range. +resource "cloudflare_record" "internal_a" { + count = var.dns_type == "internal" ? 1 : 0 + name = local.dns_name + content = var.internal_lb_ip + proxied = false + ttl = 1 + type = "A" + zone_id = var.cloudflare_zone_id + allow_overwrite = true +} diff --git a/scripts/t3-serve@.service b/scripts/t3-serve@.service index 7f3d765d..0ab84e74 100644 --- a/scripts/t3-serve@.service +++ b/scripts/t3-serve@.service @@ -21,12 +21,19 @@ WorkingDirectory=/home/%i ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3 Restart=on-failure RestartSec=5 -# Memory containment (2026-06-10): agent children live in this cgroup; a -# runaway agent (10.8G anon on a 23G host) swap-thrashed the whole devvm — -# every >20s stall fires the t3 client watchdog (visible "disconnects") — -# then global-OOMed. Cap the cgroup so a runaway OOMs early and locally, -# and forbid swap so stalls can't smear into minutes-long freezes. -MemoryHigh=12G +# Memory containment (2026-06-10, amended 2026-07-02): agent children live in +# this cgroup; a runaway agent (10.8G anon on a 23G host) swap-thrashed the +# whole devvm — every >20s stall fires the t3 client watchdog (visible +# "disconnects") — then global-OOMed. Cap the cgroup so a runaway OOMs early +# and locally, and forbid swap so stalls can't smear into minutes-long freezes. +# MemoryHigh is DELIBERATELY infinity — do not add a soft band below MemoryMax: +# with swap=0 a hog that plateaus between high and max is unreclaimable but +# never OOMs, and the kernel's high-throttle stalls EVERY task in the cgroup +# (the t3 event loop included) indefinitely. A 12.3G agent ugrep livelocked +# this unit for ~50min on 2026-07-02 exactly this way. Straight-to-OOM at +# MemoryMax is the containment; OOMPolicy=continue below keeps the server up. +# See docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md addendum. +MemoryHigh=infinity MemoryMax=16G MemorySwapMax=0 # Default OOMPolicy=stop kills the WHOLE unit (8.5min outage 2026-06-10 diff --git a/scripts/test-vault-token-renew.sh b/scripts/test-vault-token-renew.sh index d64d02b4..313ff362 100644 --- a/scripts/test-vault-token-renew.sh +++ b/scripts/test-vault-token-renew.sh @@ -1,10 +1,11 @@ #!/usr/bin/env bash -# Unit tests for the pure drift-guard functions in vault-token-renew.sh. -# Sources the script (vtr_main is guarded) and exercises the decision logic that -# decides whether ~/.vault-token is OUR periodic admin token (renew) or a foreign -# token that clobbered the file (refuse, fail loud). This is exactly the logic -# whose ABSENCE let the 2026-06-05 woodpecker-token clobber be silently renewed -# for two days. Run: bash infra/scripts/test-vault-token-renew.sh +# Unit tests for the pure functions in vault-token-renew.sh. +# Sources the script (vtr_main is guarded) and exercises (a) the drift-guard +# decision — is ~/.vault-token OUR periodic admin token (renew) or a foreign +# clobber (heal / fail loud)? — whose ABSENCE let the 2026-06-05 woodpecker +# clobber be silently renewed for two days, and (b) the self-heal's revoke +# filter — which stale token-devvm-wizard tokens a heal may sweep. +# Run: bash infra/scripts/test-vault-token-renew.sh set -uo pipefail DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" # shellcheck source=/dev/null @@ -53,5 +54,21 @@ ok "ours: parse+decide renews" vtr_drift_ok "$(vtr_display_name "$LOOKUP_ no "woodpecker: parse+decide refused" vtr_drift_ok "$(vtr_display_name "$LOOKUP_WP")" "$(vtr_policies_csv "$LOOKUP_WP")" no "oidc: parse+decide refused" vtr_drift_ok "$(vtr_display_name "$LOOKUP_OIDC")" "$(vtr_policies_csv "$LOOKUP_OIDC")" +# --- vtr_accessor: parse accessor out of lookup JSON --- +LOOKUP_NEW='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-new","policies":["default","sops-admin","vault-admin"],"identity_policies":null}}' +eq "accessor parsed" "acc-new" "$(vtr_accessor "$LOOKUP_NEW")" +eq "accessor absent -> empty" "" "$(vtr_accessor '{"data":{"display_name":"x"}}')" + +# --- vtr_is_stale_periodic: the heal's revoke filter — ONLY old token-devvm-wizard +# --- tokens are swept; the just-minted token, foreign tokens, and anything with an +# --- unknown accessor are kept. An empty keep-accessor sweeps NOTHING (fail-safe). +STALE_OURS='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-old","policies":["default","sops-admin","vault-admin"]}}' +ok "older periodic token is stale" vtr_is_stale_periodic "$STALE_OURS" "acc-new" +no "the just-minted token is kept" vtr_is_stale_periodic "$LOOKUP_NEW" "acc-new" +no "foreign oidc token never swept" vtr_is_stale_periodic "$LOOKUP_OIDC" "acc-new" +no "woodpecker token never swept" vtr_is_stale_periodic "$LOOKUP_WP" "acc-new" +no "missing accessor never swept" vtr_is_stale_periodic '{"data":{"display_name":"token-devvm-wizard"}}' "acc-new" +no "empty keep-accessor sweeps nothing" vtr_is_stale_periodic "$STALE_OURS" "" + printf '\n%d passed, %d failed\n' "$pass" "$fail" (( fail == 0 )) diff --git a/scripts/vault-token-renew.sh b/scripts/vault-token-renew.sh index 2d73c862..42e78603 100644 --- a/scripts/vault-token-renew.sh +++ b/scripts/vault-token-renew.sh @@ -45,6 +45,94 @@ vtr_drift_ok() { printf ',%s,' "$pols" | grep -q ",$REQUIRED_POLICY," || return 1 } +# vtr_accessor -> the token accessor (empty if absent). +vtr_accessor() { + printf '%s' "$1" | jq -r '.data.accessor // ""' +} + +# vtr_is_stale_periodic -> 0 if this lookup +# describes one of OUR periodic tokens (display name matches) that is NOT the +# one to keep — i.e. a stale leftover a heal should revoke. 1 otherwise. +# Name-only on purpose (no policy check): anything named token-devvm-wizard +# that isn't the current token is garbage from a previous mint. An empty +# keep-accessor sweeps NOTHING (fail-safe: never revoke when we don't know +# which token is current). +vtr_is_stale_periodic() { + local dn acc + [ -n "${2:-}" ] || return 1 + dn=$(vtr_display_name "$1") + acc=$(vtr_accessor "$1") + [ "$dn" = "$EXPECTED_DN" ] || return 1 + [ -n "$acc" ] || return 1 + [ "$acc" != "$2" ] +} + +# vtr_heal -> 0 if ~/.vault-token was re-minted back to +# our periodic admin token using the foreign token's own authority, 1 if the +# heal was denied or failed (caller exits non-zero; the unit goes failed). +# +# Self-heal added 2026-07-03 (docs/plans/2026-07-03-vault-token-self-heal-design.md): +# an OIDC login — which the infra docs prescribe before applies — clobbers +# ~/.vault-token with a 7-day token, and detect-only drift left that unnoticed +# for weeks (the weekly-expiry loop). We ATTEMPT the re-mint with the +# clobbering token itself and let Vault's authz decide — a read-only clobber +# (the 2026-06-05 woodpecker incident) is denied the mint and stays a loud +# failure, because it signals a misbehaving flow that someone should look at. +vtr_heal() { + local foreign_dn="$1" log="$2" + local errf new_token new_info new_dn new_pols new_acc tmp + errf=$(mktemp) + if ! new_token=$(vault token create -orphan -period=768h \ + -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard \ + -field=token 2>"$errf") || [ -z "$new_token" ]; then + printf '%s DRIFT: ~/.vault-token is dn=%q — heal denied, foreign token lacks create authority (%s); investigate what wrote it. Manual re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \ + "$(date -Is)" "$foreign_dn" "$(tr '\n' ' ' <"$errf")" >>"$log" + rm -f "$errf" + return 1 + fi + rm -f "$errf" + + # Sanity: the minted token must itself pass the drift guard before it may + # replace ~/.vault-token. + if ! new_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json 2>&1); then + printf '%s FAIL: heal minted a token but its lookup failed: %s\n' \ + "$(date -Is)" "$new_info" >>"$log" + return 1 + fi + new_dn=$(vtr_display_name "$new_info") + new_pols=$(vtr_policies_csv "$new_info") + if ! vtr_drift_ok "$new_dn" "$new_pols"; then + printf '%s FAIL: heal minted an unexpected token (dn=%q policies=%q) — not writing it\n' \ + "$(date -Is)" "$new_dn" "$new_pols" >>"$log" + return 1 + fi + + # Atomic replace: mktemp files are 0600 from birth; same-filesystem mv. + tmp=$(mktemp "$HOME/.vault-token.XXXXXX") + printf '%s' "$new_token" >"$tmp" + mv "$tmp" "$HOME/.vault-token" + + # Anti-sprawl: revoke previous token-devvm-wizard tokens — each heal would + # otherwise strand the prior periodic ADMIN token server-side for up to 32d. + # The clobbering foreign token is deliberately NOT revoked: it may still back + # the user's live login session, and it ages out on its own (7d for OIDC). + local sweep="accessor sweep skipped (list denied)" accessors a a_info revoked=0 + new_acc=$(vtr_accessor "$new_info") + if [ -n "$new_acc" ] && accessors=$(VAULT_TOKEN="$new_token" vault list -format=json auth/token/accessors 2>/dev/null); then + while IFS= read -r a; do + [ -n "$a" ] || continue + a_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json -accessor "$a" 2>/dev/null) || continue + if vtr_is_stale_periodic "$a_info" "$new_acc"; then + VAULT_TOKEN="$new_token" vault token revoke -accessor "$a" >/dev/null 2>&1 && revoked=$((revoked + 1)) + fi + done < <(printf '%s' "$accessors" | jq -r '.[]') + sweep="revoked $revoked stale periodic token(s)" + fi + + printf '%s HEALED: re-minted periodic token from foreign dn=%q (%s)\n' \ + "$(date -Is)" "$foreign_dn" "$sweep" >>"$log" +} + vtr_main() { set -euo pipefail export PATH="/usr/local/bin:/usr/bin:/bin:${PATH:-}" @@ -61,16 +149,19 @@ vtr_main() { dn=$(vtr_display_name "$info") pols=$(vtr_policies_csv "$info") - # Drift guard (added 2026-06-07): the renewer must NOT keep a FOREIGN token alive. - # On 2026-06-05 a stray `vault login -method=kubernetes` overwrote ~/.vault-token - # with a read-only woodpecker token, and this script then silently renewed THAT - # for two days — masking the loss of write access. So before renewing, confirm - # the token is our periodic admin token; if it has drifted, fail loudly (systemd - # marks the unit failed) instead of keeping someone else's token alive. + # Drift guard (2026-06-07) + self-heal (2026-07-03): the renewer must not + # keep a FOREIGN token alive (on 2026-06-05 a stray kubernetes login was + # silently renewed for two days, masking lost write access). But detect-only + # drift proved worse in practice: an OIDC login — which the infra docs + # prescribe before applies — clobbers this file too, and the resulting DRIFT + # failures went unnoticed for weeks while access degraded to a 7-day token + # (the weekly-expiry loop). On drift we now ATTEMPT to heal (see vtr_heal): + # re-mint the periodic token with the clobbering token's own authority. + # Vault's authz keeps the old guarantee — a token that couldn't legitimately + # hold vault-admin is denied the mint, and we still fail loud. if ! vtr_drift_ok "$dn" "$pols"; then - printf '%s DRIFT: ~/.vault-token is dn=%q policies=%q (expected dn=%q with %q). Refusing to renew a foreign token. Re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \ - "$(date -Is)" "$dn" "$pols" "$EXPECTED_DN" "$REQUIRED_POLICY" >>"$log" - exit 1 + vtr_heal "$dn" "$log" || exit 1 + exit 0 fi # `vault token renew` with no argument renews the calling token (renew-self). diff --git a/scripts/workstation/setup-devvm.sh b/scripts/workstation/setup-devvm.sh index 02bd9257..3e05b8a0 100755 --- a/scripts/workstation/setup-devvm.sh +++ b/scripts/workstation/setup-devvm.sh @@ -244,9 +244,15 @@ log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-us # virtual disk into an IO storm + multi-minute freeze (hard-killed 2026-06-22). # t3-serve@ was already capped (its [Service] block); the HOLE was the uncapped # user-.slice (all ssh/tmux work). Design — per user, on BOTH trees: -# MemoryHigh=12G soft (throttles a runaway to a crawl), MemoryMax=16G hard, -# MemorySwapMax=0 (work never touches disk swap → no thrash; it OOMs locally at -# the ceiling instead), plus fair-share CPU/IO weights. +# MemoryMax=16G hard + MemorySwapMax=0 (work never touches disk swap → no +# thrash; a runaway is cgroup-OOM-killed locally at the ceiling), plus +# fair-share CPU/IO weights. +# NO MemoryHigh soft band (removed 2026-07-02; was 12G "throttle to a crawl"): +# with swap=0, a hog that PLATEAUS between high and max is unreclaimable but +# never OOMs — the kernel parks every task of the cgroup in +# mem_cgroup_handle_over_high and the whole tree stalls indefinitely. A 12.3G +# agent ugrep livelocked t3-serve@wizard (t3 down ~50min) exactly this way. +# Cap-and-kill, never throttle-and-pray — see the post-mortem addendum. # BACKSTOP = earlyoom, NOT systemd-oomd. We first shipped systemd-oomd but it is # INERT with swap=0: its pressure-kill only acts on cgroups doing active reclaim # (pgscan rising), and a no-swap anon workload never reclaims — verified live, a @@ -260,12 +266,16 @@ log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-us # 10a) per-user caps + fair-share weights on EVERY user-.slice (ssh/tmux) install -d -m 0755 /etc/systemd/system/user-.slice.d cat > /etc/systemd/system/user-.slice.d/50-devvm-resource.conf <<'SLICE_EOF' -# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22). -# Applies to EACH user-.slice = all of one user's ssh/tmux work. Mirrors the -# t3-serve@.service caps so a user is bounded in whichever surface they work in. +# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22; +# MemoryHigh dropped 2026-07-02). Applies to EACH user-.slice = all of one +# user's ssh/tmux work. Mirrors the t3-serve@.service caps so a user is bounded +# in whichever surface they work in. MemoryHigh stays infinity: with swap=0 a +# hog plateauing in a high..max band livelocks the entire slice (every ssh/tmux +# session of that user) instead of dying — straight-to-OOM at MemoryMax is the +# containment (see post-mortem addendum 2026-07-02). [Slice] MemoryAccounting=yes -MemoryHigh=12G +MemoryHigh=infinity MemoryMax=16G MemorySwapMax=0 CPUAccounting=yes @@ -294,12 +304,14 @@ cat > /etc/systemd/system/docker.slice <<'DOCKER_SLICE_EOF' # All docker containers live here (cgroup-parent in /etc/docker/daemon.json) so # they share one bounded budget and a runaway container is capped at MemoryMax # (cgroup-OOM'd locally) instead of escaping into the uncapped system.slice. -# setup-devvm.sh §10, 2026-06-22. +# setup-devvm.sh §10, 2026-06-22; MemoryHigh dropped 2026-07-02 — a container +# plateauing in the high..max band would throttle-livelock EVERY container in +# the slice (see post-mortem addendum); MemoryMax OOM is the containment. [Unit] Description=Docker containers slice (capped) [Slice] MemoryAccounting=yes -MemoryHigh=6G +MemoryHigh=infinity MemoryMax=8G MemorySwapMax=0 CPUAccounting=yes diff --git a/secrets/nfs_directories.txt b/secrets/nfs_directories.txt index 51e11aad..cc89391f 100644 Binary files a/secrets/nfs_directories.txt and b/secrets/nfs_directories.txt differ diff --git a/stacks/cloudflared/modules/cloudflared/cloudflare.tf b/stacks/cloudflared/modules/cloudflared/cloudflare.tf index ad4d9de8..59e748ae 100644 --- a/stacks/cloudflared/modules/cloudflared/cloudflare.tf +++ b/stacks/cloudflared/modules/cloudflared/cloudflare.tf @@ -235,6 +235,12 @@ resource "cloudflare_record" "keyserver" { zone_id = var.cloudflare_zone_id } +# bridge.viktorbarzin.me (Cloudflare Pages, "мост" school site) moved to +# stacks/valia-sites (ADR-0018) — all Valia-site records live there now. +# State handoff was a manual `tg state rm` (2026-07-03): the CI terraform +# (<1.7) rejects removed{} blocks even at the stack root, so declarative +# forget wasn't available. valia-sites imported the live record by id. + # Enable HTTP/3 (QUIC) for Cloudflare-proxied domains resource "cloudflare_zone_settings_override" "http3" { zone_id = var.cloudflare_zone_id diff --git a/stacks/dawarich/main.tf b/stacks/dawarich/main.tf index 3eeb1540..1d2d1f81 100644 --- a/stacks/dawarich/main.tf +++ b/stacks/dawarich/main.tf @@ -16,7 +16,7 @@ resource "kubernetes_namespace" "dawarich" { name = "dawarich" labels = { "istio-injection" : "disabled" - tier = local.tiers.edge + tier = local.tiers.edge "keel.sh/enrolled" = "true" } } @@ -330,7 +330,7 @@ resource "kubernetes_deployment" "dawarich" { } lifecycle { ignore_changes = [ - spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates metadata[0].annotations["keel.sh/policy"], metadata[0].annotations["keel.sh/trigger"], @@ -458,6 +458,13 @@ module "ingress" { namespace = kubernetes_namespace.dawarich.metadata[0].name name = "dawarich" tls_secret_name = var.tls_secret_name + # Rails serves all its fingerprinted assets itself and the map view adds an + # API burst per page load — the default 10/50 limiter 429s the asset tail + # from a single client IP (and risks dropping OwnTracks/mobile ingestion + # POSTs on the same host). Dedicated 100/1000 limiter defined in + # stacks/traefik/modules/traefik/middleware.tf. + skip_default_rate_limit = true + extra_middlewares = ["traefik-dawarich-rate-limit@kubernetescrd"] extra_annotations = { "gethomepage.dev/enabled" = "true" "gethomepage.dev/name" = "Dawarich" diff --git a/stacks/dbaas/modules/dbaas/main.tf b/stacks/dbaas/modules/dbaas/main.tf index bd380fe1..5f86110a 100644 --- a/stacks/dbaas/modules/dbaas/main.tf +++ b/stacks/dbaas/modules/dbaas/main.tf @@ -1511,6 +1511,34 @@ resource "null_resource" "pg_instagram_poster_db" { } } +# Create tasks database for the tasks PWA (Reminders-style front-end over +# Nextcloud CalDAV; FastAPI + SvelteKit SPA — see ~/code/tasks). Stores +# Connected Accounts (Fernet-encrypted Nextcloud app passwords) + sync state. +# Role password is managed by Vault Database Secrets Engine (static role +# `pg-tasks`, 7d rotation). Tables are created by alembic on app startup. +resource "null_resource" "pg_tasks_db" { + depends_on = [null_resource.pg_cluster] + + triggers = { + db_name = "tasks" + username = "tasks" + } + + provisioner "local-exec" { + command = <<-EOT + PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}') + kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \ + bash -c ' + psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'tasks'"'"'" | grep -q 1 || \ + psql -U postgres -c "CREATE ROLE tasks WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'" + psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_database WHERE datname = '"'"'tasks'"'"'" | grep -q 1 || \ + psql -U postgres -c "CREATE DATABASE tasks OWNER tasks" + psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE tasks TO tasks" + ' + EOT + } +} + # Old PostgreSQL deployment — kept commented for rollback reference # resource "kubernetes_deployment" "postgres" { # metadata { diff --git a/stacks/drone-logbook/main.tf b/stacks/drone-logbook/main.tf new file mode 100644 index 00000000..e5f8b219 --- /dev/null +++ b/stacks/drone-logbook/main.tf @@ -0,0 +1,360 @@ +variable "tls_secret_name" { + type = string + sensitive = true +} +variable "nfs_server" { type = string } + +# Open DroneLog (https://github.com/arpanghosh8453/open-dronelog) — self-hosted +# DJI flight-log analyzer for the DJI Mini 4 Pro. Runs the UPSTREAM image (the +# ViktorBarzin/drone-logbook fork has no custom commits); Keel tracks :latest. +# Design: docs/plans/2026-07-04-drone-logbook-design.md +resource "kubernetes_namespace" "drone_logbook" { + metadata { + name = "drone-logbook" + labels = { + tier = local.tiers.aux + "keel.sh/enrolled" = "true" + } + } + lifecycle { + # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace + ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] + } +} + +resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } + manifest = { + apiVersion = "external-secrets.io/v1" + kind = "ExternalSecret" + metadata = { + name = "drone-logbook-secrets" + namespace = "drone-logbook" + } + spec = { + refreshInterval = "15m" + secretStoreRef = { + name = "vault-kv" + kind = "ClusterSecretStore" + } + target = { + name = "drone-logbook-secrets" + } + dataFrom = [{ + extract = { + key = "drone-logbook" + } + }] + } + } + depends_on = [kubernetes_namespace.drone_logbook] +} + +module "tls_secret" { + source = "../../modules/kubernetes/setup_tls_secret" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + tls_secret_name = var.tls_secret_name +} + +# DuckDB database + cached DJI decryption keys + uploaded originals. +# Embedded DB -> block storage, not NFS (same rationale as freshrss data). +# Encrypted class: flight logs are GPS traces of home/travel (sensitive data +# -> proxmox-lvm-encrypted per the storage decision rule in .claude/CLAUDE.md). +resource "kubernetes_persistent_volume_claim" "data" { + wait_until_bound = false + metadata { + name = "drone-logbook-data-encrypted" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + annotations = { + "resize.topolvm.io/threshold" = "10%" + "resize.topolvm.io/increase" = "100%" + "resize.topolvm.io/storage_limit" = "10Gi" + } + } + spec { + access_modes = ["ReadWriteOnce"] + storage_class_name = "proxmox-lvm-encrypted" + resources { + requests = { + storage = "2Gi" + } + } + } + lifecycle { + # The autoresizer expands requests.storage up to storage_limit and PVCs + # can't shrink; without this every apply tries to revert the size. + ignore_changes = [spec[0].resources[0].requests] + } +} + +# Drop folder: any producer (Nextcloud sync, scp, future phone pipeline) lands +# DJI .txt logs here over NFS; the app auto-imports on SYNC_INTERVAL. +module "nfs_sync_logs" { + source = "../../modules/kubernetes/nfs_volume" + name = "drone-logbook-sync-logs" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + nfs_server = var.nfs_server + nfs_path = "/srv/nfs/drone-logbook/sync-logs" + storage = "5Gi" +} + +resource "kubernetes_deployment" "drone_logbook" { + metadata { + name = "drone-logbook" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + labels = { + app = "drone-logbook" + "kubernetes.io/cluster-service" = "true" + tier = local.tiers.aux + } + } + spec { + replicas = 1 + strategy { + # DuckDB is single-writer; never overlap two pods on the same volume + type = "Recreate" + } + selector { + match_labels = { + app = "drone-logbook" + } + } + template { + metadata { + labels = { + app = "drone-logbook" + "kubernetes.io/cluster-service" = "true" + } + } + spec { + container { + name = "drone-logbook" + image = "ghcr.io/arpanghosh8453/open-dronelog:latest" + env { + name = "RUST_LOG" + value = "info" + } + env { + # keep re-importable originals under /data/drone-logbook/uploaded + name = "KEEP_UPLOADED_FILES" + value = "true" + } + env { + name = "SYNC_LOGS_PATH" + value = "/sync-logs" + } + env { + # 6-field cron (sec min hour dom mon dow): scan drop folder every 8h + name = "SYNC_INTERVAL" + value = "0 0 */8 * * *" + } + env { + name = "PROFILE_CREATION_PASS" + value_from { + secret_key_ref { + name = "drone-logbook-secrets" + key = "profile_creation_pass" + } + } + } + volume_mount { + name = "data" + mount_path = "/data/drone-logbook" + } + volume_mount { + name = "sync-logs" + mount_path = "/sync-logs" + read_only = true + } + port { + name = "http" + container_port = 80 + protocol = "TCP" + } + resources { + requests = { + cpu = "25m" + memory = "512Mi" + } + limits = { + memory = "512Mi" + } + } + } + volume { + name = "data" + persistent_volume_claim { + claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name + } + } + volume { + name = "sync-logs" + persistent_volume_claim { + claim_name = module.nfs_sync_logs.claim_name + } + } + } + } + } + depends_on = [kubernetes_manifest.external_secret] + lifecycle { + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 + metadata[0].annotations["keel.sh/policy"], + metadata[0].annotations["keel.sh/trigger"], + metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 + metadata[0].annotations["keel.sh/match-tag"], + spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates + metadata[0].annotations["kubernetes.io/change-cause"], + metadata[0].annotations["deployment.kubernetes.io/revision"], + spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 + ] + } +} + +resource "kubernetes_service" "drone_logbook" { + metadata { + name = "drone-logbook" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + labels = { + "app" = "drone-logbook" + } + } + + spec { + selector = { + app = "drone-logbook" + } + port { + port = "80" + target_port = "80" + } + } +} + +# ----------------------------------------------------------------------------- +# Backup — required for every proxmox-lvm(-encrypted) app: daily copy of the +# data volume to NFS /srv/nfs/drone-logbook-backup (picked up by nfs-mirror -> +# sda -> Synology offsite). 01:30 = outside the 00:00/08:00/16:00 sync-import +# windows, so the DuckDB file is quiescent; uploaded originals make even a +# mid-write copy recoverable by re-import. Pod-affinity co-schedules with the +# app pod (RWO volume mounts twice only on the same node). Vaultwarden pattern. +# ----------------------------------------------------------------------------- + +module "nfs_backup" { + source = "../../modules/kubernetes/nfs_volume" + name = "drone-logbook-backup-host" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + nfs_server = var.nfs_server + nfs_path = "/srv/nfs/drone-logbook-backup" +} + +resource "kubernetes_cron_job_v1" "backup" { + metadata { + name = "drone-logbook-backup" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + } + spec { + concurrency_policy = "Replace" + failed_jobs_history_limit = 5 + schedule = "30 1 * * *" + starting_deadline_seconds = 300 + successful_jobs_history_limit = 3 + job_template { + metadata {} + spec { + backoff_limit = 3 + ttl_seconds_after_finished = 10 + template { + metadata {} + spec { + affinity { + pod_affinity { + required_during_scheduling_ignored_during_execution { + label_selector { + match_labels = { + app = "drone-logbook" + } + } + topology_key = "kubernetes.io/hostname" + } + } + } + container { + name = "drone-logbook-backup" + image = "docker.io/library/alpine" + command = ["/bin/sh", "-c", <<-EOT + set -euxo pipefail + _t0=$(date +%s) + now=$(date +"%Y_%m_%d_%H_%M") + mkdir -p /backup/$now + cp -a /data/. /backup/$now/ + # Rotate — 30 day retention + find /backup -maxdepth 1 -mindepth 1 -type d -mtime +30 -exec rm -rf {} + + _dur=$(($(date +%s) - _t0)) + _out_bytes=$(du -sb /backup/$now | awk '{print $1}') + wget -qO- --post-data "backup_duration_seconds $${_dur} + backup_output_bytes $${_out_bytes} + backup_last_success_timestamp $(date +%s) + " "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/drone-logbook-backup" || true + EOT + ] + volume_mount { + name = "data" + mount_path = "/data" + read_only = true + } + volume_mount { + name = "backup" + mount_path = "/backup" + } + } + volume { + name = "data" + persistent_volume_claim { + claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name + } + } + volume { + name = "backup" + persistent_volume_claim { + claim_name = module.nfs_backup.claim_name + } + } + dns_config { + option { + name = "ndots" + value = "2" + } + } + } + } + } + } + } + lifecycle { + # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] + } +} + +# https://dronelog.viktorbarzin.me +module "ingress" { + source = "../../modules/kubernetes/ingress_factory" + auth = "required" # Authentik forward-auth — flight logs are GPS traces of home/travel + dns_type = "proxied" + namespace = kubernetes_namespace.drone_logbook.metadata[0].name + name = "dronelog" + service_name = "drone-logbook" + tls_secret_name = var.tls_secret_name + extra_annotations = { + "gethomepage.dev/enabled" = "true" + "gethomepage.dev/name" = "Drone Logbook" + "gethomepage.dev/description" = "DJI flight log analyzer" + "gethomepage.dev/icon" = "mdi-quadcopter" + "gethomepage.dev/group" = "Media & Entertainment" + "gethomepage.dev/pod-selector" = "" + } +} diff --git a/stacks/drone-logbook/secrets b/stacks/drone-logbook/secrets new file mode 120000 index 00000000..ca54a7cf --- /dev/null +++ b/stacks/drone-logbook/secrets @@ -0,0 +1 @@ +../../secrets \ No newline at end of file diff --git a/stacks/drone-logbook/terragrunt.hcl b/stacks/drone-logbook/terragrunt.hcl new file mode 100644 index 00000000..0d1c8e53 --- /dev/null +++ b/stacks/drone-logbook/terragrunt.hcl @@ -0,0 +1,8 @@ +include "root" { + path = find_in_parent_folders() +} + +dependency "platform" { + config_path = "../platform" + skip_outputs = true +} diff --git a/stacks/excalidraw/main.tf b/stacks/excalidraw/main.tf index 41ab48a0..b7a33117 100644 --- a/stacks/excalidraw/main.tf +++ b/stacks/excalidraw/main.tf @@ -10,7 +10,7 @@ resource "kubernetes_namespace" "excalidraw" { name = "excalidraw" labels = { "istio-injection" : "disabled" - tier = local.tiers.aux + tier = local.tiers.aux "keel.sh/enrolled" = "true" } } @@ -45,6 +45,15 @@ resource "kubernetes_deployment" "excalidraw" { app = "excalidraw" tier = local.tiers.aux } + # Keel rolls new ghcr:latest digests (k8s-portal pattern). Values here are + # recreate-correct seeds only — the keys are in ignore_changes below, so + # the live annotations win on an existing deployment. + annotations = { + "keel.sh/policy" = "force" + "keel.sh/trigger" = "poll" + "keel.sh/match-tag" = "true" + "keel.sh/pollSchedule" = "@every 5m" + } } spec { replicas = 1 @@ -67,9 +76,19 @@ resource "kubernetes_deployment" "excalidraw" { } } spec { + # GHCR pull secret: the ghcr-credentials Secret in this namespace is + # cloned in by the kyverno stack's sync-ghcr-credentials ClusterPolicy + # (allowlisted private-ghcr namespaces only — ADR-0002). Source of + # truth: stacks/kyverno/modules/kyverno/ghcr-credentials.tf. + image_pull_secrets { + name = "ghcr-credentials" + } container { - image = "viktorbarzin/excalidraw-library:v4" - image_pull_policy = "IfNotPresent" + # ADR-0002: GHA-built (.github/workflows/build-excalidraw.yml), + # PRIVATE ghcr; Keel rolls new :latest digests. DockerHub + # viktorbarzin/excalidraw-library:v4 is the frozen rollback image. + image = "ghcr.io/viktorbarzin/excalidraw-library:latest" + image_pull_policy = "Always" name = "excalidraw" port { container_port = 8080 @@ -107,7 +126,7 @@ resource "kubernetes_deployment" "excalidraw" { } lifecycle { ignore_changes = [ - spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates metadata[0].annotations["keel.sh/policy"], metadata[0].annotations["keel.sh/trigger"], diff --git a/stacks/excalidraw/project/README.md b/stacks/excalidraw/project/README.md index 0f017e85..c9c95078 100644 --- a/stacks/excalidraw/project/README.md +++ b/stacks/excalidraw/project/README.md @@ -4,18 +4,28 @@ A self-hosted Excalidraw library with per-user drawing storage and management. ## Features -- Dashboard to manage all your drawings +- Dashboard to manage all your drawings (create, open, rename, delete) - Per-user storage (via Authentik SSO headers) -- Create, edit, and delete drawings +- Rename drawings from the dashboard or by clicking the drawing name in the editor +- Native Excalidraw export via the editor's hamburger menu: "Save to..." + (.excalidraw file) and "Export image..." (PNG / SVG / clipboard) +- Autosave (2s debounce) + manual save (Ctrl+S or menu "Save now") - Persistent storage via NFS ## Docker Image ``` -viktorbarzin/excalidraw-library:v4 +ghcr.io/viktorbarzin/excalidraw-library:latest ``` -Available on Docker Hub: https://hub.docker.com/r/viktorbarzin/excalidraw-library +Built by GitHub Actions (`.github/workflows/build-excalidraw.yml` in the infra +repo, ADR-0002) on every master push touching `stacks/excalidraw/project/**`; +tags `:latest` + `:`. The package is PRIVATE — cluster pulls use the +Kyverno-synced `ghcr-credentials` secret. Keel polls `:latest` and rolls the +deployment on digest change. + +The legacy manually-built DockerHub image `viktorbarzin/excalidraw-library:v4` +is frozen as the rollback target; nothing pushes to it anymore. ## Configuration @@ -39,54 +49,13 @@ Mount a persistent volume to the `DATA_DIR` path. Drawings are stored as `.excal └── my-diagram.excalidraw ``` +The filename (without extension) is both the drawing ID and its display name; +renaming a drawing renames the file (`os.Rename`, mtime preserved). + ## Deployment -### Docker - -```bash -docker run -d \ - --name excalidraw-rooms \ - -p 8080:8080 \ - -v /path/to/storage:/data \ - viktorbarzin/excalidraw-library:v4 -``` - -### Kubernetes - -```yaml -apiVersion: apps/v1 -kind: Deployment -metadata: - name: excalidraw -spec: - replicas: 1 - selector: - matchLabels: - app: excalidraw - template: - metadata: - labels: - app: excalidraw - spec: - containers: - - name: excalidraw - image: viktorbarzin/excalidraw-library:v4 - ports: - - containerPort: 8080 - env: - - name: DATA_DIR - value: /data - - name: PORT - value: "8080" - volumeMounts: - - name: data - mountPath: /data - volumes: - - name: data - nfs: - server: 192.168.1.127 - path: /srv/nfs/excalidraw -``` +Deployed by the `stacks/excalidraw` Terraform stack (namespace `excalidraw`, +service `draw`, ingress `draw.viktorbarzin.me` with `auth = "required"`). ### With Authentik SSO @@ -96,23 +65,7 @@ The application reads user identity from Authentik headers: - `X-Authentik-Email` - Displayed in UI - `X-Authentik-Name` - Displayed in UI -Configure your ingress to pass these headers: - -```yaml -annotations: - nginx.ingress.kubernetes.io/auth-response-headers: "X-authentik-username,X-authentik-email,X-authentik-name" -``` - -## Building - -```bash -# Build the Docker image -docker build -t excalidraw-library . - -# Or build locally -go build -o excalidraw-library . -./excalidraw-library -``` +Requests without `X-Authentik-Username` fall back to the `anonymous` user. ## API Endpoints @@ -122,10 +75,25 @@ go build -o excalidraw-library . | GET | `/api/drawings` | List all drawings for current user | | GET | `/api/drawings/:id` | Get drawing data | | PUT | `/api/drawings/:id` | Save drawing | +| PATCH | `/api/drawings/:id` | Rename drawing — body `{"name": ""}`; returns `{"status":"renamed","id":""}`; 409 if the target name exists | | DELETE | `/api/drawings/:id` | Delete drawing | | GET | `/api/user` | Get current user info | | GET | `/draw/:id` | Open drawing in editor | +Rename names are sanitized server-side to `[a-zA-Z0-9-_]` (other characters +become `-`; a trailing `.excalidraw` is stripped). Existing IDs are accepted +as-is for backward compatibility with API clients. + +## Development + +```bash +# Run tests +go test ./... + +# Run locally +DATA_DIR=/tmp/excalidraw-data go run . +``` + ## License MIT diff --git a/stacks/excalidraw/project/main.go b/stacks/excalidraw/project/main.go index e6dfbd83..b444f6cf 100644 --- a/stacks/excalidraw/project/main.go +++ b/stacks/excalidraw/project/main.go @@ -9,6 +9,7 @@ import ( "net/http" "os" "path/filepath" + "regexp" "sort" "strings" "time" @@ -63,6 +64,21 @@ func getUsername(r *http.Request) string { return username } +var invalidNameChars = regexp.MustCompile(`[^a-zA-Z0-9-_]`) + +// sanitizeName normalizes a user-supplied drawing name into a safe file ID +// (same charset the dashboard applies on create). Returns "" if nothing +// meaningful remains. +func sanitizeName(name string) string { + name = strings.TrimSpace(name) + name = strings.TrimSuffix(name, ".excalidraw") + name = invalidNameChars.ReplaceAllString(name, "-") + if strings.Trim(name, "-") == "" { + return "" + } + return name +} + // getUserDataDir returns the data directory for a specific user and ensures it exists func getUserDataDir(username string) string { userDir := filepath.Join(dataDir, username) @@ -168,6 +184,41 @@ func handleDrawing(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "application/json") json.NewEncoder(w).Encode(map[string]string{"status": "saved", "id": id}) + case http.MethodPatch: + var req struct { + Name string `json:"name"` + } + if err := json.NewDecoder(r.Body).Decode(&req); err != nil { + http.Error(w, "Invalid JSON body", http.StatusBadRequest) + return + } + newID := sanitizeName(req.Name) + if newID == "" { + http.Error(w, "Invalid name", http.StatusBadRequest) + return + } + if _, err := os.Stat(filePath); err != nil { + if os.IsNotExist(err) { + http.Error(w, "Drawing not found", http.StatusNotFound) + } else { + http.Error(w, err.Error(), http.StatusInternalServerError) + } + return + } + if newID != id { + newPath := filepath.Join(userDataDir, newID+".excalidraw") + if _, err := os.Stat(newPath); err == nil { + http.Error(w, "A drawing with that name already exists", http.StatusConflict) + return + } + if err := os.Rename(filePath, newPath); err != nil { + http.Error(w, err.Error(), http.StatusInternalServerError) + return + } + } + w.Header().Set("Content-Type", "application/json") + json.NewEncoder(w).Encode(map[string]string{"status": "renamed", "id": newID}) + case http.MethodDelete: if err := os.Remove(filePath); err != nil { if os.IsNotExist(err) { @@ -264,6 +315,8 @@ const dashboardHTML = ` .btn:hover { background: #5b4cdb; } .btn-danger { background: #e74c3c; } .btn-danger:hover { background: #c0392b; } + .btn-secondary { background: #3d3d5c; } + .btn-secondary:hover { background: #4a4a70; } .btn-small { padding: 0.4rem 0.8rem; font-size: 0.85rem; } .drawings { display: grid; gap: 1rem; } .drawing { @@ -342,11 +395,11 @@ const dashboardHTML = ` @@ -369,31 +422,63 @@ const dashboardHTML = ` } } + function drawingRow(d) { + var row = document.createElement('div'); + row.className = 'drawing'; + + var info = document.createElement('div'); + info.className = 'drawing-info'; + var nameLink = document.createElement('a'); + nameLink.className = 'drawing-name'; + nameLink.href = '/draw/' + encodeURIComponent(d.id); + nameLink.textContent = d.name; + var meta = document.createElement('div'); + meta.className = 'drawing-meta'; + meta.textContent = 'Modified: ' + new Date(d.modified).toLocaleDateString() + ' ' + + new Date(d.modified).toLocaleTimeString() + ' - ' + formatSize(d.size); + info.appendChild(nameLink); + info.appendChild(meta); + + var actions = document.createElement('div'); + actions.className = 'drawing-actions'; + var open = document.createElement('a'); + open.className = 'btn btn-small'; + open.href = '/draw/' + encodeURIComponent(d.id); + open.textContent = 'Open'; + var rename = document.createElement('button'); + rename.className = 'btn btn-small btn-secondary'; + rename.textContent = 'Rename'; + rename.onclick = function() { showRenameModal(d.id); }; + var del = document.createElement('button'); + del.className = 'btn btn-small btn-danger'; + del.textContent = 'Delete'; + del.onclick = function() { deleteDrawing(d.id); }; + actions.appendChild(open); + actions.appendChild(rename); + actions.appendChild(del); + + row.appendChild(info); + row.appendChild(actions); + return row; + } + async function loadDrawings() { const resp = await fetch('/api/drawings'); const drawings = await resp.json(); const container = document.getElementById('drawings'); + container.replaceChildren(); if (!drawings || drawings.length === 0) { - container.innerHTML = '
No drawings yet. Create your first one!
'; + var empty = document.createElement('div'); + empty.className = 'empty'; + empty.textContent = 'No drawings yet. Create your first one!'; + container.appendChild(empty); return; } - container.innerHTML = drawings.map(function(d) { - return '
' + - '
' + - '' + d.name + '' + - '
' + - 'Modified: ' + new Date(d.modified).toLocaleDateString() + ' ' + new Date(d.modified).toLocaleTimeString() + - ' - ' + formatSize(d.size) + - '
' + - '
' + - '
' + - 'Open' + - '' + - '
' + - '
'; - }).join(''); + drawings.forEach(function(d) { + container.appendChild(drawingRow(d)); + }); } function formatSize(bytes) { @@ -402,18 +487,64 @@ const dashboardHTML = ` return (bytes / (1024 * 1024)).toFixed(1) + ' MB'; } - function showNewModal() { + var modalAction = null; // invoked with the input value on confirm + + function showModal(title, confirmLabel, initialValue, action) { + document.getElementById('modal-title').textContent = title; + document.getElementById('modal-confirm').textContent = confirmLabel; + var input = document.getElementById('drawingName'); + input.value = initialValue || ''; + modalAction = action; document.getElementById('modal').classList.add('active'); - document.getElementById('drawingName').focus(); + input.focus(); + input.select(); + } + + function showNewModal() { + showModal('New Drawing', 'Create', '', createDrawing); + } + + function showRenameModal(id) { + showModal('Rename Drawing', 'Rename', id, function(value) { + renameDrawing(id, value); + }); } function hideModal() { document.getElementById('modal').classList.remove('active'); document.getElementById('drawingName').value = ''; + modalAction = null; } - async function createDrawing() { - var name = document.getElementById('drawingName').value.trim(); + function confirmModal() { + if (modalAction) modalAction(document.getElementById('drawingName').value); + } + + async function renameDrawing(id, newName) { + newName = (newName || '').trim(); + if (!newName || newName === id) { + hideModal(); + return; + } + var resp = await fetch('/api/drawings/' + encodeURIComponent(id), { + method: 'PATCH', + headers: { 'Content-Type': 'application/json' }, + body: JSON.stringify({ name: newName }) + }); + if (resp.status === 409) { + alert('A drawing with that name already exists.'); + return; // keep the modal open so the user can pick another name + } + if (!resp.ok) { + alert('Rename failed: ' + await resp.text()); + return; + } + hideModal(); + loadDrawings(); + } + + async function createDrawing(name) { + name = (name || '').trim(); if (!name) { name = 'drawing-' + Date.now(); } @@ -446,7 +577,7 @@ const dashboardHTML = ` } document.getElementById('drawingName').addEventListener('keypress', function(e) { - if (e.key === 'Enter') createDrawing(); + if (e.key === 'Enter') confirmModal(); }); document.getElementById('modal').addEventListener('click', function(e) { diff --git a/stacks/excalidraw/project/main_test.go b/stacks/excalidraw/project/main_test.go new file mode 100644 index 00000000..b4ab14f8 --- /dev/null +++ b/stacks/excalidraw/project/main_test.go @@ -0,0 +1,249 @@ +package main + +import ( + "encoding/json" + "net/http" + "net/http/httptest" + "os" + "path/filepath" + "strings" + "testing" +) + +const testDrawing = `{"type":"excalidraw","version":2,"source":"excalidraw-library","elements":[{"id":"e1"}],"appState":{"viewBackgroundColor":"#ffffff"}}` + +func setupDataDir(t *testing.T) { + t.Helper() + dataDir = t.TempDir() +} + +// doDrawing sends a request to handleDrawing for the given user and returns the recorder. +func doDrawing(t *testing.T, method, id, body, user string) *httptest.ResponseRecorder { + t.Helper() + var reader *strings.Reader + if body == "" { + reader = strings.NewReader("") + } else { + reader = strings.NewReader(body) + } + req := httptest.NewRequest(method, "/api/drawings/"+id, reader) + if user != "" { + req.Header.Set("X-Authentik-Username", user) + } + w := httptest.NewRecorder() + handleDrawing(w, req) + return w +} + +func listDrawings(t *testing.T, user string) []Drawing { + t.Helper() + req := httptest.NewRequest(http.MethodGet, "/api/drawings", nil) + if user != "" { + req.Header.Set("X-Authentik-Username", user) + } + w := httptest.NewRecorder() + handleListDrawings(w, req) + if w.Code != http.StatusOK { + t.Fatalf("list: expected 200, got %d", w.Code) + } + var drawings []Drawing + if err := json.Unmarshal(w.Body.Bytes(), &drawings); err != nil { + t.Fatalf("list: bad JSON: %v", err) + } + return drawings +} + +func TestPutGetRoundtrip(t *testing.T) { + setupDataDir(t) + if w := doDrawing(t, http.MethodPut, "foo", testDrawing, "alice"); w.Code != http.StatusOK { + t.Fatalf("PUT: expected 200, got %d: %s", w.Code, w.Body.String()) + } + w := doDrawing(t, http.MethodGet, "foo", "", "alice") + if w.Code != http.StatusOK { + t.Fatalf("GET: expected 200, got %d", w.Code) + } + if w.Body.String() != testDrawing { + t.Errorf("GET: content mismatch: %s", w.Body.String()) + } +} + +func TestGetMissing(t *testing.T) { + setupDataDir(t) + if w := doDrawing(t, http.MethodGet, "nope", "", "alice"); w.Code != http.StatusNotFound { + t.Fatalf("expected 404, got %d", w.Code) + } +} + +func TestListDrawings(t *testing.T) { + setupDataDir(t) + doDrawing(t, http.MethodPut, "one", testDrawing, "alice") + doDrawing(t, http.MethodPut, "two", testDrawing, "alice") + drawings := listDrawings(t, "alice") + if len(drawings) != 2 { + t.Fatalf("expected 2 drawings, got %d", len(drawings)) + } + ids := map[string]bool{drawings[0].ID: true, drawings[1].ID: true} + if !ids["one"] || !ids["two"] { + t.Errorf("unexpected ids: %v", ids) + } + for _, d := range drawings { + if d.Name != d.ID { + t.Errorf("name should equal id: %+v", d) + } + } +} + +func TestDelete(t *testing.T) { + setupDataDir(t) + doDrawing(t, http.MethodPut, "foo", testDrawing, "alice") + if w := doDrawing(t, http.MethodDelete, "foo", "", "alice"); w.Code != http.StatusOK { + t.Fatalf("DELETE: expected 200, got %d", w.Code) + } + if w := doDrawing(t, http.MethodGet, "foo", "", "alice"); w.Code != http.StatusNotFound { + t.Fatalf("GET after delete: expected 404, got %d", w.Code) + } + if w := doDrawing(t, http.MethodDelete, "foo", "", "alice"); w.Code != http.StatusNotFound { + t.Fatalf("second DELETE: expected 404, got %d", w.Code) + } +} + +func TestPerUserIsolation(t *testing.T) { + setupDataDir(t) + doDrawing(t, http.MethodPut, "secret", testDrawing, "alice") + if w := doDrawing(t, http.MethodGet, "secret", "", "bob"); w.Code != http.StatusNotFound { + t.Fatalf("bob should not see alice's drawing, got %d", w.Code) + } + if drawings := listDrawings(t, "bob"); len(drawings) != 0 { + t.Fatalf("bob's list should be empty, got %d", len(drawings)) + } +} + +// --- rename (PATCH) --- + +func renameReq(t *testing.T, id, newName, user string) *httptest.ResponseRecorder { + t.Helper() + return doDrawing(t, http.MethodPatch, id, `{"name":`+strconv(newName)+`}`, user) +} + +// strconv JSON-quotes a string without importing encoding/json for a one-liner. +func strconv(s string) string { + b, _ := json.Marshal(s) + return string(b) +} + +func TestRenameSuccess(t *testing.T) { + setupDataDir(t) + doDrawing(t, http.MethodPut, "foo", testDrawing, "alice") + w := renameReq(t, "foo", "bar", "alice") + if w.Code != http.StatusOK { + t.Fatalf("PATCH: expected 200, got %d: %s", w.Code, w.Body.String()) + } + var resp map[string]string + if err := json.Unmarshal(w.Body.Bytes(), &resp); err != nil { + t.Fatalf("PATCH: bad JSON: %v", err) + } + if resp["id"] != "bar" || resp["status"] != "renamed" { + t.Errorf("unexpected response: %v", resp) + } + if w := doDrawing(t, http.MethodGet, "bar", "", "alice"); w.Code != http.StatusOK || w.Body.String() != testDrawing { + t.Errorf("GET new id: code=%d content=%q", w.Code, w.Body.String()) + } + if w := doDrawing(t, http.MethodGet, "foo", "", "alice"); w.Code != http.StatusNotFound { + t.Errorf("GET old id: expected 404, got %d", w.Code) + } +} + +func TestRenameConflict(t *testing.T) { + setupDataDir(t) + doDrawing(t, http.MethodPut, "a", testDrawing, "alice") + doDrawing(t, http.MethodPut, "b", testDrawing, "alice") + if w := renameReq(t, "a", "b", "alice"); w.Code != http.StatusConflict { + t.Fatalf("expected 409, got %d", w.Code) + } + // both drawings intact + for _, id := range []string{"a", "b"} { + if w := doDrawing(t, http.MethodGet, id, "", "alice"); w.Code != http.StatusOK { + t.Errorf("drawing %q should be intact, got %d", id, w.Code) + } + } +} + +func TestRenameMissing(t *testing.T) { + setupDataDir(t) + if w := renameReq(t, "nope", "new", "alice"); w.Code != http.StatusNotFound { + t.Fatalf("expected 404, got %d", w.Code) + } +} + +func TestRenameSameName(t *testing.T) { + setupDataDir(t) + doDrawing(t, http.MethodPut, "foo", testDrawing, "alice") + w := renameReq(t, "foo", "foo", "alice") + if w.Code != http.StatusOK { + t.Fatalf("same-name rename: expected 200, got %d: %s", w.Code, w.Body.String()) + } + if w := doDrawing(t, http.MethodGet, "foo", "", "alice"); w.Code != http.StatusOK { + t.Errorf("drawing should be intact, got %d", w.Code) + } +} + +func TestRenameInvalidNames(t *testing.T) { + setupDataDir(t) + doDrawing(t, http.MethodPut, "foo", testDrawing, "alice") + for _, name := range []string{"", " ", "../..", "---"} { + if w := renameReq(t, "foo", name, "alice"); w.Code != http.StatusBadRequest { + t.Errorf("rename to %q: expected 400, got %d", name, w.Code) + } + } + // malformed body + if w := doDrawing(t, http.MethodPatch, "foo", `{not json`, "alice"); w.Code != http.StatusBadRequest { + t.Errorf("malformed body: expected 400, got %d", w.Code) + } +} + +func TestRenameSanitization(t *testing.T) { + setupDataDir(t) + cases := []struct{ in, want string }{ + {"My Drawing!", "My-Drawing-"}, + {"net diag.excalidraw", "net-diag"}, // .excalidraw suffix stripped, not mangled + {"a/b\\c", "a-b-c"}, + } + for _, c := range cases { + doDrawing(t, http.MethodPut, "src", testDrawing, "alice") + w := renameReq(t, "src", c.in, "alice") + if w.Code != http.StatusOK { + t.Errorf("rename to %q: expected 200, got %d: %s", c.in, w.Code, w.Body.String()) + continue + } + var resp map[string]string + json.Unmarshal(w.Body.Bytes(), &resp) + if resp["id"] != c.want { + t.Errorf("rename to %q: expected id %q, got %q", c.in, c.want, resp["id"]) + } + // file must be inside the user dir under the sanitized name + if _, err := os.Stat(filepath.Join(dataDir, "alice", c.want+".excalidraw")); err != nil { + t.Errorf("rename to %q: expected file %q on disk: %v", c.in, c.want, err) + } + doDrawing(t, http.MethodDelete, resp["id"], "", "alice") + } +} + +func TestRenameTraversalStaysInUserDir(t *testing.T) { + setupDataDir(t) + doDrawing(t, http.MethodPut, "foo", testDrawing, "alice") + w := renameReq(t, "foo", "../../../etc/passwd", "alice") + if w.Code == http.StatusOK { + var resp map[string]string + json.Unmarshal(w.Body.Bytes(), &resp) + if strings.Contains(resp["id"], "/") || strings.Contains(resp["id"], "..") { + t.Fatalf("traversal characters survived: %q", resp["id"]) + } + if _, err := os.Stat(filepath.Join(dataDir, "alice", resp["id"]+".excalidraw")); err != nil { + t.Fatalf("renamed file escaped user dir: %v", err) + } + } + // nothing outside the data dir + if _, err := os.Stat(filepath.Join(dataDir, "..", "etc")); err == nil { + t.Fatal("file escaped the data dir") + } +} diff --git a/stacks/excalidraw/project/static/editor.html b/stacks/excalidraw/project/static/editor.html index aba6390b..f374c115 100644 --- a/stacks/excalidraw/project/static/editor.html +++ b/stacks/excalidraw/project/static/editor.html @@ -8,41 +8,41 @@ * { margin: 0; padding: 0; } html, body { width: 100%; height: 100%; overflow: hidden; } #root { width: 100%; height: 100%; } - .toolbar { - position: fixed; - top: 10px; - left: 10px; - z-index: 1000; + .top-right-ui { display: flex; + align-items: center; gap: 8px; - background: rgba(255,255,255,0.95); + font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; + } + .top-right-ui a, .top-right-ui button { + display: inline-flex; + align-items: center; + gap: 6px; padding: 8px 12px; + border: 1px solid transparent; border-radius: 8px; - box-shadow: 0 2px 8px rgba(0,0,0,0.15); - } - .toolbar button, .toolbar a { - padding: 6px 14px; - border: none; - border-radius: 6px; cursor: pointer; - font-size: 14px; - background: #6c5ce7; - color: white; + font-size: 13px; text-decoration: none; - display: inline-block; + box-shadow: 0 1px 4px rgba(0,0,0,0.12); + max-width: 40vw; + white-space: nowrap; + overflow: hidden; + text-overflow: ellipsis; } - .toolbar button:hover, .toolbar a:hover { background: #5b4cdb; } - .toolbar .secondary { background: #ddd; color: #333; } - .toolbar .secondary:hover { background: #ccc; } - .toolbar .title { - font-weight: 600; - padding: 6px 0; - color: #333; + .top-right-ui.theme-light a, .top-right-ui.theme-light button { + background: #ffffff; + color: #1b1b1f; } + .top-right-ui.theme-dark a, .top-right-ui.theme-dark button { + background: #232329; + color: #e9ecef; + } + .top-right-ui button:hover, .top-right-ui a:hover { border-color: #a29bfe; } .status { position: fixed; bottom: 10px; - right: 10px; + right: 60px; padding: 6px 12px; background: rgba(0,0,0,0.7); color: white; @@ -51,6 +51,7 @@ z-index: 1000; opacity: 0; transition: opacity 0.3s; + pointer-events: none; } .status.show { opacity: 1; } .loading { @@ -67,11 +68,6 @@ -
- Back to Library - Loading... - -
Loading Excalidraw...
@@ -81,16 +77,33 @@
Saved