diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 301e31c5..facc4c2a 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -36,7 +36,7 @@ Violations cause state drift, which causes future applies to break or silently r - `auth = "none"` — no Authentik, no own-auth claim. Use for Anubis-fronted content (Anubis is the gate), native-client APIs (Git, `/v2/`, WebDAV/CalDAV, CardDAV), webhook receivers, OAuth callbacks, and Authentik outposts themselves. - **Anti-exposure rule** (the reason `"app"` exists): only pick `"app"` or `"none"` AFTER you've verified the app has its own user auth (`"app"`) OR the endpoint is intentionally public (`"none"`). Default is `"required"` so accidental omission fails closed. **Convention**: when using `"app"` or `"none"`, add a comment line above the `auth = "..."` line stating what gates the app or why it's public. **Enforced by `scripts/tg`**: every `tg plan/apply/destroy/refresh` runs `scripts/check-ingress-auth-comments.py` against the current stack and aborts if any `auth = "app|none"` line lacks the preceding `# auth = "": ...` comment. Stack-scoped — untouched stacks aren't blocked until they're next edited. - **Anti-AI**: on by default when `auth = "none"` or `auth = "app"` (no Authentik to discourage bots); redundant on `"required"` and `"public"`. - - **DNS**: `dns_type = "proxied"` (Cloudflare CDN), `"non-proxied"` (direct A/AAAA to the public IP), or `"internal"` (public A record carrying the INTERNAL Traefik LB IP `10.0.20.203` — resolvable everywhere, routable only from home LANs/WG sites/VPN; the record is reachability, NOT a gate — pair with `extra_middlewares = ["traefik-home-lans-only@kubernetescrd"]`, since direct-to-WAN-IP SNI requests still reach Traefik, and NEVER combine that allowlist with `"proxied"` — cloudflared pod source IPs sit inside 10/8 and would bypass it. First users: the immich-frame kiosks, `docs/plans/2026-07-04-immich-frame-lan-only-design.md`). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend). + - **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend). - **Anubis PoW challenge** (`modules/kubernetes/anubis_instance/`): per-site reverse proxy that issues a 30-day JWT cookie after a tiny PoW solve. Use for **public, content-bearing sites without app-level auth** (blog, docs, wikis, static landing pages). Pattern: declare `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://..svc.cluster.local" }`, then in `ingress_factory` set `service_name = module.anubis.service_name`, `port = module.anubis.service_port`, `anti_ai_scraping = false`. Shared ed25519 key in Vault `secret/viktor` -> `anubis_ed25519_key`; cookie scoped to `viktorbarzin.me` so one solve covers all Anubis-fronted subdomains. **DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints** — clients without JS can't solve PoW. **Replicas default to 1** because Anubis stores in-flight challenges in process memory; a challenge issued by pod A and solved against pod B errors with `store: key not found` (HTTP 500). Bumping replicas requires wiring a shared Redis store (TODO). For path-level carve-outs (e.g. wrongmove has `/` behind Anubis but `/api` direct, blog has `/net-diag.sh` direct), declare a second `ingress_factory` with `ingress_path = ["/"]` pointing at the bare backend service. Active on: blog (except `/net-diag.sh`), www, kms, travel, f1, cc, json, pb (privatebin), home (homepage), wrongmove (UI only). See `.claude/reference/patterns.md` "Anti-AI Scraping" for full layering. - **Docker images**: Always build for `linux/amd64`. SHA-tag rule is being phased out — see `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`. New model: CI pushes `:latest` (optionally also `:<8-char-sha>` for traceability), Keel polls and triggers rollouts. Cache-staleness concern from the old rule is resolved at the nginx layer (URL-split — manifests pass through, blobs cached). Until Phase 1 of the migration completes (per the plan), follow the SHA-tag rule for new services to match existing pattern. - **Image registry**: **Owned images now live on `ghcr.io/viktorbarzin/`** (ADR-0002, built by GHA — see the CI/CD Architecture section). The **Forgejo container registry is FROZEN + emptied** (break-glass only — `docs/runbooks/forgejo-registry-breakglass.md`); nothing pushes to it. The rest of this bullet documents the **still-live forgejo-pull DNS/mirror machinery** (it remains in place for the break-glass path + because `registry-credentials` is still Kyverno-synced; the hairpin lessons apply to any internal-registry pull). Historical usage was `image: forgejo.viktorbarzin.me/viktor/:` + `imagePullSecrets: [{name: registry-credentials}]`. **Kubelet pulls** are kept off the hairpin **at the resolver, with zero node-side DNS config**: pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium (added 2026-06-10, `docs/runbooks/pfsense-unbound.md`), whose split-horizon zone CNAMEs every ingress host (auto-synced hourly by `technitium-ingress-dns-sync`) to the zone apex whose A record tracks the **live** Traefik LB IP (canary: `viktorbarzin-apex-probe`, alerts ViktorBarzinApexDrift). Nodes are stock — link DNS `10.0.20.1 94.140.14.14` via `qm set --nameserver`, no `/etc/hosts` pins, no resolved drop-ins (two same-day interim approaches on 2026-06-10 were removed the same day). The containerd `hosts.toml` mirror (`[host."https://10.0.20.203"]`, `skip_verify = true`) still exists but is **vestigial** — it can NOT keep pulls internal on its own: Traefik routes by Host/SNI and 404s the mirror's bare-IP requests, and the registry's Bearer auth realm is the absolute `https://forgejo.viktorbarzin.me/v2/token` URL fetched outside the mirror — without internal DNS every fresh pull degrades to public DNS → hairpin → intermittent `dial tcp 176.12.22.76:443: i/o timeout` ImagePullBackOff (tuya-bridge 7.5h outage 2026-06-10, tripit 2026-06-09; see `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md`). **In-cluster pods are ordinary internal clients too** (since 2026-06-10 evening) — CoreDNS's dedicated `viktorbarzin.me:53` block (Corefile in `stacks/technitium/modules/technitium/main.tf`) forwards to the Technitium ClusterIP `10.96.0.53`, so pods get the same split-horizon answers as everyone else; forgejo stays pinned to Traefik's **ClusterIP** in that block (TF-interpolated from the live Service) so CI pushes survive a Technitium outage. This relies on a k8s-1.34 behavior verified 2026-06-10: **pods CAN reach the ETP=Local Traefik LB IP** (kube-proxy short-circuits in-cluster traffic to LB IPs via the cluster path) — re-verify after major k8s upgrades; canary = the uptime-kuma `[External]` fleet going red. (The block briefly forwarded to `8.8.8.8/1.1.1.1` earlier that day, which kept pods on the WAN IP and the broken TP-Link NAT loopback — 27 non-proxied `[External]` monitors dark; beads code-yh33.) **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left the mirror pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull; a future LB renumber is now handled by DNS (apex record + drift probe) — only the vestigial hosts.toml literal would go stale. Mirror source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes; also cleans up the legacy 2026-06-10 node-DNS customization). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest` + any buildkit `*cache*` tag — **REVERTED to DRY_RUN 2026-06-10 after its first live run orphaned OCI index children** (multi-arch/attestation children are separate *untagged* sha256 versions that sort outside the newest-10 window while their parent index is kept; broke `kms-website:latest`+`:dfc83fb`, caught by the integrity probe, healed by re-tagging latest→a794d1a + deleting the corrupt version; see `docs/post-mortems/2026-06-10-forgejo-retention-orphaned-indexes.md`). Do NOT re-enable deletes until the keep-set resolves kept indexes' child digests (or skips untagged versions, or moves to Forgejo's native container-aware cleanup rules). The registry PVC remains at its 50Gi autoresize ceiling on the HDD (we did NOT move it to SSD, see beads code-oflt), so a container-aware retention is still needed. Integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07. @@ -137,7 +137,7 @@ audiobook-search) now also land on ghcr. chrome-service-novnc, android-emulator. - **PRIVATE ghcr:** f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, - infra-ci, k8s-portal, excalidraw-library. Pulled via the Kyverno-synced `ghcr-credentials` allowlist + infra-ci, k8s-portal. Pulled via the Kyverno-synced `ghcr-credentials` allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`; NOT cluster-wide; cred = Vault `secret/viktor/ghcr_pull_token`, a dedicated classic PAT scoped to `read:packages` (UI-minted 2026-06-15; no longer the admin `github_pat` @@ -153,9 +153,7 @@ github↔forgejo divergence was deliberately NOT reconciled): `build-cli.yml` → DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli`; `build-infra-ci.yml` → `ghcr.io/viktorbarzin/infra-ci`; `build-k8s-portal.yml` → PRIVATE `ghcr.io/viktorbarzin/k8s-portal` (Keel-deployed; the LAST in-cluster -Woodpecker build, migrated 2026-06-13 — completes "no local builds"); `build-excalidraw.yml` → -PRIVATE `ghcr.io/viktorbarzin/excalidraw-library` (Keel-deployed; replaced -manual DockerHub pushes 2026-07-02 — DockerHub `:v4` frozen as rollback). **infra-ci** +Woodpecker build, migrated 2026-06-13 — completes "no local builds"). **infra-ci** is the image the `.woodpecker/default.yml` apply step + `drift-detection.yml` run in (proven by pipelines 165/166). chatterbox-tts is already built by tripit's GHA → ghcr. The Woodpecker `build-ci-image.yml` + `build-cli.yml` pipelines were REMOVED; @@ -218,7 +216,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`). | Service | Key Operational Knowledge | |---------|--------------------------| | Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe | -| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which had no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). **Since 2026-07-02 the T4 has a scheduler-level VRAM budget + watchdog (ADR-0016)**: each GPU tenant declares `viktorbarzin.me/gpumem` MiB (immich-ml 3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; node advertises 14000) and the `gpu-vram-watchdog` (nvidia ns) recycles the biggest over-budget tenant when free VRAM < 1536 MiB — currently **DRY_RUN=true** (observe-only; flip `watchdog_dry_run` in `stacks/nvidia/modules/nvidia/gpu_memory_budget.tf` to arm). KNOWN MISCALIBRATION (2026-07-02): llama-swap's real qwen3-8b@16k resident is ~7 GB (the 4.35 GiB figure was weights-only cudaMalloc), so retune budgets (ctx 16k→8k + llama-swap 6144 + immich-ml 2500, or rebalance) BEFORE arming, else the watchdog would recycle llama-swap first. TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | +| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | | CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` | | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU | | Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control; `/`+`/static` use a dedicated `authentik-rate-limit` (100/1000) so the cold-load chunk burst isn't 429'd into a blank screen. **Reliability (2026-06-28): the chart key is `deploymentStrategy`, NOT `strategy`** — the old `strategy:` key was inert, so live ran the chart default 25%/25% and dropped a server pod out of rotation on every roll; now `maxSurge:1/maxUnavailable:0`. Readiness `failureThreshold:8` (~80s, was 30s): the DB-coupled `/-/health/ready/` returns 503 on a PG/pgbouncer blip, and with too-tight tolerance all 3 server pods left the Service at once → Traefik 502/504 (the episodic blank-screen + 30s-hang). gunicorn `max_requests=10000`/jitter=1000 decorrelates worker recycles from DB blips. Redis is GONE since 2026.2 (sessions+cache+channels on PostgreSQL, no external-cache option) — a short PG transient is now survived, but a TOTAL CNPG outage still takes authentik down. **Custom overlay image (2026-06-28):** server+worker run `ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3` (built by `.github/workflows/build-authentik.yml` from `stacks/authentik/Dockerfile` + `patch-compat-sfe.py`) with TWO guarded patches: **#1 SLOW-1a** — narrows the identification-stage `select_subclasses()` query (~1.4s→~14ms; bare upstream call LEFT-JOINs every source subtype); **#2 old-browser blank login** — `patch-compat-sfe.py` (a) extends `compat_needs_sfe()` to serve authentik's built-in no-JS **SFE** login to old Safari/WebKit AND **any iOS browser** (Chrome/CriOS, Firefox/FxiOS — all share the system WebKit) on iOS≤16.3, and (b) **injects static social-login `` links into the SFE shell** (`flow-sfe.html`) since the SFE can't render Identification-stage sources — required for password-less accounts (e.g. emo = Google-only). The modern flow SPA is ES2022 (needs Safari 16.4+) and renders BLANK on older WebKit; every iOS browser shares that WebKit, so it's not browser-choice (emo's iPadOS-15.8 iPad hit this). SFE = the *real* authentik login (password + MFA + reputation, no auth downgrade) — chosen over a Traefik basic-auth fallback which would have put a spoofable-UA single password in front of `vbarzin→wizard` passwordless-root. Social link = plain redirect to `/source/oauth/login//` (works on any browser); slugs (google/github/facebook) are static — re-verify on source changes. **Keel un-enrolled** for the ns → image pinned in `global.image` (repo+tag), **upgraded manually**: bump the Dockerfile `FROM` + the values tag (+ re-verify both patches) together, GHA rebuilds, then apply. | @@ -232,7 +230,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`). - **Cascade inhibitions** (`inhibit_rules`): `NodeDown` AND `NodeConditionBad`/`NodeDiskPressure` suppress downstream pod-churn alerts (PodCrashLooping/PodImagePullBackOff/PodsStuckContainerCreating/ScrapeTargetDown/*ReplicasMismatch); `T3ProbeLegDown` suppresses `T3ProbeDropBurst` for the same `leg`; plus existing NFS/Traefik/Authentik/Power/Tuya/iDRAC cascades. No `equal` on the node rules (pod alerts carry no `node` label → cluster-wide, like NodeDown). - **ScrapeTargetDown scrapes only Ready endpoints** (relabel `keep __meta_kubernetes_endpoint_ready=true` on both `kubernetes-service-endpoints` jobs) — completed CronJob pods lingering as NotReady EndpointSlice addresses no longer fire phantom "down" alerts (tts/tripit/beads, id=4895). Replaces the old "exclude completed CronJob pods" guidance; a Ready pod with a broken metrics endpoint still fires. - Alertmanager is now scraped (`extraScrapeConfigs` job `alertmanager`) → `alertmanager_notifications_total`/`_alerts`/`_notifications_failed_total` available; it had no `prometheus.io/scrape` annotation so notification volume was previously unmeasurable. -- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever the ingress has a public DNS record (`dns_type` `"proxied"`/`"non-proxied"`; `"internal"` and `"none"` get none — set `external_monitor = false` explicitly on internal-only ingresses so the sync's default opt-in doesn't re-add a doomed monitor; see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions. +- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions. - **External monitoring**: `[External] ` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param). - Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction), AuthentikRootRouter5xxHigh (all-3-server-pods-NotReady cascade → 502/503/504 on the authentik `/` router). **The Traefik scrape keeps `traefik_router_requests_total`** (per-router `code` label) — the drop-regex in the `traefik` scrape job drops only the high-cardinality `*_duration_seconds_bucket` histogram, NOT the request counter, so per-router 429/5xx is queryable + alertable. - **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay). diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index 7c84dd3b..447620d9 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -81,7 +81,7 @@ | ytdlp | YouTube downloader | ytdlp | | wealthfolio | Finance tracking | wealthfolio | | audiobookshelf | Audiobook server (may be merged into ebooks stack) | audiobookshelf | -| paperless-ngx | Document management. Mail ingest: forward document emails to `docs@viktorbarzin.me` — sender maps 1:1 to a paperless account (runbook `paperless-mail-ingest.md`) | paperless-ngx | +| paperless-ngx | Document management | paperless-ngx | | jsoncrack | JSON visualizer | jsoncrack | | servarr | Media automation (Sonarr/Radarr/etc) | servarr | | aiostreams | Stremio stream aggregator (Real-Debrid + Torrentio/Comet/StremThru Torz/Knaben; **MediaFusion removed 2026-06-07** — broken upstream `500`). `auth=app` (own UUID+password); stream-probe tests **both series+movie paths** with per-source breakdown (`aiostreams_streams_{comet,torrentio,stremthru_torz,knaben}`) + `aiostreams_error_streams` + `aiostreams_movie_stream_count`, success gated on Comet (workhorse) being alive; weekly NFS config + Stremio-account-collection backups to `/srv/nfs/aiostreams-backup/`. PG-backed user config (Comet timeout bumped 5s→10s 2026-06-07). | servarr/aiostreams | @@ -99,7 +99,6 @@ | tor-proxy | Tor proxy | tor-proxy | | forgejo | Git forge. Open native self-signup (Turnstile captcha + email confirm) + Authentik & GitHub OAuth sign-in; see `docs/runbooks/forgejo-open-signups.md` | forgejo | | freshrss | RSS reader | freshrss | -| drone-logbook | DJI flight-log analyzer (Open DroneLog, upstream image) — dronelog.viktorbarzin.me | drone-logbook | | navidrome | Music streaming | navidrome | | networking-toolbox | Network tools | networking-toolbox | | stirling-pdf | PDF tools | stirling-pdf | @@ -121,9 +120,7 @@ | status-page | Status page | status-page | | plotting-book | Book plotting/world-building app | plotting-book | | tripit | Self-hosted TripIt-clone travel-itinerary PWA (FastAPI + SvelteKit SPA, same-origin). CNPG (`tripit` db, Vault static role `pg-tripit`) + RWX NFS trip-doc vault (`/srv/nfs/tripit-documents`) + RWO `proxmox-lvm-encrypted` personal-document vault `tripit-personal-documents` (passports/IDs — AES-256-GCM app-layer envelope, master key `DOCUMENT_ENCRYPTION_KEY` in `secret/tripit`). `auth=required` (Authentik forward-auth, reads `X-authentik-email`); second `auth=none` ingress on `/api/calendar` for HMAC-token-gated `.ics` feed. Email-ingest CronJob `tripit-ingest-plans` (`*/15`) is the SOLE inbound path — forward a booking to plans@viktorbarzin.me (catch-all → spam@), polled read-only and routed ONLY to a registered user / verified linked address (no default-owner fallback; strangers ignored), parsed by local LLM (`qwen3vl-4b`), and the sender is emailed the outcome (Added to trip / Couldn't import). Plus `tripit-poll-flights`, `tripit-run-reminders`, `tripit-transport-nudge`, `tripit-weather-brief`. (The old Gmail-scrape `tripit-ingest-mail` CronJob was removed 2026-06-05.) App secrets in Vault `secret/tripit`. | tripit | -| tasks | Reminders-style tasks PWA over Nextcloud CalDAV (FastAPI + SvelteKit SPA same-origin, single container; code `~/code/tasks`, design `tasks/docs/2026-07-03-tasks-pwa-design.md`). Nextcloud stays the source of truth (VTODOs); the app is the front-end Apple Reminders stopped being. CNPG (`tasks` db, Vault static role `pg-tasks`) stores Connected Accounts — per-user Nextcloud app passwords Fernet-encrypted with `fernet_key` from `secret/tasks`. `auth=required` (Authentik forward-auth; identity = `X-authentik-username`, NO app-level login — `DEV_USER` must never be set in prod) at tasks.viktorbarzin.me (proxied). Exception: the five PWA icon/manifest files (`/apple-touch-icon.png`, `/favicon.png`, `/pwa-192x192.png`, `/pwa-512x512.png`, `/manifest.webmanifest`) are a path-scoped `auth=none` carve-out (`module.ingress_icons`) so cookie-less OS icon fetchers (macOS Safari Add-to-Dock, mobile home-screen installs) get the real icon instead of the Authentik 302; guarded by the `tasks-icons` walloff-probe target. NetworkPolicy `tasks-ingress` (SEC-1) restricts pod ingress to traefik + monitoring namespaces so the trusted header can't be spoofed pod-to-pod. GHA → public ghcr `tasks` → Woodpecker deploy (ADR-0002). | tasks | -| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me — **a Valia site on Cloudflare Pages since 2026-07-03** (ADR-0018): registry entry in `stacks/valia-sites`, synced from Drive folder "claude" every 10 min, deploy-on-change. The old in-cluster stack (nginx off PVE NFS + per-site rclone CronJob) is RETIRED — stacks/stem95su is a tombstone; `secret/stem95su` superseded by `secret/valia-sites`; `stem_video.mp4` was compressed 42.9→21.4MB (25MB Pages cap) with Viktor's OK. See docs/runbooks/valia-sites.md. | — | -| valia-sites | **Valia-site registry + sync** (ADR-0018): all sites authored by Valia serve OFF-INFRA on Cloudflare Pages (`bridge` + `stem95su` live). One map entry in `stacks/valia-sites/main.tf` per site fans out Pages project + custom domain + public CNAME + internal split-horizon CNAME (ConfigMap `valia-sites-dns` → technitium sync, declarative incl. removal). CronJob `valia-sites-sync` (`*/10`, image ghcr `valia-sites-sync`) mirrors each Drive Content folder (rclone `drive.readonly`, stem95su-style guards + 25MB Pages-cap guard) and wrangler-deploys ONLY on manifest change (free-tier deploy cap). Secrets `secret/valia-sites` (shared rclone conf + SCOPED CF Pages token — Global API Key never in pods). Failed-Job-only visibility by choice. Runbook: docs/runbooks/valia-sites.md. | valia-sites | +| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me. Public **open** static site (`auth=none` — CrowdSec + ai-bot-block, no login). Stock `nginx:1.28-alpine` serving content **straight off PVE host NFS** `/srv/nfs/stem-site` (RWX `nfs_volume`, mounted read-only) — **NOT** image-baked, so the externally-authored (Gemini-exported) HTML/media updates with no rebuild; auto-backed-up offsite by `nfs-mirror`. **Content source = Google Drive folder "claude"** (id `1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_`, shared Valentina→vbarzin@gmail.com). **Deploy = scheduled mirror** (since 2026-06-09, reversed the earlier on-demand-only call once content went active): CronJob `stem95su-gdrive-sync` (`*/10`, `stacks/stem95su/gdrive-sync.tf`) mounts the content PVC RW and `rclone sync`s the Drive folder onto it (`docker.io/rclone/rclone:1.74.3`, `scope=drive.readonly` — Drive is READ-ONLY; empty-source guard + `--max-delete 25` so a partial listing can't wipe the site). rclone creds (OAuth refresh-token) in Vault `secret/stem95su` (`rclone_conf`) → ESO secret `stem95su-rclone`. **Requires the GCP OAuth app (project home-lab-1700868541205) published to "Production"** or the refresh token expires ~weekly (re-mint + `vault kv put secret/stem95su rclone_conf=…` after publishing); a dead token surfaces as a failed Job. Manual on-demand sync still possible (throwaway rclone container from devvm; recipe in claude-memory). Nextcloud "PVE NFS Pool"/rsync is a manual fallback. Dashboard `stem_board.html` served at `/` via a small nginx ConfigMap (`index`). No DB, no in-cluster secrets. Reference impl for the NFS-backed static-site pattern (see patterns.md). | stem95su | | trek | **TRIAL (2026-06-05)** — self-hosted group-trip planner (upstream [TREK](https://github.com/mauriceboe/TREK), `mauriceboe/trek:3.0.22`, AGPL-3.0). Solo evaluation behind Authentik forward-auth (`auth=required`) before deciding build-vs-adopt; covers collaborative trip planning + accommodation records + activities + per-person budget splitting on free OpenStreetMap (no paid maps key). SQLite + uploads on `proxmox-lvm-encrypted` (`trek-data-encrypted` 2Gi, `trek-uploads-encrypted` 5Gi). For the trial only: `ENCRYPTION_KEY` is TREK-auto-generated onto the data PVC and the bootstrap admin (`admin@trek.local`) is printed to pod logs — NO Vault/ESO wiring (graduation TODO: move key to `secret/trek` + ESO, add an app-level SQLite backup CronJob since host file-backup can't read the LUKS PVC, wire TREK↔Authentik OIDC). Pinned image, TF-managed (no CI/Keel). Availability-poll companion (Rallly) deferred. Teardown: `tg destroy` in `stacks/trek`. | trek | ## Cloudflare Domains @@ -133,7 +130,7 @@ blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send, audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden, changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser, -travel, netbox, phpipam, tripit, t3, stem95su, tasks +travel, netbox, phpipam, tripit, t3, stem95su ``` ### Non-Proxied (Direct DNS) diff --git a/.github/workflows/build-excalidraw.yml b/.github/workflows/build-excalidraw.yml deleted file mode 100644 index 7f58131f..00000000 --- a/.github/workflows/build-excalidraw.yml +++ /dev/null @@ -1,42 +0,0 @@ -name: Build excalidraw-library - -# ADR-0002 / no-local-builds: excalidraw-library (infra-owned Go app behind -# draw.viktorbarzin.me) builds off-infra on GHA → private ghcr; Keel polls -# ghcr:latest and rolls the deployment. Replaces the manual DockerHub pushes -# (viktorbarzin/excalidraw-library:v4 stays frozen as the rollback image). -on: - push: - branches: [master] - paths: - - 'stacks/excalidraw/project/**' - workflow_dispatch: {} - -permissions: - contents: read - packages: write - -jobs: - build: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - uses: actions/setup-go@v5 - with: - go-version: '1.21' - - run: go test ./... - working-directory: stacks/excalidraw/project - - uses: docker/setup-buildx-action@v3 - - uses: docker/login-action@v3 - with: - registry: ghcr.io - username: ${{ github.actor }} - password: ${{ secrets.GITHUB_TOKEN }} - - uses: docker/build-push-action@v6 - with: - context: stacks/excalidraw/project - platforms: linux/amd64 - provenance: false - push: true - tags: | - ghcr.io/viktorbarzin/excalidraw-library:latest - ghcr.io/viktorbarzin/excalidraw-library:${{ github.sha }} diff --git a/.github/workflows/build-valia-sites-sync.yml b/.github/workflows/build-valia-sites-sync.yml deleted file mode 100644 index 090b7f5c..00000000 --- a/.github/workflows/build-valia-sites-sync.yml +++ /dev/null @@ -1,39 +0,0 @@ -name: Build valia-sites-sync - -# ADR-0002 + ADR-0018: infra-owned image built off-infra on GHA → ghcr (public). -# Rclone + wrangler runner for the Valia-sites Content-folder mirror CronJob. -# Rebuilds are rare (tool pins only change deliberately) → dispatch + path. -# Security note: no untrusted event inputs are interpolated anywhere (only -# github.actor / github.sha / GITHUB_TOKEN — same shape as the other -# build-*.yml workflows in this repo). -on: - push: - branches: [master] - paths: - - 'stacks/valia-sites/sync-image/**' - workflow_dispatch: {} - -permissions: - contents: read - packages: write - -jobs: - build: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - uses: docker/setup-buildx-action@v3 - - uses: docker/login-action@v3 - with: - registry: ghcr.io - username: ${{ github.actor }} - password: ${{ secrets.GITHUB_TOKEN }} - - uses: docker/build-push-action@v6 - with: - context: stacks/valia-sites/sync-image - platforms: linux/amd64 - provenance: false - push: true - tags: | - ghcr.io/viktorbarzin/valia-sites-sync:latest - ghcr.io/viktorbarzin/valia-sites-sync:${{ github.sha }} diff --git a/AGENTS.md b/AGENTS.md index 43f06b8e..4e3ea2de 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -95,7 +95,7 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro ## Key Paths - `stacks//main.tf` — service definition - `stacks/platform/modules//` — core infra modules -- `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI, and auto Cloudflare DNS (`dns_type = "proxied"`, `"non-proxied"`, or `"internal"` — a public A record carrying the internal Traefik LB IP for household-only services; pair with the `home-lans-only` ipAllowList middleware, never with `"proxied"`) +- `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI, and auto Cloudflare DNS (`dns_type = "proxied"` or `"non-proxied"`) - `modules/kubernetes/nfs_volume/` — NFS volume module (CSI-backed, soft mount) - `config.tfvars` — non-secret configuration (plaintext) - `secrets.sops.json` — all secrets (SOPS-encrypted JSON) diff --git a/CONTEXT.md b/CONTEXT.md index 76b101d0..fa5113d5 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -118,14 +118,6 @@ _Avoid_: "external", "outside". `viktorbarzin.lan`, served by Technitium DNS. Resolves only inside the homelab network. _Avoid_: bare "lan", "private", "intranet". -**Segment**: -One isolated L2/L3 network with pfSense as its gateway — realised as a Proxmox-bridge-level tag feeding one dedicated untagged pfSense interface (dManagementsVms 10.0.10.0/24 = vmbr1 tag 10, dKubernetes 10.0.20.0/24 = vmbr1 tag 20, dCCTV 10.0.30.0/24 = vmbr0 tag 30). pfSense itself never terminates 802.1Q. -_Avoid_: "VLAN" as the primary name (the tags 10/20/30 are transport detail; the Segment is the concept). - -**CCTV segment**: -The untrusted camera **Segment** (`dCCTV`) — devices in it may be pulled from (RTSP/ISAPI) but may initiate nothing except NTP to their gateway. Deliberately outside every trusted source-IP allowlist (ADR-0017). -_Avoid_: "camera VLAN", "CCTV LAN". - **Ingress auth**: The `auth = "..."` parameter on `ingress_factory` — a discrete *mode*, not a ranked tier — one of `required` (Authentik forward-auth gates every request), `app` (the backend owns its login), `public` (anonymous Authentik binding for audit only), or `none` (Anubis-fronted content, or native-client API). Default `required` (fail-closed). _Avoid_: "auth tier" / "auth mode" — refer to it by the canonical key, `auth` (e.g. `auth = "required"`). "tier" is reserved for State tier and Namespace tier. @@ -237,20 +229,6 @@ _Avoid_: expecting Diun to deploy; conflating with **Keel**. **Anubis**: A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content-bearing sites without app-level auth (blog, wiki, landing pages). Never in front of Git, WebDAV, CalDAV, or API endpoints (clients can't solve PoW). -### Externally-authored sites - -**Valia site**: -A small public static site authored by Valia (Viktor's mother, external to the infra) and hosted for her under `.viktorbarzin.me`. Its source of truth is a **Content folder** she owns; the live site is a mirror of that folder, fresh within ~10 minutes. Hosted **off-infra** (Cloudflare Pages) by decision: a homelab outage freezes content but never takes her sites down. Viktor picks the English subdomain name per site at registration (her folder names stay Bulgarian). Current instances: `stem95su`, `bridge`. -_Avoid_: "school site" (the family may grow beyond school projects); treating the deployed copy as editable — edits land only in the **Content folder**. - -**Content folder**: -The Google Drive folder (or subfolder) Valia shares with `vbarzin@gmail.com` holding one **Valia site**'s files. Strictly read-only from the infra side — nothing ever writes back to her Drive. Empty or half-uploaded folder states must never wipe a live site. -_Avoid_: syncing a folder root when the servable content lives in a subfolder (stem95su serves `stem claude/files/`, not the folder root). - -**Entry file**: -The HTML file a **Valia site** serves at `/`. Defaults to `index.html`; per-site override when she names it differently (stem95su: `stem_board.html`). The override is a registration-time setting, not a constraint on her authoring. -_Avoid_: asking Valia to rename her files to fit hosting conventions. - ## Relationships - A **Service** is defined by exactly one **Stack** — **flat** or wrapping a **Stack-local module** — which sources zero or more shared **Factory modules** and resolves to one or more K8s workloads. @@ -262,7 +240,6 @@ _Avoid_: asking Valia to rename her files to fit hosting conventions. - A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither. - An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**. - Tier-1 **State tier** state and ~12 app databases share one **CNPG** `pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store. -- A **Valia site** mirrors exactly one **Content folder** and serves exactly one **Entry file** at `/`; the folder is hers, the subdomain name is Viktor's, the hosting is off-infra. ## Example dialogue diff --git a/cli/VERSION b/cli/VERSION index 87a1cf59..fd2726c9 100644 --- a/cli/VERSION +++ b/cli/VERSION @@ -1 +1 @@ -v0.12.0 +v0.11.0 diff --git a/cli/cmd_memory.go b/cli/cmd_memory.go index 129d07b2..7ae11ea0 100644 --- a/cli/cmd_memory.go +++ b/cli/cmd_memory.go @@ -30,21 +30,11 @@ func memoryCommands() []Command { } } -// printMemories renders a {memories:[…]} response as one line per memory, or raw JSON. +// printMemories renders a {memories:[…]} response as compact lines, or raw JSON. func printMemories(raw []byte, jsonOut bool) error { - fmt.Print(renderMemories(raw, jsonOut)) - return nil -} - -// renderMemories formats each memory as a single line with its FULL content -// (newlines flattened to spaces). Content is deliberately never truncated: the -// old 240-rune preview cut memories mid-sentence, misled agents into believing -// no full-content read-back existed, and made blind `update --content` from -// the preview silently destroy the stored tail. Full passthrough also can't -// produce invalid UTF-8 (the old mid-rune cut crashed the recall hook). -func renderMemories(raw []byte, jsonOut bool) string { if jsonOut { - return string(raw) + "\n" + fmt.Println(string(raw)) + return nil } var r struct { Memories []struct { @@ -56,20 +46,36 @@ func renderMemories(raw []byte, jsonOut bool) string { } `json:"memories"` } if err := json.Unmarshal(raw, &r); err != nil { - return string(raw) + "\n" + fmt.Println(string(raw)) + return nil } if len(r.Memories) == 0 { - return "(no memories)\n" + fmt.Println("(no memories)") + return nil } - var b strings.Builder for _, m := range r.Memories { - c := strings.ReplaceAll(m.Content, "\n", " ") - fmt.Fprintf(&b, "#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c) + c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240) + fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c) if m.Tags != "" { - fmt.Fprintf(&b, " tags: %s\n", m.Tags) + fmt.Printf(" tags: %s\n", m.Tags) } } - return b.String() + return nil +} + +// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it +// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240] +// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte +// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict +// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit +// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit +// hook error" for Cyrillic-language users. +func truncatePreview(s string, maxRunes int) string { + r := []rune(s) + if len(r) <= maxRunes { + return s + } + return string(r[:maxRunes]) + "…" } func memoryRecall(args []string) error { diff --git a/cli/memory_test.go b/cli/memory_test.go index ee21ad12..1c673c7b 100644 --- a/cli/memory_test.go +++ b/cli/memory_test.go @@ -8,53 +8,25 @@ import ( "unicode/utf8" ) -func TestRenderMemoriesFullContent(t *testing.T) { - // The pretty view must NOT truncate content: the old 240-rune preview cut - // memories mid-sentence, misled agents into thinking no full-content - // read-back existed, and made blind `update --content` from the preview - // destroy the stored tail. Full passthrough also removes the mid-rune-cut - // invalid-UTF-8 class by construction — nothing is ever sliced. - long := strings.Repeat("я", 300) + strings.Repeat("a", 300) - raw, _ := json.Marshal(map[string]interface{}{"memories": []map[string]interface{}{ - {"id": 7, "content": long, "category": "facts", "tags": "t1,t2", "importance": 0.7}, - }}) - got := renderMemories(raw, false) - if !strings.Contains(got, long) { - t.Fatalf("content was truncated: %q", got) - } - if strings.Contains(got, "…") { - t.Fatalf("ellipsis in output — truncation still active: %q", got) - } +func TestTruncatePreviewKeepsValidUTF8(t *testing.T) { + // Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits + // invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must + // cut on a rune boundary and always stay valid UTF-8. + long := strings.Repeat("я", 300) // 300 runes / 600 bytes + got := truncatePreview(long, 240) if !utf8.ValidString(got) { - t.Fatalf("invalid UTF-8 in output: %q", got) + t.Fatalf("truncatePreview produced invalid UTF-8: %q", got) } - if !strings.Contains(got, "#7 [facts] (0.70) ") || !strings.Contains(got, "tags: t1,t2") { - t.Fatalf("line format broken: %q", got) + if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' { + t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r)) } -} - -func TestRenderMemoriesFlattensNewlinesToOneLine(t *testing.T) { - // Consumers (the recall hook, terminal skims) rely on one memory per line; - // multi-line content is flattened, never split across lines. - raw, _ := json.Marshal(map[string]interface{}{"memories": []map[string]interface{}{ - {"id": 1, "content": "line one\nline two\nline three", "category": "facts", "importance": 0.5}, - }}) - got := renderMemories(raw, false) - if !strings.Contains(got, "line one line two line three") { - t.Fatalf("newlines not flattened: %q", got) + // Short multibyte strings pass through untouched (no ellipsis). + if got := truncatePreview("кратко", 240); got != "кратко" { + t.Fatalf("short string altered: %q", got) } -} - -func TestRenderMemoriesEdgeCases(t *testing.T) { - if got := renderMemories([]byte(`{"memories":[]}`), false); got != "(no memories)\n" { - t.Fatalf("empty list: %q", got) - } - // --json and unparseable responses pass through raw. - if got := renderMemories([]byte(`{"x":1}`), true); got != "{\"x\":1}\n" { - t.Fatalf("json passthrough: %q", got) - } - if got := renderMemories([]byte(`not json`), false); got != "not json\n" { - t.Fatalf("unparseable passthrough: %q", got) + // ASCII boundary still works. + if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" { + t.Fatalf("ascii truncation wrong: %q", got) } } diff --git a/config.tfvars b/config.tfvars index 9ce566ed..790a48ae 100644 Binary files a/config.tfvars and b/config.tfvars differ diff --git a/docs/adr/0017-cctv-physical-cabling.svg b/docs/adr/0017-cctv-physical-cabling.svg deleted file mode 100644 index 6088f9e3..00000000 --- a/docs/adr/0017-cctv-physical-cabling.svg +++ /dev/null @@ -1,126 +0,0 @@ - - - - - - - - - - - ADR-0017 — physical cabling (single-switch, rev 3) - wires only — no VLANs, no traffic · solid = in place · dashed = camera-day · ~ = radio - - - - APARTMENT - - ☁ ISP (internet) - - - - AX6000 router - 192.168.1.1 · WAN←ISP · 8×LAN - - - Synology NAS · .13 - on an AX6000 LAN port - - - 📶 wifi clients (phones, laptops) - - - - - in-wall run → garage - - - - GARAGE — RACK - - - - TL-SG105PE · 5-port gigabit PoE switch - mgmt 192.168.1.6 · replaces the old TL-SG105E (→ shelf, cold spare) - - - P1 - ← apartment - - P2 - ← 4G router - - P3 - ← UPS mgmt - - P4 ⚡PoE - ← camera - - P5 - ← R730 eno1 - - every cable below re-plugs old-switch → PE on camera day (≈3 min) - - - - 4G router · 192.168.1.7 - ~cellular uplink (out-of-band) - - - 📡 cellular - - - - UPS (Huawei) - network mgmt card - - - - - Dell R730 · PVE host · 192.168.1.127 - - - eno1 · LAN1 - ← switch P5 · 1GbE - - eno2 · LAN2 - dark · fallback leg - - eno3 / eno4 - free, uncabled - - iDRAC · .4 - shared-LOM/eno1 - - no other network cables — everything else on this host is VIRTUAL: - pfSense · ha-sofia (HA) · devvm · k8s-master + node1-6 · registry VM … - (power: host + switch fed from the UPS — power wiring not drawn) - - - LAN1 cable - - - - GARAGE ENTRANCE - - vermont-garage camera - HiLook IPC-T241H-C · 10.0.30.70 - powered over the data cable (PoE) - outdoor · armored conduit - - - single cat6 in conduit · data + PoE power (camera day) - - - - - copper, in place - - camera-day cable / dark port - - radio (wifi / cellular) - total wired links at the rack: 5 (all on the one switch) · ADR-0017 rev 3 - - diff --git a/docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md b/docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md deleted file mode 100644 index d9de098d..00000000 --- a/docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md +++ /dev/null @@ -1,99 +0,0 @@ -# CCTV segment: dedicated pfSense interface, VLAN-30 trunk on the LAN1 cable - -Status: accepted (2026-07-02, rev 3 — single-switch) - -![Network topology — dCCTV segment, flows, and camera-day steps](./0017-cctv-segment-topology.svg) - -![Physical cabling — wires only, no VLANs](./0017-cctv-physical-cabling.svg) - -The first owned camera at the Sofia/Vermont site (`vermont-garage`, HiLook -IPC-T241H-C at the garage entrance) needs to be network-isolated: its cable is -physically exposed outside the apartment, so anything plugged into that cable -must land in a segment that can reach nothing. The original design doc -(NAS: `Emo shared/Claude shared/garage-camera/`) called for an "802.1Q trunk -to pfSense" — but nothing in this network terminates dot1q on pfSense; the -site idiom is one vlan-aware Proxmox bridge → one tagged VM NIC → one clean -untagged pfSense interface per segment. - -**Decision (rev 3):** ONE switch — the new TL-SG105PE **replaces** the old -garage TL-SG105E (Viktor prefers not running two switches; retired unit -becomes a cold spare, its 192.168.1.6 mgmt IP passes to the PE). Five ports, -all used: apartment uplink, 4G router 192.168.1.7, UPS mgmt (all untagged -VLAN 1), the camera (untagged VLAN 30, PoE), and the **trunk to R730 `eno1` -carrying home LAN untagged + CCTV tagged 30** over the existing LAN1 cable. -pfSense `net3` (vtnet3) sits on `vmbr0` with `tag=30` — exactly the site -idiom used for dManagementsVms/dKubernetes (bridge-level tag → clean untagged -vNIC; pfSense still terminates no dot1q itself). The earlier dedicated -`eno2`/`vmbr2` leg is kept **dormant as a fallback** (rev 2 wired it; moving -net3 back to vmbr2 restores pure physical isolation in one `qm set`). -This narrows the earlier 802.1Q objection rather than contradicting it: the -rejection assumed *unmanaged* switches, where any LAN device could inject -tagged frames; with the managed PE as the only device on eno1, VLAN-30 -membership is {camera port, trunk port} only, so tag-30 ingress from every -other port — and from the exposed camera cable — is dropped or contained. -Cameras are untrusted: default-deny on dCCTV with a single -NTP-to-gateway exception; Frigate (k8s) pulls RTSP in; ha-sofia (192.168.1.8) -may reach ISAPI/RTSP directly; home-LAN clients route in via an AX6000 static -route (10.0.30.0/24 via 192.168.1.2). 10.0.30.0/24 is deliberately NOT in the -10.0.20.0/22 trusted source-IP allowlist. - -## Traffic on the trunk — how one cable carries two networks - -The LAN1 cable is shared, but the two networks on it diverge at `vmbr0` -(the vlan-aware bridge on the PVE host), and only ONE of them ever touches -pfSense: - -- **Untagged (VLAN 1, home LAN)** is plain L2 bridging: vmbr0 switches it - between the trunk, the host's own IP (192.168.1.127) and pfSense `net0` — - where pfSense sits as an ordinary LAN *client* (WAN 192.168.1.2). The home - LAN's gateway is and remains the AX6000; home-LAN traffic never transits - pfSense. Consequently a pfSense (or R730 VM-level) outage does not affect - the home LAN, and the apartment ↔ 4G-router ↔ UPS paths don't even leave - the switch (P1/P2/P3 bridge internally), so out-of-band recovery via the - 4G router survives the whole rack being down. -- **Tagged 30 (CCTV)** has exactly one possible landing: vmbr0 delivers - VID 30 only to pfSense `net3` (dCCTV, 10.0.30.1), which is the camera - segment's gateway, firewall and sole exit. "Camera → AX6000 → internet" - is impossible by construction, not merely by firewall rule. -- pfSense forwards *upstream* only its own segments (10.0.10/20/30), NATed - out of its WAN toward the AX6000. Load-wise the trunk gained only the - camera's ~8 Mbps — it already carried all rack-bound home-LAN traffic. - -![VLAN tagging — where traffic can flow](./0017-cctv-vlan-tagging.svg) - -*(editable source: [`0017-cctv-vlan-tagging.excalidraw`](./0017-cctv-vlan-tagging.excalidraw) — open it in excalidraw to tweak)* - -## Considered options - -- **802.1Q over the LAN path behind an UNMANAGED switch** (the original plan - read this way) — rejected: any LAN device could inject tagged frames into - vmbr0 (`bridge-vids 2-4094`) and tag-passing through a dumb switch is - undefined. Rev 3 adopts the tagged path ONLY because the managed PE now - polices VLAN-30 membership at the single entry point to eno1; no bridge - reconfiguration was needed (vmbr0 was already vlan-aware). -- **Dedicated physical leg (eno2 → vmbr2 → net3), one switch per role** - (rev 1/2 as-built) — superseded by rev 3: it forced either a second switch - (6 connections vs 5 ports once the PE also replaced the old switch) or new - hardware. Strongest isolation of all options; kept dormant as the fallback. -- **AX6000 as the camera gateway** — rejected earlier in the design (consumer - router, no inter-VLAN firewall). - -## Consequences - -- The switch is now single-point and load-bearing for everything in the rack - (apartment uplink, pfSense backup-WAN via 4G, UPS mgmt, CCTV) AND its VLAN - table + mgmt password are part of the isolation boundary — the Easy Smart - mgmt UI answers on every port, so the password is the gate between a - compromised camera and the switch config. All 5 ports are consumed: the - next camera forces an 8-port PoE upgrade (the wiring plan already fits it). -- `eno2`/`vmbr2` stay cabled-ready but dormant (fallback to rev 2's physical - leg); eno3/eno4 remain free. -- The old TL-SG105E is retired to cold spare; the PE inherits 192.168.1.6 - (Kea reservation by MAC). -- Revision history (all 2026-07-02): rev 1 assumed one shared PE with a - port-VLAN split (conflated the two devices); rev 2 split into two switches - after inspecting 192.168.1.6 (old non-PoE SG105E, 4/5 ports used); rev 3 - consolidated back to one switch — the PE replacing the SG105E — per - Viktor's preference, moving CCTV onto a managed tagged trunk. -- Frigate's ADR-0016 VRAM budget was bumped 2000 → 2300 MiB for the extra - NVDEC stream. diff --git a/docs/adr/0017-cctv-segment-topology.svg b/docs/adr/0017-cctv-segment-topology.svg deleted file mode 100644 index 007b7e16..00000000 --- a/docs/adr/0017-cctv-segment-topology.svg +++ /dev/null @@ -1,178 +0,0 @@ - - - - - - - - - - - - - - - - - ADR-0017 — CCTV segment behind pfSense, VLAN-30 trunk on the LAN1 cable - Sofia/Vermont · rev 3 (single switch) 2026-07-02 · dashed = camera-day · the ONLY 802.1Q is the trunk between the switch and eno1 - - - - - - - - DENY · camera → LAN / other segments / internet (default deny on dCCTV) - - - - GARAGE ENTRANCE - - vermont-garage - HiLook IPC-T241H-C · pure IR - 10.0.30.70 (Kea reservation) - DNS: garage-cam.viktorbarzin.lan - PoE from switch · cloud/P2P off - - - cat6 in conduit · PoE → P4 - - - - RACK — GARAGE · ONE SWITCH - - - TL-SG105PE replaces the SG105E · mgmt 192.168.1.6 (Kea) · all 5 ports used - - - P1 · V1 - apartment - uplink - - P2 · V1 - 4G router - 192.168.1.7 - - P3 · V1 - UPS mgmt - - P4 · V30 - camera - PoE ON - - P5 · trunk - V1 untagged - + V30 tagged - - 802.1Q: VLAN 1 untagged {P1,P2,P3,P5} · VLAN 30 {P4 untagged/PVID 30, P5 tagged} - tag-30 ingress on P1/P2/P3 is dropped (not members) — the trunk is the only tagged path - old TL-SG105E → retired, cold spare · backup-WAN (4G) + UPS keep their ports - - - - - LAN1 cable - - - - DELL R730 — PVE HOST 192.168.1.127 (IN THE RACK) - - - - eno1 → vmbr0 - untag V1 + tag 30 - - - eno2 → vmbr2 - dormant fallback leg - - - vmbr1 - internal · tags 10/20 - - - - - pfSense (VM 101) - gateway + firewall for every segment - - - net0 · WAN 192.168.1.2 · vmbr0 untagged - - net1 · dManagementsVms 10.0.10.1 - - net2 · dKubernetes 10.0.20.1 - - net3 · dCCTV 10.0.30.1/24 · vmbr0 tag 30 - - - - - - - - - k8s VMs · 10.0.20.0/24 - vmbr1 tag 20 · pod egress SNATs - to node IPs - - Frigate · k8s-node1 (T4) - detect sub / record main - gpumem budget 2300 MiB - - go2rtc LB 10.0.20.204 - restream → HA live view (MSE/HLS) - - - - HOME LAN 192.168.1.0/24 - - AX6000 · .1 - + route 10.0.30.0/24 → .2 - - ha-sofia · .8 - Frigate card + hikvision_next - - apartment clients - laptops, phones - - CAMERA DAY: static route - 10.0.30.0/24 via 192.168.1.2 - - - apartment uplink · switch P1 · trunk · eno1 - - - - ALLOW · Frigate → camera RTSP :554 (routed k8s → dCCTV; opt1 allow-all) - - - ALLOW · ha-sofia → camera :80 ISAPI + :554 - enters pfSense WAN · reply-to off · needs the AX6000 route - - - ALLOW · camera → 10.0.30.1:123 (NTP) - - - - - home LAN / VLAN 1 - - CCTV / VLAN 30 / dCCTV 10.0.30.0/24 - - dKubernetes - - dManagementsVms - - allowed flow - - denied - - camera-day step - ADR-0017 · rev 3 - - diff --git a/docs/adr/0017-cctv-vlan-tagging.excalidraw b/docs/adr/0017-cctv-vlan-tagging.excalidraw deleted file mode 100644 index 26eb9abd..00000000 --- a/docs/adr/0017-cctv-vlan-tagging.excalidraw +++ /dev/null @@ -1,1771 +0,0 @@ -{ - "type": "excalidraw", - "version": 2, - "source": "https://excalidraw.viktorbarzin.me", - "elements": [ - { - "id": "el001", - "type": "text", - "x": 40, - "y": 20, - "width": 621.6, - "height": 35.0, - "angle": 0, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1778837932, - "version": 1, - "versionNonce": 1303193991, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "VLAN tagging \u2014 where traffic can flow", - "fontSize": 28, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "VLAN tagging \u2014 where traffic can flow", - "lineHeight": 1.25, - "baseline": 28 - }, - { - "id": "el002", - "type": "text", - "x": 40, - "y": 62, - "width": 758.4, - "height": 20.0, - "angle": 0, - "strokeColor": "#868e96", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1570340888, - "version": 1, - "versionNonce": 1243931547, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "the 802.1Q tag exists only between switch P5 and vmbr0 \u2014 endpoints never see it", - "fontSize": 16, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "the 802.1Q tag exists only between switch P5 and vmbr0 \u2014 endpoints never see it", - "lineHeight": 1.25, - "baseline": 16 - }, - { - "id": "el003", - "type": "rectangle", - "x": 700, - "y": 110, - "width": 210, - "height": 560, - "angle": 0, - "strokeColor": "#868e96", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "dashed", - "roughness": 1, - "opacity": 100, - "seed": 750280512, - "version": 1, - "versionNonce": 1195188524, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "roundness": { - "type": 3 - } - }, - { - "id": "el004", - "type": "text", - "x": 742, - "y": 122, - "width": 97.2, - "height": 22.5, - "angle": 0, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 473142373, - "version": 1, - "versionNonce": 115692583, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "ONE CABLE", - "fontSize": 18, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "ONE CABLE", - "lineHeight": 1.25, - "baseline": 18 - }, - { - "id": "el005", - "type": "text", - "x": 716, - "y": 148, - "width": 171.6, - "height": 16.25, - "angle": 0, - "strokeColor": "#868e96", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1069030696, - "version": 1, - "versionNonce": 1650002323, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "the LAN1 run \u00b7 P5\u2194eno1", - "fontSize": 13, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "the LAN1 run \u00b7 P5\u2194eno1", - "lineHeight": 1.25, - "baseline": 13 - }, - { - "id": "el006", - "type": "text", - "x": 40, - "y": 120, - "width": 276.0, - "height": 25.0, - "angle": 0, - "strokeColor": "#7048e8", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1802024079, - "version": 1, - "versionNonce": 1083980019, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "VLAN 30 \u00b7 CCTV (camera)", - "fontSize": 20, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "VLAN 30 \u00b7 CCTV (camera)", - "lineHeight": 1.25, - "baseline": 20 - }, - { - "id": "el007", - "type": "rectangle", - "x": 40, - "y": 160, - "width": 170, - "height": 100, - "angle": 0, - "strokeColor": "#7048e8", - "backgroundColor": "#d0bfff", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1363373344, - "version": 1, - "versionNonce": 1724819963, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "roundness": { - "type": 3 - } - }, - { - "id": "el008", - "type": "text", - "x": 56, - "y": 172, - "width": 126.0, - "height": 56.25, - "angle": 0, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 590735843, - "version": 1, - "versionNonce": 267116025, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "camera\n10.0.30.70\nsends untagged", - "fontSize": 15, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "camera\n10.0.30.70\nsends untagged", - "lineHeight": 1.25, - "baseline": 15 - }, - { - "id": "el009", - "type": "arrow", - "x": 210, - "y": 210, - "width": 50, - "height": 0, - "angle": 0, - "strokeColor": "#7048e8", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 600787264, - "version": 1, - "versionNonce": 844240212, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "points": [ - [ - 0, - 0 - ], - [ - 50, - 0 - ] - ], - "lastCommittedPoint": null, - "startBinding": null, - "endBinding": null, - "startArrowhead": null, - "endArrowhead": "arrow", - "roundness": { - "type": 2 - } - }, - { - "id": "el010", - "type": "rectangle", - "x": 260, - "y": 160, - "width": 190, - "height": 100, - "angle": 0, - "strokeColor": "#7048e8", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 648177040, - "version": 1, - "versionNonce": 901986117, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "roundness": { - "type": 3 - } - }, - { - "id": "el011", - "type": "text", - "x": 274, - "y": 170, - "width": 153.0, - "height": 37.5, - "angle": 0, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1421789145, - "version": 1, - "versionNonce": 530430174, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "P4 ingress\nPVID 30 \u2192 VLAN 30", - "fontSize": 15, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "P4 ingress\nPVID 30 \u2192 VLAN 30", - "lineHeight": 1.25, - "baseline": 15 - }, - { - "id": "el012", - "type": "text", - "x": 274, - "y": 226, - "width": 126.0, - "height": 17.5, - "angle": 0, - "strokeColor": "#e03131", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 297119438, - "version": 1, - "versionNonce": 1328001885, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "\u2717 not in VLAN 1", - "fontSize": 14, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "\u2717 not in VLAN 1", - "lineHeight": 1.25, - "baseline": 14 - }, - { - "id": "el013", - "type": "arrow", - "x": 450, - "y": 210, - "width": 50, - "height": 0, - "angle": 0, - "strokeColor": "#7048e8", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1759537933, - "version": 1, - "versionNonce": 351602578, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "points": [ - [ - 0, - 0 - ], - [ - 50, - 0 - ] - ], - "lastCommittedPoint": null, - "startBinding": null, - "endBinding": null, - "startArrowhead": null, - "endArrowhead": "arrow", - "roundness": { - "type": 2 - } - }, - { - "id": "el014", - "type": "rectangle", - "x": 500, - "y": 160, - "width": 170, - "height": 100, - "angle": 0, - "strokeColor": "#7048e8", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 2036237420, - "version": 1, - "versionNonce": 608198039, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "roundness": { - "type": 3 - } - }, - { - "id": "el015", - "type": "text", - "x": 514, - "y": 172, - "width": 99.0, - "height": 37.5, - "angle": 0, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1755241687, - "version": 1, - "versionNonce": 1444750360, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "P5 egress\nadds 802.1Q", - "fontSize": 15, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "P5 egress\nadds 802.1Q", - "lineHeight": 1.25, - "baseline": 15 - }, - { - "id": "el016", - "type": "text", - "x": 514, - "y": 226, - "width": 81.6, - "height": 21.25, - "angle": 0, - "strokeColor": "#7048e8", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 76597799, - "version": 1, - "versionNonce": 1858784829, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "+ tag 30", - "fontSize": 17, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "+ tag 30", - "lineHeight": 1.25, - "baseline": 17 - }, - { - "id": "el017", - "type": "arrow", - "x": 670, - "y": 200, - "width": 270, - "height": 0, - "angle": 0, - "strokeColor": "#7048e8", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 3, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1598556093, - "version": 1, - "versionNonce": 221916615, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "points": [ - [ - 0, - 0 - ], - [ - 270, - 0 - ] - ], - "lastCommittedPoint": null, - "startBinding": null, - "endBinding": null, - "startArrowhead": null, - "endArrowhead": "arrow", - "roundness": { - "type": 2 - } - }, - { - "id": "el018", - "type": "arrow", - "x": 670, - "y": 222, - "width": 270, - "height": 0, - "angle": 0, - "strokeColor": "#7048e8", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 3, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1523174671, - "version": 1, - "versionNonce": 216018217, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "points": [ - [ - 0, - 0 - ], - [ - 270, - 0 - ] - ], - "lastCommittedPoint": null, - "startBinding": null, - "endBinding": null, - "startArrowhead": null, - "endArrowhead": "arrow", - "roundness": { - "type": 2 - } - }, - { - "id": "el019", - "type": "text", - "x": 724, - "y": 172, - "width": 126.0, - "height": 18.75, - "angle": 0, - "strokeColor": "#7048e8", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 2049719155, - "version": 1, - "versionNonce": 1609878353, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "carries tag 30", - "fontSize": 15, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "carries tag 30", - "lineHeight": 1.25, - "baseline": 15 - }, - { - "id": "el020", - "type": "rectangle", - "x": 940, - "y": 160, - "width": 180, - "height": 100, - "angle": 0, - "strokeColor": "#7048e8", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 22152744, - "version": 1, - "versionNonce": 1741428563, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "roundness": { - "type": 3 - } - }, - { - "id": "el021", - "type": "text", - "x": 954, - "y": 170, - "width": 144.0, - "height": 37.5, - "angle": 0, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1026267703, - "version": 1, - "versionNonce": 502895922, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "vmbr0 vlan-aware\nVID 30 \u2192 net3", - "fontSize": 15, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "vmbr0 vlan-aware\nVID 30 \u2192 net3", - "lineHeight": 1.25, - "baseline": 15 - }, - { - "id": "el022", - "type": "text", - "x": 954, - "y": 226, - "width": 151.2, - "height": 17.5, - "angle": 0, - "strokeColor": "#7048e8", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 918449769, - "version": 1, - "versionNonce": 1067599022, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "ONLY, nowhere else", - "fontSize": 14, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "ONLY, nowhere else", - "lineHeight": 1.25, - "baseline": 14 - }, - { - "id": "el023", - "type": "arrow", - "x": 1120, - "y": 210, - "width": 50, - "height": 0, - "angle": 0, - "strokeColor": "#7048e8", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1544933330, - "version": 1, - "versionNonce": 249589260, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "points": [ - [ - 0, - 0 - ], - [ - 50, - 0 - ] - ], - "lastCommittedPoint": null, - "startBinding": null, - "endBinding": null, - "startArrowhead": null, - "endArrowhead": "arrow", - "roundness": { - "type": 2 - } - }, - { - "id": "el024", - "type": "rectangle", - "x": 1170, - "y": 130, - "width": 300, - "height": 190, - "angle": 0, - "strokeColor": "#7048e8", - "backgroundColor": "#d0bfff", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1147616804, - "version": 1, - "versionNonce": 275900123, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "roundness": { - "type": 3 - } - }, - { - "id": "el025", - "type": "text", - "x": 1186, - "y": 142, - "width": 198.0, - "height": 56.25, - "angle": 0, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1183197673, - "version": 1, - "versionNonce": 827844211, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "pfSense net3 \u00b7 dCCTV\n10.0.30.1/24\ntag stripped by bridge", - "fontSize": 15, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "pfSense net3 \u00b7 dCCTV\n10.0.30.1/24\ntag stripped by bridge", - "lineHeight": 1.25, - "baseline": 15 - }, - { - "id": "el026", - "type": "text", - "x": 1186, - "y": 212, - "width": 268.8, - "height": 35.0, - "angle": 0, - "strokeColor": "#2f9e44", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 556137867, - "version": 1, - "versionNonce": 1074481459, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "\u2713 in: Frigate :554 \u00b7 HA :80+:554\n\u2713 out: NTP :123 only", - "fontSize": 14, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "\u2713 in: Frigate :554 \u00b7 HA :80+:554\n\u2713 out: NTP :123 only", - "lineHeight": 1.25, - "baseline": 14 - }, - { - "id": "el027", - "type": "text", - "x": 1186, - "y": 268, - "width": 193.2, - "height": 17.5, - "angle": 0, - "strokeColor": "#e03131", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1321167842, - "version": 1, - "versionNonce": 1493882225, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "\u2717 everything else: DENY", - "fontSize": 14, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "\u2717 everything else: DENY", - "lineHeight": 1.25, - "baseline": 14 - }, - { - "id": "el028", - "type": "text", - "x": 40, - "y": 380, - "width": 480.0, - "height": 25.0, - "angle": 0, - "strokeColor": "#1971c2", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1369574852, - "version": 1, - "versionNonce": 733267986, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "VLAN 1 \u00b7 home LAN (the rest of the rack)", - "fontSize": 20, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "VLAN 1 \u00b7 home LAN (the rest of the rack)", - "lineHeight": 1.25, - "baseline": 20 - }, - { - "id": "el029", - "type": "rectangle", - "x": 40, - "y": 420, - "width": 170, - "height": 120, - "angle": 0, - "strokeColor": "#1971c2", - "backgroundColor": "#a5d8ff", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1426243518, - "version": 1, - "versionNonce": 404213796, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "roundness": { - "type": 3 - } - }, - { - "id": "el030", - "type": "text", - "x": 54, - "y": 432, - "width": 142.79999999999998, - "height": 70.0, - "angle": 0, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1170712377, - "version": 1, - "versionNonce": 1439293404, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "apartment uplink\n4G router \u00b7 .7\nUPS \u00b7 switch mgmt\nall untagged", - "fontSize": 14, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "apartment uplink\n4G router \u00b7 .7\nUPS \u00b7 switch mgmt\nall untagged", - "lineHeight": 1.25, - "baseline": 14 - }, - { - "id": "el031", - "type": "arrow", - "x": 210, - "y": 480, - "width": 50, - "height": 0, - "angle": 0, - "strokeColor": "#1971c2", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 41933292, - "version": 1, - "versionNonce": 217435681, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "points": [ - [ - 0, - 0 - ], - [ - 50, - 0 - ] - ], - "lastCommittedPoint": null, - "startBinding": null, - "endBinding": null, - "startArrowhead": null, - "endArrowhead": "arrow", - "roundness": { - "type": 2 - } - }, - { - "id": "el032", - "type": "rectangle", - "x": 260, - "y": 420, - "width": 190, - "height": 120, - "angle": 0, - "strokeColor": "#1971c2", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1494665817, - "version": 1, - "versionNonce": 82528369, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "roundness": { - "type": 3 - } - }, - { - "id": "el033", - "type": "text", - "x": 274, - "y": 430, - "width": 135.0, - "height": 37.5, - "angle": 0, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 2006432221, - "version": 1, - "versionNonce": 1170391402, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "P1 / P2 / P3\nPVID 1 \u2192 VLAN 1", - "fontSize": 15, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "P1 / P2 / P3\nPVID 1 \u2192 VLAN 1", - "lineHeight": 1.25, - "baseline": 15 - }, - { - "id": "el034", - "type": "text", - "x": 274, - "y": 488, - "width": 142.79999999999998, - "height": 35.0, - "angle": 0, - "strokeColor": "#e03131", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 2035003054, - "version": 1, - "versionNonce": 231739024, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "\u2717 tag-30 arriving\nhere is DROPPED", - "fontSize": 14, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "\u2717 tag-30 arriving\nhere is DROPPED", - "lineHeight": 1.25, - "baseline": 14 - }, - { - "id": "el035", - "type": "arrow", - "x": 450, - "y": 480, - "width": 50, - "height": 0, - "angle": 0, - "strokeColor": "#1971c2", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 851649342, - "version": 1, - "versionNonce": 1330529717, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "points": [ - [ - 0, - 0 - ], - [ - 50, - 0 - ] - ], - "lastCommittedPoint": null, - "startBinding": null, - "endBinding": null, - "startArrowhead": null, - "endArrowhead": "arrow", - "roundness": { - "type": 2 - } - }, - { - "id": "el036", - "type": "rectangle", - "x": 500, - "y": 420, - "width": 170, - "height": 120, - "angle": 0, - "strokeColor": "#1971c2", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1108429504, - "version": 1, - "versionNonce": 322250604, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "roundness": { - "type": 3 - } - }, - { - "id": "el037", - "type": "text", - "x": 514, - "y": 434, - "width": 117.0, - "height": 37.5, - "angle": 0, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 2082654793, - "version": 1, - "versionNonce": 88739979, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "P5 egress\nnative VLAN 1", - "fontSize": 15, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "P5 egress\nnative VLAN 1", - "lineHeight": 1.25, - "baseline": 15 - }, - { - "id": "el038", - "type": "text", - "x": 514, - "y": 496, - "width": 108.0, - "height": 18.75, - "angle": 0, - "strokeColor": "#1971c2", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 594390025, - "version": 1, - "versionNonce": 1730926570, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "no tag added", - "fontSize": 15, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "no tag added", - "lineHeight": 1.25, - "baseline": 15 - }, - { - "id": "el039", - "type": "arrow", - "x": 670, - "y": 480, - "width": 270, - "height": 0, - "angle": 0, - "strokeColor": "#1971c2", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 3, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 2082581262, - "version": 1, - "versionNonce": 1681796809, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "points": [ - [ - 0, - 0 - ], - [ - 270, - 0 - ] - ], - "lastCommittedPoint": null, - "startBinding": null, - "endBinding": null, - "startArrowhead": null, - "endArrowhead": "arrow", - "roundness": { - "type": 2 - } - }, - { - "id": "el040", - "type": "text", - "x": 716, - "y": 452, - "width": 189.0, - "height": 18.75, - "angle": 0, - "strokeColor": "#1971c2", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 787209477, - "version": 1, - "versionNonce": 840302416, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "plain untagged frames", - "fontSize": 15, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "plain untagged frames", - "lineHeight": 1.25, - "baseline": 15 - }, - { - "id": "el041", - "type": "rectangle", - "x": 940, - "y": 420, - "width": 180, - "height": 120, - "angle": 0, - "strokeColor": "#1971c2", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1079834069, - "version": 1, - "versionNonce": 647687454, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "roundness": { - "type": 3 - } - }, - { - "id": "el042", - "type": "text", - "x": 954, - "y": 432, - "width": 168.0, - "height": 70.0, - "angle": 0, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 474197814, - "version": 1, - "versionNonce": 912206893, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "vmbr0 untagged\n= plain L2 switching\nhost .127 + pfSense\nWAN \u2014 no routing", - "fontSize": 14, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "vmbr0 untagged\n= plain L2 switching\nhost .127 + pfSense\nWAN \u2014 no routing", - "lineHeight": 1.25, - "baseline": 14 - }, - { - "id": "el043", - "type": "arrow", - "x": 1120, - "y": 480, - "width": 50, - "height": 0, - "angle": 0, - "strokeColor": "#1971c2", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 215726947, - "version": 1, - "versionNonce": 1310489154, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "points": [ - [ - 0, - 0 - ], - [ - 50, - 0 - ] - ], - "lastCommittedPoint": null, - "startBinding": null, - "endBinding": null, - "startArrowhead": null, - "endArrowhead": "arrow", - "roundness": { - "type": 2 - } - }, - { - "id": "el044", - "type": "rectangle", - "x": 1170, - "y": 410, - "width": 300, - "height": 160, - "angle": 0, - "strokeColor": "#1971c2", - "backgroundColor": "#a5d8ff", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1355096973, - "version": 1, - "versionNonce": 1357902601, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "roundness": { - "type": 3 - } - }, - { - "id": "el045", - "type": "text", - "x": 1186, - "y": 422, - "width": 218.4, - "height": 52.5, - "angle": 0, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 212355785, - "version": 1, - "versionNonce": 693422793, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "pfSense net0 \u00b7 WAN .2\njust a LAN client \u2014\nhome LAN never transits it", - "fontSize": 14, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "pfSense net0 \u00b7 WAN .2\njust a LAN client \u2014\nhome LAN never transits it", - "lineHeight": 1.25, - "baseline": 14 - }, - { - "id": "el046", - "type": "text", - "x": 1186, - "y": 494, - "width": 201.6, - "height": 35.0, - "angle": 0, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1799580904, - "version": 1, - "versionNonce": 398539541, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "gateway = AX6000\npfSense NATs only 10.0.x", - "fontSize": 14, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "gateway = AX6000\npfSense NATs only 10.0.x", - "lineHeight": 1.25, - "baseline": 14 - }, - { - "id": "el047", - "type": "rectangle", - "x": 40, - "y": 600, - "width": 630, - "height": 90, - "angle": 0, - "strokeColor": "#868e96", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "dashed", - "roughness": 1, - "opacity": 100, - "seed": 1339321764, - "version": 1, - "versionNonce": 1076065263, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "roundness": { - "type": 3 - } - }, - { - "id": "el048", - "type": "text", - "x": 56, - "y": 612, - "width": 554.4, - "height": 35.0, - "angle": 0, - "strokeColor": "#868e96", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1733803932, - "version": 1, - "versionNonce": 2062677415, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "not on this cable: vmbr1 tag 10 \u2192 dMgmt \u00b7 tag 20 \u2192 dK8s (Frigate)\ndormant fallback: eno2 \u2192 vmbr2 (revert = one qm set)", - "fontSize": 14, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "not on this cable: vmbr1 tag 10 \u2192 dMgmt \u00b7 tag 20 \u2192 dK8s (Frigate)\ndormant fallback: eno2 \u2192 vmbr2 (revert = one qm set)", - "lineHeight": 1.25, - "baseline": 14 - }, - { - "id": "el049", - "type": "text", - "x": 940, - "y": 620, - "width": 396.0, - "height": 37.5, - "angle": 0, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 322195856, - "version": 1, - "versionNonce": 365731358, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "L2 drops (membership) happen in the switch \u2014\nL3 allow/deny happens in pfSense", - "fontSize": 15, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "L2 drops (membership) happen in the switch \u2014\nL3 allow/deny happens in pfSense", - "lineHeight": 1.25, - "baseline": 15 - }, - { - "id": "el050", - "type": "text", - "x": 940, - "y": 676, - "width": 109.2, - "height": 16.25, - "angle": 0, - "strokeColor": "#868e96", - "backgroundColor": "transparent", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "seed": 1038112083, - "version": 1, - "versionNonce": 966092898, - "isDeleted": false, - "groupIds": [], - "frameId": null, - "boundElements": null, - "updated": 1, - "link": null, - "locked": false, - "text": "ADR-0017 rev 3", - "fontSize": 13, - "fontFamily": 1, - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "ADR-0017 rev 3", - "lineHeight": 1.25, - "baseline": 13 - } - ], - "appState": { - "gridSize": null, - "viewBackgroundColor": "#ffffff" - }, - "files": {} -} \ No newline at end of file diff --git a/docs/adr/0017-cctv-vlan-tagging.svg b/docs/adr/0017-cctv-vlan-tagging.svg deleted file mode 100644 index 868aa746..00000000 --- a/docs/adr/0017-cctv-vlan-tagging.svg +++ /dev/null @@ -1 +0,0 @@ -VLAN tagging — where traffic can flowthe 802.1Q tag exists only between switch P5 and vmbr0 — endpoints never see itONE CABLEthe LAN1 run - P5 to eno1VLAN 30 - CCTV (camera)camera10.0.30.70sends untaggedP4 ingressPVID 30 -> VLAN 30x not in VLAN 1P5 egressadds 802.1Q:+ tag 30carries tag 30vmbr0 vlan-awareVID 30 -> net3ONLY, nowhere elsepfSense net3 - dCCTV 10.0.30.1/24tag stripped by the bridgeok in: Frigate :554 - HA :80 + :554ok out: NTP :123 onlyx everything else: DENYVLAN 1 - home LAN (the rest of the rack)apartment uplink4G router - .7UPS - switch mgmtall untaggedP1 / P2 / P3PVID 1 -> VLAN 1x tag-30 arrivinghere is DROPPEDP5 egressnative VLAN 1:no tag addedplain untagged framesvmbr0 untagged =plain L2 switching:host .127 + pfSenseWAN - no routingpfSense net0 - WAN 192.168.1.2just a LAN client - home LANnever transits pfSensegateway = AX6000 - pfSense NATs only 10.0.xnot on this cable: vmbr1 tag 10 -> dMgmt - tag 20 -> dK8s (Frigate)dormant fallback: eno2 -> vmbr2 (revert = one qm set)L2 drops (membership) happen in the switch,L3 allow/deny happens in pfSenseADR-0017 rev 3 - editable source: 0017-cctv-vlan-tagging.excalidraw \ No newline at end of file diff --git a/docs/adr/0018-valia-sites-off-infra-pages-in-cluster-sync.md b/docs/adr/0018-valia-sites-off-infra-pages-in-cluster-sync.md deleted file mode 100644 index 5344382a..00000000 --- a/docs/adr/0018-valia-sites-off-infra-pages-in-cluster-sync.md +++ /dev/null @@ -1,47 +0,0 @@ -# Valia sites are served off-infra (Cloudflare Pages), synced in-cluster - -Valia (Viktor's mother) authors small one-page static sites in Google Drive folders she -shares, and keeps asking for them to be hosted — two exist already (`stem95su`, `bridge`) -and more are expected. We decided all **Valia sites** are served **off-infra on Cloudflare -Pages** under `.viktorbarzin.me`, kept fresh by **one shared in-cluster -CronJob** (`stacks/valia-sites/`) that mirrors each **Content folder** every 10 minutes -(rclone, drive.readonly) and re-deploys only on change (wrangler direct upload). The -existing in-cluster `stem95su` serving stack (nginx + NFS + ingress + per-site sync) -migrates onto this and is retired. - -Why off-infra serving: these are her sites, shown to teachers/parents — they must survive -homelab outages (cf. the 2026-06-27 egress incident that took every proxied in-cluster -site down). With Pages, a homelab outage degrades to "content frozen until we're back", -never "site down". Serving costs no cluster resources and no per-site nginx/PVC/ingress/ -Anubis. Why the syncer stays in-cluster anyway: secrets stay in Vault (no per-site GHA -secret sprawl), and the stem95su guard patterns (hard-fail on Drive auth errors, never -wipe a live site on an empty/partial folder, capped deletes) carry over wholesale. The -deliberate asymmetry — off-infra serving, on-infra syncing — is the point, not an -accident. - -## Considered options - -- **In-cluster everywhere** (generalise stem95su into a factory module): one roof, no - Cloudflare Pages dependency — but her sites share the homelab's fate and each site - spends cluster resources to serve static files a free CDN serves better. -- **Pages for new sites only**: less work now, two patterns and two runbooks forever. -- **GHA-scheduled sync** (fully off-infra pipeline): no cluster dependency at all, but - Drive + Cloudflare credentials would live as GitHub secrets per repo, outside Vault. - -## Consequences - -- Registration is one entry in the `sites` map (name, Content folder, optional Entry - file); CI applies Pages project, custom domain, public CNAME, and internal-DNS config - together. Names are English, picked by Viktor (most → bridge set the precedent). -- The internal split-horizon zone learns Valia sites from a ConfigMap the - `technitium-ingress-dns-sync` script consumes — declaratively, including **removal** - (the previous static-CNAME approach was add-only; a retired site left a stale record). -- Deploy-on-change is mandatory, not an optimisation: Pages caps monthly deployments on - the free tier, and a 10-minute cadence would burn ~4,300/month if unchanged runs - deployed. -- Failure visibility is **failed-Job-only** by explicit choice (no stale-sync alert, no - per-site uptime monitors, no notifications to Valia) — Viktor fields "it didn't - update" reports, consistent with the alert-noise-reduction posture. Revisit if a - silent stall actually bites. -- If the homelab is down, content updates pause; the sites keep serving last-deployed - content. Accepted degradation. diff --git a/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md b/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md deleted file mode 100644 index 708d8624..00000000 --- a/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md +++ /dev/null @@ -1,97 +0,0 @@ -# Inbound mail gets a self-hosted store-and-forward backup MX on Oracle Always-Free - -`viktorbarzin.me` has run a single direct MX to the home IP since the 2026-04-12 -inbound overhaul, with sender-MTA retry (1–5 days, sender-dependent) as the only -outage protection — a documented "No Backup MX" decision made after ForwardEmail's -forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email -Routing proved pass-through-only. Viktor now wants inbound mail to survive -homelab outages **without loss** (2026-07-04): delayed delivery is fine, -mid-outage reading is not required, and the budget is **$0** — a hard -constraint that eliminated every managed option (see below). - -We run a minimal **Postfix store-and-forward relay on an Oracle Cloud -Always-Free `VM.Standard.E2.1.Micro`** (`mx2.viktorbarzin.me`, **reserved** -public IP, MX preference 20; primary untouched at 1). It accepts everything -for the domain (catch-all — every RCPT is valid; reputation may only ever -4xx-defer, via postscreen pregreet + conservative DNSBL-defer on the VM — -never 5xx: a backup MX that hard-rejects manufactures the loss it exists to -prevent), queues up to **30 days** (bounce lifetime 1 day — the VM can never -deliver a DSN, its only egress is the drain), and drains to the primary over -**port 2526** — one scripted pfSense WAN NAT rule onto the existing HAProxy -frontend — because Oracle blocks egress TCP 25 tenancy-wide. Management is -tailnet-only (headscale preauth key, `tag:backup-mx`; OCI console as -mid-outage break-glass since headscale itself lives in the cluster); TLS via -certbot HTTP-01 (port 80 permanently open — LE validation is -multi-perspective and unscopeable); the VM is a cattle-rebuild from a new -`stacks/backup-mx/` Terraform stack (OCI provider + cloud-init, which must -also punch 25/80 through the OCI Ubuntu image's OS-level iptables REJECT). -On the primary, the drain stream (one /32) is enabled at the layers that -actually bite — `check_client_access` permits past -`reject_unknown_client_hostname` and spoof-protection, an anvil rate-limit -exception, and rspamd `external_relay` (score against the *original* sender -IP) with the reject action capped to tag/fold so drained spam can never force -the VM to emit backscatter. Go-live is gated on empirical checks: inbound-25 -reachability (recurring probe — Oracle publishes no commitment), drain -end-to-end, and a live failover test that includes a high-spam-score and a ->10 MB message. Two independent adversarial reviews (2026-07-04) shaped this -final form. Design: -[`plans/2026-07-04-backup-mx-design.md`](../plans/2026-07-04-backup-mx-design.md). - -## Considered options - -- **Roller Network free Secondary MX** — v1 of this decision, killed at the - validation gates the same day: free tier caps at 200 relayed messages or - 10 MB per rolling 7 days, and overage suspends the domain for 48 h - answering **SMTP 5xx** (permanent bounces) — since spammers target backup - MXes even while the primary is up, background spam alone can hold it - suspended, making it *worse than no backup MX*. Free accounts are also - being discontinued. (Their TLS checked out; their paid Basic at $30/yr is - the documented fallback if the OCI route sours.) -- **Dynu Email Backup ($9.99/yr)** — queue lifetime undocumented (FAQ hints - 12–24 h, barely beating sender retry); filtering black-box; not free. -- **Cloudflare Email Routing / mailflare** — no store-and-forward / terminal - inbox on Cloudflare; rejected earlier (2026-04-12; 2026-07-04 memory #7148). -- **Other free tiers** (challenged and re-verified 2026-07-04): GCP e2-micro - blocks egress 25 too and its free regions are US-only; AWS's 2025+ "free" - plan is a 6-month credit; Azure has no always-free VM and blocks 25; - Hetzner has no free tier; Fly.io ended free allowances; Vultr/Linode are - trial credits; DNSExit/KisoLabs/DuoCircle backup-MX are paid or dead. OCI - is the only standing free option. -- **Harden-only** (5xx-misconfig guards + paging) — does not address - multi-day outages or short-retry senders; deferred as a complementary - track. - -## Consequences - -- **A pet outside the cluster** — deliberately cattle: rebuilt entirely from - Terraform + cloud-init, patched by unattended-upgrades, scraped by the - cluster's Prometheus (exporters on the reserved public IP, allowlisted to - the homelab WAN /32 — there is **no cluster→tailnet route**, so tailnet - scraping was rejected as fictional; blackbox TCP:25 + MX-set drift alerts - besides). Never a backup target itself. -- **Oracle free-tier caprice is the top risk**: Oracle silently halved the A1 - free allowance in June 2026 and terminated over-limit instances, and - publishes no commitment that inbound 25 stays open. Mitigations: - **Pay-As-You-Go conversion is a required prerequisite** (exempts idle - reclamation, stays $0), a recurring inbound-25 probe, `BackupMxDown`, and - the queue being empty outside outages (a surprise reclamation loses - coverage, never mail). Home region is fixed at signup — Frankfurt, chosen - once. -- The drain stream bypasses `reject_unknown_client_hostname`, anvil limits, - and rspamd's reject tier for one /32; DKIM verification, SPF/DMARC (against - the original IP via `external_relay`), and content scoring stay on — spam - arriving via the backup is tagged and folded to Junk, never bounced. The VM - is deliberately NOT in the primary's `mynetworks` (a compromised VM must - not relay through us). -- **Outages > 30 days lose queued mail silently** — no DSN can ever leave the - VM. Stated and accepted (6× better than the status quo). -- Outage mail sits in plaintext on Oracle disk ≤ 30 days — single-tenant but - off-premises; accepted (same class as Brevo holding outbound today). -- Cloudflare zone lands at 197/200 records; the MTA-STS follow-up (policy - host found dangling during design — inert today; must list `mx2` when - fixed) needs 1–2 more → schedule the next record purge proactively. -- `architecture/mailserver.md` §"No Backup MX" superseded at implementation; - new runbook `docs/runbooks/backup-mx.md` (incl. OCI console break-glass); - `vpn.md`'s stale headscale claims fixed in passing; the roundtrip probe's - failure semantics change (a "failing" probe may now mean "delayed via mx2, - drains shortly" — noted in alert description). diff --git a/docs/architecture/chrome-service.md b/docs/architecture/chrome-service.md index f77518b4..118c0895 100644 --- a/docs/architecture/chrome-service.md +++ b/docs/architecture/chrome-service.md @@ -329,12 +329,6 @@ Two independent grants make up "browser access" for a user: the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate a token by deleting its `-browser-token` Secret). -Because the SA is the user's DEFAULT kubectl credential, other per-namespace -port-forward grants hang off the same identity: `stacks/excalidraw/rbac.tf` -grants `emo-browser` `pods/portforward` in `excalidraw` (2026-07-02) so emo's -agent can upload drawings via the port-forward + `X-Authentik-Username` recipe -in his `~/.claude/CLAUDE.md`. Revoking the SA revokes those too. - ## Limits + risks - **Anti-bot vs stealth arms race** — when an upstream beats us (DRM diff --git a/docs/architecture/ci-cd.md b/docs/architecture/ci-cd.md index 17d0859f..b8cfcdd5 100644 --- a/docs/architecture/ci-cd.md +++ b/docs/architecture/ci-cd.md @@ -94,7 +94,7 @@ can't reach Forgejo's public hairpin. | Visibility | Packages | Pull mechanism | |------------|----------|----------------| | **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous | -| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci, k8s-portal, excalidraw-library | `ghcr-credentials` dockerconfigjson | +| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson | Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit @@ -188,8 +188,6 @@ reconciled — the workflows were added to the GitHub lineage via PR): | android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` | | infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` | | infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` | -| k8s-portal | `build-k8s-portal.yml` | private `ghcr.io/viktorbarzin/k8s-portal` (Keel rolls `:latest` digests) | -| excalidraw-library | `build-excalidraw.yml` | private `ghcr.io/viktorbarzin/excalidraw-library` (Keel rolls `:latest` digests; DockerHub `:v4` frozen as rollback) | **`infra-ci`** is the image the `.woodpecker/default.yml` apply step and `drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is diff --git a/docs/architecture/dns.md b/docs/architecture/dns.md index e3fe6ee5..6150d226 100644 --- a/docs/architecture/dns.md +++ b/docs/architecture/dns.md @@ -277,7 +277,7 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h). -**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too. **Off-infra Valia sites** (Cloudflare Pages, ADR-0018) are the other class of public-only names with no Traefik ingress — without internal records they NXDOMAIN for every internal client while working fine externally. Since 2026-07-03 they are reconciled **declaratively**: `stacks/valia-sites` writes the ConfigMap `valia-sites-dns` (technitium ns, `.pages.dev`), and the sync script ensures/updates a CNAME per entry and **deletes** stale internal CNAMEs targeting `*.pages.dev` that left the map (retire/rename cleans itself up; deletion is suffix-scoped so nothing else can be touched). +**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too. ## NodeLocal DNSCache @@ -368,7 +368,6 @@ The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`) | TXT (MTA-STS) | 1 | `v=STSv1; id=20260412` | TLS enforcement | | TXT (TLSRPT) | 1 | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS reporting | | A (keyserver) | 1 | `130.162.165.220` (Oracle VPS) | PGP keyserver | -| CNAME (CF Pages) | 2 | `.pages.dev` (Cloudflare Pages) | bridge, stem95su — Valia sites (ADR-0018), managed by `stacks/valia-sites` | ### Proxied vs Non-Proxied @@ -514,7 +513,6 @@ For external `.viktorbarzin.me` records: 1. Add `dns_type = "proxied"` (or `"non-proxied"`) to the `ingress_factory` module call in the service stack 2. Run `scripts/tg apply` on the service stack — DNS record is auto-created 3. For non-standard records (MX, TXT), add a `cloudflare_record` resource in `stacks/cloudflared/modules/cloudflared/cloudflare.tf` -4. For a Valia site (off-infra Cloudflare Pages), add the entry to `local.sites` in `stacks/valia-sites/main.tf` — public CNAME + internal record both follow (`docs/runbooks/valia-sites.md`) ## Incident History diff --git a/docs/architecture/mailserver.md b/docs/architecture/mailserver.md index a7925849..0edeffb4 100644 --- a/docs/architecture/mailserver.md +++ b/docs/architecture/mailserver.md @@ -161,17 +161,6 @@ https://mail.viktorbarzin.me → Traefik → Roundcubemail DB: MySQL (mysql.dbaas.svc.cluster.local) ``` -### Paperless ingest mailbox (docs@) - -`docs@viktorbarzin.me` is a dedicated real mailbox (explicit self-alias in -`extra/aliases.txt` so the `@domain → spam@` catch-all doesn't shadow it) that -paperless-ngx polls over IMAP; family members forward document emails to it -and the sender maps 1:1 to a paperless account. A per-user Dovecot sieve -(`docs-at-viktorbarzin.me.dovecot.sieve` in the `mailserver.config` ConfigMap, -mounted as `/tmp/docker-mailserver/docs@viktorbarzin.me.dovecot.sieve`) -discards mail from non-allowlisted senders at delivery. Full flow, sender map, -and add-a-sender procedure: [`runbooks/paperless-mail-ingest.md`](../runbooks/paperless-mail-ingest.md). - ## DNS Records All managed in Terraform at `stacks/cloudflared/modules/cloudflared/cloudflare.tf`. @@ -311,21 +300,6 @@ Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from External ## Troubleshooting -### All mail tempfailing with `451 4.3.0 queue file write error` (postsrsd spin) - -Seen 2026-07-03 right after a pod restart. Signature in `/var/log/mail/mail.log`: -`postfix/cleanup: warning: tcp:localhost:10001 lookup error` + -`sender_canonical_maps map lookup problem ... message not accepted, try again later`. -Cause: **postsrsd** (SRS daemon, `sender_canonical_maps = tcp:localhost:10001`) -came up spinning at 100% CPU without binding 10001/10002 — supervisor shows it -`RUNNING` but `ss -ltn | grep 1000` is empty and its log is empty. Postfix then -tempfails every message (inbound AND submission); senders retry so nothing is -lost, and the roundtrip probe alerts within the hour. -Fix: `supervisorctl restart postsrsd` inside the container; if the fresh -process spins again (it did once), `kubectl -n mailserver delete pod` for a -full re-init — that healed it. Root cause not pinned down (one-off bad init; -postsrsd 1.10). - ### Inbound mail not arriving 1. **DNS/MX**: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me` 2. **WAN reachability**: `nc -zw5 mail.viktorbarzin.me 25` from outside diff --git a/docs/architecture/networking.md b/docs/architecture/networking.md index 1e17d95d..b93df195 100644 --- a/docs/architecture/networking.md +++ b/docs/architecture/networking.md @@ -1,10 +1,10 @@ # Networking Architecture -Last updated: 2026-07-02 (dCCTV segment added — dedicated pfSense leg for the garage camera, ADR-0017) +Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS TSIG-signed) ## Overview -The homelab network is built on three isolated segments behind pfSense (management VLAN 10, Kubernetes VLAN 20, and the physically-legged dCCTV camera segment — see ADR-0017) with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs. +The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs. ## Architecture Diagram @@ -24,14 +24,9 @@ graph TB CSdrop[CrowdSec drop
nftables / CF edge
out-of-band, pre-Traefik] - subgraph "Proxmox Host (eno1, eno2)" + subgraph "Proxmox Host (eno1)" vmbr0[vmbr0 Bridge
192.168.1.127/24] vmbr1[vmbr1 Internal
VLAN-aware] - vmbr2[vmbr2 Bridge
eno2 → TL-SG105PE] - - subgraph "dCCTV - 10.0.30.0/24
ADR-0017" - Camera[vermont-garage
10.0.30.70] - end subgraph "VLAN 10 - Management
10.0.10.0/24" Proxmox[Proxmox Host
10.0.10.1] @@ -76,9 +71,6 @@ graph TB vmbr1 -.VLAN 20.- Tech vmbr1 -.VLAN 20.- Master vmbr1 -.VLAN 20.- Node1 - vmbr2 -.physical link.- eno2 - vmbr2 -.untagged.- Camera - vmbr2 -.pfSense net3 = dCCTV 10.0.30.1.- pfSense ``` ## Components @@ -89,7 +81,6 @@ graph TB | phpIPAM | v1.7.0 | phpipam.viktorbarzin.me | IP address management, device inventory, DNS sync | | vmbr0 | Linux bridge | 192.168.1.127/24 | Physical bridge on eno1, uplink to LAN | | vmbr1 | Linux bridge (VLAN-aware) | Internal | VLAN trunk for VM isolation | -| vmbr2 | Linux bridge | Physical (eno2) | DORMANT fallback leg for dCCTV (ADR-0017 rev 3) — live dCCTV rides vmbr0 tag 30 over the LAN1 trunk | | Technitium DNS | Container | 10.0.20.201 (LB) / 10.96.0.53 (ClusterIP) | Internal DNS (viktorbarzin.lan) + full recursive resolver | | Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me | | Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding | @@ -99,22 +90,6 @@ graph TB | MetalLB | v0.15.3 Helm chart | K8s | LoadBalancer IPs (10.0.20.200-10.0.20.220), all services on 10.0.20.200 | | Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 | -## CCTV Segment (dCCTV) — as-built 2026-07-02 - -Isolated camera segment for owned cameras at the Sofia site (first: `vermont-garage`, HiLook IPC-T241H-C at the garage entrance). Decision + rejected alternatives: `docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md`. - -**Physical path (rev 3, single switch)**: camera → TL-SG105PE PoE port (untagged VLAN 30) → trunk port (home LAN untagged + CCTV **tagged 30**) → the existing LAN1 cable → R730 `eno1` → `vmbr0` (vlan-aware) → pfSense `net3`/vtnet3 = `vmbr0 tag=30` = interface **dCCTV `10.0.30.1/24`**. The TL-SG105PE **replaces** the old garage TL-SG105E (retired to cold spare) and carries everything: apartment uplink, 4G router `192.168.1.7`, UPS mgmt (VLAN 1), camera (VLAN 30), trunk — all 5 ports used. VLAN-30 membership is {camera port, trunk port} only, so tagged injection from other ports is dropped. `eno2`/`vmbr2` remain dormant as the fallback physical leg (rev 2). - -**Addressing**: Kea DHCP pool `10.0.30.100-199`; devices get MAC reservations (camera `10.0.30.70`; the PE switch mgmt inherits the retired switch's `192.168.1.6` on the home LAN). Kea DDNS auto-registers names in Technitium; `phpipam-pfsense-import` picks up leases hourly. - -**Firewall** (all on pfSense): -- dCCTV in: pass `udp OPT4-net → 10.0.30.1:123` (NTP) — everything else hits the interface's default deny. Cameras cannot reach LAN, other segments, or the internet. -- WAN in (home LAN side): pass `192.168.1.8` (ha-sofia) → `10.0.30.70:80` (ISAPI/hikvision_next) and `:554` (RTSP), reply-to disabled on both. -- dKubernetes is allow-all, so cluster Frigate/go2rtc pulls RTSP with no extra rule (pod egress SNATs to node IPs). -- Home-LAN clients need the **AX6000 static route** `10.0.30.0/24 via 192.168.1.2` (camera-day step) to reach the camera UI. - -**Consumers**: cluster Frigate (`/srv/nfs/frigate/config/config.yml` — NOT Terraform) pulls `rtsp://10.0.30.70:554` main+sub as `vermont-garage`; HA integrates via Frigate plus direct hikvision_next for tamper events. - ## IPAM & DNS Auto-Registration Devices are automatically discovered, named, and registered in DNS without manual intervention. @@ -232,8 +207,6 @@ VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the up - blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send, audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden, changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser, travel, netbox - **Non-proxied domains** (grey cloud, direct IP resolution): - mail, wg, headscale, immich, calibre, vaultwarden, and other services requiring direct connections -- **Internal-IP domains** (grey cloud, A → `10.0.20.203` Traefik LB, `ingress_factory` `dns_type = "internal"`): - - highlights-immich, highlights-immich-emo — publicly *resolvable* but only *routable* from home LANs / WG sites / VPN (spokes policy-route `10.0.0.0/8` down the tunnel, so kiosk devices with baked-in URLs need no per-site DNS overrides). The record is reachability, not a gate — enforcement is the `home-lans-only` Traefik ipAllowList (Sofia/London/Valchedrym LANs + 10/8) on the ingress. See `docs/plans/2026-07-04-immich-frame-lan-only-design.md`. - CNAME records for proxied domains point to Cloudflared tunnel FQDNs ### Ingress Flow @@ -288,7 +261,7 @@ Traefik chain: 1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`). 2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend. -3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients), tripit (`tripit-rate-limit`, 100/1000, photo-tab thumbnail bursts), health (`health-rate-limit`, 100/1000, SPA shell + API burst per page), and dawarich (`dawarich-rate-limit`, 100/1000 — the Rails app self-serves all fingerprinted assets and the map adds an API burst per load; the default burst 429'd the asset tail and risked dropping OwnTracks/mobile location POSTs on the same host). +3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), and authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients). 4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors). Additional middleware: @@ -579,7 +552,7 @@ chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Che **Diagnosis**: Check Traefik middleware config for the affected IngressRoute. -**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik--rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, and tripit/health/authentik/dawarich each 100/1000 (SPA or asset-heavy page loads bursting past the default from one client IP). +**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik--rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, authentik 100/1000 (login SPA `/static` chunk burst → blank screen). ### Large Downloads or Uploads Truncate / Fail Partway diff --git a/docs/plans/2026-07-03-vault-token-self-heal-design.md b/docs/plans/2026-07-03-vault-token-self-heal-design.md deleted file mode 100644 index a88aff46..00000000 --- a/docs/plans/2026-07-03-vault-token-self-heal-design.md +++ /dev/null @@ -1,103 +0,0 @@ -# Vault Token Renewer Self-Heal Design - -**Date**: 2026-07-03 -**Status**: Approved (brainstorm complete; implementation pending) -**Owner**: wizard@devvm -**Supersedes**: the "version-only, no self-heal" scope choice recorded in -`docs/runbooks/vault-token-renew-devvm.md` (2026-06-07) - -## Problem - -`wizard@devvm` holds a maintenance-free periodic Vault token -(`token-devvm-wizard`, `period=768h`, renewed daily by the -`vault-token-renew` user timer) precisely so no weekly re-login is needed. -But `~/.vault-token` is the Vault CLI's default token sink, so any -`vault login -method=oidc` — which the infra docs themselves instruct before -applies — overwrites it with a 7-day OIDC token. The renewer's drift guard -(deliberately detect-only) then refuses to renew the foreign token and fails -the unit daily, into a log nobody watches. - -Observed consequence: a self-perpetuating weekly-expiry loop. The OIDC token -expires after 7 days → Vault 403s → the natural response is another -`vault login -method=oidc` → clobbers again. Drift persisted unnoticed -2026-06-18 → 06-26 and 2026-06-29 → 07-03 (memory #7121); Viktor experienced -it as "the token expires maybe once a week". - -**Goal**: `vault login -method=oidc` becomes harmless on devvm. The renewer -converts any admin-capable clobber back into the permanent periodic token, -unattended. (Chosen over "never log in" doc-fixes and over instant path-unit -healing — see Alternatives.) - -## Decisions - -| # | Decision | Notes | -|---|----------|-------| -| 1 | Heal in the existing renewer's drift branch, at its nightly run | ~20-line diff to an already-tested script; no new units. A few-hours window holding the 7-day OIDC token is harmless (heal window 24h ≪ 7d TTL) | -| 2 | Heal = *attempt* re-mint using the foreign token itself; let Vault's 403 decide | No policy-list guessing — identity-vs-token-policies burned us before (memory #4211). OIDC tokens carry `vault-admin` via `identity_policies`, so the create succeeds | -| 3 | Weak foreign token (create denied) → keep today's loud DRIFT failure | A read-only clobber (e.g. the 2026-06-05 `kubernetes-woodpecker-default` incident) signals a misbehaving agent flow; auto-papering over it would hide the offender. Log gains a "heal denied — investigate what wrote it" suffix | -| 4 | Do NOT revoke the clobbering OIDC token | It may still back the user's live login session; it ages out in 7 days on its own | -| 5 | After a successful heal, revoke stale `token-devvm-wizard` accessors | Anti-sprawl: each heal would otherwise strand the previous periodic **admin** token server-side for up to 32 days. Walk `auth/token/accessors`, revoke every `display_name=token-devvm-wizard` except the just-minted one. Runs only on heal (rare), never on the happy path | -| 6 | Minted-token sanity check before writing the file | Look up the new token; require `display_name=token-devvm-wizard`. Write via temp file + `mv` + `chmod 600` so a failed mint can never truncate `~/.vault-token` | -| 7 | Keep timer cadence (daily) and all happy-path behavior unchanged | | -| 8 | No notification plumbing in this change | devvm alerting is tracked separately (beads `code-aslh`). Heal events are logged; heal-denied/FAIL still fail the unit | - -## Behavior matrix - -| Token found in `~/.vault-token` | Before | After | -|---|---|---| -| Our periodic token | renew-self, log `OK` | unchanged | -| Foreign, admin-capable (OIDC login) | log `DRIFT`, exit 1 | re-mint periodic token with it, sanity-check, atomic write, revoke stale periodic accessors, log `HEALED: re-minted from foreign dn= (revoked N stale)`, exit 0 | -| Foreign, weak (read-only k8s clobber) | log `DRIFT`, exit 1 | log `DRIFT … heal denied — foreign token lacks create authority; investigate what wrote it`, exit 1 | -| Vault unreachable / lookup fails | log `FAIL`, exit 1 | unchanged | - -Re-mint command (identical to the manual recovery the DRIFT log already -prescribes): - -``` -vault token create -orphan -period=768h \ - -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -``` - -## Testing - -- **Unit** (`scripts/test-vault-token-renew.sh`, existing source-the-functions - harness): new pure functions for (a) the stale-accessor revoke filter - (match on `display_name`, exclude the current accessor) and (b) the - minted-token sanity predicate; regression cases for the existing drift - predicate stay green. -- **Live, post-deploy** (on devvm): - 1. Mint a fake 1h admin token (`-display-name=fake-oidc`, - `-policy=vault-admin -policy=sops-admin`), write to `~/.vault-token`, - start the service → expect `HEALED`, file holds `token-devvm-wizard`. - 2. Mint a fake 10m no-privilege token (`-policy=default`), write it, start - the service → expect `DRIFT … heal denied`, unit `failed`; restore real - token. - 3. Revoke both fakes; one-off sweep of stale periodic accessors left by the - June 26 / July 3 manual re-mints. - -## Docs & rollout - -- Same commit rewrites the runbook's "Drift guard & recovery" section: - self-heal is the recovery for admin-capable clobbers; manual re-mint remains - only for weak clobbers (or a dead token with no admin-capable replacement in - the file). -- `vault login -method=oidc` instructions across the docs stay as-is — the - login is now harmless by design. -- Deploy per the runbook's manual model: `install -m 0755` to - `~/.local/bin/vault-token-renew`. Units unchanged — no daemon-reload. -- After landing: update memories #4204/#4211 (gotcha now self-healing). - -## Alternatives considered - -- **Instant heal** (systemd path unit + protected source-copy of the token): - strictly more capable (seconds-latency, heals weak clobbers too, zero - re-minting), but 2 new units + a second secret file + inotify re-trigger - edge cases — machinery disproportionate to the residual risk. Revisit only - if the few-hour heal window ever bites. -- **Vault CLI `token_helper` interception**: right interception point in - theory, but a helper bug breaks every `vault` CLI call, Terraform reads - `~/.vault-token` natively anyway, and it adds latency inside login. Rejected. -- **Docs-only ("never log in")**: rejected by user — the login should keep - working, not become forbidden knowledge. -- **Raise the OIDC role's 7-day `token_max_ttl`**: shared role, affects every - OIDC user; rejected previously for the same reason (memory #4205). diff --git a/docs/plans/2026-07-03-vault-token-self-heal-plan.md b/docs/plans/2026-07-03-vault-token-self-heal-plan.md deleted file mode 100644 index 1bfd7978..00000000 --- a/docs/plans/2026-07-03-vault-token-self-heal-plan.md +++ /dev/null @@ -1,443 +0,0 @@ -# Vault Token Renewer Self-Heal Implementation Plan - -> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. - -**Goal:** Make `vault login -method=oidc` harmless on devvm — the nightly renewer re-mints the permanent periodic token from any admin-capable clobber of `~/.vault-token`, unattended. - -**Architecture:** Extend the drift branch of `scripts/vault-token-renew.sh` (deployed to `~/.local/bin/vault-token-renew`, driven by an existing systemd user timer). On drift, *attempt* the re-mint with the clobbering token itself and let Vault's 403 be the authority; sanity-check the minted token, replace the file atomically, then revoke stale `token-devvm-wizard` leftovers. Weak clobbers keep today's loud failure. Design: `docs/plans/2026-07-03-vault-token-self-heal-design.md`. - -**Tech Stack:** bash + jq + vault CLI; existing test harness `scripts/test-vault-token-renew.sh` (sources the script, `vtr_main` is guarded). - -**Working copy:** everything below runs in the worktree -`~/code/infra/.worktrees/vault-token-self-heal` on branch `wizard/vault-token-self-heal`. -Per repo policy, EVERY git command in this git-crypt repo worktree carries: -`-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false` -(abbreviated as `$GCFLAGS` below; define once per shell: -`GCFLAGS="-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false"` -and use it unquoted: `git $GCFLAGS …`). - ---- - -### Task 1: Unit tests for the two new pure functions (RED) - -**Files:** -- Modify: `scripts/test-vault-token-renew.sh` (append before the final `printf`/exit lines) - -- [ ] **Step 1: Append the failing tests** - -Insert this block immediately after the existing "parse + decide end-to-end" section (after the line `no "oidc: parse+decide refused" …`, before the final `printf '\n%d passed…'`): - -```bash -# --- vtr_accessor: parse accessor out of lookup JSON --- -LOOKUP_NEW='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-new","policies":["default","sops-admin","vault-admin"],"identity_policies":null}}' -eq "accessor parsed" "acc-new" "$(vtr_accessor "$LOOKUP_NEW")" -eq "accessor absent -> empty" "" "$(vtr_accessor '{"data":{"display_name":"x"}}')" - -# --- vtr_is_stale_periodic: the heal's revoke filter — ONLY old token-devvm-wizard -# --- tokens are swept; the just-minted token, foreign tokens, and anything with an -# --- unknown accessor are kept. An empty keep-accessor sweeps NOTHING (fail-safe). -STALE_OURS='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-old","policies":["default","sops-admin","vault-admin"]}}' -ok "older periodic token is stale" vtr_is_stale_periodic "$STALE_OURS" "acc-new" -no "the just-minted token is kept" vtr_is_stale_periodic "$LOOKUP_NEW" "acc-new" -no "foreign oidc token never swept" vtr_is_stale_periodic "$LOOKUP_OIDC" "acc-new" -no "woodpecker token never swept" vtr_is_stale_periodic "$LOOKUP_WP" "acc-new" -no "missing accessor never swept" vtr_is_stale_periodic '{"data":{"display_name":"token-devvm-wizard"}}' "acc-new" -no "empty keep-accessor sweeps nothing" vtr_is_stale_periodic "$STALE_OURS" "" -``` - -(`LOOKUP_OIDC` / `LOOKUP_WP` and the `ok`/`no`/`eq` helpers already exist in the file.) - -- [ ] **Step 2: Run tests, verify they fail** - -Run: `bash scripts/test-vault-token-renew.sh` -Expected: FAILs / `command not found` for `vtr_accessor` and `vtr_is_stale_periodic`; the 17 pre-existing tests stay green. - -### Task 2: Implement the pure functions (GREEN) - -**Files:** -- Modify: `scripts/vault-token-renew.sh` (insert after `vtr_drift_ok()`, before `vtr_main()`) - -- [ ] **Step 1: Add the two functions** - -```bash -# vtr_accessor -> the token accessor (empty if absent). -vtr_accessor() { - printf '%s' "$1" | jq -r '.data.accessor // ""' -} - -# vtr_is_stale_periodic -> 0 if this lookup -# describes one of OUR periodic tokens (display name matches) that is NOT the -# one to keep — i.e. a stale leftover a heal should revoke. 1 otherwise. -# Name-only on purpose (no policy check): anything named token-devvm-wizard -# that isn't the current token is garbage from a previous mint. An empty -# keep-accessor sweeps NOTHING (fail-safe: never revoke when we don't know -# which token is current). -vtr_is_stale_periodic() { - local dn acc - [ -n "${2:-}" ] || return 1 - dn=$(vtr_display_name "$1") - acc=$(vtr_accessor "$1") - [ "$dn" = "$EXPECTED_DN" ] || return 1 - [ -n "$acc" ] || return 1 - [ "$acc" != "$2" ] -} -``` - -- [ ] **Step 2: Run tests, verify all pass** - -Run: `bash scripts/test-vault-token-renew.sh` -Expected: `25 passed, 0 failed`, exit 0. - -- [ ] **Step 3: Commit** - -```bash -cd ~/code/infra/.worktrees/vault-token-self-heal -git $GCFLAGS add scripts/vault-token-renew.sh scripts/test-vault-token-renew.sh -git $GCFLAGS commit -m "vault-token-renew: pure helpers for the self-heal revoke filter - -vtr_accessor parses the accessor from lookup JSON; vtr_is_stale_periodic -decides which old token-devvm-wizard tokens a heal may revoke (never the -just-minted one, never foreign tokens, nothing when the keeper is unknown). -TDD red-green for the heal branch that lands next." -``` - -### Task 3: The heal branch (`vtr_heal` + `vtr_main` wiring) - -**Files:** -- Modify: `scripts/vault-token-renew.sh` - -- [ ] **Step 1: Add `vtr_heal` after `vtr_is_stale_periodic()`, before `vtr_main()`** - -```bash -# vtr_heal -> 0 if ~/.vault-token was re-minted back to -# our periodic admin token using the foreign token's own authority, 1 if the -# heal was denied or failed (caller exits non-zero; the unit goes failed). -# -# Self-heal added 2026-07-03 (docs/plans/2026-07-03-vault-token-self-heal-design.md): -# an OIDC login — which the infra docs prescribe before applies — clobbers -# ~/.vault-token with a 7-day token, and detect-only drift left that unnoticed -# for weeks (the weekly-expiry loop). We ATTEMPT the re-mint with the -# clobbering token itself and let Vault's authz decide — a read-only clobber -# (the 2026-06-05 woodpecker incident) is denied the mint and stays a loud -# failure, because it signals a misbehaving flow that someone should look at. -vtr_heal() { - local foreign_dn="$1" log="$2" - local errf new_token new_info new_dn new_pols new_acc tmp - errf=$(mktemp) - if ! new_token=$(vault token create -orphan -period=768h \ - -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard \ - -field=token 2>"$errf") || [ -z "$new_token" ]; then - printf '%s DRIFT: ~/.vault-token is dn=%q — heal denied, foreign token lacks create authority (%s); investigate what wrote it. Manual re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \ - "$(date -Is)" "$foreign_dn" "$(tr '\n' ' ' <"$errf")" >>"$log" - rm -f "$errf" - return 1 - fi - rm -f "$errf" - - # Sanity: the minted token must itself pass the drift guard before it may - # replace ~/.vault-token. - if ! new_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json 2>&1); then - printf '%s FAIL: heal minted a token but its lookup failed: %s\n' \ - "$(date -Is)" "$new_info" >>"$log" - return 1 - fi - new_dn=$(vtr_display_name "$new_info") - new_pols=$(vtr_policies_csv "$new_info") - if ! vtr_drift_ok "$new_dn" "$new_pols"; then - printf '%s FAIL: heal minted an unexpected token (dn=%q policies=%q) — not writing it\n' \ - "$(date -Is)" "$new_dn" "$new_pols" >>"$log" - return 1 - fi - - # Atomic replace: mktemp files are 0600 from birth; same-filesystem mv. - tmp=$(mktemp "$HOME/.vault-token.XXXXXX") - printf '%s' "$new_token" >"$tmp" - mv "$tmp" "$HOME/.vault-token" - - # Anti-sprawl: revoke previous token-devvm-wizard tokens — each heal would - # otherwise strand the prior periodic ADMIN token server-side for up to 32d. - # The clobbering foreign token is deliberately NOT revoked: it may still back - # the user's live login session, and it ages out on its own (7d for OIDC). - local sweep="accessor sweep skipped (list denied)" accessors a a_info revoked=0 - new_acc=$(vtr_accessor "$new_info") - if [ -n "$new_acc" ] && accessors=$(VAULT_TOKEN="$new_token" vault list -format=json auth/token/accessors 2>/dev/null); then - while IFS= read -r a; do - [ -n "$a" ] || continue - a_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json -accessor "$a" 2>/dev/null) || continue - if vtr_is_stale_periodic "$a_info" "$new_acc"; then - VAULT_TOKEN="$new_token" vault token revoke -accessor "$a" >/dev/null 2>&1 && revoked=$((revoked + 1)) - fi - done < <(printf '%s' "$accessors" | jq -r '.[]') - sweep="revoked $revoked stale periodic token(s)" - fi - - printf '%s HEALED: re-minted periodic token from foreign dn=%q (%s)\n' \ - "$(date -Is)" "$foreign_dn" "$sweep" >>"$log" -} -``` - -- [ ] **Step 2: Rewire the drift branch in `vtr_main`** - -Replace this exact block (comment + if): - -```bash - # Drift guard (added 2026-06-07): the renewer must NOT keep a FOREIGN token alive. - # On 2026-06-05 a stray `vault login -method=kubernetes` overwrote ~/.vault-token - # with a read-only woodpecker token, and this script then silently renewed THAT - # for two days — masking the loss of write access. So before renewing, confirm - # the token is our periodic admin token; if it has drifted, fail loudly (systemd - # marks the unit failed) instead of keeping someone else's token alive. - if ! vtr_drift_ok "$dn" "$pols"; then - printf '%s DRIFT: ~/.vault-token is dn=%q policies=%q (expected dn=%q with %q). Refusing to renew a foreign token. Re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \ - "$(date -Is)" "$dn" "$pols" "$EXPECTED_DN" "$REQUIRED_POLICY" >>"$log" - exit 1 - fi -``` - -with: - -```bash - # Drift guard (2026-06-07) + self-heal (2026-07-03): the renewer must not - # keep a FOREIGN token alive (on 2026-06-05 a stray kubernetes login was - # silently renewed for two days, masking lost write access). But detect-only - # drift proved worse in practice: an OIDC login — which the infra docs - # prescribe before applies — clobbers this file too, and the resulting DRIFT - # failures went unnoticed for weeks while access degraded to a 7-day token - # (the weekly-expiry loop). On drift we now ATTEMPT to heal (see vtr_heal): - # re-mint the periodic token with the clobbering token's own authority. - # Vault's authz keeps the old guarantee — a token that couldn't legitimately - # hold vault-admin is denied the mint, and we still fail loud. - if ! vtr_drift_ok "$dn" "$pols"; then - vtr_heal "$dn" "$log" || exit 1 - exit 0 - fi -``` - -- [ ] **Step 3: Syntax + lint + regression check** - -Run: `bash -n scripts/vault-token-renew.sh && bash scripts/test-vault-token-renew.sh; command -v shellcheck >/dev/null && shellcheck scripts/vault-token-renew.sh` -Expected: syntax OK, `25 passed, 0 failed`; shellcheck (if installed) reports nothing new. - -- [ ] **Step 4: Commit** - -```bash -git $GCFLAGS add scripts/vault-token-renew.sh -git $GCFLAGS commit -m "vault-token-renew: self-heal the periodic token on admin-capable clobber - -Viktor asked for 'vault login -method=oidc' to work seamlessly: the OIDC -login the docs prescribe kept clobbering ~/.vault-token with a 7-day token, -and detect-only DRIFT failures went unnoticed for weeks (weekly-expiry -loop, twice in June). On drift the renewer now re-mints the periodic token -with the clobbering token's own authority (Vault's 403 is the judge — no -policy guessing), sanity-checks it, replaces the file atomically, and -revokes stale token-devvm-wizard leftovers. Weak/read-only clobbers still -fail loudly on purpose. Design: docs/plans/2026-07-03-vault-token-self-heal-design.md" -``` - -### Task 4: Docs — runbook + test-file header - -**Files:** -- Modify: `docs/runbooks/vault-token-renew-devvm.md` (the `## Drift guard & recovery` section + the healthy-log-line note + `## Tests`) -- Modify: `scripts/test-vault-token-renew.sh` (header comment only) - -- [ ] **Step 1: Replace the runbook's `## Drift guard & recovery` section with:** - -```markdown -## Drift guard & self-heal - -`~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login` -overwrites it. Two confirmed clobber vectors: - -1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer - can't push past the OIDC role's 7-day `token_max_ttl`). The infra docs - prescribe this login before applies, so it recurs — it went unnoticed for - weeks twice (2026-06-18→26, 2026-06-29→07-03) and read as "Vault expires - weekly". -2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) → - writes a read-only `kubernetes-woodpecker-default` token (can read Vault but - **cannot** write `secret/*`). Happened 2026-06-05, unnoticed for two days. - -Since 2026-07-03 the renewer **self-heals** -(`docs/plans/2026-07-03-vault-token-self-heal-design.md`). On a foreign token -it attempts the re-mint **with the clobbering token's own authority** and lets -Vault's authz decide: - -- **Admin-capable clobber (OIDC login)** → re-mints the periodic token, - sanity-checks it against the drift guard, atomically replaces - `~/.vault-token`, revokes stale `token-devvm-wizard` leftovers - (anti-sprawl), logs - `HEALED: re-minted periodic token from foreign dn=… (revoked N stale periodic token(s))` - and exits 0. The clobbering token is NOT revoked — it may still back a live - login session; it ages out on its own. -- **Weak clobber (read-only k8s token)** → the mint is denied; logs - `DRIFT: … heal denied, foreign token lacks create authority …; investigate what wrote it` - and exits non-zero (unit `failed`). Deliberately loud: this signals a - misbehaving agent flow — exactly the 2026-06-05 case. - -**Manual recovery** is only needed for the weak-clobber case (the DRIFT log -line still contains the exact command) — run the -[mint/re-mint](#mint--re-mint-the-token) block. -``` - -- [ ] **Step 2: In the runbook's `## Health check` section**, after the "A healthy log line looks like…" sentence, add: - -```markdown -After an OIDC login you'll instead see, at the next nightly run: -` HEALED: re-minted periodic token from foreign dn="oidc-…" (revoked N stale periodic token(s))` — that's the self-heal working as designed. -``` - -- [ ] **Step 3: In the runbook's `## Tests` section**, replace the first sentence with: - -```markdown -`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision, -the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber -case), and the self-heal's revoke filter (which stale periodic tokens a heal -may sweep). -``` - -- [ ] **Step 4: Update the test file's header comment** (lines 2–7) to: - -```bash -# Unit tests for the pure functions in vault-token-renew.sh. -# Sources the script (vtr_main is guarded) and exercises (a) the drift-guard -# decision — is ~/.vault-token OUR periodic admin token (renew) or a foreign -# clobber (heal / fail loud)? — whose ABSENCE let the 2026-06-05 woodpecker -# clobber be silently renewed for two days, and (b) the self-heal's revoke -# filter — which stale token-devvm-wizard tokens a heal may sweep. -# Run: bash infra/scripts/test-vault-token-renew.sh -``` - -- [ ] **Step 5: Run tests once more, then commit** - -Run: `bash scripts/test-vault-token-renew.sh` -Expected: `25 passed, 0 failed`. - -```bash -git $GCFLAGS add docs/runbooks/vault-token-renew-devvm.md scripts/test-vault-token-renew.sh -git $GCFLAGS commit -m "vault-token-renew runbook: document the self-heal behavior - -Drift guard section rewritten: admin-capable clobbers now self-heal at the -nightly run (HEALED log line); weak clobbers keep the loud DRIFT failure; -manual re-mint is only the weak-clobber recovery now." -``` - -### Task 5: Deploy + live verification (on devvm, as wizard) - -**Files:** none (host deploy + live checks) - -- [ ] **Step 1: Install from the worktree** - -```bash -install -m 0755 ~/code/infra/.worktrees/vault-token-self-heal/scripts/vault-token-renew.sh ~/.local/bin/vault-token-renew -``` - -(Units unchanged — no `daemon-reload` needed.) - -- [ ] **Step 2: Live case 1 — admin-capable clobber heals** - -```bash -export VAULT_ADDR=https://vault.viktorbarzin.me -export XDG_RUNTIME_DIR=/run/user/$(id -u) -FAKE_ADMIN=$(vault token create -ttl=1h -policy=vault-admin -policy=sops-admin -display-name=fake-oidc -field=token) -printf '%s' "$FAKE_ADMIN" > ~/.vault-token -systemctl --user start vault-token-renew.service; echo "exit=$?" -tail -1 ~/.local/state/vault-token-renew.log -vault token lookup | grep -E 'display_name|period' -``` - -Expected: `exit=0`; log line `HEALED: re-minted periodic token from foreign dn="token-fake-oidc" (revoked N stale periodic token(s))` with N ≥ 1 (the pre-clobber periodic token is itself swept as stale — by design — along with any strays from the June 26 / July 3 manual re-mints); lookup shows `display_name token-devvm-wizard`, `period 768h`. Note: `FAKE_ADMIN` is a child of the swept old token, so the cascade revokes it too — no cleanup needed. - -- [ ] **Step 3: Verify exactly ONE periodic token remains server-side** - -```bash -for a in $(vault list -format=json auth/token/accessors | jq -r '.[]'); do - vault token lookup -format=json -accessor "$a" 2>/dev/null \ - | jq -r 'select(.data.display_name=="token-devvm-wizard") | .data.accessor' -done -``` - -Expected: exactly one line, matching `vault token lookup -format=json | jq -r .data.accessor`. - -- [ ] **Step 4: Live case 2 — weak clobber stays a loud failure** - -```bash -GOOD=$(cat ~/.vault-token) -FAKE_WEAK=$(vault token create -ttl=10m -policy=default -display-name=fake-weak -field=token) -printf '%s' "$FAKE_WEAK" > ~/.vault-token -systemctl --user start vault-token-renew.service; echo "exit=$?" -systemctl --user is-failed vault-token-renew.service -tail -1 ~/.local/state/vault-token-renew.log -printf '%s' "$GOOD" > ~/.vault-token && chmod 600 ~/.vault-token -vault token revoke "$FAKE_WEAK" >/dev/null -``` - -Expected: `exit=1` (start reports the oneshot failure), `is-failed` prints `failed`, log line `DRIFT: ~/.vault-token is dn="token-fake-weak" — heal denied, foreign token lacks create authority (… permission denied …); investigate what wrote it. Manual re-mint: …`. - -- [ ] **Step 5: Happy path still green** - -```bash -systemctl --user start vault-token-renew.service; echo "exit=$?" -tail -1 ~/.local/state/vault-token-renew.log -``` - -Expected: `exit=0`, log `OK renewed (dn=token-devvm-wizard ttl=2764800s)`. - -### Task 6: Land on master + cleanup - -- [ ] **Step 1: Merge latest master into the branch, re-verify, push** - -```bash -cd ~/code/infra/.worktrees/vault-token-self-heal -git $GCFLAGS fetch forgejo -git $GCFLAGS merge forgejo/master -bash scripts/test-vault-token-renew.sh -git $GCFLAGS push forgejo HEAD:master -``` - -Expected: clean merge (or already up to date), `25 passed, 0 failed`, push accepted. Non-fast-forward → fetch, merge, push again. - -- [ ] **Step 2: Watch CI to completion** - -The push fires the infra Woodpecker `default.yml` (terragrunt apply for changed stacks). This change touches only `scripts/` + `docs/` → expect a fast success / no-op apply. Check (Forgejo-forge infra repo = Woodpecker repo id 82): - -```bash -export VAULT_ADDR=https://vault.viktorbarzin.me -vault kv get -format=json secret/ci/global | jq -r '.data.data | keys[]' # find the woodpecker admin token key -WP_TOKEN=$(vault kv get -field= secret/ci/global) -curl -s -H "Authorization: Bearer $WP_TOKEN" 'https://ci.viktorbarzin.me/api/repos/82/pipelines?perPage=1' | jq '.[0] | {number, status, commit: .commit[0:8]}' -``` - -Expected: the pipeline for the pushed commit reaches `status: "success"` (poll until terminal). If it fails, fix before proceeding. - -- [ ] **Step 3: Remove worktree + branch, reconcile main checkout** - -```bash -git -C ~/code/infra $GCFLAGS worktree remove .worktrees/vault-token-self-heal -git -C ~/code/infra $GCFLAGS branch -d wizard/vault-token-self-heal -git -C ~/code/infra status --porcelain # expect clean before pulling -git -C ~/code/infra $GCFLAGS pull --ff-only forgejo master -``` - -Expected: worktree gone, branch deleted (already merged), main checkout fast-forwards to the landed commit. - -### Task 7: Memory + wrap-up - -- [ ] **Step 1: Update the stale memories** (they say the drift guard is detect-only / recovery is manual): - -```bash -homelab memory recall "vault periodic token renewer drift" # confirm ids 4204, 4211, 7121 still say detect-only -homelab memory update 4211 "" -homelab memory update 7121 "" -``` - -(Fetch each memory's current text first and preserve it — amend, don't replace wholesale.) - -- [ ] **Step 2: End-of-task extraction** — dispatch the standard M.3 memory-mining subagent per `~/.claude/rules/execution.md`, then give the final summary. - ---- - -## Plan self-review (done at write time) - -- **Spec coverage**: heal-on-admin-clobber (T3), loud-fail-on-weak (T3 + live T5.4), no-revoke-foreign (T3 comment + design decision 4), anti-sprawl sweep + fail-safe filter (T2/T3, live T5.3), minted-token sanity + atomic write (T3), unit tests (T1/T2), runbook (T4), deploy + live sim (T5), memory updates (T7). ✓ -- **Placeholders**: `` in T6.2 is a deliberate discovery step (key name verified live from Vault, not invented). No other TBDs. ✓ -- **Name consistency**: `vtr_accessor`, `vtr_is_stale_periodic`, `vtr_heal`, `EXPECTED_DN` match across tasks; test count 17→25 consistent (8 new cases). ✓ diff --git a/docs/plans/2026-07-04-backup-mx-design.md b/docs/plans/2026-07-04-backup-mx-design.md deleted file mode 100644 index fe54af61..00000000 --- a/docs/plans/2026-07-04-backup-mx-design.md +++ /dev/null @@ -1,335 +0,0 @@ -# Backup MX — self-hosted store-and-forward relay on Oracle Always-Free — design - -Date: 2026-07-04 (v3 — post-challenge; v2 Oracle pivot same day) · Status: design, -pre-implementation · ADR: [0019](../adr/0019-backup-mx-self-hosted-oracle-relay.md) - -v3 incorporates two independent adversarial-challenge reviews (same day). Their -material corrections are marked **[CH]** throughout — the largest: the v2 drain -path would never have drained (primary-side smtpd rejects), monitoring-over- -tailnet was fiction (no cluster→tailnet route exists), and the VM's bounce -model was wrong (it can never deliver a DSN). - -## Goal - -Inbound mail for `viktorbarzin.me` must survive homelab outages without loss. -Requirement level (Viktor, 2026-07-04): **never lose mail; delayed delivery is -acceptable; budget is $0** (hard constraint — reaffirmed after the Rollernet -gates failed). A store-and-forward backup MX queues mail while the homelab is -down and re-delivers when it returns. - -Out of scope, explicitly: - -- Reading new mail *during* an outage. -- Outbound mail during outages. -- The "primary up but hard-bouncing 5xx" misconfig class — a backup MX is - never consulted when the primary answers. Separate hardening/alerting track. - -Known residual limit (state it plainly): an outage **longer than 30 days** -loses the queued mail *silently* — the VM cannot emit a bounce to anyone -(egress 25 blocked), so no sender ever learns. Accepted; 30 days is already -6× the sender-retry status quo. - -## v1 → v2: why Rollernet was dropped (gate evidence, 2026-07-04) - -v1 selected Roller Network's free Secondary MX. The validation gates killed it -before any DNS change: - -- **G2 FAILED**: the [free-accounts policy](https://rollernet.us/policy/free-accounts.html) - caps free mail service at **200 relayed messages or 10 MB per rolling 7 - days**; overage → domain suspended **48 h answering SMTP 5xx** (permanent - bounces), repeatable. Spammers deliberately target backup MXes even while - the primary is up, so background spam alone can hold the domain suspended — - worse than no backup MX. -- **G1 SHAKY**: same policy page says free accounts are being discontinued. -- **G3 PASSED** (for posterity): `mail{,2}.rollernet.us` present valid LE - certs over STARTTLS. -- Signup is Cloudflare-Turnstile-gated — moot given G1/G2. - -Viktor's decision: stay free → self-host on Oracle Always-Free. **[CH]** The -external challenger re-searched the free landscape (DNSExit, KisoLabs, -DuoCircle, AWS/Azure/GCP/Hetzner/Fly/Vultr/Linode free tiers) and confirmed: -no credible free managed backup-MX or free VM with a usable port-25 story -exists in 2026 other than OCI. GCP's free e2-micro also blocks egress 25 and -is US-regions-only (wrong continent). - -## Decision - -A minimal **Postfix store-and-forward relay** (`mx2.viktorbarzin.me`) on an -Oracle Cloud **Always-Free** compute instance, published as a lower-preference -MX. It accepts mail for `viktorbarzin.me` when the primary is unreachable, -queues up to 30 days, and drains to the primary when it returns. No mailboxes, -no third-party terms — the queue-lifetime and reject-behavior knobs are ours. - -## Architecture - -``` - ┌── pri 1 mail.viktorbarzin.me ──► pfSense HAProxy ──► mailserver pod -sender MTA ──► MX lookup ┤ ▲ - └── pri 20 mx2.viktorbarzin.me │ drain: smtp to - (Oracle VM, Postfix relay, │ mail.viktorbarzin.me:2526 - queue ≤ 30 days) ───────────────────┘ (pfSense WAN NAT rdr - 2526 → 10.0.20.1:25, - existing HAProxy frontend) -``` - -- **Normal operation**: senders use pri 1; the VM idles (spammers targeting - the backup + transient-blip retries get relayed onward immediately). -- **Outage**: senders fall back to pri 20 → VM accepts + queues → Postfix - retries the primary on its native schedule → queue drains after recovery - through the standard external ingress path (PROXY v2 → :2525 → rspamd → - Dovecot). -- **Custom drain port**: Oracle blocks **egress TCP 25** tenancy-wide - (post-2021; exemptions unreliable) — the VM cannot reach - `mail.viktorbarzin.me:25`. One pfSense WAN NAT rule `TCP 2526 → - 10.0.20.1:25` reuses the existing HAProxy frontend unchanged. **[CH] - Verified against the runbook**: the frontend binds `*:25` on pfSense (not - strictly 10.0.20.1), rdr dst-port rewrite is the existing production - pattern (WAN:25 already rewrites to 10.0.20.1:25), and port 2526 collides - with nothing (the HAProxy test frontend uses :2525). Inbound TCP 25 **to** - the VM is unaffected by Oracle's egress-only block per practitioner - evidence (iRedMail/mailcow on OCI: receive works, send doesn't) — **to be - proven at gate O2 before any DNS change** (Oracle publishes no positive - commitment). - -## Oracle account & instance - -- **Account**: Viktor creates it (human signup; card for identity, $0 - charged). **Home region is fixed at signup and Always-Free compute exists - only there — choose `eu-frankfurt-1` deliberately; there is no - try-another-region fallback without a new account. [CH]** -- **[CH] PAYG conversion is a REQUIRED prerequisite, not a recommendation**: - Oracle stops idle Always-Free instances (95th-pct CPU < 20% over 7 days — an - idle Postfix box qualifies) and demonstrably changes free-tier terms without - notice, enforcing by termination (June 2026: A1 allowance silently halved, - over-limit instances shut down). PAYG keeps Always-Free resources free and - exempts them from idle reclamation. -- **Shape**: `VM.Standard.E2.1.Micro` (x86, 1/8 OCPU burst, 1 GB RAM; 2 - always-free instances allowed; ample for queue-only Postfix — and untouched - by the 2026 A1 cuts). ARM A1 fallback is **unreliable** (halved quota, - chronic Frankfurt capacity) — treat E2.1.Micro availability as the gate. -- **[CH] Reserved public IP is mandatory** (`oci_core_public_ip`, reserved): - an ephemeral IP rotates on stop/start and would silently break all four - IP-keyed controls at once (pfSense NAT source-restriction, the primary's - smtpd/rspamd exemptions, the Oracle security list, Prometheus scrape - allowlist) — discovered only at the next outage's drain. -- **OS**: Ubuntu 24.04. **[CH] OCI Ubuntu images ship an OS-level iptables - ruleset (`/etc/iptables/rules.v4`) that ACCEPTs 22 and REJECTs everything - else, independent of security lists** — cloud-init must insert ACCEPT rules - for 25/80 (+ scrape ports) ahead of the REJECT and persist them, or gate O2 - fails on day 1 with a correct security list. -- **Credentials**: OCI API key for Terraform → Vault `secret/viktor` - (`oci_*`); web login → Vaultwarden item `Oracle Cloud (backup MX)`. - -## Networking & security posture - -- **Ingress on the VM**: TCP 25 world-open (the service). **[CH] TCP 80 - world-open permanently** — Let's Encrypt validation is multi-perspective - with no published source IPs, so it cannot be source-scoped, and a - "open-only-during-renewal" toggle is unspecified automation whose realistic - failure mode is an expired cert at day ~90. Nothing listens on 80 outside - certbot's seconds-long renewal windows; connection-refused surface is - negligible. TCP 9100/9154 (exporters) restricted to the homelab WAN /32 - (176.12.22.76) in both the Oracle security list and the VM firewall. -- **No public SSH**: management rides the headscale tailnet — cloud-init - enrolls via a **preauth key for a dedicated non-OIDC headscale user** with - node tag `tag:backup-mx` (headscale 0.28.0 file-mode ACL, content in Vault - `secret/headscale` → `headscale_acl`); SSH bound to the tailnet interface. - ACL grant: `group:admin → tag:backup-mx:22` (cluster pods are NOT tailnet - members — see monitoring). **[CH] Outage caveat**: headscale's control - plane + DERP live in the cluster, so mid-outage tailnet reachability is - cached-netmap best-effort — the runbook documents the **OCI instance - console connection as break-glass** management. (Also fix `vpn.md`'s stale - "0.23.x / OIDC-only" claims while in there.) -- **VM compromise blast radius**: plaintext of outage-queued mail + a relay - surface contained by `relay_domains = viktorbarzin.me` only, no submission - ports, no SASL, no local delivery. The VM is deliberately NOT added to the - primary's `mynetworks` (that would let a compromised VM relay arbitrary - mail *through* the primary) — per-stage exemptions instead, below. - -## Postfix configuration (relay-only, accept-and-queue with 4xx-only hygiene) - -- `relay_domains = viktorbarzin.me`; `mydestination =` (empty). -- **[CH]** `smtpd_relay_restrictions = permit_mynetworks, - reject_unauth_destination` — explicit 5xx for foreign-domain RCPTs (the - default tail is `defer_unauth_destination`, whose 4xx invites every relay - probe to retry forever). -- **[CH]** `relay_recipient_maps` explicitly set to the wildcard form - (`@viktorbarzin.me OK`) — documents accept-all-recipients as a decision - (the domain is catch-all; every RCPT is valid by definition). -- `transport_maps`: `viktorbarzin.me smtp:[mail.viktorbarzin.me]:2526`. -- `maximal_queue_lifetime = 30d`. **[CH]** `bounce_queue_lifetime = 1d` and - `delay_warning_time = 0` — this host can never deliver a DSN to anyone - (egress 25 blocked; its only egress is 2526 to the primary), so undeliverable - bounces must be discarded quickly or they rot in the queue for a month and - permanently poison the queue-depth alert. -- **[CH]** `message_size_limit = 209715200` — exactly the primary's 200 MB - (`POSTFIX_MESSAGE_SIZE_LIMIT`, mailserver main.tf:88). The stock 10 MB - default would 552-reject large legitimate mail during outages — the exact - loss mode this project exists to prevent. Equal, never higher (higher - recreates drain-time rejects). -- **[CH] postscreen on the VM in 4xx-only posture**: pregreet test ON - (fire-and-forget bots don't retry; real MTAs do — the whole design already - rests on sender retry, so 4xx filtering is loss-free by construction), - optionally `postscreen_dnsbl_action = defer` with a conservative threshold. - v2's blanket "no DNSBL" conflated 5xx reputation rejects (rightly banned) - with 4xx tempfail (harmless); without any hygiene the backup is a 24/7 - spam backdoor since spammers deliberately deliver to the highest-numbered - MX. Zero 5xx from reputation, ever. -- `inet_protocols = ipv4` **[CH]** — the primary publishes an AAAA (HE - tunnel) but the IPv6 HAProxy bridge has no :2526 listener; skip the wasted - v6 attempt per delivery. -- `smtpd_tls_cert_file` = LE cert for `mx2.viktorbarzin.me` (opportunistic - STARTTLS inbound; `smtp_tls_security_level = may` on the drain leg). -- Queue disk: the ~45 GB free boot volume dwarfs any realistic 30-day - accumulation for a personal domain. - -## TLS - -certbot standalone HTTP-01 for `mx2.viktorbarzin.me` (no Cloudflare API token -on an internet-facing VM). Port 80 permanently open (see above); certbot renew -timer. The MTA-STS follow-up (separate task; policy host currently dangling — -below) must list `mx2.viktorbarzin.me` when implemented. - -## Primary-side drain enablement **[CH — this section replaces v2's "SPF/DMARC exemption + postscreen permit", which exempted the wrong layers]** - -The v2 exemptions targeted postscreen DNSBL (which is **off** on the primary — -`ENABLE_DNSBL` unset) and rspamd SPF/DMARC scoring — but missed the three -mechanisms that would actually break the drain. All are keyed on the VM's -reserved /32 (the PROXY-v2-recovered client IP): - -1. **`reject_unknown_client_hostname` bypass** — the primary sets - `POSTFIX_REJECT_UNKNOWN_CLIENT_HOSTNAME=1` (main.tf:89); an Oracle IP - without full FCrDNS (PTR needs an Oracle SR; limited on free accounts) - would be **450-deferred on every drain attempt → the queue never drains → - mass-bounces at day 30**. Fix: `check_client_access` permit for the VM /32 - early in `smtpd_client_restrictions`, and a matching permit at the sender - stage (SPOOF_PROTECTION=1 rejects unauthenticated own-domain envelope - senders — drained self-addressed/bounced mail would 5xx). Attempt the - Oracle PTR anyway (belt and braces). -2. **Anvil rate-limit exception** — `smtpd_client_message_rate_limit = 30`/min - keys on the VM's IP at drain; a >3,600-message backlog would throttle for - hours and false-fire the queue alert. Add the VM /32 to - `smtpd_client_event_limit_exceptions`. -3. **rspamd: evaluate the original sender, never 5xx the drain stream** — via - the existing override.d ConfigMap pattern (same mount as - `dkim_signing.conf`): (a) configure rspamd's **`external_relay`** module - (ip_map = VM /32) so SPF/DMARC/IP reputation evaluate against the - *original* client IP parsed from the VM's Received header — this keeps - DMARC protection for the entire drain stream instead of v2's blanket - disable; (b) cap rspamd's **action at the VM /32 to tag/fold — never - milter-reject**: the primary's default reject tier (DMS default, active - since only dkim_signing is overridden today) would 5xx high-score spam at - DATA, forcing the VM to generate DSNs to forged senders = classic - backup-MX backscatter → mx2's IP blacklisted. Drained spam lands tagged in - the catch-all's Junk instead. Validate the external_relay ↔ settings-rule - interplay at gate O5 with a high-spam-score message. -4. postscreen permit for the /32 (harmless; pregreet never trips a real - Postfix client and DNSBL is off — kept for future-proofing only). - -## Our-side changes (Terraform unless noted) - -1. **New stack `stacks/backup-mx/`** (Tier 1): OCI provider (creds from - Vault), VCN + subnet + security list + **reserved public IP** + - `VM.Standard.E2.1.Micro` + cloud-init (`templatefile`): **OS iptables - ACCEPTs for 25/80/9100/9154 ahead of the OCI image's REJECT rule - (persisted)**, postfix + config above, certbot, tailscale→headscale - enrollment (preauth key from Vault), node_exporter, postfix_exporter, - unattended-upgrades. -2. **DNS** — `stacks/cloudflared/modules/cloudflared/cloudflare.tf`: A - `mx2.viktorbarzin.me` → reserved IP (non-proxied), MX pref 20 → `mx2`. - **[CH] Live zone count verified: 195/200 → 197/200 after this change; only - 3 slots remain and the MTA-STS follow-up needs 1–2 → plan the next - record-purge now, not at collision time.** -3. **pfSense (live network device — approved as part of this plan)**: WAN NAT - rdr `TCP 2526 → 10.0.20.1:25` + firewall rule, source-restricted to the - reserved IP. **[CH] Scripted** (extend the existing - `scripts/pfsense-*-haproxy*.php` bootstrap-script family), not - hand-clicked — keeps the git-rebuildable parity the rest of the pfSense - mail config has. Config.xml rides the nightly backup. -4. **Mailserver stack**: the four-layer drain enablement above (client+sender - `check_client_access` permits, anvil exception, rspamd external_relay + - action cap, postscreen permit) — all keyed to one /32, via the existing - `postfix_cf` / `user-patches.sh` / rspamd-override hook points (verified - present: main.tf:129-144, 222-281, 467-474). -5. **Monitoring [CH — replaces v2's tailnet scraping, which had no transport: - no cluster→tailnet route exists and no existing target is scraped that - way]**: Prometheus scrapes `node_exporter`/`postfix_exporter` on the VM's - **public reserved IP**, allowed only from the homelab WAN /32 (Oracle SL + - VM firewall); blackbox TCP:25 from the cluster (`BackupMxDown`, warning); - MX-set drift assertion (both MX records present). Alerts: - `BackupMxQueueStuck` = **non-bounce** queue depth > 0 for 2 h while the - primary is healthy (gate on the existing `MailServerDown`/roundtrip - series, machine-readable — not prose); bounce residue is excluded by the - 1-day bounce lifetime. Note: during a full homelab outage Prometheus - itself is down — queue growth is unobservable live under ANY transport; - what we actually watch is the post-recovery drain. A WAN-IP change stales - the Oracle allowlist → visible as ScrapeTargetDown (self-signaling). - **Probe semantics note**: once mx2 exists, the Brevo roundtrip probe's - mail fails over to mx2 on transient primary blips and arrives minutes late - via the drain — `EmailRoundtripFailing` may then mean "delayed via mx2", - not "lost"; note in the alert description and runbook. -6. **Docs (same commit as implementation)**: rewrite `mailserver.md` §"No - Backup MX", new runbook `docs/runbooks/backup-mx.md` (`postqueue -p`, - forced drain `postqueue -f`, cert renewal, **OCI console break-glass**, VM - rebuild from stack, Oracle account facts incl. PAYG + home-region lock), - `vpn.md` headscale-version/OIDC staleness fix, monitoring rows. - -### MTA-STS finding (unchanged; no action in this change) - -`_mta-sts` TXT is published but `mta-sts.viktorbarzin.me` has no record and -nothing serves the policy — MTA-STS is inert today. When fixed, the policy -MUST include `mx: mx2.viktorbarzin.me` (and budget its DNS records against the -3 remaining zone slots). - -## Validation gates (in order; any failure → stop and report) - -| # | Gate | Method | Failure handling | -|---|------|--------|------------------| -| O1 | Oracle account (home region `eu-frankfurt-1`, **fixed forever at signup**), **PAYG conversion done**, E2.1.Micro capacity | Viktor signs up + converts; TF apply | A1-in-home-region is a best-effort fallback only (halved quota, contended); else decision returns to Viktor | -| O2 | Inbound TCP 25 reachable from the internet (after the OS-iptables fix) | `nc -zv 25` from outside + recurring Uptime-Kuma TCP monitor (keeps proving it — Oracle publishes no commitment) | Stop; decision returns to Viktor | -| O3 | Drain works: VM → `mail.viktorbarzin.me:2526` delivers end-to-end | Test message injected on the VM | Debug pfSense NAT / HAProxy path | -| O4 | LE cert issued | certbot standalone | STARTTLS is opportunistic — non-blocking for go-live; fix before MTA-STS | -| O5 | Live failover test — **hardened [CH]** | presence-claim → scale mailserver to 0 (~30 min) → send from Gmail + Brevo **plus a high-spam-score message and a >10 MB message** → confirm queued (`postqueue -p`) → scale up → verify full drain within the anvil-exception expectations, spam folded to Junk (not bounced), headers show original-IP SPF/DMARC evaluation, no DSN generated on the VM, roundtrip probe recovers | Debug or roll back (remove MX record) | - -## Failure modes - -Covered: cluster/pod outages, pfSense/power/ISP outages ≤ 30 days, WAN IP -changes, short-retry senders. If pfSense is down the drain waits — Postfix -retries until it heals. - -Not covered: primary-up-but-5xx misconfigs; outbound; mid-outage mailbox -access; **outages > 30 days lose queued mail silently (no DSN possible)**. -Simultaneous Oracle+homelab outage = status quo ante (sender retries). - -Newly introduced, accepted: - -- **A pet outside the cluster** — deliberately cattle: rebuilt from TF + - cloud-init, patched by unattended-upgrades, scraped by Prometheus. Never a - backup target. -- **Oracle free-tier caprice [CH — upgraded from v2's framing]**: Oracle has - silently cut Always-Free allowances and terminated over-limit instances - (June 2026, A1). Mitigations: PAYG (required), recurring inbound-25 probe, - `BackupMxDown`, and the fact that outside an active outage the queue is - empty — a surprise reclamation loses nothing, only coverage until rebuilt. - Rollernet Basic ($30/yr) stays the documented fallback if OCI sours. -- **Spam hygiene**: 4xx-only postscreen on the VM (pregreet + conservative - DNSBL-defer) instead of v2's nothing; drained spam is tagged/folded by - rspamd, never bounced. -- Outage mail sits plaintext on Oracle disk ≤ 30 days (single-tenant; - accepted). - -## Rollback - -Remove the MX + A records; wait for `postqueue -p` empty; `terraform destroy` -on `backup-mx`; delete the pfSense NAT rule (scripted); drop the mailserver -/32 exemptions. Order matters: MX record first. - -## Viktor's manual steps (everything else is mine) - -1. Create the Oracle Cloud account — **home region `eu-frankfurt-1`** (fixed - forever), card for identity, $0 charged. -2. **Convert the tenancy to Pay-As-You-Go** (required — idle-reclamation - exemption; Always-Free stays $0). -3. Hand me the tenancy OCID + a console user → I mint the API key, store - creds (Vault + Vaultwarden), and build the stack. -4. Approve the (scripted) pfSense NAT rule when I reach that step. diff --git a/docs/plans/2026-07-04-drone-logbook-design.md b/docs/plans/2026-07-04-drone-logbook-design.md deleted file mode 100644 index 78e3b469..00000000 --- a/docs/plans/2026-07-04-drone-logbook-design.md +++ /dev/null @@ -1,89 +0,0 @@ -# Drone Logbook (Open DroneLog) — Design - -**Date:** 2026-07-04 -**Status:** Approved (Viktor, 2026-07-04) -**Owner request:** "I have a DJI Mini 4 Pro. I'm interested in github.com/ViktorBarzin/drone-logbook" → self-host it in the cluster. - -## Goal - -Self-host [Open DroneLog](https://github.com/arpanghosh8453/open-dronelog) (upstream of the -`ViktorBarzin/drone-logbook` fork) at **https://dronelog.viktorbarzin.me** so Viktor can import -DJI Fly flight logs from his DJI Mini 4 Pro and analyze them privately: telemetry charts, 3D map -replay, per-flight and lifetime stats. All data stays in the cluster (single DuckDB database). - -## Decisions (interview, 2026-07-04) - -| Question | Decision | -|---|---| -| Deployment form | Self-hosted Docker web app in k8s (not desktop app, not hosted webapp) | -| Exposure | Public `dronelog.viktorbarzin.me`, **Authentik forward-auth** (`auth = "required"`) | -| Log ingestion | **Both** manual web upload *and* a server-side auto-import drop folder from day one | -| Image source | **Upstream** `ghcr.io/arpanghosh8453/open-dronelog:latest` — NOT the fork | -| Fork disposition | Fork is 0 ahead / 372 behind, adds nothing; delete or park it. Only revive (sync + ADR-0002 GHA build) if Viktor starts modifying the code | - -## Architecture - -New Tier-1 stack `stacks/drone-logbook/`, modeled line-by-line on `stacks/freshrss/` -(the closest existing shape: single upstream-image app, own data volume, Keel-updated): - -- **Namespace** `drone-logbook`, tier `4-aux`, label `keel.sh/enrolled=true` → Kyverno injects - Keel poll annotations → auto-upgrades as upstream releases (project is actively maintained). -- **Deployment** (1 replica, `Recreate` — DuckDB is single-writer/embedded): - - image `ghcr.io/arpanghosh8453/open-dronelog:latest` (nginx frontend + Axum REST backend, port 80) - - memory request=limit **512Mi** (DuckDB import/analytics spikes), cpu request 25m, no cpu limit - - standard `KYVERNO_LIFECYCLE_V1` / `KEEL_IGNORE_IMAGE` / `KEEL_LIFECYCLE_V1` lifecycle ignores -- **App data** `/data/drone-logbook` (DuckDB db, cached DJI decryption keys, uploaded originals): - **`proxmox-lvm-encrypted` block PVC** `drone-logbook-data-encrypted`, 2Gi, topolvm autoresize → - 10Gi ceiling. Encrypted class because flight logs are GPS traces of home/travel — sensitive data - defaults to `proxmox-lvm-encrypted` per the storage decision rule (`.claude/CLAUDE.md`). - Embedded DBs stay off NFS (same rationale documented in the freshrss stack: NFS only for static files). -- **Backup CronJob** `drone-logbook-backup` (mandatory for every proxmox-lvm app): daily 01:30 - file copy of the data volume → NFS `/srv/nfs/drone-logbook-backup` (dated dirs, 30-day retention, - Pushgateway metrics), pod-affinity co-scheduled with the app pod (RWO volume). 01:30 sits outside - the 00:00/08:00/16:00 sync-import windows so the DuckDB file is quiescent; retained upload - originals make even a torn copy recoverable by re-import. `nfs-mirror` (02:00) ships it to sda → - Synology offsite. Vaultwarden pattern. -- **Sync drop folder**: static NFS volume (`modules/kubernetes/nfs_volume`) - `192.168.1.127:/srv/nfs/drone-logbook/sync-logs`, mounted **read-only** at `/sync-logs`; - `SYNC_LOGS_PATH=/sync-logs`, `SYNC_INTERVAL="0 0 */8 * * *"` (every 8 h). - Any producer (Nextcloud sync, scp, a future phone pipeline) drops `.txt` logs there; the app - imports them automatically. `KEEP_UPLOADED_FILES=true` keeps re-importable originals in the PVC. -- **Ingress** via `ingress_factory`: `name = "dronelog"`, `auth = "required"` (Authentik - forward-auth), `dns_type = "proxied"`. External Uptime Kuma HTTPS monitor comes automatically - with the ingress annotation. Homepage tile (group "Media & Entertainment", icon `mdi-quadcopter`). -- **Secrets**: Vault KV `secret/drone-logbook` (`profile_creation_pass`) → ExternalSecret - (`vault-kv` ClusterSecretStore) → k8s secret `drone-logbook-secrets` → env - `PROFILE_CREATION_PASS`. Gates profile create/delete even for other Authentik-logged-in users. - No plan-time secret reads needed (no `data "kubernetes_secret"`). - No `DJI_API_KEY` — bundled default is fine at personal import volume; add later if rate-limited. - -## Operational notes - -- **DJI egress dependency**: importing a *new* log file requires the pod to reach DJI's servers - once (flight-log decryption key fetch; keys are then cached in the data dir). Remember this when - egress enforcement lands (Security wave 1, beads `code-8ywc`). -- The web UI is desktop-first; mobile is functional but basic. -- NFS host prerequisite: `/srv/nfs/drone-logbook/sync-logs` (root:www-data, 2775 — same shape as - sibling dirs) and `/srv/nfs/drone-logbook-backup` created on 192.168.1.127 and recorded in - `secrets/nfs_directories.txt`. `/srv/nfs` is exported whole-tree, so no `/etc/exports` - (`scripts/pve-nfs-exports`) change. -- Backup story = the daily app-level backup CronJob (above) + the host `daily-backup` LVM-snapshot - leg + original log files retained both in the drop folder and in the data volume - (`KEEP_UPLOADED_FILES=true`). - -## Alternatives considered - -- **Build from the fork** (`ghcr.io/viktorbarzin/...` via GHA, ADR-0002): rejected for now — fork - has zero custom commits; a build chain adds maintenance for no benefit. Revisit if code changes - are wanted. -- **`auth = "app"` + app profile passwords** (would enable the `opendronelog-sync` native uploader - from anywhere): rejected — a single app password guarding GPS traces of home/travel on the open - internet is weaker than Authentik; the sync drop folder covers automated ingestion instead. -- **Internal-only (.lan + VPN)**: rejected — Authentik-gated public matches the rest of the - homelab and works without VPN while traveling. -- **NFS for the DuckDB data**: rejected — embedded-DB-on-NFS locking risk; freshrss precedent - keeps app DB data on proxmox-lvm. - -## Implementation - -See `2026-07-04-drone-logbook-plan.md`. diff --git a/docs/plans/2026-07-04-drone-logbook-plan.md b/docs/plans/2026-07-04-drone-logbook-plan.md deleted file mode 100644 index 588c7ab1..00000000 --- a/docs/plans/2026-07-04-drone-logbook-plan.md +++ /dev/null @@ -1,542 +0,0 @@ -# Drone Logbook (Open DroneLog) Implementation Plan - -> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. - -**Goal:** Deploy Open DroneLog (DJI flight-log analyzer) at https://dronelog.viktorbarzin.me — new Tier-1 stack `stacks/drone-logbook/`, upstream image, Authentik-gated, with a DuckDB data PVC and an NFS auto-import drop folder. - -**Architecture:** Single Deployment running `ghcr.io/arpanghosh8453/open-dronelog:latest` (nginx + Axum + DuckDB, port 80) in namespace `drone-logbook`; data on a `proxmox-lvm-encrypted` PVC (GPS logs = sensitive data), `/sync-logs` drop folder on static NFS, daily backup CronJob to `/srv/nfs/drone-logbook-backup` (vaultwarden pattern), `ingress_factory` with `auth = "required"`, Keel auto-upgrades via namespace enrollment. Modeled line-by-line on `stacks/freshrss/`. Design: `2026-07-04-drone-logbook-design.md`. - -**Tech Stack:** Terraform/Terragrunt (Tier-1 PG state), Vault KV + ESO, ingress_factory, nfs_volume module, Keel/Kyverno. - -Terraform is exempt from TDD (execution.md); each task ends with a concrete verification instead. - ---- - -### Task 1: Vault secret - -**Files:** none (Vault KV only) - -- [ ] **Step 1.1: Create `secret/drone-logbook` with a generated profile-creation password** - -```bash -vault kv put secret/drone-logbook profile_creation_pass="$(openssl rand -base64 24)" -``` - -- [ ] **Step 1.2: Verify** - -```bash -vault kv get -field=profile_creation_pass secret/drone-logbook | wc -c -``` - -Expected: `33` (32 chars + newline). Never echo the value itself. - -### Task 2: NFS drop folder on 192.168.1.127 - -**Files:** -- Modify: `secrets/nfs_directories.txt` (git-crypt'd — **edit from the MAIN checkout only**, never the worktree; sorted list, add `drone-logbook/sync-logs`) - -- [ ] **Step 2.1: Create the directories** — world-writable + setgid like `vaultwarden-backup` (the `/srv/nfs` export root-squashes, so pod-root writes land as `nobody`): - -```bash -ssh root@192.168.1.127 'mkdir -p /srv/nfs/drone-logbook/sync-logs /srv/nfs/drone-logbook-backup && chown -R root:www-data /srv/nfs/drone-logbook /srv/nfs/drone-logbook-backup && chmod 2777 /srv/nfs/drone-logbook/sync-logs /srv/nfs/drone-logbook-backup && ls -ld /srv/nfs/drone-logbook/sync-logs /srv/nfs/drone-logbook-backup' -``` - -Expected: `drwxrwsrwx ... root www-data ...` for both. -No `/etc/exports` (`scripts/pve-nfs-exports`) change — `/srv/nfs` is exported whole-tree. - -- [ ] **Step 2.2: Record them in the declarative list (MAIN checkout, plaintext there)** — insert `drone-logbook-backup` and `drone-logbook/sync-logs` (after `diun`, before `etcd-backup`) in `~/code/infra/secrets/nfs_directories.txt`, then commit that single file to master: - -```bash -git -C ~/code/infra add secrets/nfs_directories.txt -git -C ~/code/infra commit -m "nfs_directories: add drone-logbook/sync-logs - -Drop folder for the new drone-logbook stack's auto-import (SYNC_LOGS_PATH). -Directory created on 192.168.1.127 root:www-data 2775." -git -C ~/code/infra push forgejo master -``` - -(Trivial single-file exception per execution.md; encrypted files cannot be edited from the worktree.) - -### Task 3: Stack files (in the `wizard/drone-logbook` worktree) - -**Files:** -- Create: `stacks/drone-logbook/main.tf` (content below) -- Create: `stacks/drone-logbook/terragrunt.hcl` (content below) -- Create: `stacks/drone-logbook/secrets` → symlink to `../../secrets` -- (`backend.tf`, `tiers.tf`, `cloudflare_provider.tf`, `providers.tf`, `.terraform.lock.hcl` are terragrunt-generated and **gitignored** — do NOT create or commit them; the tracked copies in old stacks like freshrss predate the ignore rule) - -- [ ] **Step 3.1: `terragrunt.hcl`** - -```hcl -include "root" { - path = find_in_parent_folders() -} - -dependency "platform" { - config_path = "../platform" - skip_outputs = true -} -``` - -- [ ] **Step 3.2: `main.tf`** — exact content: - -```hcl -variable "tls_secret_name" { - type = string - sensitive = true -} -variable "nfs_server" { type = string } - -# Open DroneLog (https://github.com/arpanghosh8453/open-dronelog) — self-hosted -# DJI flight-log analyzer for the DJI Mini 4 Pro. Runs the UPSTREAM image (the -# ViktorBarzin/drone-logbook fork has no custom commits); Keel tracks :latest. -# Design: docs/plans/2026-07-04-drone-logbook-design.md -resource "kubernetes_namespace" "drone_logbook" { - metadata { - name = "drone-logbook" - labels = { - tier = local.tiers.aux - "keel.sh/enrolled" = "true" - } - } - lifecycle { - # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace - ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] - } -} - -resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } - manifest = { - apiVersion = "external-secrets.io/v1" - kind = "ExternalSecret" - metadata = { - name = "drone-logbook-secrets" - namespace = "drone-logbook" - } - spec = { - refreshInterval = "15m" - secretStoreRef = { - name = "vault-kv" - kind = "ClusterSecretStore" - } - target = { - name = "drone-logbook-secrets" - } - dataFrom = [{ - extract = { - key = "drone-logbook" - } - }] - } - } - depends_on = [kubernetes_namespace.drone_logbook] -} - -module "tls_secret" { - source = "../../modules/kubernetes/setup_tls_secret" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - tls_secret_name = var.tls_secret_name -} - -# DuckDB database + cached DJI decryption keys + uploaded originals. -# Embedded DB -> block storage, not NFS (same rationale as freshrss data). -# Encrypted class: flight logs are GPS traces of home/travel (sensitive data -# -> proxmox-lvm-encrypted per the storage decision rule in .claude/CLAUDE.md). -resource "kubernetes_persistent_volume_claim" "data" { - wait_until_bound = false - metadata { - name = "drone-logbook-data-encrypted" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - annotations = { - "resize.topolvm.io/threshold" = "10%" - "resize.topolvm.io/increase" = "100%" - "resize.topolvm.io/storage_limit" = "10Gi" - } - } - spec { - access_modes = ["ReadWriteOnce"] - storage_class_name = "proxmox-lvm-encrypted" - resources { - requests = { - storage = "2Gi" - } - } - } - lifecycle { - # The autoresizer expands requests.storage up to storage_limit and PVCs - # can't shrink; without this every apply tries to revert the size. - ignore_changes = [spec[0].resources[0].requests] - } -} - -# Drop folder: any producer (Nextcloud sync, scp, future phone pipeline) lands -# DJI .txt logs here over NFS; the app auto-imports on SYNC_INTERVAL. -module "nfs_sync_logs" { - source = "../../modules/kubernetes/nfs_volume" - name = "drone-logbook-sync-logs" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - nfs_server = var.nfs_server - nfs_path = "/srv/nfs/drone-logbook/sync-logs" - storage = "5Gi" -} - -resource "kubernetes_deployment" "drone_logbook" { - metadata { - name = "drone-logbook" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - labels = { - app = "drone-logbook" - "kubernetes.io/cluster-service" = "true" - tier = local.tiers.aux - } - } - spec { - replicas = 1 - strategy { - # DuckDB is single-writer; never overlap two pods on the same volume - type = "Recreate" - } - selector { - match_labels = { - app = "drone-logbook" - } - } - template { - metadata { - labels = { - app = "drone-logbook" - "kubernetes.io/cluster-service" = "true" - } - } - spec { - container { - name = "drone-logbook" - image = "ghcr.io/arpanghosh8453/open-dronelog:latest" - env { - name = "RUST_LOG" - value = "info" - } - env { - # keep re-importable originals under /data/drone-logbook/uploaded - name = "KEEP_UPLOADED_FILES" - value = "true" - } - env { - name = "SYNC_LOGS_PATH" - value = "/sync-logs" - } - env { - # 6-field cron (sec min hour dom mon dow): scan drop folder every 8h - name = "SYNC_INTERVAL" - value = "0 0 */8 * * *" - } - env { - name = "PROFILE_CREATION_PASS" - value_from { - secret_key_ref { - name = "drone-logbook-secrets" - key = "profile_creation_pass" - } - } - } - volume_mount { - name = "data" - mount_path = "/data/drone-logbook" - } - volume_mount { - name = "sync-logs" - mount_path = "/sync-logs" - read_only = true - } - port { - name = "http" - container_port = 80 - protocol = "TCP" - } - resources { - requests = { - cpu = "25m" - memory = "512Mi" - } - limits = { - memory = "512Mi" - } - } - } - volume { - name = "data" - persistent_volume_claim { - claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name - } - } - volume { - name = "sync-logs" - persistent_volume_claim { - claim_name = module.nfs_sync_logs.claim_name - } - } - } - } - } - depends_on = [kubernetes_manifest.external_secret] - lifecycle { - ignore_changes = [ - spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 - metadata[0].annotations["keel.sh/policy"], - metadata[0].annotations["keel.sh/trigger"], - metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 - metadata[0].annotations["keel.sh/match-tag"], - spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates - metadata[0].annotations["kubernetes.io/change-cause"], - metadata[0].annotations["deployment.kubernetes.io/revision"], - spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 - ] - } -} - -resource "kubernetes_service" "drone_logbook" { - metadata { - name = "drone-logbook" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - labels = { - "app" = "drone-logbook" - } - } - - spec { - selector = { - app = "drone-logbook" - } - port { - port = "80" - target_port = "80" - } - } -} - -# ----------------------------------------------------------------------------- -# Backup — required for every proxmox-lvm(-encrypted) app: daily copy of the -# data volume to NFS /srv/nfs/drone-logbook-backup (picked up by nfs-mirror -> -# sda -> Synology offsite). 01:30 = outside the 00:00/08:00/16:00 sync-import -# windows, so the DuckDB file is quiescent; uploaded originals make even a -# mid-write copy recoverable by re-import. Pod-affinity co-schedules with the -# app pod (RWO volume mounts twice only on the same node). Vaultwarden pattern. -# ----------------------------------------------------------------------------- - -module "nfs_backup" { - source = "../../modules/kubernetes/nfs_volume" - name = "drone-logbook-backup-host" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - nfs_server = var.nfs_server - nfs_path = "/srv/nfs/drone-logbook-backup" -} - -resource "kubernetes_cron_job_v1" "backup" { - metadata { - name = "drone-logbook-backup" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - } - spec { - concurrency_policy = "Replace" - failed_jobs_history_limit = 5 - schedule = "30 1 * * *" - starting_deadline_seconds = 300 - successful_jobs_history_limit = 3 - job_template { - metadata {} - spec { - backoff_limit = 3 - ttl_seconds_after_finished = 10 - template { - metadata {} - spec { - affinity { - pod_affinity { - required_during_scheduling_ignored_during_execution { - label_selector { - match_labels = { - app = "drone-logbook" - } - } - topology_key = "kubernetes.io/hostname" - } - } - } - container { - name = "drone-logbook-backup" - image = "docker.io/library/alpine" - command = ["/bin/sh", "-c", <<-EOT - set -euxo pipefail - _t0=$(date +%s) - now=$(date +"%Y_%m_%d_%H_%M") - mkdir -p /backup/$now - cp -a /data/. /backup/$now/ - # Rotate — 30 day retention - find /backup -maxdepth 1 -mindepth 1 -type d -mtime +30 -exec rm -rf {} + - _dur=$(($(date +%s) - _t0)) - _out_bytes=$(du -sb /backup/$now | awk '{print $1}') - wget -qO- --post-data "backup_duration_seconds $${_dur} - backup_output_bytes $${_out_bytes} - backup_last_success_timestamp $(date +%s) - " "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/drone-logbook-backup" || true - EOT - ] - volume_mount { - name = "data" - mount_path = "/data" - read_only = true - } - volume_mount { - name = "backup" - mount_path = "/backup" - } - } - volume { - name = "data" - persistent_volume_claim { - claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name - } - } - volume { - name = "backup" - persistent_volume_claim { - claim_name = module.nfs_backup.claim_name - } - } - dns_config { - option { - name = "ndots" - value = "2" - } - } - } - } - } - } - } - lifecycle { - # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 - ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] - } -} - -# https://dronelog.viktorbarzin.me -module "ingress" { - source = "../../modules/kubernetes/ingress_factory" - auth = "required" # Authentik forward-auth — flight logs are GPS traces of home/travel - dns_type = "proxied" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - name = "dronelog" - service_name = "drone-logbook" - tls_secret_name = var.tls_secret_name - extra_annotations = { - "gethomepage.dev/enabled" = "true" - "gethomepage.dev/name" = "Drone Logbook" - "gethomepage.dev/description" = "DJI flight log analyzer" - "gethomepage.dev/icon" = "mdi-quadcopter" - "gethomepage.dev/group" = "Media & Entertainment" - "gethomepage.dev/pod-selector" = "" - } -} -``` - -- [ ] **Step 3.3: Boilerplate** - -```bash -ln -s ../../secrets ~/code/infra/.worktrees/drone-logbook/stacks/drone-logbook/secrets -``` - -- [ ] **Step 3.4: Format check** - -```bash -terraform fmt -check -diff $WT/stacks/drone-logbook/ || terraform fmt $WT/stacks/drone-logbook/ -``` - -Expected: no diff (or auto-fixed). - -- [ ] **Step 3.5: Commit on the branch (files by name, git-crypt filter flags per execution.md)** - -```bash -git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false \ - add docs/plans/2026-07-04-drone-logbook-design.md docs/plans/2026-07-04-drone-logbook-plan.md \ - stacks/drone-logbook/main.tf stacks/drone-logbook/terragrunt.hcl stacks/drone-logbook/secrets \ - .claude/reference/service-catalog.md -git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false \ - commit -m "drone-logbook: new stack — self-hosted Open DroneLog at dronelog.viktorbarzin.me - -Viktor asked to self-host the DJI flight-log analyzer for his DJI Mini 4 Pro -(fork ViktorBarzin/drone-logbook -> upstream arpanghosh8453/open-dronelog). -Upstream ghcr image with Keel auto-upgrade, DuckDB data on proxmox-lvm PVC, -NFS /sync-logs drop folder auto-imported every 8h, Authentik-gated ingress, -PROFILE_CREATION_PASS from Vault via ESO. Design + plan in docs/plans/." -``` - -### Task 4: Land and apply - -- [ ] **Step 4.1: Presence claim** (CI apply mutates shared infra) - -```bash -~/code/scripts/presence claim infra:drone-logbook --purpose "deploy new drone-logbook stack (Open DroneLog) via CI apply" -``` - -- [ ] **Step 4.2: Merge latest master into the branch, push to master** - -```bash -git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false fetch forgejo -git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false merge forgejo/master -git -C $WT -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false push forgejo HEAD:master -``` - -Non-fast-forward → another agent landed first: fetch, merge, push again. Branch-protection rejection → fall back to PR via Forgejo API (token = password in `~/.git-credentials`). - -- [ ] **Step 4.3: Watch the CI apply to completion** — Woodpecker pipeline on the infra repo (`ci.viktorbarzin.me`), then confirm live: - -```bash -kubectl get ns drone-logbook && kubectl -n drone-logbook get deploy,pvc,pods,externalsecret,cronjob -kubectl -n drone-logbook rollout status deploy/drone-logbook --timeout=300s -``` - -Expected: namespace present, ExternalSecret `SecretSynced`, data PVC `Bound` (the NFS PVCs bind on first pod/job use), CronJob `drone-logbook-backup` scheduled `30 1 * * *`, pod `Running 1/1`. - -- [ ] **Step 4.4: Cleanup worktree + branch; release presence** - -```bash -git -C ~/code/infra worktree remove .worktrees/drone-logbook -git -C ~/code/infra branch -d wizard/drone-logbook -git -C ~/code/infra pull --ff-only # only if main checkout clean/quiescent -~/code/scripts/presence release infra:drone-logbook -``` - -### Task 5: End-to-end verification - -- [ ] **Step 5.1: Ingress + Authentik gate** - -```bash -curl -sI https://dronelog.viktorbarzin.me | head -5 -``` - -Expected: `302` redirect into Authentik (NOT `200`, NOT `404`). - -- [ ] **Step 5.2: App alive behind the gate** (bypass ingress via port-forward, read-only debug) - -```bash -kubectl -n drone-logbook port-forward svc/drone-logbook 18080:80 & -sleep 2 && curl -s -o /dev/null -w '%{http_code}\n' http://127.0.0.1:18080/ && kill %1 -``` - -Expected: `200`. - -- [ ] **Step 5.3: Sync folder visible in-pod** - -```bash -kubectl -n drone-logbook exec deploy/drone-logbook -- ls -ld /sync-logs /data/drone-logbook -``` - -Expected: both directories listed; `/sync-logs` read-only mount. - -- [ ] **Step 5.4: Monitor + homepage** — Uptime Kuma external monitor for `dronelog.viktorbarzin.me` auto-created (ingress annotation); homepage tile under "Media & Entertainment". - -- [ ] **Step 5.5: Functional import** — Viktor uploads a real Mini 4 Pro `.txt` log via the web UI (or drops it in `/srv/nfs/drone-logbook/sync-logs`); confirms flight appears with charts/map. Requires pod egress to DJI once per new log (decryption key). If an upstream sample log is available, the agent may pre-verify import via the REST API through the port-forward. diff --git a/docs/plans/2026-07-04-immich-frame-lan-only-design.md b/docs/plans/2026-07-04-immich-frame-lan-only-design.md deleted file mode 100644 index 199316cf..00000000 --- a/docs/plans/2026-07-04-immich-frame-lan-only-design.md +++ /dev/null @@ -1,125 +0,0 @@ -# immich-frame: LAN-only access, Portals untouched (2026-07-04) - -## Goal - -Strangers must no longer be able to view `highlights-immich.viktorbarzin.me` -(Viktor's London Portal Plus frame) or `highlights-immich-emo.viktorbarzin.me` -(Emo's Sofia Portal Mini frame) — pages or ImmichFrame API. Both were -`auth = "none"`, Cloudflare-proxied, fully public. - -Who keeps access (per Viktor, this session): the two Portals plus **any -household device on the Sofia, London, or Valchedrym home networks**. No -public access, no tailnet requirement. Hard constraint: the Portal app is a -WebView with the URL **baked in at APK build time** (`portal-immich-frame`, -`-PframeUrl`), so the exact URLs must keep loading from where the Portals sit -— zero app rebuilds, zero device touches, zero router changes. - -## Design - -Two cooperating pieces — the gate and the reachability pointer: - -1. **The gate — `home-lans-only` Traefik middleware** (traefik stack, next to - `local-only`): `ipAllowList` of `192.168.1.0/24` (Sofia LAN), `10.0.0.0/8` - (VLANs, K8s pods `10.10.0.0/16`, services `10.96.0.0/12`, WG tunnel - `10.3.2.0/24`), `192.168.8.0/24` (London LAN), `192.168.9.0/24` (London - GUEST net — post-rollout discovery: the Portal Plus actually leases here, - `Portal-75AE8F9C2A8A` = `192.168.9.198`, added same day), `192.168.0.0/24` - (Valchedrym LAN), `fc00::/7`, `fe80::/10`. Attached to both frame - ingresses via `extra_middlewares`. Everyone else gets a Traefik 403 — - including direct-to-WAN-IP requests carrying the right SNI, which DNS - changes alone cannot stop. A **separate** middleware rather than a widened - `local-only`, because widening would silently grant the remote LANs access - to the 9 admin surfaces using it (Prometheus, iDRAC, Loki, …). - -2. **The pointer — `dns_type = "internal"`** (new `ingress_factory` tier, - Viktor's idea): a **non-proxied public A record → `10.0.20.203`** (module - var `internal_lb_ip`). Outsiders resolve it but get an unroutable RFC1918 - address; every household resolver path delivers a working answer with no - config anywhere: Sofia LAN already gets the internal CNAME from Technitium, - London/Valchedrym resolve the public record via any upstream and - policy-route `10.0.0.0/8` down the WireGuard tunnel. IPv4-only (spokes - route no internal v6 range). - -Interlock (the reason both flip together): with a *proxied* record, public -traffic arrives from cloudflared **pod IPs inside 10/8** and would sail -through the allowlist. `internal` removes the Cloudflare path entirely (CF -edge stops serving the hostname), so every request reaches Traefik with its -real source IP (ETP=Local). Verified: no wildcard `*.viktorbarzin.me` record -exists to resurrect public resolution. - -`auth` stays `"none"` — there is still no *user* auth by design (kiosk -WebView; forward-auth would 302 the device to a login it can't complete, and -emo's Google-only account can't log in inside a WebView at all); the -convention comment now names the ipAllowList as the gate. - -### Resulting flows - -| Client | Path | Result | -|---|---|---| -| Emo's Portal Mini (Sofia LAN) | Technitium CNAME → `.203` direct (unchanged) | allowed (`192.168.1.x`) | -| Viktor's Portal Plus (London GUEST net) | public A → `10.0.20.203` → WG tunnel | allowed (`192.168.9.x`) | -| Household browsers (any of the 3 LANs) | same as above | allowed | -| In-cluster checks (`homelab browser`, blackbox) | CoreDNS → Technitium → `.203` | allowed (pod IP in 10/8) | -| Stranger, resolves hostname | gets `10.0.20.203` | unroutable | -| Stranger, hits WAN IP with SNI | pfSense NAT → Traefik (real source IP) | **403** | -| Stranger, via Cloudflare | no proxied record | CF edge won't serve the host | - -### Rejected alternatives - -- **ImmichFrame `AuthenticationSecret`** (supported upstream: web input field - or `?authsecret=` param + bearer API): real auth from anywhere, but family - browsers would face a secret prompt (fails "household devices just work"), - the secret leaks into URLs/analytics/APK, and robust rollout needs APK - rebuild + USB-adb sideload on both Portals (the Sofia one is high-friction). -- **Authentik forward-auth / `auth = "public"`**: WebView can't complete SSO - (Google blocks WebView logins; session expiry silently bricks an appliance); - the anonymous outpost is an audit trail, not a gate. -- **Remove DNS + London router AdGuardHome rewrites**: works, but adds an - out-of-band, un-IaC'd router dependency the internal-IP record makes - unnecessary. Kept as documented fallback if resolver-side private-IP - filtering ever appears in the London path. - -## Pre-verified facts (2026-07-04) - -- London Flint 2 DNS chain returns RFC1918 answers unfiltered - (`nslookup 10.0.20.203.nip.io 127.0.0.1` on the router → `10.0.20.203`; - dnsmasq `rebind_protection '0'`, no AdGuardHome rebind filtering). -- Technitium already CNAMEs both hostnames → apex → `10.0.20.203` - (`technitium-ingress-dns-sync` is ingress-driven, not DNS-record-driven, so - the internal answer survives the Cloudflare record swap). -- Pod CIDR `10.10.0.0/16`, service CIDR `10.96.0.0/12` — inside `10.0.0.0/8`. -- No public wildcard record in the zone. - -## Blast radius & cleanups - -- `external_monitor = false` set explicitly on both ingresses: the - external-monitor-sync default opt-in would otherwise keep the now-doomed - `[External] highlights-immich*` uptime-kuma monitors alive and red. Verify - the sync drops them post-apply. -- rybbit CF worker: `highlights-immich` removed from `SITE_IDS` (`index.js`) - and `wrangler.toml` routes — off Cloudflare the route can never fire. - Requires a `wrangler deploy` to take effect (route removal is hygiene, not - functional). -- Homepage dashboard link keeps working from LANs (hostname unchanged). -- Docs updated in the same change: `.claude/CLAUDE.md` (DNS tier + - external-monitor mechanism), `AGENTS.md`, `docs/architecture/networking.md` - (Internal-IP domains category). The `portal-immich-frame` repo's glossary - ("public, login-less URL") updated separately in that repo. - -## Failure-mode delta - -London frame now depends on the WG tunnel instead of Cloudflare+cloudflared -(the app self-heals with 5s retries; tunnel-flap modes documented in -`docs/architecture/vpn.md`). A Traefik LB renumber must update -`internal_lb_ip` in the module alongside the split-horizon apex record. -Cutover window: cached proxied answers keep working ≤ ~5 min TTL, then the -WebView's own retry picks up the new path. - -## Verification & rollback - -Verify: public dig → `10.0.20.203` (both hosts); Technitium dig → `.203`; -curl from devvm (10/8) → 200; external vantage (WebFetch/cloud) → unreachable -or 403; middleware attached on both ingresses; Emo's frame renders via -`homelab browser`; London Portal image fetches visible in Traefik access logs -from `192.168.8.x`. Rollback: `git revert` + apply traefik/immich — records -and middleware chain restore (`allow_overwrite = true` re-adopts the records). diff --git a/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md index 27a4484a..664869fa 100644 --- a/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md +++ b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md @@ -129,40 +129,3 @@ heavy user between 12–16G even with RAM free; bump to 16/20 if that bites. storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the correct pairing. A famous tool that "does OOM" still has to be proven to fire under *your* configuration. - -## Addendum (2026-07-02): the MemoryHigh throttle band livelocks — removed - -The soft-cap layer of this design was falsified in production on 2026-07-02 -(~15:42–16:35 UTC): an agent-spawned `ugrep` (12.35G RSS; `-o` with wide -alternation captures over a multi-GB `.jsonl` transcript) **plateaued inside -t3-serve@wizard's `MemoryHigh=12G..MemoryMax=16G` band**. With -`MemorySwapMax=0` its anonymous pages were unreclaimable, so the kernel parked -every allocating task of the cgroup in `mem_cgroup_handle_over_high` -(`memory.pressure full avg60 ≈ 80%`, `memory.events high=882948`, `oom_kill=0`) -— including the `t3 serve` event loop (~0.5G RSS, pure collateral). The accept -queue backed up (21 pending connections), t3-probe logged `t3serve: [Errno 104] -Connection reset by peer`, t3-dispatch logged `proxy error: context canceled`, -and t3.viktorbarzin.me was dead for its user until the hog was SIGKILLed by -hand (the D-state high-throttle sleep IS killable; the cgroup dropped 14G→1.4G -and the service recovered in seconds with no restart). - -The Verification bullet above — a soft-capped balloon "throttled to a crawl, -making no progress and **harming nothing**" — holds only when the hog is alone -in its cgroup. Sharing the cgroup with a latency-sensitive server, the crawl -IS the harm: a hog that stabilises below `MemoryMax` never triggers the local -OOM the design counted on, so the band converts "runaway dies" into "everyone -in the cgroup stalls forever". - -**Fix (same day, admin-approved): `MemoryHigh=infinity` on all three work -cgroup definitions** — `scripts/t3-serve@.service`, the `user-.slice.d` -drop-in, and `docker.slice` (`setup-devvm.sh` §10a/§10c). A runaway now runs -unthrottled into `MemoryMax` and is cgroup-OOM-killed immediately -(`OOMPolicy=continue` keeps t3-serve itself alive; in slices the kernel kills -the biggest task). `MemoryMax`, `MemorySwapMax=0`, and earlyoom — the layers -the stress tests actually validated — are unchanged. Applied live via -`daemon-reload` + runtime `set-property` on the running cgroups; no session -restarts. - -Lesson: **with `swap=0`, `memory.high` is not a gentler `memory.max` — it is -an unbounded stall injector for everything sharing the cgroup.** Cap-and-kill -beats throttle-and-pray for multi-tenant interactive services. diff --git a/docs/runbooks/paperless-mail-ingest.md b/docs/runbooks/paperless-mail-ingest.md deleted file mode 100644 index 50c404be..00000000 --- a/docs/runbooks/paperless-mail-ingest.md +++ /dev/null @@ -1,135 +0,0 @@ -# Paperless-ngx Mail Ingest (docs@viktorbarzin.me) - -Last updated: 2026-07-03 (initial build) - -Forward any email with document attachments to **`docs@viktorbarzin.me`** and -paperless-ngx ingests the attachments, owned by the paperless account mapped -from the **sender** (From) address. Built entirely from existing parts: a -docker-mailserver mailbox + Dovecot sieve, and paperless-ngx's native mail -consumer (the same machinery as the `utility:` rules). - -## Flow - -``` -family member forwards email ──> MX ──> docker-mailserver - │ postfix virtual: docs@ has an explicit self-alias (extra/aliases.txt), - │ so the @domain catch-all (→ spam@, swept by TripIt) does NOT apply - ▼ -Dovecot LMTP delivery to docs@ - │ per-user sieve (docs@viktorbarzin.me.dovecot.sieve): sender NOT in - │ allowlist → discard (decision 2026-07-03: unmatched = ignore & delete) - ▼ -docs@ INBOX ── paperless-ngx mail task (every 10 min, PAPERLESS_EMAIL_TASK_CRON - │ default) applies mail rules in order: filter_from = - │ → consume attachments (ALL parts incl. inline — see design - │ notes: Apple Mail marks real PDFs inline), owner = mapped user, - │ tag = email-ingest, title = mail subject - ▼ -consumed mail is MOVED to the "Processed" IMAP folder (audit trail); -INBOX stays empty in steady state -``` - -## Sender → paperless account map (as built) - -| Sender (From) | Paperless user | Rule | -|--------------------------|----------------|-----------------| -| me@viktorbarzin.me | root (id 3) | forward: Viktor (me@) | -| vbarzin@gmail.com | root (id 3) | forward: Viktor (gmail) | -| viktorbarzin@meta.com | root (id 3) | forward: Viktor (meta) | -| ancaelena98@gmail.com | anca (id 4) | forward: Anca | -| emil.barzin@gmail.com | emo (id 7) | forward: Emo | - -The map lives in **two places by design** — keep them in sync: - -1. **Delivery gate (infra, Terraform):** - `stacks/mailserver/modules/mailserver/extra/docs-at-viktorbarzin.me.dovecot.sieve` - — senders not listed here are discarded at delivery (spam control + the - "ignore and delete unmatched" behaviour; paperless cannot express - "delete without ingesting", so this must happen before the mailbox). -2. **Owner map (paperless DB, via API/UI):** one mail rule per sender on the - `docs@viktorbarzin.me` mail account. DB-state like workflows — NOT - Terraform. - -## Add a family member / sender - -1. Add the address to the sieve allowlist file above; commit; apply the - `mailserver` stack (normal apply is enough — the sieve CM key is not under - `ignore_changes`; Reloader restarts the pod). -2. Clone an existing `forward:` mail rule in the paperless admin UI - (Mail → Rules) or via API, changing `filter_from` and the rule **owner** - (documents are owned by the rule owner — `assign_owner_from_rule=true`). - Keep: action = Move to `Processed`, attachment type = **process all files - including inline** (`attachment_type=2` — NOT attachments-only, see design - notes), consumption scope = attachments only, tag `email-ingest`, order - after the existing rules. - -## Operations - -- **Trigger a fetch immediately** (instead of waiting ≤10 min): - `kubectl -n paperless-ngx exec deploy/paperless-ngx -c paperless-ngx -- s6-setuidgid paperless python3 manage.py mail_fetcher` - The `s6-setuidgid paperless` is **required**: `kubectl exec` runs as root, and a - root-run fetcher downloads attachments root-owned into the scratch dir, which - the celery consumer (uid 1000) then can't read — `PermissionError` on - `/tmp/paperless/paperless-mail-*/...`, consume task FAILURE (hit during the - 2026-07-03 build E2E). The mail correctly stays in INBOX for retry (the move - action is a chord callback on successful consumption). Recover: `rm -rf - /tmp/paperless/paperless-mail-*` (as root) and let the next scheduled fetch - re-process. -- **Mailbox credentials:** Vault `secret/platform` → `mailserver_accounts` - JSON, key `docs@viktorbarzin.me` (also used by the paperless mail account). -- **Inspect the mailbox:** - `python3 -c` IMAP to `mailserver.mailserver.svc.cluster.local:993` (in-cluster, - from a pod) or `mail.viktorbarzin.me:993` (externally / devvm). -- **Paperless-side logs:** `kubectl -n paperless-ngx logs deploy/paperless-ngx | grep -i mail` - (also Loki, ns `paperless-ngx`). Rule/account state: `GET /api/mail_rules/`, - `GET /api/mail_accounts/` with the admin token - (k8s secret `paperless-ngx-secrets`, field `api_token`). -- **Account/mailbox provisioning:** adding/rotating anything in - `mailserver_accounts` requires the ConfigMap replace workaround — - `scripts/tg apply mailserver -- -replace=module.mailserver.kubernetes_config_map.mailserver_config` - — because `postfix-accounts.cf` is under `ignore_changes` - (non-deterministic bcrypt; see the module comment). - -## Design notes / caveats - -- **Why not the catch-all?** Mail to unknown `@viktorbarzin.me` addresses - lands in `spam@`, which the TripIt `ingest-plans` CronJob sweeps every - 15 min: it marks everything `\Seen`, LLM-parses mail from linked senders and - replies with ack/failure emails. Forwarded bank statements would get - "couldn't parse a trip" replies. `docs@` being a real mailbox bypasses that - path entirely; TripIt, the `smoke-test@` roundtrip probe, and `dmarc@` are - untouched. -- **Spoofing:** the sender match is on the From header. Rspamd verifies - SPF/DKIM/DMARC on inbound mail, but gmail.com publishes `p=none`, so a - crafted spoof could ingest documents into a family member's account. Accepted - risk (worst case: unwanted documents appear, visible + deletable in - paperless). -- **Not PDF-only:** any attachment type paperless supports is consumed - (PDF, images, Office via the existing tika+gotenberg pipeline). -- **Inline attachments ARE processed (`attachment_type=2`, flipped - 2026-07-03):** the rules originally used attachments-only (1) to skip - signature logos, but the very first real forward (Apple Mail, Viktor's - client) attached the invoice PDF with `Content-Disposition: inline` — - paperless matched the rule, consumed nothing, and recorded - `PROCESSED_WO_CONSUMPTION` (which, like any ProcessedMail row, blocks that - UID from ever being re-processed — delete the row via `manage.py shell` to - retry). Trade-off: signature/inline images in forwards may be ingested as - junk docs (tagged `email-ingest`, easy to spot). If that gets noisy, add - `filter_attachment_filename_exclude` patterns to the rules using the - actually-observed junk filenames — do NOT flip back to attachments-only. -- **No dedicated alerting** (deliberate, 2026-07-03): mail-task errors surface - in paperless logs; the mailserver inbound path is covered by - `email-roundtrip-monitor`. Revisit if forwards start silently failing. -- **Workflows:** the global `payslip-webhook` + `claude-mcp-readers - auto-permission` workflows fire for mail-ingested docs like any other - consumption source (verified pre-build; payslip receiver does its own - filtering). - -## Rollback - -1. Disable/delete the 5 `forward:` mail rules + the `docs@` mail account - (paperless admin UI or API). -2. Revert the infra commit (aliases.txt entry, sieve file, CM key + mount). -3. Remove `docs@viktorbarzin.me` from Vault `mailserver_accounts`, then apply - with the `-replace` workaround above. Mail to docs@ then falls back to the - catch-all (spam@) like any unknown address. diff --git a/docs/runbooks/t3-drop-attribution.md b/docs/runbooks/t3-drop-attribution.md index e05f163b..df4cef09 100644 --- a/docs/runbooks/t3-drop-attribution.md +++ b/docs/runbooks/t3-drop-attribution.md @@ -109,17 +109,10 @@ rate(node_pressure_memory_stalled_seconds_total{instance="devvm"}[5m]) node_memory_SwapFree_bytes{instance="devvm"} ``` -Guardrails in place (2026-06-10, hardened 2026-07-02; `scripts/t3-serve@.service`): -per-unit `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue`, and -`MemoryHigh=infinity` — deliberately NO soft throttle band. With swap=0, a hog -plateauing between high and max never OOMs and the kernel high-throttle stalls -the whole unit: a 12.3G agent `ugrep` livelocked t3-serve@wizard for ~50min on -2026-07-02 (signature: probe `t3serve` leg `Connection reset by peer`, dispatch -`proxy error: context canceled`, server D-state in `mem_cgroup_handle_over_high`, -`ss` backlog on the serve port; fix: SIGKILL the hog — the D-state is killable). -A runaway agent now OOMs alone at 16G inside the cgroup instead of throttling -the WS server with it. Post-mortem addendum: -`docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md`. +Guardrails in place (2026-06-10, `scripts/t3-serve@.service`): per-unit +`MemoryHigh=12G`, `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue` — +a runaway agent now OOMs alone inside the cgroup instead of taking the box +(and the WS server) with it. ## 4. Known root causes (2026-06-10 investigation) diff --git a/docs/runbooks/valia-sites.md b/docs/runbooks/valia-sites.md deleted file mode 100644 index ee10a866..00000000 --- a/docs/runbooks/valia-sites.md +++ /dev/null @@ -1,98 +0,0 @@ -# Valia sites — add / update / retire - -Off-infra static sites authored by Valia (ADR-0018, CONTEXT.md "Valia site"). -Serving: Cloudflare Pages. Freshness: the `valia-sites-sync` CronJob -(`valia-sites` ns) mirrors each Content folder every 10 minutes and deploys -only when the folder's manifest hash changed. Registry: `local.sites` in -`stacks/valia-sites/main.tf` — one entry per site drives everything (Pages -project, custom domain, public CNAME, internal split-horizon CNAME, sync). - -Current sites: `bridge` (ОбУ „Отец Паисий“ — "мост"), `stem95su` (95. СУ STEM -board). - -## Add a site - -1. Valia shares the Drive folder with **vbarzin@gmail.com** (viewer is enough — - the pipeline is strictly read-only towards Drive). -2. Get the folder id from its URL (`drive.google.com/drive/folders/`). -3. Pick the **English** subdomain name (Viktor's call — CONTEXT.md naming rule). -4. Add one entry to `local.sites` in `stacks/valia-sites/main.tf`: - - ```hcl - = { - folder_id = "" - src_path = "" # or "sub/folder" if servable files live deeper - entry_file = "index.html" # or whatever her main HTML file is called - manage_dns = true - } - ``` - -5. Commit + push; CI applies. Within ~10 min the sync deploys content and the - site serves at `https://.viktorbarzin.me` (custom-domain TLS takes - ~5–10 min extra on first attach — CF returns 522 for the hostname until - then). Internal LAN/VLAN/pod resolution appears when the hourly - `technitium-ingress-dns-sync` next runs — trigger it early with: - `kubectl create job --from=cronjob/technitium-ingress-dns-sync valia-dns-now -n technitium` - -## Content rules (what Valia's folder must look like) - -- The **entry file** must exist — the sync stages a copy as `index.html` at - deploy time, so `/` works; the original filename keeps working too (deep - links survive). If the folder is empty or the entry file is missing, the - sync **skips the site and leaves it as-is** (never wipes a live site). -- Google-native files (Docs/Sheets) are **ignored** (`--drive-skip-gdocs`) — - only real files (`.html`, images, …) deploy. Gemini's HTML exports are fine. -- Per-file limit 25 MB (Cloudflare Pages), 20k files max — far beyond a - 1-page site. - -## Update a site - -Nothing to do: Valia edits the folder, the site follows within ~10 minutes. -Force it early: `kubectl create job --from=cronjob/valia-sites-sync sync-now -n valia-sites` - -## Rename / retire a site - -Rename = retire + add (Pages projects can't be renamed). Retire: - -1. Delete the entry from `local.sites`; commit + push. TF destroys the public - CNAME + custom domain + Pages project; the internal record is removed by - the next `technitium-ingress-dns-sync` run (its deletion pass drops any - internal `*.pages.dev` CNAME that left the `valia-sites-dns` ConfigMap — - scoped so it can never touch non-Pages records). -2. That's all — no manual DNS cleanup (the pre-ADR-0018 add-only gotcha is - fixed by the deletion pass). - -## Failure modes / debugging - -- **Visibility is failed-Job-only by choice** (ADR-0018): no alerts, no - notifications. Check: `kubectl get jobs -n valia-sites | tail`, logs of the - last `valia-sites-sync-*` pod. -- **Drive auth broken** (`FATAL … Drive list failed`): the shared - `secret/valia-sites.rclone_conf` token died. The GCP OAuth app - (`home-lab-1700868541205`) must stay published to "Production" or refresh - tokens expire weekly (same constraint as the old stem95su conf, which this - one was copied from). Re-mint and `vault kv patch secret/valia-sites - rclone_conf=@…`. -- **Wrangler auth broken**: `secret/valia-sites.cloudflare_pages_token` is a - SCOPED token (Pages Read+Write on the account, id - `355d2c9d11579bdad1e9498dafca30d5`) — re-mint via - `POST /user/tokens` with the Global API Key (`secret/platform`), patch - Vault. Do NOT put the Global API Key in the pod. -- **Site serves stale content**: check the state CM - (`kubectl get cm valia-sites-state -n valia-sites -o yaml`) — deleting a - site's key forces a redeploy on the next run. -- **`GUARD … skipping`** in logs: Valia's folder is empty or renamed the - entry file — the site deliberately kept its last content. Fix the folder or - update `entry_file`. - -## History - -- stem95su served in-cluster (nginx + NFS + its own rclone CronJob) until - 2026-07-03, when it was cut over to this pattern and the old stack retired - (ADR-0018). The blocking 42.9 MB `stem_video.mp4` was compressed to 21.4 MB - (same 1080p, ~2.5 Mbps H.264) and replaced in Valia's folder with Viktor's - explicit one-time OK. `secret/stem95su` is superseded by - `secret/valia-sites`; `/srv/nfs/stem-site` on the PVE host is a harmless - leftover. -- bridge started as a hand-deployed wrangler experiment (2026-07-03, memory - id 7085) and was adopted into the stack the same day. diff --git a/docs/runbooks/vault-token-renew-devvm.md b/docs/runbooks/vault-token-renew-devvm.md index 2ccddb8e..2dc4d35b 100644 --- a/docs/runbooks/vault-token-renew-devvm.md +++ b/docs/runbooks/vault-token-renew-devvm.md @@ -82,48 +82,33 @@ tail -5 ~/.local/state/vault-token-renew.log # recent results A healthy log line looks like: ` OK renewed (dn=token-devvm-wizard ttl=2764800s)` (ttl 2764800s = 768h). -After an OIDC login you'll instead see, at the next nightly run: -` HEALED: re-minted periodic token from foreign dn=oidc-… (revoked N stale periodic token(s))` -— that's the self-heal working as designed. - -## Drift guard & self-heal +## Drift guard & recovery `~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login` overwrites it. Two confirmed clobber vectors: 1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer - can't push past the OIDC role's 7-day `token_max_ttl`). The infra docs - prescribe this login before applies, so it recurs — it went unnoticed for - weeks twice (2026-06-18→26, 2026-06-29→07-03) and read as "Vault expires - weekly". + can't push past the OIDC role's 7-day `token_max_ttl`). 2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) → writes a read-only `kubernetes-woodpecker-default` token (can read Vault but - **cannot** write `secret/*`). Happened 2026-06-05, unnoticed for two days. + **cannot** write `secret/*`). This happened 2026-06-05 and went unnoticed for + two days — reads worked, writes silently 403'd. -Since 2026-07-03 the renewer **self-heals** -(`docs/plans/2026-07-03-vault-token-self-heal-design.md`). On a foreign token -it attempts the re-mint **with the clobbering token's own authority** and lets -Vault's authz decide: +To stop the renewer from silently keeping a foreign token alive, it runs a +**drift guard** first: it refuses to renew unless the token is +`token-devvm-wizard` **and** carries `vault-admin`. On drift it logs loudly and +exits non-zero (the systemd unit goes `failed`) rather than renewing someone +else's token. Symptom in the log: -- **Admin-capable clobber (OIDC login)** → re-mints the periodic token, - sanity-checks it against the drift guard, atomically replaces - `~/.vault-token`, revokes stale `token-devvm-wizard` leftovers - (anti-sprawl), logs - `HEALED: re-minted periodic token from foreign dn=… (revoked N stale periodic token(s))` - and exits 0. The clobbering token is NOT revoked — it may still back a live - login session; it ages out on its own. -- **Weak clobber (read-only k8s token)** → the mint is denied; logs - `DRIFT: … heal denied, foreign token lacks create authority …; investigate what wrote it` - and exits non-zero (unit `failed`). Deliberately loud: this signals a - misbehaving agent flow — exactly the 2026-06-05 case. +` DRIFT: ~/.vault-token is dn=... policies=... Refusing to renew a foreign token. Re-mint: ...` -**Manual recovery** is only needed for the weak-clobber case (the DRIFT log -line still contains the exact command) — run the -[mint/re-mint](#mint--re-mint-the-token) block. +**Recovery: re-mint** (the DRIFT log line contains the exact command) — run the +[mint/re-mint](#mint--re-mint-the-token) block. The drift guard detects but does +**not** auto-recover (a deliberate scope choice — version-only, no self-heal); +recovery is the manual re-mint above. ## Tests -`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision, -the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber -case), and the self-heal's revoke filter (which stale periodic tokens a heal -may sweep). Run: `bash infra/scripts/test-vault-token-renew.sh`. +`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision +and the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber +case). Run: `bash infra/scripts/test-vault-token-renew.sh`. diff --git a/modules/kubernetes/ingress_factory/main.tf b/modules/kubernetes/ingress_factory/main.tf index ddcc7105..fc9bc9f5 100644 --- a/modules/kubernetes/ingress_factory/main.tf +++ b/modules/kubernetes/ingress_factory/main.tf @@ -127,29 +127,20 @@ variable "anti_ai_scraping" { variable "dns_type" { type = string default = "none" - description = <<-EOT - Cloudflare DNS: 'proxied' (CNAME to tunnel), 'non-proxied' (A/AAAA to - public IP), 'internal' (A to the internal Traefik LB IP — resolvable from - any resolver but only ROUTABLE from home LANs / WG sites / VPN; the record - is a reachability pointer, NOT a gate: pair it with an ipAllowList via - extra_middlewares, e.g. traefik-home-lans-only@kubernetescrd, because - direct-to-WAN-IP requests with the right SNI still hit Traefik), or 'none'. - EOT + description = "Cloudflare DNS: 'proxied' (CNAME to tunnel), 'non-proxied' (A/AAAA to public IP), or 'none'" validation { - condition = contains(["proxied", "non-proxied", "internal", "none"], var.dns_type) - error_message = "dns_type must be 'proxied', 'non-proxied', 'internal', or 'none'." + condition = contains(["proxied", "non-proxied", "none"], var.dns_type) + error_message = "dns_type must be 'proxied', 'non-proxied', or 'none'." } } # Uptime Kuma external monitor: when true, annotate the ingress so the # external-monitor-sync CronJob creates a `[External] ` monitor pointing -# at https://. Null means "follow dns_type" — enabled when the ingress -# has a PUBLIC DNS record (proxied or non-proxied; 'internal' records are not -# externally reachable, so no external monitor). +# at https://. Null means "follow dns_type" — enabled when proxied. variable "external_monitor" { type = bool default = null - description = "Enable Uptime Kuma external monitor. null = auto (enabled when dns_type is 'proxied' or 'non-proxied')." + description = "Enable Uptime Kuma external monitor. null = auto (enabled when dns_type == 'proxied')." } variable "external_monitor_name" { @@ -180,15 +171,6 @@ variable "public_ipv6" { default = "2001:470:6e:43d::2" } -# Internal Traefik LB IP used by dns_type = "internal" records. Tracks the -# dedicated MetalLB IP from stacks/traefik (ETP=Local). A future LB renumber -# must update this default alongside the split-horizon apex record — see -# docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*. -variable "internal_lb_ip" { - type = string - default = "10.0.20.203" -} - variable "homepage_group" { type = string default = null # auto-detect from namespace @@ -219,10 +201,8 @@ locals { ) # External monitor enabled by default when the ingress has a public DNS - # record (either CF-proxied or direct A/AAAA). 'internal' records resolve - # publicly but are unroutable from outside, so they get no external monitor. - # Explicit bool overrides. - effective_external_monitor = var.external_monitor != null ? var.external_monitor : (var.dns_type == "proxied" || var.dns_type == "non-proxied") + # record (either CF-proxied or direct A/AAAA). Explicit bool overrides. + effective_external_monitor = var.external_monitor != null ? var.external_monitor : (var.dns_type != "none") # Emit the annotation when effective is true (positive signal), or when the # caller explicitly set external_monitor=false (opt-out). When the caller @@ -444,19 +424,3 @@ resource "cloudflare_record" "non_proxied_aaaa" { zone_id = var.cloudflare_zone_id allow_overwrite = true } - -# 'internal': a publicly-resolvable A record carrying the INTERNAL Traefik LB -# IP. Outsiders resolve it but can't route to it; home-LAN/WG-site/VPN clients -# reach Traefik directly (the WG spokes policy-route 10.0.0.0/8 through the -# tunnel), so kiosk devices with baked-in URLs need no DNS overrides anywhere. -# IPv4-only on purpose: the spokes route no internal IPv6 range. -resource "cloudflare_record" "internal_a" { - count = var.dns_type == "internal" ? 1 : 0 - name = local.dns_name - content = var.internal_lb_ip - proxied = false - ttl = 1 - type = "A" - zone_id = var.cloudflare_zone_id - allow_overwrite = true -} diff --git a/scripts/t3-serve@.service b/scripts/t3-serve@.service index 0ab84e74..7f3d765d 100644 --- a/scripts/t3-serve@.service +++ b/scripts/t3-serve@.service @@ -21,19 +21,12 @@ WorkingDirectory=/home/%i ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3 Restart=on-failure RestartSec=5 -# Memory containment (2026-06-10, amended 2026-07-02): agent children live in -# this cgroup; a runaway agent (10.8G anon on a 23G host) swap-thrashed the -# whole devvm — every >20s stall fires the t3 client watchdog (visible -# "disconnects") — then global-OOMed. Cap the cgroup so a runaway OOMs early -# and locally, and forbid swap so stalls can't smear into minutes-long freezes. -# MemoryHigh is DELIBERATELY infinity — do not add a soft band below MemoryMax: -# with swap=0 a hog that plateaus between high and max is unreclaimable but -# never OOMs, and the kernel's high-throttle stalls EVERY task in the cgroup -# (the t3 event loop included) indefinitely. A 12.3G agent ugrep livelocked -# this unit for ~50min on 2026-07-02 exactly this way. Straight-to-OOM at -# MemoryMax is the containment; OOMPolicy=continue below keeps the server up. -# See docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md addendum. -MemoryHigh=infinity +# Memory containment (2026-06-10): agent children live in this cgroup; a +# runaway agent (10.8G anon on a 23G host) swap-thrashed the whole devvm — +# every >20s stall fires the t3 client watchdog (visible "disconnects") — +# then global-OOMed. Cap the cgroup so a runaway OOMs early and locally, +# and forbid swap so stalls can't smear into minutes-long freezes. +MemoryHigh=12G MemoryMax=16G MemorySwapMax=0 # Default OOMPolicy=stop kills the WHOLE unit (8.5min outage 2026-06-10 diff --git a/scripts/test-vault-token-renew.sh b/scripts/test-vault-token-renew.sh index 313ff362..d64d02b4 100644 --- a/scripts/test-vault-token-renew.sh +++ b/scripts/test-vault-token-renew.sh @@ -1,11 +1,10 @@ #!/usr/bin/env bash -# Unit tests for the pure functions in vault-token-renew.sh. -# Sources the script (vtr_main is guarded) and exercises (a) the drift-guard -# decision — is ~/.vault-token OUR periodic admin token (renew) or a foreign -# clobber (heal / fail loud)? — whose ABSENCE let the 2026-06-05 woodpecker -# clobber be silently renewed for two days, and (b) the self-heal's revoke -# filter — which stale token-devvm-wizard tokens a heal may sweep. -# Run: bash infra/scripts/test-vault-token-renew.sh +# Unit tests for the pure drift-guard functions in vault-token-renew.sh. +# Sources the script (vtr_main is guarded) and exercises the decision logic that +# decides whether ~/.vault-token is OUR periodic admin token (renew) or a foreign +# token that clobbered the file (refuse, fail loud). This is exactly the logic +# whose ABSENCE let the 2026-06-05 woodpecker-token clobber be silently renewed +# for two days. Run: bash infra/scripts/test-vault-token-renew.sh set -uo pipefail DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" # shellcheck source=/dev/null @@ -54,21 +53,5 @@ ok "ours: parse+decide renews" vtr_drift_ok "$(vtr_display_name "$LOOKUP_ no "woodpecker: parse+decide refused" vtr_drift_ok "$(vtr_display_name "$LOOKUP_WP")" "$(vtr_policies_csv "$LOOKUP_WP")" no "oidc: parse+decide refused" vtr_drift_ok "$(vtr_display_name "$LOOKUP_OIDC")" "$(vtr_policies_csv "$LOOKUP_OIDC")" -# --- vtr_accessor: parse accessor out of lookup JSON --- -LOOKUP_NEW='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-new","policies":["default","sops-admin","vault-admin"],"identity_policies":null}}' -eq "accessor parsed" "acc-new" "$(vtr_accessor "$LOOKUP_NEW")" -eq "accessor absent -> empty" "" "$(vtr_accessor '{"data":{"display_name":"x"}}')" - -# --- vtr_is_stale_periodic: the heal's revoke filter — ONLY old token-devvm-wizard -# --- tokens are swept; the just-minted token, foreign tokens, and anything with an -# --- unknown accessor are kept. An empty keep-accessor sweeps NOTHING (fail-safe). -STALE_OURS='{"data":{"display_name":"token-devvm-wizard","accessor":"acc-old","policies":["default","sops-admin","vault-admin"]}}' -ok "older periodic token is stale" vtr_is_stale_periodic "$STALE_OURS" "acc-new" -no "the just-minted token is kept" vtr_is_stale_periodic "$LOOKUP_NEW" "acc-new" -no "foreign oidc token never swept" vtr_is_stale_periodic "$LOOKUP_OIDC" "acc-new" -no "woodpecker token never swept" vtr_is_stale_periodic "$LOOKUP_WP" "acc-new" -no "missing accessor never swept" vtr_is_stale_periodic '{"data":{"display_name":"token-devvm-wizard"}}' "acc-new" -no "empty keep-accessor sweeps nothing" vtr_is_stale_periodic "$STALE_OURS" "" - printf '\n%d passed, %d failed\n' "$pass" "$fail" (( fail == 0 )) diff --git a/scripts/vault-token-renew.sh b/scripts/vault-token-renew.sh index 42e78603..2d73c862 100644 --- a/scripts/vault-token-renew.sh +++ b/scripts/vault-token-renew.sh @@ -45,94 +45,6 @@ vtr_drift_ok() { printf ',%s,' "$pols" | grep -q ",$REQUIRED_POLICY," || return 1 } -# vtr_accessor -> the token accessor (empty if absent). -vtr_accessor() { - printf '%s' "$1" | jq -r '.data.accessor // ""' -} - -# vtr_is_stale_periodic -> 0 if this lookup -# describes one of OUR periodic tokens (display name matches) that is NOT the -# one to keep — i.e. a stale leftover a heal should revoke. 1 otherwise. -# Name-only on purpose (no policy check): anything named token-devvm-wizard -# that isn't the current token is garbage from a previous mint. An empty -# keep-accessor sweeps NOTHING (fail-safe: never revoke when we don't know -# which token is current). -vtr_is_stale_periodic() { - local dn acc - [ -n "${2:-}" ] || return 1 - dn=$(vtr_display_name "$1") - acc=$(vtr_accessor "$1") - [ "$dn" = "$EXPECTED_DN" ] || return 1 - [ -n "$acc" ] || return 1 - [ "$acc" != "$2" ] -} - -# vtr_heal -> 0 if ~/.vault-token was re-minted back to -# our periodic admin token using the foreign token's own authority, 1 if the -# heal was denied or failed (caller exits non-zero; the unit goes failed). -# -# Self-heal added 2026-07-03 (docs/plans/2026-07-03-vault-token-self-heal-design.md): -# an OIDC login — which the infra docs prescribe before applies — clobbers -# ~/.vault-token with a 7-day token, and detect-only drift left that unnoticed -# for weeks (the weekly-expiry loop). We ATTEMPT the re-mint with the -# clobbering token itself and let Vault's authz decide — a read-only clobber -# (the 2026-06-05 woodpecker incident) is denied the mint and stays a loud -# failure, because it signals a misbehaving flow that someone should look at. -vtr_heal() { - local foreign_dn="$1" log="$2" - local errf new_token new_info new_dn new_pols new_acc tmp - errf=$(mktemp) - if ! new_token=$(vault token create -orphan -period=768h \ - -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard \ - -field=token 2>"$errf") || [ -z "$new_token" ]; then - printf '%s DRIFT: ~/.vault-token is dn=%q — heal denied, foreign token lacks create authority (%s); investigate what wrote it. Manual re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \ - "$(date -Is)" "$foreign_dn" "$(tr '\n' ' ' <"$errf")" >>"$log" - rm -f "$errf" - return 1 - fi - rm -f "$errf" - - # Sanity: the minted token must itself pass the drift guard before it may - # replace ~/.vault-token. - if ! new_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json 2>&1); then - printf '%s FAIL: heal minted a token but its lookup failed: %s\n' \ - "$(date -Is)" "$new_info" >>"$log" - return 1 - fi - new_dn=$(vtr_display_name "$new_info") - new_pols=$(vtr_policies_csv "$new_info") - if ! vtr_drift_ok "$new_dn" "$new_pols"; then - printf '%s FAIL: heal minted an unexpected token (dn=%q policies=%q) — not writing it\n' \ - "$(date -Is)" "$new_dn" "$new_pols" >>"$log" - return 1 - fi - - # Atomic replace: mktemp files are 0600 from birth; same-filesystem mv. - tmp=$(mktemp "$HOME/.vault-token.XXXXXX") - printf '%s' "$new_token" >"$tmp" - mv "$tmp" "$HOME/.vault-token" - - # Anti-sprawl: revoke previous token-devvm-wizard tokens — each heal would - # otherwise strand the prior periodic ADMIN token server-side for up to 32d. - # The clobbering foreign token is deliberately NOT revoked: it may still back - # the user's live login session, and it ages out on its own (7d for OIDC). - local sweep="accessor sweep skipped (list denied)" accessors a a_info revoked=0 - new_acc=$(vtr_accessor "$new_info") - if [ -n "$new_acc" ] && accessors=$(VAULT_TOKEN="$new_token" vault list -format=json auth/token/accessors 2>/dev/null); then - while IFS= read -r a; do - [ -n "$a" ] || continue - a_info=$(VAULT_TOKEN="$new_token" vault token lookup -format=json -accessor "$a" 2>/dev/null) || continue - if vtr_is_stale_periodic "$a_info" "$new_acc"; then - VAULT_TOKEN="$new_token" vault token revoke -accessor "$a" >/dev/null 2>&1 && revoked=$((revoked + 1)) - fi - done < <(printf '%s' "$accessors" | jq -r '.[]') - sweep="revoked $revoked stale periodic token(s)" - fi - - printf '%s HEALED: re-minted periodic token from foreign dn=%q (%s)\n' \ - "$(date -Is)" "$foreign_dn" "$sweep" >>"$log" -} - vtr_main() { set -euo pipefail export PATH="/usr/local/bin:/usr/bin:/bin:${PATH:-}" @@ -149,19 +61,16 @@ vtr_main() { dn=$(vtr_display_name "$info") pols=$(vtr_policies_csv "$info") - # Drift guard (2026-06-07) + self-heal (2026-07-03): the renewer must not - # keep a FOREIGN token alive (on 2026-06-05 a stray kubernetes login was - # silently renewed for two days, masking lost write access). But detect-only - # drift proved worse in practice: an OIDC login — which the infra docs - # prescribe before applies — clobbers this file too, and the resulting DRIFT - # failures went unnoticed for weeks while access degraded to a 7-day token - # (the weekly-expiry loop). On drift we now ATTEMPT to heal (see vtr_heal): - # re-mint the periodic token with the clobbering token's own authority. - # Vault's authz keeps the old guarantee — a token that couldn't legitimately - # hold vault-admin is denied the mint, and we still fail loud. + # Drift guard (added 2026-06-07): the renewer must NOT keep a FOREIGN token alive. + # On 2026-06-05 a stray `vault login -method=kubernetes` overwrote ~/.vault-token + # with a read-only woodpecker token, and this script then silently renewed THAT + # for two days — masking the loss of write access. So before renewing, confirm + # the token is our periodic admin token; if it has drifted, fail loudly (systemd + # marks the unit failed) instead of keeping someone else's token alive. if ! vtr_drift_ok "$dn" "$pols"; then - vtr_heal "$dn" "$log" || exit 1 - exit 0 + printf '%s DRIFT: ~/.vault-token is dn=%q policies=%q (expected dn=%q with %q). Refusing to renew a foreign token. Re-mint: vault login -method=oidc && vault token create -orphan -period=768h -policy=vault-admin -policy=sops-admin -display-name=devvm-wizard -field=token > ~/.vault-token && chmod 600 ~/.vault-token\n' \ + "$(date -Is)" "$dn" "$pols" "$EXPECTED_DN" "$REQUIRED_POLICY" >>"$log" + exit 1 fi # `vault token renew` with no argument renews the calling token (renew-self). diff --git a/scripts/workstation/setup-devvm.sh b/scripts/workstation/setup-devvm.sh index 3e05b8a0..02bd9257 100755 --- a/scripts/workstation/setup-devvm.sh +++ b/scripts/workstation/setup-devvm.sh @@ -244,15 +244,9 @@ log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-us # virtual disk into an IO storm + multi-minute freeze (hard-killed 2026-06-22). # t3-serve@ was already capped (its [Service] block); the HOLE was the uncapped # user-.slice (all ssh/tmux work). Design — per user, on BOTH trees: -# MemoryMax=16G hard + MemorySwapMax=0 (work never touches disk swap → no -# thrash; a runaway is cgroup-OOM-killed locally at the ceiling), plus -# fair-share CPU/IO weights. -# NO MemoryHigh soft band (removed 2026-07-02; was 12G "throttle to a crawl"): -# with swap=0, a hog that PLATEAUS between high and max is unreclaimable but -# never OOMs — the kernel parks every task of the cgroup in -# mem_cgroup_handle_over_high and the whole tree stalls indefinitely. A 12.3G -# agent ugrep livelocked t3-serve@wizard (t3 down ~50min) exactly this way. -# Cap-and-kill, never throttle-and-pray — see the post-mortem addendum. +# MemoryHigh=12G soft (throttles a runaway to a crawl), MemoryMax=16G hard, +# MemorySwapMax=0 (work never touches disk swap → no thrash; it OOMs locally at +# the ceiling instead), plus fair-share CPU/IO weights. # BACKSTOP = earlyoom, NOT systemd-oomd. We first shipped systemd-oomd but it is # INERT with swap=0: its pressure-kill only acts on cgroups doing active reclaim # (pgscan rising), and a no-swap anon workload never reclaims — verified live, a @@ -266,16 +260,12 @@ log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-us # 10a) per-user caps + fair-share weights on EVERY user-.slice (ssh/tmux) install -d -m 0755 /etc/systemd/system/user-.slice.d cat > /etc/systemd/system/user-.slice.d/50-devvm-resource.conf <<'SLICE_EOF' -# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22; -# MemoryHigh dropped 2026-07-02). Applies to EACH user-.slice = all of one -# user's ssh/tmux work. Mirrors the t3-serve@.service caps so a user is bounded -# in whichever surface they work in. MemoryHigh stays infinity: with swap=0 a -# hog plateauing in a high..max band livelocks the entire slice (every ssh/tmux -# session of that user) instead of dying — straight-to-OOM at MemoryMax is the -# containment (see post-mortem addendum 2026-07-02). +# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22). +# Applies to EACH user-.slice = all of one user's ssh/tmux work. Mirrors the +# t3-serve@.service caps so a user is bounded in whichever surface they work in. [Slice] MemoryAccounting=yes -MemoryHigh=infinity +MemoryHigh=12G MemoryMax=16G MemorySwapMax=0 CPUAccounting=yes @@ -304,14 +294,12 @@ cat > /etc/systemd/system/docker.slice <<'DOCKER_SLICE_EOF' # All docker containers live here (cgroup-parent in /etc/docker/daemon.json) so # they share one bounded budget and a runaway container is capped at MemoryMax # (cgroup-OOM'd locally) instead of escaping into the uncapped system.slice. -# setup-devvm.sh §10, 2026-06-22; MemoryHigh dropped 2026-07-02 — a container -# plateauing in the high..max band would throttle-livelock EVERY container in -# the slice (see post-mortem addendum); MemoryMax OOM is the containment. +# setup-devvm.sh §10, 2026-06-22. [Unit] Description=Docker containers slice (capped) [Slice] MemoryAccounting=yes -MemoryHigh=infinity +MemoryHigh=6G MemoryMax=8G MemorySwapMax=0 CPUAccounting=yes diff --git a/secrets/nfs_directories.txt b/secrets/nfs_directories.txt index cc89391f..51e11aad 100644 Binary files a/secrets/nfs_directories.txt and b/secrets/nfs_directories.txt differ diff --git a/stacks/cloudflared/modules/cloudflared/cloudflare.tf b/stacks/cloudflared/modules/cloudflared/cloudflare.tf index 59e748ae..ad4d9de8 100644 --- a/stacks/cloudflared/modules/cloudflared/cloudflare.tf +++ b/stacks/cloudflared/modules/cloudflared/cloudflare.tf @@ -235,12 +235,6 @@ resource "cloudflare_record" "keyserver" { zone_id = var.cloudflare_zone_id } -# bridge.viktorbarzin.me (Cloudflare Pages, "мост" school site) moved to -# stacks/valia-sites (ADR-0018) — all Valia-site records live there now. -# State handoff was a manual `tg state rm` (2026-07-03): the CI terraform -# (<1.7) rejects removed{} blocks even at the stack root, so declarative -# forget wasn't available. valia-sites imported the live record by id. - # Enable HTTP/3 (QUIC) for Cloudflare-proxied domains resource "cloudflare_zone_settings_override" "http3" { zone_id = var.cloudflare_zone_id diff --git a/stacks/dawarich/main.tf b/stacks/dawarich/main.tf index 1d2d1f81..3eeb1540 100644 --- a/stacks/dawarich/main.tf +++ b/stacks/dawarich/main.tf @@ -16,7 +16,7 @@ resource "kubernetes_namespace" "dawarich" { name = "dawarich" labels = { "istio-injection" : "disabled" - tier = local.tiers.edge + tier = local.tiers.edge "keel.sh/enrolled" = "true" } } @@ -330,7 +330,7 @@ resource "kubernetes_deployment" "dawarich" { } lifecycle { ignore_changes = [ - spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates metadata[0].annotations["keel.sh/policy"], metadata[0].annotations["keel.sh/trigger"], @@ -458,13 +458,6 @@ module "ingress" { namespace = kubernetes_namespace.dawarich.metadata[0].name name = "dawarich" tls_secret_name = var.tls_secret_name - # Rails serves all its fingerprinted assets itself and the map view adds an - # API burst per page load — the default 10/50 limiter 429s the asset tail - # from a single client IP (and risks dropping OwnTracks/mobile ingestion - # POSTs on the same host). Dedicated 100/1000 limiter defined in - # stacks/traefik/modules/traefik/middleware.tf. - skip_default_rate_limit = true - extra_middlewares = ["traefik-dawarich-rate-limit@kubernetescrd"] extra_annotations = { "gethomepage.dev/enabled" = "true" "gethomepage.dev/name" = "Dawarich" diff --git a/stacks/dbaas/modules/dbaas/main.tf b/stacks/dbaas/modules/dbaas/main.tf index 5f86110a..bd380fe1 100644 --- a/stacks/dbaas/modules/dbaas/main.tf +++ b/stacks/dbaas/modules/dbaas/main.tf @@ -1511,34 +1511,6 @@ resource "null_resource" "pg_instagram_poster_db" { } } -# Create tasks database for the tasks PWA (Reminders-style front-end over -# Nextcloud CalDAV; FastAPI + SvelteKit SPA — see ~/code/tasks). Stores -# Connected Accounts (Fernet-encrypted Nextcloud app passwords) + sync state. -# Role password is managed by Vault Database Secrets Engine (static role -# `pg-tasks`, 7d rotation). Tables are created by alembic on app startup. -resource "null_resource" "pg_tasks_db" { - depends_on = [null_resource.pg_cluster] - - triggers = { - db_name = "tasks" - username = "tasks" - } - - provisioner "local-exec" { - command = <<-EOT - PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}') - kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \ - bash -c ' - psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'tasks'"'"'" | grep -q 1 || \ - psql -U postgres -c "CREATE ROLE tasks WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'" - psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_database WHERE datname = '"'"'tasks'"'"'" | grep -q 1 || \ - psql -U postgres -c "CREATE DATABASE tasks OWNER tasks" - psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE tasks TO tasks" - ' - EOT - } -} - # Old PostgreSQL deployment — kept commented for rollback reference # resource "kubernetes_deployment" "postgres" { # metadata { diff --git a/stacks/drone-logbook/main.tf b/stacks/drone-logbook/main.tf deleted file mode 100644 index e5f8b219..00000000 --- a/stacks/drone-logbook/main.tf +++ /dev/null @@ -1,360 +0,0 @@ -variable "tls_secret_name" { - type = string - sensitive = true -} -variable "nfs_server" { type = string } - -# Open DroneLog (https://github.com/arpanghosh8453/open-dronelog) — self-hosted -# DJI flight-log analyzer for the DJI Mini 4 Pro. Runs the UPSTREAM image (the -# ViktorBarzin/drone-logbook fork has no custom commits); Keel tracks :latest. -# Design: docs/plans/2026-07-04-drone-logbook-design.md -resource "kubernetes_namespace" "drone_logbook" { - metadata { - name = "drone-logbook" - labels = { - tier = local.tiers.aux - "keel.sh/enrolled" = "true" - } - } - lifecycle { - # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace - ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] - } -} - -resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } - manifest = { - apiVersion = "external-secrets.io/v1" - kind = "ExternalSecret" - metadata = { - name = "drone-logbook-secrets" - namespace = "drone-logbook" - } - spec = { - refreshInterval = "15m" - secretStoreRef = { - name = "vault-kv" - kind = "ClusterSecretStore" - } - target = { - name = "drone-logbook-secrets" - } - dataFrom = [{ - extract = { - key = "drone-logbook" - } - }] - } - } - depends_on = [kubernetes_namespace.drone_logbook] -} - -module "tls_secret" { - source = "../../modules/kubernetes/setup_tls_secret" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - tls_secret_name = var.tls_secret_name -} - -# DuckDB database + cached DJI decryption keys + uploaded originals. -# Embedded DB -> block storage, not NFS (same rationale as freshrss data). -# Encrypted class: flight logs are GPS traces of home/travel (sensitive data -# -> proxmox-lvm-encrypted per the storage decision rule in .claude/CLAUDE.md). -resource "kubernetes_persistent_volume_claim" "data" { - wait_until_bound = false - metadata { - name = "drone-logbook-data-encrypted" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - annotations = { - "resize.topolvm.io/threshold" = "10%" - "resize.topolvm.io/increase" = "100%" - "resize.topolvm.io/storage_limit" = "10Gi" - } - } - spec { - access_modes = ["ReadWriteOnce"] - storage_class_name = "proxmox-lvm-encrypted" - resources { - requests = { - storage = "2Gi" - } - } - } - lifecycle { - # The autoresizer expands requests.storage up to storage_limit and PVCs - # can't shrink; without this every apply tries to revert the size. - ignore_changes = [spec[0].resources[0].requests] - } -} - -# Drop folder: any producer (Nextcloud sync, scp, future phone pipeline) lands -# DJI .txt logs here over NFS; the app auto-imports on SYNC_INTERVAL. -module "nfs_sync_logs" { - source = "../../modules/kubernetes/nfs_volume" - name = "drone-logbook-sync-logs" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - nfs_server = var.nfs_server - nfs_path = "/srv/nfs/drone-logbook/sync-logs" - storage = "5Gi" -} - -resource "kubernetes_deployment" "drone_logbook" { - metadata { - name = "drone-logbook" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - labels = { - app = "drone-logbook" - "kubernetes.io/cluster-service" = "true" - tier = local.tiers.aux - } - } - spec { - replicas = 1 - strategy { - # DuckDB is single-writer; never overlap two pods on the same volume - type = "Recreate" - } - selector { - match_labels = { - app = "drone-logbook" - } - } - template { - metadata { - labels = { - app = "drone-logbook" - "kubernetes.io/cluster-service" = "true" - } - } - spec { - container { - name = "drone-logbook" - image = "ghcr.io/arpanghosh8453/open-dronelog:latest" - env { - name = "RUST_LOG" - value = "info" - } - env { - # keep re-importable originals under /data/drone-logbook/uploaded - name = "KEEP_UPLOADED_FILES" - value = "true" - } - env { - name = "SYNC_LOGS_PATH" - value = "/sync-logs" - } - env { - # 6-field cron (sec min hour dom mon dow): scan drop folder every 8h - name = "SYNC_INTERVAL" - value = "0 0 */8 * * *" - } - env { - name = "PROFILE_CREATION_PASS" - value_from { - secret_key_ref { - name = "drone-logbook-secrets" - key = "profile_creation_pass" - } - } - } - volume_mount { - name = "data" - mount_path = "/data/drone-logbook" - } - volume_mount { - name = "sync-logs" - mount_path = "/sync-logs" - read_only = true - } - port { - name = "http" - container_port = 80 - protocol = "TCP" - } - resources { - requests = { - cpu = "25m" - memory = "512Mi" - } - limits = { - memory = "512Mi" - } - } - } - volume { - name = "data" - persistent_volume_claim { - claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name - } - } - volume { - name = "sync-logs" - persistent_volume_claim { - claim_name = module.nfs_sync_logs.claim_name - } - } - } - } - } - depends_on = [kubernetes_manifest.external_secret] - lifecycle { - ignore_changes = [ - spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 - metadata[0].annotations["keel.sh/policy"], - metadata[0].annotations["keel.sh/trigger"], - metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 - metadata[0].annotations["keel.sh/match-tag"], - spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates - metadata[0].annotations["kubernetes.io/change-cause"], - metadata[0].annotations["deployment.kubernetes.io/revision"], - spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 - ] - } -} - -resource "kubernetes_service" "drone_logbook" { - metadata { - name = "drone-logbook" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - labels = { - "app" = "drone-logbook" - } - } - - spec { - selector = { - app = "drone-logbook" - } - port { - port = "80" - target_port = "80" - } - } -} - -# ----------------------------------------------------------------------------- -# Backup — required for every proxmox-lvm(-encrypted) app: daily copy of the -# data volume to NFS /srv/nfs/drone-logbook-backup (picked up by nfs-mirror -> -# sda -> Synology offsite). 01:30 = outside the 00:00/08:00/16:00 sync-import -# windows, so the DuckDB file is quiescent; uploaded originals make even a -# mid-write copy recoverable by re-import. Pod-affinity co-schedules with the -# app pod (RWO volume mounts twice only on the same node). Vaultwarden pattern. -# ----------------------------------------------------------------------------- - -module "nfs_backup" { - source = "../../modules/kubernetes/nfs_volume" - name = "drone-logbook-backup-host" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - nfs_server = var.nfs_server - nfs_path = "/srv/nfs/drone-logbook-backup" -} - -resource "kubernetes_cron_job_v1" "backup" { - metadata { - name = "drone-logbook-backup" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - } - spec { - concurrency_policy = "Replace" - failed_jobs_history_limit = 5 - schedule = "30 1 * * *" - starting_deadline_seconds = 300 - successful_jobs_history_limit = 3 - job_template { - metadata {} - spec { - backoff_limit = 3 - ttl_seconds_after_finished = 10 - template { - metadata {} - spec { - affinity { - pod_affinity { - required_during_scheduling_ignored_during_execution { - label_selector { - match_labels = { - app = "drone-logbook" - } - } - topology_key = "kubernetes.io/hostname" - } - } - } - container { - name = "drone-logbook-backup" - image = "docker.io/library/alpine" - command = ["/bin/sh", "-c", <<-EOT - set -euxo pipefail - _t0=$(date +%s) - now=$(date +"%Y_%m_%d_%H_%M") - mkdir -p /backup/$now - cp -a /data/. /backup/$now/ - # Rotate — 30 day retention - find /backup -maxdepth 1 -mindepth 1 -type d -mtime +30 -exec rm -rf {} + - _dur=$(($(date +%s) - _t0)) - _out_bytes=$(du -sb /backup/$now | awk '{print $1}') - wget -qO- --post-data "backup_duration_seconds $${_dur} - backup_output_bytes $${_out_bytes} - backup_last_success_timestamp $(date +%s) - " "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/drone-logbook-backup" || true - EOT - ] - volume_mount { - name = "data" - mount_path = "/data" - read_only = true - } - volume_mount { - name = "backup" - mount_path = "/backup" - } - } - volume { - name = "data" - persistent_volume_claim { - claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name - } - } - volume { - name = "backup" - persistent_volume_claim { - claim_name = module.nfs_backup.claim_name - } - } - dns_config { - option { - name = "ndots" - value = "2" - } - } - } - } - } - } - } - lifecycle { - # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 - ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] - } -} - -# https://dronelog.viktorbarzin.me -module "ingress" { - source = "../../modules/kubernetes/ingress_factory" - auth = "required" # Authentik forward-auth — flight logs are GPS traces of home/travel - dns_type = "proxied" - namespace = kubernetes_namespace.drone_logbook.metadata[0].name - name = "dronelog" - service_name = "drone-logbook" - tls_secret_name = var.tls_secret_name - extra_annotations = { - "gethomepage.dev/enabled" = "true" - "gethomepage.dev/name" = "Drone Logbook" - "gethomepage.dev/description" = "DJI flight log analyzer" - "gethomepage.dev/icon" = "mdi-quadcopter" - "gethomepage.dev/group" = "Media & Entertainment" - "gethomepage.dev/pod-selector" = "" - } -} diff --git a/stacks/drone-logbook/secrets b/stacks/drone-logbook/secrets deleted file mode 120000 index ca54a7cf..00000000 --- a/stacks/drone-logbook/secrets +++ /dev/null @@ -1 +0,0 @@ -../../secrets \ No newline at end of file diff --git a/stacks/drone-logbook/terragrunt.hcl b/stacks/drone-logbook/terragrunt.hcl deleted file mode 100644 index 0d1c8e53..00000000 --- a/stacks/drone-logbook/terragrunt.hcl +++ /dev/null @@ -1,8 +0,0 @@ -include "root" { - path = find_in_parent_folders() -} - -dependency "platform" { - config_path = "../platform" - skip_outputs = true -} diff --git a/stacks/excalidraw/main.tf b/stacks/excalidraw/main.tf index b7a33117..41ab48a0 100644 --- a/stacks/excalidraw/main.tf +++ b/stacks/excalidraw/main.tf @@ -10,7 +10,7 @@ resource "kubernetes_namespace" "excalidraw" { name = "excalidraw" labels = { "istio-injection" : "disabled" - tier = local.tiers.aux + tier = local.tiers.aux "keel.sh/enrolled" = "true" } } @@ -45,15 +45,6 @@ resource "kubernetes_deployment" "excalidraw" { app = "excalidraw" tier = local.tiers.aux } - # Keel rolls new ghcr:latest digests (k8s-portal pattern). Values here are - # recreate-correct seeds only — the keys are in ignore_changes below, so - # the live annotations win on an existing deployment. - annotations = { - "keel.sh/policy" = "force" - "keel.sh/trigger" = "poll" - "keel.sh/match-tag" = "true" - "keel.sh/pollSchedule" = "@every 5m" - } } spec { replicas = 1 @@ -76,19 +67,9 @@ resource "kubernetes_deployment" "excalidraw" { } } spec { - # GHCR pull secret: the ghcr-credentials Secret in this namespace is - # cloned in by the kyverno stack's sync-ghcr-credentials ClusterPolicy - # (allowlisted private-ghcr namespaces only — ADR-0002). Source of - # truth: stacks/kyverno/modules/kyverno/ghcr-credentials.tf. - image_pull_secrets { - name = "ghcr-credentials" - } container { - # ADR-0002: GHA-built (.github/workflows/build-excalidraw.yml), - # PRIVATE ghcr; Keel rolls new :latest digests. DockerHub - # viktorbarzin/excalidraw-library:v4 is the frozen rollback image. - image = "ghcr.io/viktorbarzin/excalidraw-library:latest" - image_pull_policy = "Always" + image = "viktorbarzin/excalidraw-library:v4" + image_pull_policy = "IfNotPresent" name = "excalidraw" port { container_port = 8080 @@ -126,7 +107,7 @@ resource "kubernetes_deployment" "excalidraw" { } lifecycle { ignore_changes = [ - spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates metadata[0].annotations["keel.sh/policy"], metadata[0].annotations["keel.sh/trigger"], diff --git a/stacks/excalidraw/project/README.md b/stacks/excalidraw/project/README.md index c9c95078..0f017e85 100644 --- a/stacks/excalidraw/project/README.md +++ b/stacks/excalidraw/project/README.md @@ -4,28 +4,18 @@ A self-hosted Excalidraw library with per-user drawing storage and management. ## Features -- Dashboard to manage all your drawings (create, open, rename, delete) +- Dashboard to manage all your drawings - Per-user storage (via Authentik SSO headers) -- Rename drawings from the dashboard or by clicking the drawing name in the editor -- Native Excalidraw export via the editor's hamburger menu: "Save to..." - (.excalidraw file) and "Export image..." (PNG / SVG / clipboard) -- Autosave (2s debounce) + manual save (Ctrl+S or menu "Save now") +- Create, edit, and delete drawings - Persistent storage via NFS ## Docker Image ``` -ghcr.io/viktorbarzin/excalidraw-library:latest +viktorbarzin/excalidraw-library:v4 ``` -Built by GitHub Actions (`.github/workflows/build-excalidraw.yml` in the infra -repo, ADR-0002) on every master push touching `stacks/excalidraw/project/**`; -tags `:latest` + `:`. The package is PRIVATE — cluster pulls use the -Kyverno-synced `ghcr-credentials` secret. Keel polls `:latest` and rolls the -deployment on digest change. - -The legacy manually-built DockerHub image `viktorbarzin/excalidraw-library:v4` -is frozen as the rollback target; nothing pushes to it anymore. +Available on Docker Hub: https://hub.docker.com/r/viktorbarzin/excalidraw-library ## Configuration @@ -49,13 +39,54 @@ Mount a persistent volume to the `DATA_DIR` path. Drawings are stored as `.excal └── my-diagram.excalidraw ``` -The filename (without extension) is both the drawing ID and its display name; -renaming a drawing renames the file (`os.Rename`, mtime preserved). - ## Deployment -Deployed by the `stacks/excalidraw` Terraform stack (namespace `excalidraw`, -service `draw`, ingress `draw.viktorbarzin.me` with `auth = "required"`). +### Docker + +```bash +docker run -d \ + --name excalidraw-rooms \ + -p 8080:8080 \ + -v /path/to/storage:/data \ + viktorbarzin/excalidraw-library:v4 +``` + +### Kubernetes + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: excalidraw +spec: + replicas: 1 + selector: + matchLabels: + app: excalidraw + template: + metadata: + labels: + app: excalidraw + spec: + containers: + - name: excalidraw + image: viktorbarzin/excalidraw-library:v4 + ports: + - containerPort: 8080 + env: + - name: DATA_DIR + value: /data + - name: PORT + value: "8080" + volumeMounts: + - name: data + mountPath: /data + volumes: + - name: data + nfs: + server: 192.168.1.127 + path: /srv/nfs/excalidraw +``` ### With Authentik SSO @@ -65,7 +96,23 @@ The application reads user identity from Authentik headers: - `X-Authentik-Email` - Displayed in UI - `X-Authentik-Name` - Displayed in UI -Requests without `X-Authentik-Username` fall back to the `anonymous` user. +Configure your ingress to pass these headers: + +```yaml +annotations: + nginx.ingress.kubernetes.io/auth-response-headers: "X-authentik-username,X-authentik-email,X-authentik-name" +``` + +## Building + +```bash +# Build the Docker image +docker build -t excalidraw-library . + +# Or build locally +go build -o excalidraw-library . +./excalidraw-library +``` ## API Endpoints @@ -75,25 +122,10 @@ Requests without `X-Authentik-Username` fall back to the `anonymous` user. | GET | `/api/drawings` | List all drawings for current user | | GET | `/api/drawings/:id` | Get drawing data | | PUT | `/api/drawings/:id` | Save drawing | -| PATCH | `/api/drawings/:id` | Rename drawing — body `{"name": ""}`; returns `{"status":"renamed","id":""}`; 409 if the target name exists | | DELETE | `/api/drawings/:id` | Delete drawing | | GET | `/api/user` | Get current user info | | GET | `/draw/:id` | Open drawing in editor | -Rename names are sanitized server-side to `[a-zA-Z0-9-_]` (other characters -become `-`; a trailing `.excalidraw` is stripped). Existing IDs are accepted -as-is for backward compatibility with API clients. - -## Development - -```bash -# Run tests -go test ./... - -# Run locally -DATA_DIR=/tmp/excalidraw-data go run . -``` - ## License MIT diff --git a/stacks/excalidraw/project/main.go b/stacks/excalidraw/project/main.go index b444f6cf..e6dfbd83 100644 --- a/stacks/excalidraw/project/main.go +++ b/stacks/excalidraw/project/main.go @@ -9,7 +9,6 @@ import ( "net/http" "os" "path/filepath" - "regexp" "sort" "strings" "time" @@ -64,21 +63,6 @@ func getUsername(r *http.Request) string { return username } -var invalidNameChars = regexp.MustCompile(`[^a-zA-Z0-9-_]`) - -// sanitizeName normalizes a user-supplied drawing name into a safe file ID -// (same charset the dashboard applies on create). Returns "" if nothing -// meaningful remains. -func sanitizeName(name string) string { - name = strings.TrimSpace(name) - name = strings.TrimSuffix(name, ".excalidraw") - name = invalidNameChars.ReplaceAllString(name, "-") - if strings.Trim(name, "-") == "" { - return "" - } - return name -} - // getUserDataDir returns the data directory for a specific user and ensures it exists func getUserDataDir(username string) string { userDir := filepath.Join(dataDir, username) @@ -184,41 +168,6 @@ func handleDrawing(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "application/json") json.NewEncoder(w).Encode(map[string]string{"status": "saved", "id": id}) - case http.MethodPatch: - var req struct { - Name string `json:"name"` - } - if err := json.NewDecoder(r.Body).Decode(&req); err != nil { - http.Error(w, "Invalid JSON body", http.StatusBadRequest) - return - } - newID := sanitizeName(req.Name) - if newID == "" { - http.Error(w, "Invalid name", http.StatusBadRequest) - return - } - if _, err := os.Stat(filePath); err != nil { - if os.IsNotExist(err) { - http.Error(w, "Drawing not found", http.StatusNotFound) - } else { - http.Error(w, err.Error(), http.StatusInternalServerError) - } - return - } - if newID != id { - newPath := filepath.Join(userDataDir, newID+".excalidraw") - if _, err := os.Stat(newPath); err == nil { - http.Error(w, "A drawing with that name already exists", http.StatusConflict) - return - } - if err := os.Rename(filePath, newPath); err != nil { - http.Error(w, err.Error(), http.StatusInternalServerError) - return - } - } - w.Header().Set("Content-Type", "application/json") - json.NewEncoder(w).Encode(map[string]string{"status": "renamed", "id": newID}) - case http.MethodDelete: if err := os.Remove(filePath); err != nil { if os.IsNotExist(err) { @@ -315,8 +264,6 @@ const dashboardHTML = ` .btn:hover { background: #5b4cdb; } .btn-danger { background: #e74c3c; } .btn-danger:hover { background: #c0392b; } - .btn-secondary { background: #3d3d5c; } - .btn-secondary:hover { background: #4a4a70; } .btn-small { padding: 0.4rem 0.8rem; font-size: 0.85rem; } .drawings { display: grid; gap: 1rem; } .drawing { @@ -395,11 +342,11 @@ const dashboardHTML = ` @@ -422,63 +369,31 @@ const dashboardHTML = ` } } - function drawingRow(d) { - var row = document.createElement('div'); - row.className = 'drawing'; - - var info = document.createElement('div'); - info.className = 'drawing-info'; - var nameLink = document.createElement('a'); - nameLink.className = 'drawing-name'; - nameLink.href = '/draw/' + encodeURIComponent(d.id); - nameLink.textContent = d.name; - var meta = document.createElement('div'); - meta.className = 'drawing-meta'; - meta.textContent = 'Modified: ' + new Date(d.modified).toLocaleDateString() + ' ' + - new Date(d.modified).toLocaleTimeString() + ' - ' + formatSize(d.size); - info.appendChild(nameLink); - info.appendChild(meta); - - var actions = document.createElement('div'); - actions.className = 'drawing-actions'; - var open = document.createElement('a'); - open.className = 'btn btn-small'; - open.href = '/draw/' + encodeURIComponent(d.id); - open.textContent = 'Open'; - var rename = document.createElement('button'); - rename.className = 'btn btn-small btn-secondary'; - rename.textContent = 'Rename'; - rename.onclick = function() { showRenameModal(d.id); }; - var del = document.createElement('button'); - del.className = 'btn btn-small btn-danger'; - del.textContent = 'Delete'; - del.onclick = function() { deleteDrawing(d.id); }; - actions.appendChild(open); - actions.appendChild(rename); - actions.appendChild(del); - - row.appendChild(info); - row.appendChild(actions); - return row; - } - async function loadDrawings() { const resp = await fetch('/api/drawings'); const drawings = await resp.json(); const container = document.getElementById('drawings'); - container.replaceChildren(); if (!drawings || drawings.length === 0) { - var empty = document.createElement('div'); - empty.className = 'empty'; - empty.textContent = 'No drawings yet. Create your first one!'; - container.appendChild(empty); + container.innerHTML = '
No drawings yet. Create your first one!
'; return; } - drawings.forEach(function(d) { - container.appendChild(drawingRow(d)); - }); + container.innerHTML = drawings.map(function(d) { + return '
' + + '
' + + '' + d.name + '' + + '
' + + 'Modified: ' + new Date(d.modified).toLocaleDateString() + ' ' + new Date(d.modified).toLocaleTimeString() + + ' - ' + formatSize(d.size) + + '
' + + '
' + + '
' + + 'Open' + + '' + + '
' + + '
'; + }).join(''); } function formatSize(bytes) { @@ -487,64 +402,18 @@ const dashboardHTML = ` return (bytes / (1024 * 1024)).toFixed(1) + ' MB'; } - var modalAction = null; // invoked with the input value on confirm - - function showModal(title, confirmLabel, initialValue, action) { - document.getElementById('modal-title').textContent = title; - document.getElementById('modal-confirm').textContent = confirmLabel; - var input = document.getElementById('drawingName'); - input.value = initialValue || ''; - modalAction = action; - document.getElementById('modal').classList.add('active'); - input.focus(); - input.select(); - } - function showNewModal() { - showModal('New Drawing', 'Create', '', createDrawing); - } - - function showRenameModal(id) { - showModal('Rename Drawing', 'Rename', id, function(value) { - renameDrawing(id, value); - }); + document.getElementById('modal').classList.add('active'); + document.getElementById('drawingName').focus(); } function hideModal() { document.getElementById('modal').classList.remove('active'); document.getElementById('drawingName').value = ''; - modalAction = null; } - function confirmModal() { - if (modalAction) modalAction(document.getElementById('drawingName').value); - } - - async function renameDrawing(id, newName) { - newName = (newName || '').trim(); - if (!newName || newName === id) { - hideModal(); - return; - } - var resp = await fetch('/api/drawings/' + encodeURIComponent(id), { - method: 'PATCH', - headers: { 'Content-Type': 'application/json' }, - body: JSON.stringify({ name: newName }) - }); - if (resp.status === 409) { - alert('A drawing with that name already exists.'); - return; // keep the modal open so the user can pick another name - } - if (!resp.ok) { - alert('Rename failed: ' + await resp.text()); - return; - } - hideModal(); - loadDrawings(); - } - - async function createDrawing(name) { - name = (name || '').trim(); + async function createDrawing() { + var name = document.getElementById('drawingName').value.trim(); if (!name) { name = 'drawing-' + Date.now(); } @@ -577,7 +446,7 @@ const dashboardHTML = ` } document.getElementById('drawingName').addEventListener('keypress', function(e) { - if (e.key === 'Enter') confirmModal(); + if (e.key === 'Enter') createDrawing(); }); document.getElementById('modal').addEventListener('click', function(e) { diff --git a/stacks/excalidraw/project/main_test.go b/stacks/excalidraw/project/main_test.go deleted file mode 100644 index b4ab14f8..00000000 --- a/stacks/excalidraw/project/main_test.go +++ /dev/null @@ -1,249 +0,0 @@ -package main - -import ( - "encoding/json" - "net/http" - "net/http/httptest" - "os" - "path/filepath" - "strings" - "testing" -) - -const testDrawing = `{"type":"excalidraw","version":2,"source":"excalidraw-library","elements":[{"id":"e1"}],"appState":{"viewBackgroundColor":"#ffffff"}}` - -func setupDataDir(t *testing.T) { - t.Helper() - dataDir = t.TempDir() -} - -// doDrawing sends a request to handleDrawing for the given user and returns the recorder. -func doDrawing(t *testing.T, method, id, body, user string) *httptest.ResponseRecorder { - t.Helper() - var reader *strings.Reader - if body == "" { - reader = strings.NewReader("") - } else { - reader = strings.NewReader(body) - } - req := httptest.NewRequest(method, "/api/drawings/"+id, reader) - if user != "" { - req.Header.Set("X-Authentik-Username", user) - } - w := httptest.NewRecorder() - handleDrawing(w, req) - return w -} - -func listDrawings(t *testing.T, user string) []Drawing { - t.Helper() - req := httptest.NewRequest(http.MethodGet, "/api/drawings", nil) - if user != "" { - req.Header.Set("X-Authentik-Username", user) - } - w := httptest.NewRecorder() - handleListDrawings(w, req) - if w.Code != http.StatusOK { - t.Fatalf("list: expected 200, got %d", w.Code) - } - var drawings []Drawing - if err := json.Unmarshal(w.Body.Bytes(), &drawings); err != nil { - t.Fatalf("list: bad JSON: %v", err) - } - return drawings -} - -func TestPutGetRoundtrip(t *testing.T) { - setupDataDir(t) - if w := doDrawing(t, http.MethodPut, "foo", testDrawing, "alice"); w.Code != http.StatusOK { - t.Fatalf("PUT: expected 200, got %d: %s", w.Code, w.Body.String()) - } - w := doDrawing(t, http.MethodGet, "foo", "", "alice") - if w.Code != http.StatusOK { - t.Fatalf("GET: expected 200, got %d", w.Code) - } - if w.Body.String() != testDrawing { - t.Errorf("GET: content mismatch: %s", w.Body.String()) - } -} - -func TestGetMissing(t *testing.T) { - setupDataDir(t) - if w := doDrawing(t, http.MethodGet, "nope", "", "alice"); w.Code != http.StatusNotFound { - t.Fatalf("expected 404, got %d", w.Code) - } -} - -func TestListDrawings(t *testing.T) { - setupDataDir(t) - doDrawing(t, http.MethodPut, "one", testDrawing, "alice") - doDrawing(t, http.MethodPut, "two", testDrawing, "alice") - drawings := listDrawings(t, "alice") - if len(drawings) != 2 { - t.Fatalf("expected 2 drawings, got %d", len(drawings)) - } - ids := map[string]bool{drawings[0].ID: true, drawings[1].ID: true} - if !ids["one"] || !ids["two"] { - t.Errorf("unexpected ids: %v", ids) - } - for _, d := range drawings { - if d.Name != d.ID { - t.Errorf("name should equal id: %+v", d) - } - } -} - -func TestDelete(t *testing.T) { - setupDataDir(t) - doDrawing(t, http.MethodPut, "foo", testDrawing, "alice") - if w := doDrawing(t, http.MethodDelete, "foo", "", "alice"); w.Code != http.StatusOK { - t.Fatalf("DELETE: expected 200, got %d", w.Code) - } - if w := doDrawing(t, http.MethodGet, "foo", "", "alice"); w.Code != http.StatusNotFound { - t.Fatalf("GET after delete: expected 404, got %d", w.Code) - } - if w := doDrawing(t, http.MethodDelete, "foo", "", "alice"); w.Code != http.StatusNotFound { - t.Fatalf("second DELETE: expected 404, got %d", w.Code) - } -} - -func TestPerUserIsolation(t *testing.T) { - setupDataDir(t) - doDrawing(t, http.MethodPut, "secret", testDrawing, "alice") - if w := doDrawing(t, http.MethodGet, "secret", "", "bob"); w.Code != http.StatusNotFound { - t.Fatalf("bob should not see alice's drawing, got %d", w.Code) - } - if drawings := listDrawings(t, "bob"); len(drawings) != 0 { - t.Fatalf("bob's list should be empty, got %d", len(drawings)) - } -} - -// --- rename (PATCH) --- - -func renameReq(t *testing.T, id, newName, user string) *httptest.ResponseRecorder { - t.Helper() - return doDrawing(t, http.MethodPatch, id, `{"name":`+strconv(newName)+`}`, user) -} - -// strconv JSON-quotes a string without importing encoding/json for a one-liner. -func strconv(s string) string { - b, _ := json.Marshal(s) - return string(b) -} - -func TestRenameSuccess(t *testing.T) { - setupDataDir(t) - doDrawing(t, http.MethodPut, "foo", testDrawing, "alice") - w := renameReq(t, "foo", "bar", "alice") - if w.Code != http.StatusOK { - t.Fatalf("PATCH: expected 200, got %d: %s", w.Code, w.Body.String()) - } - var resp map[string]string - if err := json.Unmarshal(w.Body.Bytes(), &resp); err != nil { - t.Fatalf("PATCH: bad JSON: %v", err) - } - if resp["id"] != "bar" || resp["status"] != "renamed" { - t.Errorf("unexpected response: %v", resp) - } - if w := doDrawing(t, http.MethodGet, "bar", "", "alice"); w.Code != http.StatusOK || w.Body.String() != testDrawing { - t.Errorf("GET new id: code=%d content=%q", w.Code, w.Body.String()) - } - if w := doDrawing(t, http.MethodGet, "foo", "", "alice"); w.Code != http.StatusNotFound { - t.Errorf("GET old id: expected 404, got %d", w.Code) - } -} - -func TestRenameConflict(t *testing.T) { - setupDataDir(t) - doDrawing(t, http.MethodPut, "a", testDrawing, "alice") - doDrawing(t, http.MethodPut, "b", testDrawing, "alice") - if w := renameReq(t, "a", "b", "alice"); w.Code != http.StatusConflict { - t.Fatalf("expected 409, got %d", w.Code) - } - // both drawings intact - for _, id := range []string{"a", "b"} { - if w := doDrawing(t, http.MethodGet, id, "", "alice"); w.Code != http.StatusOK { - t.Errorf("drawing %q should be intact, got %d", id, w.Code) - } - } -} - -func TestRenameMissing(t *testing.T) { - setupDataDir(t) - if w := renameReq(t, "nope", "new", "alice"); w.Code != http.StatusNotFound { - t.Fatalf("expected 404, got %d", w.Code) - } -} - -func TestRenameSameName(t *testing.T) { - setupDataDir(t) - doDrawing(t, http.MethodPut, "foo", testDrawing, "alice") - w := renameReq(t, "foo", "foo", "alice") - if w.Code != http.StatusOK { - t.Fatalf("same-name rename: expected 200, got %d: %s", w.Code, w.Body.String()) - } - if w := doDrawing(t, http.MethodGet, "foo", "", "alice"); w.Code != http.StatusOK { - t.Errorf("drawing should be intact, got %d", w.Code) - } -} - -func TestRenameInvalidNames(t *testing.T) { - setupDataDir(t) - doDrawing(t, http.MethodPut, "foo", testDrawing, "alice") - for _, name := range []string{"", " ", "../..", "---"} { - if w := renameReq(t, "foo", name, "alice"); w.Code != http.StatusBadRequest { - t.Errorf("rename to %q: expected 400, got %d", name, w.Code) - } - } - // malformed body - if w := doDrawing(t, http.MethodPatch, "foo", `{not json`, "alice"); w.Code != http.StatusBadRequest { - t.Errorf("malformed body: expected 400, got %d", w.Code) - } -} - -func TestRenameSanitization(t *testing.T) { - setupDataDir(t) - cases := []struct{ in, want string }{ - {"My Drawing!", "My-Drawing-"}, - {"net diag.excalidraw", "net-diag"}, // .excalidraw suffix stripped, not mangled - {"a/b\\c", "a-b-c"}, - } - for _, c := range cases { - doDrawing(t, http.MethodPut, "src", testDrawing, "alice") - w := renameReq(t, "src", c.in, "alice") - if w.Code != http.StatusOK { - t.Errorf("rename to %q: expected 200, got %d: %s", c.in, w.Code, w.Body.String()) - continue - } - var resp map[string]string - json.Unmarshal(w.Body.Bytes(), &resp) - if resp["id"] != c.want { - t.Errorf("rename to %q: expected id %q, got %q", c.in, c.want, resp["id"]) - } - // file must be inside the user dir under the sanitized name - if _, err := os.Stat(filepath.Join(dataDir, "alice", c.want+".excalidraw")); err != nil { - t.Errorf("rename to %q: expected file %q on disk: %v", c.in, c.want, err) - } - doDrawing(t, http.MethodDelete, resp["id"], "", "alice") - } -} - -func TestRenameTraversalStaysInUserDir(t *testing.T) { - setupDataDir(t) - doDrawing(t, http.MethodPut, "foo", testDrawing, "alice") - w := renameReq(t, "foo", "../../../etc/passwd", "alice") - if w.Code == http.StatusOK { - var resp map[string]string - json.Unmarshal(w.Body.Bytes(), &resp) - if strings.Contains(resp["id"], "/") || strings.Contains(resp["id"], "..") { - t.Fatalf("traversal characters survived: %q", resp["id"]) - } - if _, err := os.Stat(filepath.Join(dataDir, "alice", resp["id"]+".excalidraw")); err != nil { - t.Fatalf("renamed file escaped user dir: %v", err) - } - } - // nothing outside the data dir - if _, err := os.Stat(filepath.Join(dataDir, "..", "etc")); err == nil { - t.Fatal("file escaped the data dir") - } -} diff --git a/stacks/excalidraw/project/static/editor.html b/stacks/excalidraw/project/static/editor.html index f374c115..aba6390b 100644 --- a/stacks/excalidraw/project/static/editor.html +++ b/stacks/excalidraw/project/static/editor.html @@ -8,41 +8,41 @@ * { margin: 0; padding: 0; } html, body { width: 100%; height: 100%; overflow: hidden; } #root { width: 100%; height: 100%; } - .top-right-ui { + .toolbar { + position: fixed; + top: 10px; + left: 10px; + z-index: 1000; display: flex; - align-items: center; gap: 8px; - font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; - } - .top-right-ui a, .top-right-ui button { - display: inline-flex; - align-items: center; - gap: 6px; + background: rgba(255,255,255,0.95); padding: 8px 12px; - border: 1px solid transparent; border-radius: 8px; + box-shadow: 0 2px 8px rgba(0,0,0,0.15); + } + .toolbar button, .toolbar a { + padding: 6px 14px; + border: none; + border-radius: 6px; cursor: pointer; - font-size: 13px; + font-size: 14px; + background: #6c5ce7; + color: white; text-decoration: none; - box-shadow: 0 1px 4px rgba(0,0,0,0.12); - max-width: 40vw; - white-space: nowrap; - overflow: hidden; - text-overflow: ellipsis; + display: inline-block; } - .top-right-ui.theme-light a, .top-right-ui.theme-light button { - background: #ffffff; - color: #1b1b1f; + .toolbar button:hover, .toolbar a:hover { background: #5b4cdb; } + .toolbar .secondary { background: #ddd; color: #333; } + .toolbar .secondary:hover { background: #ccc; } + .toolbar .title { + font-weight: 600; + padding: 6px 0; + color: #333; } - .top-right-ui.theme-dark a, .top-right-ui.theme-dark button { - background: #232329; - color: #e9ecef; - } - .top-right-ui button:hover, .top-right-ui a:hover { border-color: #a29bfe; } .status { position: fixed; bottom: 10px; - right: 60px; + right: 10px; padding: 6px 12px; background: rgba(0,0,0,0.7); color: white; @@ -51,7 +51,6 @@ z-index: 1000; opacity: 0; transition: opacity 0.3s; - pointer-events: none; } .status.show { opacity: 1; } .loading { @@ -68,6 +67,11 @@ +
+ Back to Library + Loading... + +
Loading Excalidraw...
@@ -77,33 +81,16 @@
Saved