Merge origin/master into wizard/cctv-adr-trunk

2026-07-03 12:32:00 +00:00 · 2026-07-03 12:32:00 +00:00 · 126cf4c88e
commit 126cf4c88e
parent 5d16a18cf4 695e020111
11 changed files with 550 additions and 1 deletions
--- a/docs/adr/0018-valia-sites-off-infra-pages-in-cluster-sync.md
+++ b/docs/adr/0018-valia-sites-off-infra-pages-in-cluster-sync.md
@ -0,0 +1,47 @@
+# Valia sites are served off-infra (Cloudflare Pages), synced in-cluster
+
+Valia (Viktor's mother) authors small one-page static sites in Google Drive folders she
+shares, and keeps asking for them to be hosted — two exist already (`stem95su`, `bridge`)
+and more are expected. We decided all **Valia sites** are served **off-infra on Cloudflare
+Pages** under `<english-name>.viktorbarzin.me`, kept fresh by **one shared in-cluster
+CronJob** (`stacks/valia-sites/`) that mirrors each **Content folder** every 10 minutes
+(rclone, drive.readonly) and re-deploys only on change (wrangler direct upload). The
+existing in-cluster `stem95su` serving stack (nginx + NFS + ingress + per-site sync)
+migrates onto this and is retired.
+
+Why off-infra serving: these are her sites, shown to teachers/parents — they must survive
+homelab outages (cf. the 2026-06-27 egress incident that took every proxied in-cluster
+site down). With Pages, a homelab outage degrades to "content frozen until we're back",
+never "site down". Serving costs no cluster resources and no per-site nginx/PVC/ingress/
+Anubis. Why the syncer stays in-cluster anyway: secrets stay in Vault (no per-site GHA
+secret sprawl), and the stem95su guard patterns (hard-fail on Drive auth errors, never
+wipe a live site on an empty/partial folder, capped deletes) carry over wholesale. The
+deliberate asymmetry — off-infra serving, on-infra syncing — is the point, not an
+accident.
+
+## Considered options
+
+- **In-cluster everywhere** (generalise stem95su into a factory module): one roof, no
+  Cloudflare Pages dependency — but her sites share the homelab's fate and each site
+  spends cluster resources to serve static files a free CDN serves better.
+- **Pages for new sites only**: less work now, two patterns and two runbooks forever.
+- **GHA-scheduled sync** (fully off-infra pipeline): no cluster dependency at all, but
+  Drive + Cloudflare credentials would live as GitHub secrets per repo, outside Vault.
+
+## Consequences
+
+- Registration is one entry in the `sites` map (name, Content folder, optional Entry
+  file); CI applies Pages project, custom domain, public CNAME, and internal-DNS config
+  together. Names are English, picked by Viktor (most → bridge set the precedent).
+- The internal split-horizon zone learns Valia sites from a ConfigMap the
+  `technitium-ingress-dns-sync` script consumes — declaratively, including **removal**
+  (the previous static-CNAME approach was add-only; a retired site left a stale record).
+- Deploy-on-change is mandatory, not an optimisation: Pages caps monthly deployments on
+  the free tier, and a 10-minute cadence would burn ~4,300/month if unchanged runs
+  deployed.
+- Failure visibility is **failed-Job-only** by explicit choice (no stale-sync alert, no
+  per-site uptime monitors, no notifications to Valia) — Viktor fields "it didn't
+  update" reports, consistent with the alert-noise-reduction posture. Revisit if a
+  silent stall actually bites.
+- If the homelab is down, content updates pause; the sites keep serving last-deployed
+  content. Accepted degradation.
--- a/docs/architecture/dns.md
+++ b/docs/architecture/dns.md
@ -277,7 +277,7 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons

 Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).

-**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too.
+**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too. The same applies to **off-infra sites** (e.g. `bridge` → CNAME `bridge-cv2.pages.dev`, Cloudflare Pages): any public-only name with no Traefik ingress must be added as a static record in the sync script, or internal clients NXDOMAIN on it while it works fine externally.

 ## NodeLocal DNSCache

@ -368,6 +368,7 @@ The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`)
 | TXT (MTA-STS) | 1 | `v=STSv1; id=20260412` | TLS enforcement |
 | TXT (TLSRPT) | 1 | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS reporting |
 | A (keyserver) | 1 | `130.162.165.220` (Oracle VPS) | PGP keyserver |
+| CNAME (CF Pages) | 1 | `bridge-cv2.pages.dev` (Cloudflare Pages) | `bridge` — static site hosted off-infra on CF Pages, content deployed via wrangler |

 ### Proxied vs Non-Proxied