docs: valia-sites runbook + dns.md CM mechanism + service-catalog entries
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Runbook covers add/update/retire (one map entry; internal DNS now cleans up after itself), content rules for Valia's folders, and the failure modes incl. both token re-mint paths. dns.md superset-rule paragraph now describes the declarative ConfigMap reconcile instead of hand-added static CNAMEs. Catalog: new valia-sites row; stem95su row notes its Pages cutover is parked on the 42.9MB stem_video.mp4 exceeding the 25MB Pages per-file cap. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
4a3c8287c3
commit
316cdb7441
3 changed files with 104 additions and 3 deletions
|
|
@ -120,7 +120,8 @@
|
||||||
| status-page | Status page | status-page |
|
| status-page | Status page | status-page |
|
||||||
| plotting-book | Book plotting/world-building app | plotting-book |
|
| plotting-book | Book plotting/world-building app | plotting-book |
|
||||||
| tripit | Self-hosted TripIt-clone travel-itinerary PWA (FastAPI + SvelteKit SPA, same-origin). CNPG (`tripit` db, Vault static role `pg-tripit`) + RWX NFS trip-doc vault (`/srv/nfs/tripit-documents`) + RWO `proxmox-lvm-encrypted` personal-document vault `tripit-personal-documents` (passports/IDs — AES-256-GCM app-layer envelope, master key `DOCUMENT_ENCRYPTION_KEY` in `secret/tripit`). `auth=required` (Authentik forward-auth, reads `X-authentik-email`); second `auth=none` ingress on `/api/calendar` for HMAC-token-gated `.ics` feed. Email-ingest CronJob `tripit-ingest-plans` (`*/15`) is the SOLE inbound path — forward a booking to plans@viktorbarzin.me (catch-all → spam@), polled read-only and routed ONLY to a registered user / verified linked address (no default-owner fallback; strangers ignored), parsed by local LLM (`qwen3vl-4b`), and the sender is emailed the outcome (Added to trip / Couldn't import). Plus `tripit-poll-flights`, `tripit-run-reminders`, `tripit-transport-nudge`, `tripit-weather-brief`. (The old Gmail-scrape `tripit-ingest-mail` CronJob was removed 2026-06-05.) App secrets in Vault `secret/tripit`. | tripit |
|
| tripit | Self-hosted TripIt-clone travel-itinerary PWA (FastAPI + SvelteKit SPA, same-origin). CNPG (`tripit` db, Vault static role `pg-tripit`) + RWX NFS trip-doc vault (`/srv/nfs/tripit-documents`) + RWO `proxmox-lvm-encrypted` personal-document vault `tripit-personal-documents` (passports/IDs — AES-256-GCM app-layer envelope, master key `DOCUMENT_ENCRYPTION_KEY` in `secret/tripit`). `auth=required` (Authentik forward-auth, reads `X-authentik-email`); second `auth=none` ingress on `/api/calendar` for HMAC-token-gated `.ics` feed. Email-ingest CronJob `tripit-ingest-plans` (`*/15`) is the SOLE inbound path — forward a booking to plans@viktorbarzin.me (catch-all → spam@), polled read-only and routed ONLY to a registered user / verified linked address (no default-owner fallback; strangers ignored), parsed by local LLM (`qwen3vl-4b`), and the sender is emailed the outcome (Added to trip / Couldn't import). Plus `tripit-poll-flights`, `tripit-run-reminders`, `tripit-transport-nudge`, `tripit-weather-brief`. (The old Gmail-scrape `tripit-ingest-mail` CronJob was removed 2026-06-05.) App secrets in Vault `secret/tripit`. | tripit |
|
||||||
| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me. Public **open** static site (`auth=none` — CrowdSec + ai-bot-block, no login). Stock `nginx:1.28-alpine` serving content **straight off PVE host NFS** `/srv/nfs/stem-site` (RWX `nfs_volume`, mounted read-only) — **NOT** image-baked, so the externally-authored (Gemini-exported) HTML/media updates with no rebuild; auto-backed-up offsite by `nfs-mirror`. **Content source = Google Drive folder "claude"** (id `1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_`, shared Valentina→vbarzin@gmail.com). **Deploy = scheduled mirror** (since 2026-06-09, reversed the earlier on-demand-only call once content went active): CronJob `stem95su-gdrive-sync` (`*/10`, `stacks/stem95su/gdrive-sync.tf`) mounts the content PVC RW and `rclone sync`s the Drive folder onto it (`docker.io/rclone/rclone:1.74.3`, `scope=drive.readonly` — Drive is READ-ONLY; empty-source guard + `--max-delete 25` so a partial listing can't wipe the site). rclone creds (OAuth refresh-token) in Vault `secret/stem95su` (`rclone_conf`) → ESO secret `stem95su-rclone`. **Requires the GCP OAuth app (project home-lab-1700868541205) published to "Production"** or the refresh token expires ~weekly (re-mint + `vault kv put secret/stem95su rclone_conf=…` after publishing); a dead token surfaces as a failed Job. Manual on-demand sync still possible (throwaway rclone container from devvm; recipe in claude-memory). Nextcloud "PVE NFS Pool"/rsync is a manual fallback. Dashboard `stem_board.html` served at `/` via a small nginx ConfigMap (`index`). No DB, no in-cluster secrets. Reference impl for the NFS-backed static-site pattern (see patterns.md). | stem95su |
|
| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me. Public **open** static site (`auth=none` — CrowdSec + ai-bot-block, no login). Stock `nginx:1.28-alpine` serving content **straight off PVE host NFS** `/srv/nfs/stem-site` (RWX `nfs_volume`, mounted read-only) — **NOT** image-baked, so the externally-authored (Gemini-exported) HTML/media updates with no rebuild; auto-backed-up offsite by `nfs-mirror`. **Content source = Google Drive folder "claude"** (id `1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_`, shared Valentina→vbarzin@gmail.com). **Deploy = scheduled mirror** (since 2026-06-09, reversed the earlier on-demand-only call once content went active): CronJob `stem95su-gdrive-sync` (`*/10`, `stacks/stem95su/gdrive-sync.tf`) mounts the content PVC RW and `rclone sync`s the Drive folder onto it (`docker.io/rclone/rclone:1.74.3`, `scope=drive.readonly` — Drive is READ-ONLY; empty-source guard + `--max-delete 25` so a partial listing can't wipe the site). rclone creds (OAuth refresh-token) in Vault `secret/stem95su` (`rclone_conf`) → ESO secret `stem95su-rclone`. **Requires the GCP OAuth app (project home-lab-1700868541205) published to "Production"** or the refresh token expires ~weekly (re-mint + `vault kv put secret/stem95su rclone_conf=…` after publishing); a dead token surfaces as a failed Job. Manual on-demand sync still possible (throwaway rclone container from devvm; recipe in claude-memory). Nextcloud "PVE NFS Pool"/rsync is a manual fallback. Dashboard `stem_board.html` served at `/` via a small nginx ConfigMap (`index`). No DB, no in-cluster secrets. Reference impl for the NFS-backed static-site pattern (see patterns.md). **Pages cutover PARKED** (ADR-0018): `stem_board.html` embeds the 42.9MB `stem_video.mp4` > the 25MB CF Pages per-file cap — stays on this stack until the video shrinks (parked as `manage_dns=false` in stacks/valia-sites; see docs/runbooks/valia-sites.md). | stem95su |
|
||||||
|
| valia-sites | **Valia-site registry + sync** (ADR-0018): all sites authored by Valia serve OFF-INFRA on Cloudflare Pages (`bridge` live; `stem95su` parked, see above). One map entry in `stacks/valia-sites/main.tf` per site fans out Pages project + custom domain + public CNAME + internal split-horizon CNAME (ConfigMap `valia-sites-dns` → technitium sync, declarative incl. removal). CronJob `valia-sites-sync` (`*/10`, image ghcr `valia-sites-sync`) mirrors each Drive Content folder (rclone `drive.readonly`, stem95su-style guards + 25MB Pages-cap guard) and wrangler-deploys ONLY on manifest change (free-tier deploy cap). Secrets `secret/valia-sites` (shared rclone conf + SCOPED CF Pages token — Global API Key never in pods). Failed-Job-only visibility by choice. Runbook: docs/runbooks/valia-sites.md. | valia-sites |
|
||||||
| trek | **TRIAL (2026-06-05)** — self-hosted group-trip planner (upstream [TREK](https://github.com/mauriceboe/TREK), `mauriceboe/trek:3.0.22`, AGPL-3.0). Solo evaluation behind Authentik forward-auth (`auth=required`) before deciding build-vs-adopt; covers collaborative trip planning + accommodation records + activities + per-person budget splitting on free OpenStreetMap (no paid maps key). SQLite + uploads on `proxmox-lvm-encrypted` (`trek-data-encrypted` 2Gi, `trek-uploads-encrypted` 5Gi). For the trial only: `ENCRYPTION_KEY` is TREK-auto-generated onto the data PVC and the bootstrap admin (`admin@trek.local`) is printed to pod logs — NO Vault/ESO wiring (graduation TODO: move key to `secret/trek` + ESO, add an app-level SQLite backup CronJob since host file-backup can't read the LUKS PVC, wire TREK↔Authentik OIDC). Pinned image, TF-managed (no CI/Keel). Availability-poll companion (Rallly) deferred. Teardown: `tg destroy` in `stacks/trek`. | trek |
|
| trek | **TRIAL (2026-06-05)** — self-hosted group-trip planner (upstream [TREK](https://github.com/mauriceboe/TREK), `mauriceboe/trek:3.0.22`, AGPL-3.0). Solo evaluation behind Authentik forward-auth (`auth=required`) before deciding build-vs-adopt; covers collaborative trip planning + accommodation records + activities + per-person budget splitting on free OpenStreetMap (no paid maps key). SQLite + uploads on `proxmox-lvm-encrypted` (`trek-data-encrypted` 2Gi, `trek-uploads-encrypted` 5Gi). For the trial only: `ENCRYPTION_KEY` is TREK-auto-generated onto the data PVC and the bootstrap admin (`admin@trek.local`) is printed to pod logs — NO Vault/ESO wiring (graduation TODO: move key to `secret/trek` + ESO, add an app-level SQLite backup CronJob since host file-backup can't read the LUKS PVC, wire TREK↔Authentik OIDC). Pinned image, TF-managed (no CI/Keel). Availability-poll companion (Rallly) deferred. Teardown: `tg destroy` in `stacks/trek`. | trek |
|
||||||
|
|
||||||
## Cloudflare Domains
|
## Cloudflare Domains
|
||||||
|
|
|
||||||
|
|
@ -277,7 +277,7 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons
|
||||||
|
|
||||||
Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).
|
Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).
|
||||||
|
|
||||||
**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too. The same applies to **off-infra sites** (e.g. `bridge` → CNAME `bridge-cv2.pages.dev`, Cloudflare Pages): any public-only name with no Traefik ingress must be added as a static record in the sync script, or internal clients NXDOMAIN on it while it works fine externally.
|
**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too. **Off-infra Valia sites** (Cloudflare Pages, ADR-0018) are the other class of public-only names with no Traefik ingress — without internal records they NXDOMAIN for every internal client while working fine externally. Since 2026-07-03 they are reconciled **declaratively**: `stacks/valia-sites` writes the ConfigMap `valia-sites-dns` (technitium ns, `<name> → <project>.pages.dev`), and the sync script ensures/updates a CNAME per entry and **deletes** stale internal CNAMEs targeting `*.pages.dev` that left the map (retire/rename cleans itself up; deletion is suffix-scoped so nothing else can be touched).
|
||||||
|
|
||||||
## NodeLocal DNSCache
|
## NodeLocal DNSCache
|
||||||
|
|
||||||
|
|
@ -368,7 +368,7 @@ The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`)
|
||||||
| TXT (MTA-STS) | 1 | `v=STSv1; id=20260412` | TLS enforcement |
|
| TXT (MTA-STS) | 1 | `v=STSv1; id=20260412` | TLS enforcement |
|
||||||
| TXT (TLSRPT) | 1 | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS reporting |
|
| TXT (TLSRPT) | 1 | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS reporting |
|
||||||
| A (keyserver) | 1 | `130.162.165.220` (Oracle VPS) | PGP keyserver |
|
| A (keyserver) | 1 | `130.162.165.220` (Oracle VPS) | PGP keyserver |
|
||||||
| CNAME (CF Pages) | 1 | `bridge-cv2.pages.dev` (Cloudflare Pages) | `bridge` — static site hosted off-infra on CF Pages, content deployed via wrangler |
|
| CNAME (CF Pages) | 2 | `<project>.pages.dev` (Cloudflare Pages) | bridge, stem95su — Valia sites (ADR-0018), managed by `stacks/valia-sites` |
|
||||||
|
|
||||||
### Proxied vs Non-Proxied
|
### Proxied vs Non-Proxied
|
||||||
|
|
||||||
|
|
@ -514,6 +514,7 @@ For external `.viktorbarzin.me` records:
|
||||||
1. Add `dns_type = "proxied"` (or `"non-proxied"`) to the `ingress_factory` module call in the service stack
|
1. Add `dns_type = "proxied"` (or `"non-proxied"`) to the `ingress_factory` module call in the service stack
|
||||||
2. Run `scripts/tg apply` on the service stack — DNS record is auto-created
|
2. Run `scripts/tg apply` on the service stack — DNS record is auto-created
|
||||||
3. For non-standard records (MX, TXT), add a `cloudflare_record` resource in `stacks/cloudflared/modules/cloudflared/cloudflare.tf`
|
3. For non-standard records (MX, TXT), add a `cloudflare_record` resource in `stacks/cloudflared/modules/cloudflared/cloudflare.tf`
|
||||||
|
4. For a Valia site (off-infra Cloudflare Pages), add the entry to `local.sites` in `stacks/valia-sites/main.tf` — public CNAME + internal record both follow (`docs/runbooks/valia-sites.md`)
|
||||||
|
|
||||||
## Incident History
|
## Incident History
|
||||||
|
|
||||||
|
|
|
||||||
99
docs/runbooks/valia-sites.md
Normal file
99
docs/runbooks/valia-sites.md
Normal file
|
|
@ -0,0 +1,99 @@
|
||||||
|
# Valia sites — add / update / retire
|
||||||
|
|
||||||
|
Off-infra static sites authored by Valia (ADR-0018, CONTEXT.md "Valia site").
|
||||||
|
Serving: Cloudflare Pages. Freshness: the `valia-sites-sync` CronJob
|
||||||
|
(`valia-sites` ns) mirrors each Content folder every 10 minutes and deploys
|
||||||
|
only when the folder's manifest hash changed. Registry: `local.sites` in
|
||||||
|
`stacks/valia-sites/main.tf` — one entry per site drives everything (Pages
|
||||||
|
project, custom domain, public CNAME, internal split-horizon CNAME, sync).
|
||||||
|
|
||||||
|
Current sites: `bridge` (ОбУ „Отец Паисий“ — "мост"), `stem95su` (95. СУ STEM
|
||||||
|
board).
|
||||||
|
|
||||||
|
## Add a site
|
||||||
|
|
||||||
|
1. Valia shares the Drive folder with **vbarzin@gmail.com** (viewer is enough —
|
||||||
|
the pipeline is strictly read-only towards Drive).
|
||||||
|
2. Get the folder id from its URL (`drive.google.com/drive/folders/<ID>`).
|
||||||
|
3. Pick the **English** subdomain name (Viktor's call — CONTEXT.md naming rule).
|
||||||
|
4. Add one entry to `local.sites` in `stacks/valia-sites/main.tf`:
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
<name> = {
|
||||||
|
folder_id = "<ID>"
|
||||||
|
src_path = "" # or "sub/folder" if servable files live deeper
|
||||||
|
entry_file = "index.html" # or whatever her main HTML file is called
|
||||||
|
manage_dns = true
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
5. Commit + push; CI applies. Within ~10 min the sync deploys content and the
|
||||||
|
site serves at `https://<name>.viktorbarzin.me` (custom-domain TLS takes
|
||||||
|
~5–10 min extra on first attach — CF returns 522 for the hostname until
|
||||||
|
then). Internal LAN/VLAN/pod resolution appears when the hourly
|
||||||
|
`technitium-ingress-dns-sync` next runs — trigger it early with:
|
||||||
|
`kubectl create job --from=cronjob/technitium-ingress-dns-sync valia-dns-now -n technitium`
|
||||||
|
|
||||||
|
## Content rules (what Valia's folder must look like)
|
||||||
|
|
||||||
|
- The **entry file** must exist — the sync stages a copy as `index.html` at
|
||||||
|
deploy time, so `/` works; the original filename keeps working too (deep
|
||||||
|
links survive). If the folder is empty or the entry file is missing, the
|
||||||
|
sync **skips the site and leaves it as-is** (never wipes a live site).
|
||||||
|
- Google-native files (Docs/Sheets) are **ignored** (`--drive-skip-gdocs`) —
|
||||||
|
only real files (`.html`, images, …) deploy. Gemini's HTML exports are fine.
|
||||||
|
- Per-file limit 25 MB (Cloudflare Pages), 20k files max — far beyond a
|
||||||
|
1-page site.
|
||||||
|
|
||||||
|
## Update a site
|
||||||
|
|
||||||
|
Nothing to do: Valia edits the folder, the site follows within ~10 minutes.
|
||||||
|
Force it early: `kubectl create job --from=cronjob/valia-sites-sync sync-now -n valia-sites`
|
||||||
|
|
||||||
|
## Rename / retire a site
|
||||||
|
|
||||||
|
Rename = retire + add (Pages projects can't be renamed). Retire:
|
||||||
|
|
||||||
|
1. Delete the entry from `local.sites`; commit + push. TF destroys the public
|
||||||
|
CNAME + custom domain + Pages project; the internal record is removed by
|
||||||
|
the next `technitium-ingress-dns-sync` run (its deletion pass drops any
|
||||||
|
internal `*.pages.dev` CNAME that left the `valia-sites-dns` ConfigMap —
|
||||||
|
scoped so it can never touch non-Pages records).
|
||||||
|
2. That's all — no manual DNS cleanup (the pre-ADR-0018 add-only gotcha is
|
||||||
|
fixed by the deletion pass).
|
||||||
|
|
||||||
|
## Failure modes / debugging
|
||||||
|
|
||||||
|
- **Visibility is failed-Job-only by choice** (ADR-0018): no alerts, no
|
||||||
|
notifications. Check: `kubectl get jobs -n valia-sites | tail`, logs of the
|
||||||
|
last `valia-sites-sync-*` pod.
|
||||||
|
- **Drive auth broken** (`FATAL … Drive list failed`): the shared
|
||||||
|
`secret/valia-sites.rclone_conf` token died. The GCP OAuth app
|
||||||
|
(`home-lab-1700868541205`) must stay published to "Production" or refresh
|
||||||
|
tokens expire weekly (same constraint as the old stem95su conf, which this
|
||||||
|
one was copied from). Re-mint and `vault kv patch secret/valia-sites
|
||||||
|
rclone_conf=@…`.
|
||||||
|
- **Wrangler auth broken**: `secret/valia-sites.cloudflare_pages_token` is a
|
||||||
|
SCOPED token (Pages Read+Write on the account, id
|
||||||
|
`355d2c9d11579bdad1e9498dafca30d5`) — re-mint via
|
||||||
|
`POST /user/tokens` with the Global API Key (`secret/platform`), patch
|
||||||
|
Vault. Do NOT put the Global API Key in the pod.
|
||||||
|
- **Site serves stale content**: check the state CM
|
||||||
|
(`kubectl get cm valia-sites-state -n valia-sites -o yaml`) — deleting a
|
||||||
|
site's key forces a redeploy on the next run.
|
||||||
|
- **`GUARD … skipping`** in logs: Valia's folder is empty or renamed the
|
||||||
|
entry file — the site deliberately kept its last content. Fix the folder or
|
||||||
|
update `entry_file`.
|
||||||
|
|
||||||
|
## History
|
||||||
|
|
||||||
|
- stem95su still serves from its ORIGINAL in-cluster stack (nginx + NFS +
|
||||||
|
its own rclone CronJob): its Pages cutover is **parked** (`manage_dns =
|
||||||
|
false`) because `stem_board.html` embeds the 42.9 MB `stem_video.mp4`,
|
||||||
|
over the 25 MB Pages per-file cap — the sync guard-skips it until the
|
||||||
|
video shrinks below 25 MB (or the site is deliberately kept in-cluster
|
||||||
|
and removed from the map). Once cut over: flip `manage_dns = true`,
|
||||||
|
set `dns_type = "none"` in `stacks/stem95su`, then retire that stack;
|
||||||
|
`secret/stem95su` becomes superseded by `secret/valia-sites`.
|
||||||
|
- bridge started as a hand-deployed wrangler experiment (2026-07-03, memory
|
||||||
|
id 7085) and was adopted into the stack the same day.
|
||||||
Loading…
Add table
Add a link
Reference in a new issue