docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]

Audited 14 documentation files against live cluster state and Terraform code.

Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
  excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
  CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
  correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB

Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading

Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
This commit is contained in:
Viktor Barzin 2026-04-06 13:21:05 +03:00
parent 06359aa3fa
commit fc233bd27f
14 changed files with 152 additions and 142 deletions

View file

@ -54,7 +54,7 @@ Violations cause state drift, which causes future applies to break or silently r
- **ESO (External Secrets Operator)**: `stacks/external-secrets/` — 43 ExternalSecrets + 9 DB-creds ExternalSecrets. API version `v1beta1`. Two ClusterSecretStores: `vault-kv` and `vault-database`.
- **Plan-time pattern**: Former plan-time stacks use `data "kubernetes_secret"` to read ESO-created K8s Secrets at plan time (no Vault dependency). First-apply gotcha: must `terragrunt apply -target=kubernetes_manifest.external_secret` first, then full apply. `count` on resources using secret values fails — remove conditional counts.
- **14 hybrid stacks** still keep `data "vault_kv_secret_v2"` for plan-time needs (job commands, Helm templatefile, module inputs). Platform has 48 plan-time refs — no migration possible without restructuring modules.
- **Database rotation**: Vault DB engine rotates passwords every 24h. MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, technitium. PostgreSQL: trading, health, linkwarden, affine, woodpecker, claude_memory. Excluded: authentik (PgBouncer), crowdsec (Helm-baked), root users. Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API.
- **Database rotation**: Vault DB engine rotates passwords every 7 days (604800s). MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, technitium. PostgreSQL: health, linkwarden, affine, woodpecker, claude_memory. Excluded: authentik (PgBouncer), crowdsec (Helm-baked), root users. Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API.
- **K8s credentials**: Vault K8s secrets engine. Roles: `dashboard-admin`, `ci-deployer`, `openclaw`, `local-admin`. Use `vault write kubernetes/creds/ROLE kubernetes_namespace=NS`. Helper: `scripts/vault-kubeconfig`.
- **CI/CD (GHA + Woodpecker)**: Docker builds run on **GitHub Actions** (free on public repos). Woodpecker is **deploy-only** — receives image tag via API POST, runs `kubectl set image`. Woodpecker authenticates via K8s SA JWT → Vault K8s auth. Sync CronJob pushes `secret/ci/global` → Woodpecker API every 6h. Shell scripts in HCL heredocs: escape `$``$$`, `%{}``%%{}`.
- **Platform cannot depend on vault** (circular). Apply order: vault first, then platform. Platform has 48 vault refs, all in module inputs — no ESO migration possible.
@ -79,7 +79,7 @@ Violations cause state drift, which causes future applies to break or silently r
**Flow**: `git push → GHA build+push DockerHub (8-char SHA) → POST Woodpecker API → kubectl set image`
**Migrated to GHA** (7): Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book
**Migrated to GHA** (9): Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search
**Woodpecker-only**: travel_blog (1.4GB content too large for GHA), infra pipelines (terragrunt apply, certbot, build-cli — need cluster access)
**Per-project files**:
@ -97,7 +97,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
**GitHub repo secrets** (set on all repos): `DOCKERHUB_USERNAME`, `DOCKERHUB_TOKEN`, `WOODPECKER_TOKEN`
**Infra pipelines unchanged**: `default.yml` (terragrunt apply), `renew-tls.yml` (certbot cron), `build-cli.yml` (dual registry push), `k8s-portal.yml` (path-filtered build) — all stay on Woodpecker.
**Infra pipelines unchanged**: `default.yml` (terragrunt apply), `renew-tls.yml` (certbot cron), `build-cli.yml` (dual registry push), `k8s-portal.yml` (path-filtered build), `provision-user.yml` — all stay on Woodpecker.
## Database Host
@ -121,7 +121,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
| MySQL InnoDB | Enable auto-recovery, anti-affinity excludes node2 (SIGBUS), 4.4Gi req but ~1Gi used |
| MySQL InnoDB | Enable auto-recovery, anti-affinity excludes k8s-node1 (GPU), 2Gi req / 3Gi limit |
## Monitoring & Alerting
- Alert cascade inhibitions: if node is down, suppress pod alerts on that node.

View file

@ -5,33 +5,33 @@
## Critical - Network & Auth (Tier: core)
| Service | Description | Stack |
|---------|-------------|-------|
| wireguard | VPN server | platform |
| technitium | DNS server (10.0.20.101) | platform |
| headscale | Tailscale control server | platform |
| traefik | Ingress controller (Helm) | platform |
| wireguard | VPN server | wireguard |
| technitium | DNS server (10.0.20.101) | technitium |
| headscale | Tailscale control server | headscale |
| traefik | Ingress controller (Helm) | traefik |
| xray | Proxy/tunnel | platform |
| authentik | Identity provider (SSO) | platform |
| cloudflared | Cloudflare tunnel | platform |
| authelia | Auth middleware | platform |
| monitoring | Prometheus/Grafana/Loki stack | platform |
| authentik | Identity provider (SSO) | authentik |
| cloudflared | Cloudflare tunnel | cloudflared |
| authelia | Auth middleware (may be merged into ebooks or removed) | platform |
| monitoring | Prometheus/Grafana/Loki stack | monitoring |
## Storage & Security (Tier: cluster)
| Service | Description | Stack |
|---------|-------------|-------|
| vaultwarden | Bitwarden-compatible password manager | platform |
| redis | Shared Redis at `redis.redis.svc.cluster.local` | platform |
| redis | Shared Redis at `redis.redis.svc.cluster.local` | redis |
| immich | Photo management (GPU) | immich |
| nvidia | GPU device plugin | platform |
| metrics-server | K8s metrics | platform |
| uptime-kuma | Status monitoring | platform |
| crowdsec | Security/WAF | platform |
| kyverno | Policy engine | platform |
| nvidia | GPU device plugin | nvidia |
| metrics-server | K8s metrics | metrics-server |
| uptime-kuma | Status monitoring | uptime-kuma |
| crowdsec | Security/WAF | crowdsec |
| kyverno | Policy engine | kyverno |
## Admin
| Service | Description | Stack |
|---------|-------------|-------|
| k8s-dashboard | Kubernetes dashboard | platform |
| reverse-proxy | Generic reverse proxy | platform |
| k8s-dashboard | Kubernetes dashboard | k8s-dashboard |
| reverse-proxy | Generic reverse proxy | reverse-proxy |
## Active Use
| Service | Description | Stack |
@ -43,12 +43,15 @@
| dawarich | Location history | dawarich |
| owntracks | Location tracking | owntracks |
| nextcloud | File sync/share | nextcloud |
| calibre | E-book management | calibre |
| calibre | E-book management (may be merged into ebooks stack) | calibre |
| onlyoffice | Document editing | onlyoffice |
| f1-stream | F1 streaming | f1-stream |
| rybbit | Analytics | rybbit |
| isponsorblocktv | SponsorBlock for TV | isponsorblocktv |
| actualbudget | Budgeting (factory pattern) | actualbudget |
| insta2spotify | Instagram reel song ID to Spotify playlist | insta2spotify |
| trading-bot | Event-driven trading with sentiment analysis | trading-bot |
| claude-memory | Persistent memory MCP server | claude-memory |
## Optional
| Service | Description | Stack |
@ -69,7 +72,7 @@
| send | Firefox Send | send |
| ytdlp | YouTube downloader | ytdlp |
| wealthfolio | Finance tracking | wealthfolio |
| audiobookshelf | Audiobook server | audiobookshelf |
| audiobookshelf | Audiobook server (may be merged into ebooks stack) | audiobookshelf |
| paperless-ngx | Document management | paperless-ngx |
| jsoncrack | JSON visualizer | jsoncrack |
| servarr | Media automation (Sonarr/Radarr/etc) | servarr |
@ -103,6 +106,9 @@
| grampsweb | Genealogy web app (Gramps Web) | grampsweb |
| openclaw | AI agent gateway (OpenClaw) | openclaw |
| poison-fountain | Anti-AI scraping (tarpit + poison) | poison-fountain |
| priority-pass | Boarding pass color transformer | priority-pass |
| status-page | Status page | status-page |
| plotting-book | Book plotting/world-building app | plotting-book |
## Cloudflare Domains