docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code. Architecture docs: - databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h), CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints - overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage, correct Vault paths (secret/ not kv/) - compute.md: 272GB physical host RAM, ~160GB allocated to VMs - secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config - networking.md: MetalLB pool 10.0.20.200-220 - ci-cd.md: 9 GHA projects, travel_blog 5.7GB Runbooks: - restore-mysql/postgresql: backup files are .sql.gz (not .sql) - restore-vault: weekly backup (not daily), auto-unseal sidecar note - restore-vaultwarden: PVC is proxmox (not iscsi) - restore-full-cluster: updated node roles, removed trading Reference docs: - CLAUDE.md: 7-day rotation, removed trading from PG list - AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell - service-catalog.md: 6 new stacks, 14 stack column updates
This commit is contained in:
parent
06359aa3fa
commit
fc233bd27f
14 changed files with 152 additions and 142 deletions
|
|
@ -54,7 +54,7 @@ Violations cause state drift, which causes future applies to break or silently r
|
|||
- **ESO (External Secrets Operator)**: `stacks/external-secrets/` — 43 ExternalSecrets + 9 DB-creds ExternalSecrets. API version `v1beta1`. Two ClusterSecretStores: `vault-kv` and `vault-database`.
|
||||
- **Plan-time pattern**: Former plan-time stacks use `data "kubernetes_secret"` to read ESO-created K8s Secrets at plan time (no Vault dependency). First-apply gotcha: must `terragrunt apply -target=kubernetes_manifest.external_secret` first, then full apply. `count` on resources using secret values fails — remove conditional counts.
|
||||
- **14 hybrid stacks** still keep `data "vault_kv_secret_v2"` for plan-time needs (job commands, Helm templatefile, module inputs). Platform has 48 plan-time refs — no migration possible without restructuring modules.
|
||||
- **Database rotation**: Vault DB engine rotates passwords every 24h. MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, technitium. PostgreSQL: trading, health, linkwarden, affine, woodpecker, claude_memory. Excluded: authentik (PgBouncer), crowdsec (Helm-baked), root users. Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API.
|
||||
- **Database rotation**: Vault DB engine rotates passwords every 7 days (604800s). MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, technitium. PostgreSQL: health, linkwarden, affine, woodpecker, claude_memory. Excluded: authentik (PgBouncer), crowdsec (Helm-baked), root users. Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API.
|
||||
- **K8s credentials**: Vault K8s secrets engine. Roles: `dashboard-admin`, `ci-deployer`, `openclaw`, `local-admin`. Use `vault write kubernetes/creds/ROLE kubernetes_namespace=NS`. Helper: `scripts/vault-kubeconfig`.
|
||||
- **CI/CD (GHA + Woodpecker)**: Docker builds run on **GitHub Actions** (free on public repos). Woodpecker is **deploy-only** — receives image tag via API POST, runs `kubectl set image`. Woodpecker authenticates via K8s SA JWT → Vault K8s auth. Sync CronJob pushes `secret/ci/global` → Woodpecker API every 6h. Shell scripts in HCL heredocs: escape `$` → `$$`, `%{}` → `%%{}`.
|
||||
- **Platform cannot depend on vault** (circular). Apply order: vault first, then platform. Platform has 48 vault refs, all in module inputs — no ESO migration possible.
|
||||
|
|
@ -79,7 +79,7 @@ Violations cause state drift, which causes future applies to break or silently r
|
|||
|
||||
**Flow**: `git push → GHA build+push DockerHub (8-char SHA) → POST Woodpecker API → kubectl set image`
|
||||
|
||||
**Migrated to GHA** (7): Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book
|
||||
**Migrated to GHA** (9): Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search
|
||||
**Woodpecker-only**: travel_blog (1.4GB content too large for GHA), infra pipelines (terragrunt apply, certbot, build-cli — need cluster access)
|
||||
|
||||
**Per-project files**:
|
||||
|
|
@ -97,7 +97,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
|
|||
|
||||
**GitHub repo secrets** (set on all repos): `DOCKERHUB_USERNAME`, `DOCKERHUB_TOKEN`, `WOODPECKER_TOKEN`
|
||||
|
||||
**Infra pipelines unchanged**: `default.yml` (terragrunt apply), `renew-tls.yml` (certbot cron), `build-cli.yml` (dual registry push), `k8s-portal.yml` (path-filtered build) — all stay on Woodpecker.
|
||||
**Infra pipelines unchanged**: `default.yml` (terragrunt apply), `renew-tls.yml` (certbot cron), `build-cli.yml` (dual registry push), `k8s-portal.yml` (path-filtered build), `provision-user.yml` — all stay on Woodpecker.
|
||||
|
||||
## Database Host
|
||||
|
||||
|
|
@ -121,7 +121,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
|
|||
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
|
||||
| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
|
||||
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
|
||||
| MySQL InnoDB | Enable auto-recovery, anti-affinity excludes node2 (SIGBUS), 4.4Gi req but ~1Gi used |
|
||||
| MySQL InnoDB | Enable auto-recovery, anti-affinity excludes k8s-node1 (GPU), 2Gi req / 3Gi limit |
|
||||
|
||||
## Monitoring & Alerting
|
||||
- Alert cascade inhibitions: if node is down, suppress pod alerts on that node.
|
||||
|
|
|
|||
|
|
@ -5,33 +5,33 @@
|
|||
## Critical - Network & Auth (Tier: core)
|
||||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| wireguard | VPN server | platform |
|
||||
| technitium | DNS server (10.0.20.101) | platform |
|
||||
| headscale | Tailscale control server | platform |
|
||||
| traefik | Ingress controller (Helm) | platform |
|
||||
| wireguard | VPN server | wireguard |
|
||||
| technitium | DNS server (10.0.20.101) | technitium |
|
||||
| headscale | Tailscale control server | headscale |
|
||||
| traefik | Ingress controller (Helm) | traefik |
|
||||
| xray | Proxy/tunnel | platform |
|
||||
| authentik | Identity provider (SSO) | platform |
|
||||
| cloudflared | Cloudflare tunnel | platform |
|
||||
| authelia | Auth middleware | platform |
|
||||
| monitoring | Prometheus/Grafana/Loki stack | platform |
|
||||
| authentik | Identity provider (SSO) | authentik |
|
||||
| cloudflared | Cloudflare tunnel | cloudflared |
|
||||
| authelia | Auth middleware (may be merged into ebooks or removed) | platform |
|
||||
| monitoring | Prometheus/Grafana/Loki stack | monitoring |
|
||||
|
||||
## Storage & Security (Tier: cluster)
|
||||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| vaultwarden | Bitwarden-compatible password manager | platform |
|
||||
| redis | Shared Redis at `redis.redis.svc.cluster.local` | platform |
|
||||
| redis | Shared Redis at `redis.redis.svc.cluster.local` | redis |
|
||||
| immich | Photo management (GPU) | immich |
|
||||
| nvidia | GPU device plugin | platform |
|
||||
| metrics-server | K8s metrics | platform |
|
||||
| uptime-kuma | Status monitoring | platform |
|
||||
| crowdsec | Security/WAF | platform |
|
||||
| kyverno | Policy engine | platform |
|
||||
| nvidia | GPU device plugin | nvidia |
|
||||
| metrics-server | K8s metrics | metrics-server |
|
||||
| uptime-kuma | Status monitoring | uptime-kuma |
|
||||
| crowdsec | Security/WAF | crowdsec |
|
||||
| kyverno | Policy engine | kyverno |
|
||||
|
||||
## Admin
|
||||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| k8s-dashboard | Kubernetes dashboard | platform |
|
||||
| reverse-proxy | Generic reverse proxy | platform |
|
||||
| k8s-dashboard | Kubernetes dashboard | k8s-dashboard |
|
||||
| reverse-proxy | Generic reverse proxy | reverse-proxy |
|
||||
|
||||
## Active Use
|
||||
| Service | Description | Stack |
|
||||
|
|
@ -43,12 +43,15 @@
|
|||
| dawarich | Location history | dawarich |
|
||||
| owntracks | Location tracking | owntracks |
|
||||
| nextcloud | File sync/share | nextcloud |
|
||||
| calibre | E-book management | calibre |
|
||||
| calibre | E-book management (may be merged into ebooks stack) | calibre |
|
||||
| onlyoffice | Document editing | onlyoffice |
|
||||
| f1-stream | F1 streaming | f1-stream |
|
||||
| rybbit | Analytics | rybbit |
|
||||
| isponsorblocktv | SponsorBlock for TV | isponsorblocktv |
|
||||
| actualbudget | Budgeting (factory pattern) | actualbudget |
|
||||
| insta2spotify | Instagram reel song ID to Spotify playlist | insta2spotify |
|
||||
| trading-bot | Event-driven trading with sentiment analysis | trading-bot |
|
||||
| claude-memory | Persistent memory MCP server | claude-memory |
|
||||
|
||||
## Optional
|
||||
| Service | Description | Stack |
|
||||
|
|
@ -69,7 +72,7 @@
|
|||
| send | Firefox Send | send |
|
||||
| ytdlp | YouTube downloader | ytdlp |
|
||||
| wealthfolio | Finance tracking | wealthfolio |
|
||||
| audiobookshelf | Audiobook server | audiobookshelf |
|
||||
| audiobookshelf | Audiobook server (may be merged into ebooks stack) | audiobookshelf |
|
||||
| paperless-ngx | Document management | paperless-ngx |
|
||||
| jsoncrack | JSON visualizer | jsoncrack |
|
||||
| servarr | Media automation (Sonarr/Radarr/etc) | servarr |
|
||||
|
|
@ -103,6 +106,9 @@
|
|||
| grampsweb | Genealogy web app (Gramps Web) | grampsweb |
|
||||
| openclaw | AI agent gateway (OpenClaw) | openclaw |
|
||||
| poison-fountain | Anti-AI scraping (tarpit + poison) | poison-fountain |
|
||||
| priority-pass | Boarding pass color transformer | priority-pass |
|
||||
| status-page | Status page | status-page |
|
||||
| plotting-book | Book plotting/world-building app | plotting-book |
|
||||
|
||||
## Cloudflare Domains
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue