diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 9c873a07..0f93847e 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -24,21 +24,13 @@ Violations cause state drift, which causes future applies to break or silently revert changes. ## Instructions -- **"remember X"**: store to the remote claude-memory store via the **`homelab memory` CLI**: `homelab memory store "content" --category facts --tags "tag1,tag2"` (also `recall "query"` / `update ` / `list` / `delete `). For shared knowledge, also update the relevant CLAUDE.md / `AGENTS.md`. (Supersedes the old `memory-tool` CLI **and** the claude-memory MCP — both retired 2026-06-21; the homelab CLI hits the same remote HTTP API. Recall also runs automatically each turn via a UserPromptSubmit hook.) -- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies, and `-lock-timeout` (default `5m`, override via `TG_LOCK_TIMEOUT`) on every state-locking verb (`plan`/`apply`/`destroy`/`refresh`) so a contended state lock **waits** instead of failing instantly with `Error acquiring the state lock`. -- **New services need CI/CD** and **monitoring** (Prometheus/Uptime Kuma). CI = a GHA workflow on the repo's GitHub mirror (build + tests off-infra, ADR-0002); Woodpecker gets a deploy-only pipeline — never an in-cluster build. +- **"remember X"**: Use `memory-tool store "content" --category facts --tags "tag1,tag2"` (via exec) for persistent cross-session memory. Also update this file + `AGENTS.md` (if shared knowledge), commit with `[ci skip]`. To recall: `memory-tool recall "query"`. To list: `memory-tool list`. To delete: `memory-tool delete `. The native `memory_search` and `memory_get` tools are also available for searching indexed memory files. For **storing** new memories, always use the `memory-tool` CLI via exec. +- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies. +- **New services need CI/CD** and **monitoring** (Prometheus/Uptime Kuma) - **New service**: Use `setup-project` skill for full workflow -- **Ingress**: `ingress_factory` module. **Auth** (`auth` string enum, default `"required"` — fail-closed). Pick by asking "what gates the app?": - - `auth = "required"` — Authentik forward-auth gates every request. Use when the backend has **no built-in user auth** and Authentik is the only thing standing between strangers and the app (prowlarr, qbittorrent, netbox, phpipam, k8s-dashboard, any admin UI shipped without its own login). - - `auth = "app"` — the backend handles its own user authentication (NextAuth, Django, OAuth, bearer-token API, etc.); Authentik would only break it. No middleware attached; the app's own login is the gate. Examples: immich, linkwarden, tandoor, freshrss, affine, actualbudget, audiobookshelf, novelapp. **Functionally identical to `"none"`** — the distinct name exists to record intent at the call site. - - `auth = "public"` — Authentik anonymous binding via the dedicated `public` outpost (routes via `traefik-authentik-forward-auth-public` → `ak-outpost-public.authentik.svc:9000`). Strangers auto-bound to `guest`; logged-in users keep their identity in `X-authentik-username`. **Only works for top-level browser navigation** — CORS preflight rejects XHR/fetch and automation can't replay the cookie dance. Audit trail, not a gate. - - `auth = "none"` — no Authentik, no own-auth claim. Use for Anubis-fronted content (Anubis is the gate), native-client APIs (Git, `/v2/`, WebDAV/CalDAV, CardDAV), webhook receivers, OAuth callbacks, and Authentik outposts themselves. - - **Anti-exposure rule** (the reason `"app"` exists): only pick `"app"` or `"none"` AFTER you've verified the app has its own user auth (`"app"`) OR the endpoint is intentionally public (`"none"`). Default is `"required"` so accidental omission fails closed. **Convention**: when using `"app"` or `"none"`, add a comment line above the `auth = "..."` line stating what gates the app or why it's public. **Enforced by `scripts/tg`**: every `tg plan/apply/destroy/refresh` runs `scripts/check-ingress-auth-comments.py` against the current stack and aborts if any `auth = "app|none"` line lacks the preceding `# auth = "": ...` comment. Stack-scoped — untouched stacks aren't blocked until they're next edited. - - **Anti-AI**: on by default when `auth = "none"` or `auth = "app"` (no Authentik to discourage bots); redundant on `"required"` and `"public"`. - - **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend). -- **Anubis PoW challenge** (`modules/kubernetes/anubis_instance/`): per-site reverse proxy that issues a 30-day JWT cookie after a tiny PoW solve. Use for **public, content-bearing sites without app-level auth** (blog, docs, wikis, static landing pages). Pattern: declare `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://..svc.cluster.local" }`, then in `ingress_factory` set `service_name = module.anubis.service_name`, `port = module.anubis.service_port`, `anti_ai_scraping = false`. Shared ed25519 key in Vault `secret/viktor` -> `anubis_ed25519_key`; cookie scoped to `viktorbarzin.me` so one solve covers all Anubis-fronted subdomains. **DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints** — clients without JS can't solve PoW. **Replicas default to 1** because Anubis stores in-flight challenges in process memory; a challenge issued by pod A and solved against pod B errors with `store: key not found` (HTTP 500). Bumping replicas requires wiring a shared Redis store (TODO). For path-level carve-outs (e.g. wrongmove has `/` behind Anubis but `/api` direct, blog has `/net-diag.sh` direct), declare a second `ingress_factory` with `ingress_path = ["/"]` pointing at the bare backend service. Active on: blog (except `/net-diag.sh`), www, kms, travel, f1, cc, json, pb (privatebin), home (homepage), wrongmove (UI only). See `.claude/reference/patterns.md` "Anti-AI Scraping" for full layering. -- **Docker images**: Always build for `linux/amd64`. SHA-tag rule is being phased out — see `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`. New model: CI pushes `:latest` (optionally also `:<8-char-sha>` for traceability), Keel polls and triggers rollouts. Cache-staleness concern from the old rule is resolved at the nginx layer (URL-split — manifests pass through, blobs cached). Until Phase 1 of the migration completes (per the plan), follow the SHA-tag rule for new services to match existing pattern. -- **Image registry**: **Owned images now live on `ghcr.io/viktorbarzin/`** (ADR-0002, built by GHA — see the CI/CD Architecture section). The **Forgejo container registry is FROZEN + emptied** (break-glass only — `docs/runbooks/forgejo-registry-breakglass.md`); nothing pushes to it. The rest of this bullet documents the **still-live forgejo-pull DNS/mirror machinery** (it remains in place for the break-glass path + because `registry-credentials` is still Kyverno-synced; the hairpin lessons apply to any internal-registry pull). Historical usage was `image: forgejo.viktorbarzin.me/viktor/:` + `imagePullSecrets: [{name: registry-credentials}]`. **Kubelet pulls** are kept off the hairpin **at the resolver, with zero node-side DNS config**: pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium (added 2026-06-10, `docs/runbooks/pfsense-unbound.md`), whose split-horizon zone CNAMEs every ingress host (auto-synced hourly by `technitium-ingress-dns-sync`) to the zone apex whose A record tracks the **live** Traefik LB IP (canary: `viktorbarzin-apex-probe`, alerts ViktorBarzinApexDrift). Nodes are stock — link DNS `10.0.20.1 94.140.14.14` via `qm set --nameserver`, no `/etc/hosts` pins, no resolved drop-ins (two same-day interim approaches on 2026-06-10 were removed the same day). The containerd `hosts.toml` mirror (`[host."https://10.0.20.203"]`, `skip_verify = true`) still exists but is **vestigial** — it can NOT keep pulls internal on its own: Traefik routes by Host/SNI and 404s the mirror's bare-IP requests, and the registry's Bearer auth realm is the absolute `https://forgejo.viktorbarzin.me/v2/token` URL fetched outside the mirror — without internal DNS every fresh pull degrades to public DNS → hairpin → intermittent `dial tcp 176.12.22.76:443: i/o timeout` ImagePullBackOff (tuya-bridge 7.5h outage 2026-06-10, tripit 2026-06-09; see `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md`). **In-cluster pods are ordinary internal clients too** (since 2026-06-10 evening) — CoreDNS's dedicated `viktorbarzin.me:53` block (Corefile in `stacks/technitium/modules/technitium/main.tf`) forwards to the Technitium ClusterIP `10.96.0.53`, so pods get the same split-horizon answers as everyone else; forgejo stays pinned to Traefik's **ClusterIP** in that block (TF-interpolated from the live Service) so CI pushes survive a Technitium outage. This relies on a k8s-1.34 behavior verified 2026-06-10: **pods CAN reach the ETP=Local Traefik LB IP** (kube-proxy short-circuits in-cluster traffic to LB IPs via the cluster path) — re-verify after major k8s upgrades; canary = the uptime-kuma `[External]` fleet going red. (The block briefly forwarded to `8.8.8.8/1.1.1.1` earlier that day, which kept pods on the WAN IP and the broken TP-Link NAT loopback — 27 non-proxied `[External]` monitors dark; beads code-yh33.) **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left the mirror pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull; a future LB renumber is now handled by DNS (apex record + drift probe) — only the vestigial hosts.toml literal would go stale. Mirror source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes; also cleans up the legacy 2026-06-10 node-DNS customization). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest` + any buildkit `*cache*` tag — **REVERTED to DRY_RUN 2026-06-10 after its first live run orphaned OCI index children** (multi-arch/attestation children are separate *untagged* sha256 versions that sort outside the newest-10 window while their parent index is kept; broke `kms-website:latest`+`:dfc83fb`, caught by the integrity probe, healed by re-tagging latest→a794d1a + deleting the corrupt version; see `docs/post-mortems/2026-06-10-forgejo-retention-orphaned-indexes.md`). Do NOT re-enable deletes until the keep-set resolves kept indexes' child digests (or skips untagged versions, or moves to Forgejo's native container-aware cleanup rules). The registry PVC remains at its 50Gi autoresize ceiling on the HDD (we did NOT move it to SSD, see beads code-oflt), so a container-aware retention is still needed. Integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07. +- **Ingress**: `ingress_factory` module. Auth: `protected = true`. Anti-AI: on by default. **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. +- **Docker images**: Always build for `linux/amd64`. Use 8-char git SHA tags — `:latest` causes stale pull-through cache. +- **Private registry**: `registry.viktorbarzin.me` (htpasswd auth, credentials in Vault `secret/viktor`). Use `image: registry.viktorbarzin.me/:` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the secret to all namespaces. Build & push from registry VM (`10.0.20.10`). Containerd `hosts.toml` redirects pulls to LAN IP directly. Web UI at `docker.viktorbarzin.me` (Authentik-protected). - **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts. - **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi. - **Node OS disk tuning** (in `stacks/infra/main.tf`): kubelet `imageGCHighThresholdPercent=70` (was 85), `imageGCLowThresholdPercent=60` (was 80), ext4 `commit=60` in fstab (was default 5s), journald `SystemMaxUse=200M` + `MaxRetentionSec=3day`. @@ -47,7 +39,7 @@ Violations cause state drift, which causes future applies to break or silently r ## Terraform State — Two-Tier Backend - **Tier 0 (bootstrap)**: Local state, SOPS-encrypted in git. Stacks: `infra`, `platform`, `cnpg`, `vault`, `dbaas`, `external-secrets`. These must exist before PG is reachable. -- **Tier 1 (everything else)**: PostgreSQL backend (`pg`) on CNPG cluster at `pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`. Native `pg_advisory_lock` for concurrent safety. Each stack gets its own PG schema. **Lock contention is non-fatal**: `scripts/tg` passes `-lock-timeout` (default `5m`) so a contended lock waits rather than hard-failing — this was the #1 cause of infra CI failures (a Woodpecker-killed run's unreaped PG lock, a concurrent local apply, or the daily drift `plan`; Tier-1 stacks have no Vault advisory-lock skip to fall back on, unlike Tier-0). +- **Tier 1 (everything else)**: PostgreSQL backend (`pg`) on CNPG cluster at `pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`. Native `pg_advisory_lock` for concurrent safety. Each stack gets its own PG schema. - **Auth**: `scripts/tg` auto-fetches PG credentials from Vault (`database/static-creds/pg-terraform-state`). Humans use `vault login -method=oidc`, agents use K8s auth (role: `terraform-state`, namespace: `claude-agent`). - **Tier 0 workflow** (unchanged): `git pull` → `scripts/tg plan` → `scripts/tg apply` → `git push`. State sync via SOPS is transparent. - **Tier 1 workflow**: `vault login -method=oidc` → `scripts/tg plan` → `scripts/tg apply`. No git commit needed — PG is authoritative. @@ -56,17 +48,16 @@ Violations cause state drift, which causes future applies to break or silently r - **Tier 0 details**: Decrypt priority: Vault Transit (primary) → age key fallback. Encrypt: both Vault Transit + age recipients. Scripts: `scripts/state-sync {encrypt|decrypt|commit} [stack]`. - **Adding operator**: Generate age key (`age-keygen`), add pubkey to `.sops.yaml`, run `sops updatekeys` on Tier 0 `.enc` files. For Tier 1, only Vault access is needed. - **Migration script**: `scripts/migrate-state-to-pg` (one-shot, idempotent) migrates Tier 1 stacks from local to PG. -- **Adopting existing resources**: use HCL `import {}` blocks (TF 1.5+), not `terraform import` CLI. Commit stanza → plan-to-zero → apply → delete stanza. Canonical reason: reviewable in PR, plan-safe, idempotent, tier-agnostic. Full rules + per-provider ID formats in `AGENTS.md` → "Adopting Existing Resources". ## Secrets Management — Vault KV - **Vault is the sole source of truth** for secrets. - **`secret/viktor`** — go-to path for ALL personal secrets (135 keys). Contains every API key, token, password, SSH key, and config from the old terraform.tfvars. Check here first: `vault kv get -field=KEY secret/viktor`. - **Auth**: `vault login -method=oidc` (Authentik SSO) → `~/.vault-token` → read by Vault TF provider. - **Vault stack self-reads**: `data "vault_kv_secret_v2" "vault"` reads its own OIDC creds from `secret/vault`. -- **ESO (External Secrets Operator)**: `stacks/external-secrets/` — chart **2.6.0 / app v2.6.0** (migrated 0.12.1→2.6.0 on 2026-06-22, one minor at a time; helm_release has `atomic=true`). **~104 ExternalSecrets across 73 files**, all on **API version `v1`** (migrated v1beta1→v1 on 2026-06-22 — there is NO v1beta1→v1 conversion webhook, so all CRs were rewritten to v1 on chart 0.16.2 before 0.17 removed v1beta1; see `docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md`). Two ClusterSecretStores: `vault-kv` and `vault-database`. (2 pre-existing dead ESs — instagram-poster, payslip-ingest — fail "cannot find secret data" on missing Vault keys, unrelated.) +- **ESO (External Secrets Operator)**: `stacks/external-secrets/` — 43 ExternalSecrets + 9 DB-creds ExternalSecrets. API version `v1beta1`. Two ClusterSecretStores: `vault-kv` and `vault-database`. - **Plan-time pattern**: Former plan-time stacks use `data "kubernetes_secret"` to read ESO-created K8s Secrets at plan time (no Vault dependency). First-apply gotcha: must `terragrunt apply -target=kubernetes_manifest.external_secret` first, then full apply. `count` on resources using secret values fails — remove conditional counts. - **14 hybrid stacks** still keep `data "vault_kv_secret_v2"` for plan-time needs (job commands, Helm templatefile, module inputs). Platform has 48 plan-time refs — no migration possible without restructuring modules. -- **Database rotation**: Vault DB engine rotates passwords every 7 days (604800s). MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, phpipam. PostgreSQL: health, linkwarden, affine, woodpecker, claude_memory, crowdsec, technitium. Excluded: authentik (PgBouncer), root users. **Apps that read a rotated secret only at startup** (env var / initContainer, not a hot-reloaded mount) MUST carry a Reloader annotation (`secret.reloader.stakater.com/reload: `) or they keep the stale password and silently fail DB auth on each rotation until manually restarted — matrix's Synapse `inject-db-password` initContainer hit exactly this (found via Loki 2026-06-05, ~12.9k auth-fail lines/hr); matrix has since migrated to tuwunel (RocksDB, no Postgres) on 2026-06-08 and is no longer in the rotation list above. Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API, disable SQLite + MySQL logging, check PG plugin is loaded, configure PG query logging (90-day retention), and disable SQLite on secondary/tertiary instances. +- **Database rotation**: Vault DB engine rotates passwords every 7 days (604800s). MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, phpipam. PostgreSQL: health, linkwarden, affine, woodpecker, claude_memory, crowdsec, technitium. Excluded: authentik (PgBouncer), root users. Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API, disable SQLite + MySQL logging, check PG plugin is loaded, configure PG query logging (90-day retention), and disable SQLite on secondary/tertiary instances. - **K8s credentials**: Vault K8s secrets engine. Roles: `dashboard-admin`, `ci-deployer`, `openclaw`, `local-admin`. Use `vault write kubernetes/creds/ROLE kubernetes_namespace=NS`. Helper: `scripts/vault-kubeconfig`. - **CI/CD (GHA + Woodpecker)**: Docker builds run on **GitHub Actions** (free on public repos). Woodpecker is **deploy-only** — receives image tag via API POST, runs `kubectl set image`. Woodpecker authenticates via K8s SA JWT → Vault K8s auth. Sync CronJob pushes `secret/ci/global` → Woodpecker API every 6h. Shell scripts in HCL heredocs: escape `$` → `$$`, `%{}` → `%%{}`. - **Platform cannot depend on vault** (circular). Apply order: vault first, then platform. Platform has 48 vault refs, all in module inputs — no ESO migration possible. @@ -78,123 +69,42 @@ Violations cause state drift, which causes future applies to break or silently r ## Resource Management Patterns - **CPU**: All CPU limits removed cluster-wide (CFS throttling). Only set CPU requests based on actual usage. - **Memory**: Set explicit `requests=limits` based on VPA upperBound. Target: upperBound x 1.2 for stable services, x 1.3 for GPU/volatile workloads. -- **Right-sizing**: VPA/Goldilocks was **REMOVED 2026-06-12** (etcd-load-reduction — 349 VPAs all ran `updateMode=Off`, costing ~800 etcd objects + continuous recommender writes + a pod-creation admission webhook for dashboard-only value). Right-size **on demand with `krr`** (Robusta, Dockerized from the devvm — no cluster install, no admission webhook, no eviction risk; reads Prometheus). Set container resources explicitly in TF from krr output. +- **VPA (Goldilocks)**: Must be `Initial` mode (not `Auto`) — Auto conflicts with Terraform's declarative resource management. - **LimitRange**: Tier-based defaults silently apply to pods with `resources: {}`. Always set explicit resources on containers needing more than defaults. Tier 3-edge and 4-aux now use Burstable QoS (request < limit) to reduce scheduler pressure. - **Democratic-CSI sidecars**: Must set explicit resources (32-80Mi) in Helm values — 17 sidecars default to 256Mi each via LimitRange. `csiProxy` is a TOP-LEVEL chart key, not nested under controller/node. - **ResourceQuota blocks rolling updates**: When quota is tight, scale to 0 then back to 1 instead of RollingUpdate. Or use Recreate strategy. -- **Kyverno ndots drift**: Kyverno injects dns_config on all pods. Every `kubernetes_deployment`, `kubernetes_stateful_set`, and `kubernetes_cron_job_v1` MUST include `lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 }` (use `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config` for CronJobs). The `# KYVERNO_LIFECYCLE_V1` marker is the canonical discoverability tag — grep for it to locate every site. A shared Terraform module was considered but `ignore_changes` only accepts static attribute paths (not module outputs, locals, or expressions), so the snippet convention is the only viable path. Full rationale and copy-paste snippets in `AGENTS.md` → "Kyverno Drift Suppression". +- **Kyverno ndots drift**: Kyverno injects dns_config on all pods. Add `lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] }` to kubernetes_deployment resources to prevent perpetual TF plan drift. - **NVIDIA GPU operator resources**: dcgm-exporter and cuda-validator resources configurable via `dcgmExporter.resources` and `validator.resources` in nvidia values.yaml. - **Pin database versions**: Disable Diun (image update monitoring) for MySQL, PostgreSQL, Redis. -- **Quarterly right-sizing**: Run `krr` (Dockerized, against Prometheus) for recommendations; compare to current requests and adjust in TF. (Goldilocks dashboard removed 2026-06-12.) +- **Quarterly right-sizing**: Check Goldilocks dashboard. Compare VPA upperBound to current request. Also check for under-provisioned (VPA upper > request x 0.8). -## CI/CD Architecture — GHA Builds → ghcr + Woodpecker Deploy +## CI/CD Architecture — GHA Builds + Woodpecker Deploy -**Doctrine (ADR-0002, fleet-wide as of 2026-06-13): ALL image builds + CI -compute run OFF-infra.** Every owned image is built/linted/tested on GitHub -Actions (public repos: free; private: 2000 free min/mo) and pushed to -`ghcr.io/viktorbarzin/`. **No in-cluster image builds or CI test runs -exist anywhere** — the in-cluster Woodpecker buildkit and the fallback-build -pattern were removed (clean cut). Woodpecker is **deploy-only** (plus infra -applies + maintenance crons). Canonical CI/CD reference: -`docs/architecture/ci-cd.md`; decision: `docs/adr/0002-all-image-builds-off-infra-gha-ghcr.md`. -**Watch what you trigger**: after a push that fires a build chain, follow it to -completion (GHA run → Woodpecker deploy → `rollout status`) and fix failures; -verify via live state, not the checkmark. +**Flow**: `git push → GHA build+push DockerHub (8-char SHA) → POST Woodpecker API → kubectl set image` -**The fleet pattern (every owned app):** Forgejo `viktor/` (canonical) -push-mirrors (`sync_on_commit`) → GitHub `ViktorBarzin/` → GHA -`.github/workflows/build.yml` (committed on Forgejo, mirrors over): `on: push: -branches:[master]` ONLY (feature branches mirror but build/deploy nothing — the -safety valve). The `build` job: lint/test → `svu` cuts the next `vX.Y.Z` tag to -CANONICAL Forgejo (GHA secret `FORGEJO_GIT_TOKEN` = write:repository PAT) + bakes -`VERSION` → `buildx` `linux/amd64` `provenance:false` (single-manifest, dodges -the orphaned-index-children class) → push `ghcr.io/viktorbarzin/:` + -`:latest` → `delete-package-versions` keep-10. The `deploy` job POSTs -`ci.viktorbarzin.me/api/repos//pipelines` (the GitHub-mirror's Woodpecker -registration, github-forge; GHA secret `WOODPECKER_TOKEN`) with `IMAGE_TAG` + -`IMAGE_NAME` → `.woodpecker/deploy.yml` (event:**manual** ONLY, so the raw -Forgejo→GitHub mirror pushes don't fire a tag-less deploy) runs `kubectl set -image deployment/ …` in-cluster (woodpecker-agent SA = cluster-admin, no -kubeconfig). Deployment image is `ignore_changes`/KEEL_IGNORE_IMAGE so the SHA -sticks vs `terragrunt apply`; CronJobs track `:latest` + `imagePullPolicy: -Always`. **Keel stays enrolled** as a redundant net (sees the SHA already -running → no-op). **Never** `set image`/`rollout restart` operator-managed -StatefulSets (memory id=740). Onboarding tool: `scripts/offinfra-onboard` + -`scripts/offinfra-templates/`; mirror + workflow commits via the Forgejo API over -the internal Traefik LB (`curl --resolve forgejo.viktorbarzin.me:443:10.0.20.203`). -Reference impls: tripit (the original pilot), f1-stream, job-hunter, tuya_bridge. +**Migrated to GHA** (10): Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints +**Woodpecker-only**: travel_blog (1.4GB content too large for GHA), infra pipelines (terragrunt apply, certbot, build-cli — need cluster access) -**Migrated apps (issues #13–#27):** f1-stream, job-hunter, tuya_bridge, -beadboard, nextcloud-todos, claude-agent-service, **claude-memory-mcp** (GHA → -ghcr, NOT DockerHub), kms-website, Freedify, instagram-poster, payslip-ingest, -broker-sync (image `wealthfolio-sync`), fire-planner, recruiter-responder, -x402-gateway — plus tripit. Earlier public-repo apps already on GHA (Website, -apple-health-data, audiblez-web, plotting-book, insta2spotify, -audiobook-search) now also land on ghcr. -- **PUBLIC ghcr packages:** beadboard, nextcloud-todos, claude-agent-service, - claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, - chrome-service-novnc, android-emulator. -- **PRIVATE ghcr:** f1-stream, job-hunter, instagram-poster, payslip-ingest, - wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, - infra-ci, k8s-portal. Pulled via the Kyverno-synced `ghcr-credentials` allowlist - (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`; NOT cluster-wide; cred - = Vault `secret/viktor/ghcr_pull_token`, a dedicated classic PAT scoped to - `read:packages` (UI-minted 2026-06-15; no longer the admin `github_pat` - alias). GitHub has no token-mint API, so rotation is manual: re-mint → - `vault kv patch secret/viktor ghcr_pull_token=…` → targeted apply - `module.kyverno.kubernetes_secret.ghcr_credentials` (reads Vault, dodges the - git-crypt tls-secret-sync landmine), Kyverno re-syncs the allowlist). +**Per-project files**: +- `.github/workflows/build-and-deploy.yml` — GHA: checkout, build, push DockerHub, POST Woodpecker API +- `.woodpecker/deploy.yml` — Woodpecker: `kubectl set image` + Slack notify (event: `[manual, push]`) +- `.woodpecker/build-fallback.yml` — Old full build pipeline preserved (event: `deployment` — never auto-fires) -**Infra-owned images (issues #29/#30)** build on GHA workflows IN the infra -repo's own `.github/workflows/` (added to the GitHub lineage via PR; the -github↔forgejo divergence was deliberately NOT reconciled): -`build-chrome-service-novnc.yml` + `build-android-emulator.yml` → public ghcr; -`build-cli.yml` → DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli`; -`build-infra-ci.yml` → `ghcr.io/viktorbarzin/infra-ci`; `build-k8s-portal.yml` → -PRIVATE `ghcr.io/viktorbarzin/k8s-portal` (Keel-deployed; the LAST in-cluster -Woodpecker build, migrated 2026-06-13 — completes "no local builds"). **infra-ci** -is the image the `.woodpecker/default.yml` apply step + `drift-detection.yml` run -in (proven by pipelines 165/166). chatterbox-tts is already built by tripit's GHA → ghcr. -The Woodpecker `build-ci-image.yml` + `build-cli.yml` pipelines were REMOVED; -infra-ci break-glass is a manual `.woodpecker/breakglass-infra-ci.yml` (ghcr -pull-and-save to the registry VM). - -**Forgejo container registry: FROZEN + emptied** (issue #32 wiped all `viktor/*` -container packages). Break-glass-only now; nothing pushes. `forgejo-cleanup` -stays DRY_RUN. Pull-through caches on `10.0.20.10` are unchanged. Runbook: -`docs/runbooks/forgejo-registry-breakglass.md`. - -**Woodpecker now runs only:** per-app `deploy.yml` (manual, `kubectl set -image`), `default.yml` (terragrunt apply), `renew-tls.yml` (certbot), -maintenance crons (drift-detection, provision-user, registry-config-sync, -pve-nfs-exports-sync, issue-automation, postmortem-todos), and the -manual `breakglass-infra-ci.yml`. **No build/test pipeline on any repo — do not -(re)introduce one.** (`.woodpecker/k8s-portal.yml`, the last in-cluster image -build, was removed 2026-06-13 — k8s-portal now builds on GHA → ghcr, see -Infra-owned images above.) - -**Decommissioned (issue #31):** travel_blog (stack destroyed + dir removed), 6 -dead builders' pipelines (terminal-lobby, webhook-handler, hmrc-sync, -trading-bot, travel-agent, trip-planner), and all `build-fallback.yml` files -(only Website had one). - -**Woodpecker API**: numeric repo IDs (`/api/repos//pipelines`), NOT -owner/name (those return HTML). The deploy registration for each app is the -**GitHub mirror** repo (github-forge). Infra: Forgejo forge = repo 82, legacy -GitHub forge = repo 1. +**Woodpecker API**: Uses **numeric repo IDs** (`/api/repos/2/pipelines`), NOT owner/name paths (those return HTML). +Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handler=6, audiblez-web=9, f1-stream=10, plotting-book=43, claude-memory-mcp=78, infra-onboarding=79, council-complaints=TBD **Woodpecker YAML gotchas**: - Commands with `${VAR}:${VAR}` must be **quoted** — unquoted `:` triggers YAML map parsing when vars are empty - Use `bitnami/kubectl:latest` (not pinned versions — entrypoint compatibility issues) - Global secrets must have `manual` in their events list for API-triggered pipelines -**GitHub repo secrets** (per repo): `WOODPECKER_TOKEN` (POST deploy pipeline), -`FORGEJO_GIT_TOKEN` (write:repository PAT for the svu tag push). ghcr push uses -the workflow's built-in `GITHUB_TOKEN` (`packages: write`). +**GitHub repo secrets** (set on all repos): `DOCKERHUB_USERNAME`, `DOCKERHUB_TOKEN`, `WOODPECKER_TOKEN` + +**Infra pipelines unchanged**: `default.yml` (terragrunt apply), `renew-tls.yml` (certbot cron), `build-cli.yml` (dual registry push), `k8s-portal.yml` (path-filtered build), `provision-user.yml` — all stay on Woodpecker. ## Database Host -**`postgresql_host`** in `config.tfvars` is `pg-cluster-rw.dbaas.svc.cluster.local` (the CNPG primary). The legacy `postgresql.dbaas` service is a live compatibility alias (selector `cnpg.io/instanceRole=primary`, so it also reaches the primary — authentik's PgBouncer still points at it) — but use `pg-cluster-rw` for anything new. This variable is shared by ~12 stacks. +**`postgresql_host`** in `config.tfvars` is `pg-cluster-rw.dbaas.svc.cluster.local` (the CNPG primary). The legacy `postgresql.dbaas` service has no endpoints — never use it. This variable is shared by ~12 stacks. **CNPG tuning** (in `stacks/dbaas/modules/dbaas/main.tf`): `shared_buffers=512MB`, `work_mem=16MB`, `wal_compression=on`, `effective_cache_size=1536MB`, pod memory 2Gi. @@ -202,49 +112,31 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`). - **Critical path services scaled to 3**: Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared. - **PDBs**: minAvailable=2 on Traefik and Authentik. - **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down. -- **CrowdSec enforcement is out-of-band** (no Traefik plugin/middleware — the dead Yaegi `crowdsec-bouncer-traefik-plugin` was removed on Traefik 3.7.5): banned IPs are dropped **in-kernel via nftables** by the `cs-firewall-bouncer` DaemonSet on **direct** hosts (drops in BOTH the `input` and `forward` hooks — Traefik is ETP=Local so client traffic is DNAT'd to the pod via `forward`; pulls ALL decisions incl. the ~31k CAPI blocklist), and **blocked at the Cloudflare edge** for **proxied** hosts (one `crowdsec_ban` Rules List + a zone WAF block rule, fed by the `crowdsec-cf-sync` CronJob in `rybbit` ns every 2 min — excludes CAPI). Zero per-request latency; **fails open** (LAPI down → no new bans, existing drops persist, legit traffic never blocked). Whitelist covers RFC1918 + tailnet + internal CIDRs. Full as-built: `docs/architecture/security.md`. -- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations). +- **CrowdSec bouncer**: graceful degradation mode (fail-open on error). +- **Rate limiting**: Return 429 (not 503). Per-service tuning: Immich/Nextcloud need higher limits. - **Retry middleware**: 2 attempts, 100ms — in default ingress chain. -- **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts". -- **HTTP/3 (QUIC)**: Enabled on Traefik. Works for **direct (non-proxied) apps** via the dedicated LB IP below (ETP=Local). Proxied apps get QUIC at the Cloudflare edge. -- **Traefik LB IP = `10.0.20.203`, `externalTrafficPolicy: Local`** (dedicated, NOT the shared `.200`). Moved off the shared `.200` on 2026-05-30 so direct/non-proxied apps preserve the **real client IP for CrowdSec** (ETP=Cluster SNAT'd them to the node IP) and so QUIC works. **The shared `10.0.20.200` keeps the other 10 LB services** (PG state-backend `postgresql-lb`, headscale, wireguard, coturn, xray, etc. — all ETP=Cluster; MetalLB forbids mixed ETP on a shared IP, hence Traefik's own IP). **cloudflared targets the in-cluster Traefik Service** (`https://traefik.traefik.svc.cluster.local:443`, remote/dashboard tunnel config — edit via CF Global API Key in `secret/platform`), so proxied apps are decoupled from the LB IP. pfSense WAN 443 (tcp+udp) NAT → alias `traefik_lb` (`.203`). Internal split-horizon apex `viktorbarzin.me A` → `.203`. Full runbook + post-mortem: `docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*`. -- **IPv6 ingress** = HE 6in4 tunnel (`2001:470:6e:43d::2`) → **standalone HAProxy on pfSense** (`/usr/local/etc/ipv6-haproxy.cfg`, NOT the HAProxy package) using `send-proxy-v2` → Traefik `.203` (web 443/80) + mail NodePorts `30125-30128` (25/465/587/993) — so **real IPv6 client IPs reach CrowdSec**. Traefik trusts PROXY-v2 **only from `10.0.20.1`** (`entryPoints.web/websecure.proxyProtocol.trustedIPs`); real IPv4 clients (own source IP) unaffected. **No QUIC over IPv6** (bridge is TCP/h2). Replaced socat 2026-05-30 (socat masked every v6 client as `10.0.20.1`). Boot/persistence: config.xml `` → `ipv6_proxy.sh` (patches nginx off `[::]:443/:80` to free the tunnel IPv6, then `service ipv6proxy onestart`); `rc.d/ipv6proxy` manages HAProxy. Backends use **no health `check`** (a plain TCP check false-DOWNs the PROXY-expecting listeners). As-built: `docs/architecture/networking.md` → "IPv6 Ingress". -- **IPAM & DNS auto-registration**: pfSense Kea DHCP serves all 3 subnets (VLAN 10, VLAN 20, 192.168.1.x). Kea DDNS auto-registers every DHCP client in Technitium (RFC 2136, A+PTR). CronJob `phpipam-pfsense-import` (hourly) pulls Kea leases + ARP into phpIPAM via SSH (passive, no scanning). CronJob `phpipam-dns-sync` (15min) bidirectional sync phpIPAM ↔ Technitium. 42 MAC reservations for 192.168.1.x. +- **HTTP/3 (QUIC)**: Enabled cluster-wide via Traefik. +- **IPAM & DNS auto-registration**: pfSense Kea DHCP serves all 3 subnets (VLAN 10, VLAN 20, 192.168.1.x). Kea DDNS auto-registers every DHCP client in Technitium (RFC 2136, A+PTR). CronJob `phpipam-pfsense-import` (5min) pulls Kea leases + ARP into phpIPAM via SSH (passive, no scanning). CronJob `phpipam-dns-sync` (15min) bidirectional sync phpIPAM ↔ Technitium. 42 MAC reservations for 192.168.1.x. ## Service-Specific Notes | Service | Key Operational Knowledge | |---------|--------------------------| | Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe | -| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | -| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` | +| Immich | ML on SSD, disable ModSecurity (breaks streaming), CUDA for ML, frequent upgrades | +| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob | | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU | -| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. | +| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding | | Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version | -| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. | -| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). | +| MySQL Standalone | Raw `kubernetes_stateful_set_v1` with `mysql:8.4` (migrated from InnoDB Cluster 2026-04-16). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (15Gi, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Old InnoDB Cluster + operator still in TF (Phase 4 cleanup pending). Bitnami charts deprecated (Broadcom Aug 2025) — use official images. | +| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (5min) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). | ## Monitoring & Alerting -- **Alert-on-change routing** (alert-noise-reduction 2026-06-12, `route` block in `prometheus_chart_values.tpl`): warning/info notify ONCE then stay quiet while firing (`repeat_interval: 8760h` ≈ off); criticals re-ping every 6h (was 1h); `send_resolved` on. Standing state is reviewed via the daily digest, not re-pings. -- **Daily alert digest**: CronJob `alert-digest` (monitoring ns, `alert_digest.tf` + `alert_digest.py`) posts the full current board grouped by severity + resolved-in-24h to `#alerts` at 08:00 Europe/London. Stock `python:3.12-alpine`, pure-stdlib (no pip/apk at runtime — avoids the status-page-pusher disk anti-pattern, id=559); reads Alertmanager v2 + Prometheus; reuses the Alertmanager Slack webhook via the `alert-digest` Secret. Safety net for alert-on-change. -- **Cascade inhibitions** (`inhibit_rules`): `NodeDown` AND `NodeConditionBad`/`NodeDiskPressure` suppress downstream pod-churn alerts (PodCrashLooping/PodImagePullBackOff/PodsStuckContainerCreating/ScrapeTargetDown/*ReplicasMismatch); `T3ProbeLegDown` suppresses `T3ProbeDropBurst` for the same `leg`; plus existing NFS/Traefik/Authentik/Power/Tuya/iDRAC cascades. No `equal` on the node rules (pod alerts carry no `node` label → cluster-wide, like NodeDown). -- **ScrapeTargetDown scrapes only Ready endpoints** (relabel `keep __meta_kubernetes_endpoint_ready=true` on both `kubernetes-service-endpoints` jobs) — completed CronJob pods lingering as NotReady EndpointSlice addresses no longer fire phantom "down" alerts (tts/tripit/beads, id=4895). Replaces the old "exclude completed CronJob pods" guidance; a Ready pod with a broken metrics endpoint still fires. -- Alertmanager is now scraped (`extraScrapeConfigs` job `alertmanager`) → `alertmanager_notifications_total`/`_alerts`/`_notifications_failed_total` available; it had no `prometheus.io/scrape` annotation so notification volume was previously unmeasurable. -- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions. +- Alert cascade inhibitions: if node is down, suppress pod alerts on that node. +- Exclude completed CronJob pods from "pod not ready" alerts. +- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). - **External monitoring**: `[External] ` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param). -- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction). -- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay). -- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → `#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI ''` must NOT show a Location to `authentik.viktorbarzin.me` before adding. - -## Security Posture (Wave 1 — locked 2026-05-18) - -Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/security-incident.md`. Beads epic: `code-8ywc`. - -- **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming. -- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.) -- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging. -- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred. -- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. -- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2). +- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence. +- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Mailgun API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Mailserver on dedicated MetalLB IP `10.0.20.202` with `externalTrafficPolicy: Local` for CrowdSec real-IP detection. Vault: `mailgun_api_key` in `secret/viktor` (probe), `brevo_api_key` in `secret/viktor` (relay). ## Storage & Backup Architecture @@ -252,20 +144,19 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se Choose storage class based on workload type: -| Use **proxmox-lvm-encrypted** when | Use **proxmox-lvm** when | Use **NFS** (`nfs_volume` module) when | -|------------------------------------|--------------------------|----------------------------------------| -| **Any service storing sensitive data** | Non-sensitive app state (configs, caches) | Shared data across multiple pods (RWX) | -| Databases (user data, credentials) | Media indexes, search caches | Media libraries (music, ebooks, photos) | -| Auth/identity services | Monitoring data (Prometheus) | Backup destinations (cloud sync picks up from NFS) | -| Password managers, email, git repos | Tools with no user secrets | Large datasets (>10Gi) where snapshots matter | -| Health/financial data | | Data you want to browse/inspect from outside k8s | +| Use **proxmox-lvm-encrypted** when | Use **proxmox-lvm** when | Use **NFS** (`nfs_volume` module) when | Use **nfs-proxmox** SC when | +|------------------------------------|--------------------------|----------------------------------------|-----------------------------| +| **Any service storing sensitive data** | Non-sensitive app state (configs, caches) | Shared data across multiple pods (RWX) | Dynamic provisioning on Proxmox host NFS | +| Databases (user data, credentials) | Media indexes, search caches | Media libraries (music, ebooks, photos) | Vault (dynamic PVC creation) | +| Auth/identity services | Monitoring data (Prometheus) | Backup destinations (cloud sync picks up from NFS) | | +| Password managers, email, git repos | Tools with no user secrets | Large datasets (>10Gi) where snapshots matter | | +| Health/financial data | | Data you want to browse/inspect from outside k8s | | **Default for sensitive data is proxmox-lvm-encrypted.** Use plain `proxmox-lvm` only for non-sensitive workloads. Use NFS when you need RWX, backup pipeline integration, or it's a large shared media library. -**NFS server:** -- **Proxmox host** (192.168.1.127): Sole NFS for all workloads. HDD at `/srv/nfs` (ext4 thin LV `pve/nfs-data`, 3 TB). SSD at `/srv/nfs-ssd` (ext4 LV `ssd/nfs-ssd-data`, 100GB). Exports use `async,insecure` options (`async` — safe with UPS + Vault Raft replication + databases on block storage; `insecure` — pfSense NATs source ports >1024 between VLANs). -- **Nextcloud as NFS browser**: Nextcloud (`nextcloud.viktorbarzin.me`) mounts the PVE NFS roots (`/srv/nfs`, `/srv/nfs-ssd`) inside the NC pod at `/mnt/pve-nfs` + `/mnt/pve-nfs-ssd`. Surfaced to users via two ACL patterns: (1) admin-only root browsers `PVE NFS Pool` + `PVE NFS-SSD Pool` (scoped to NC group `admin`); (2) per-archive mounts (e.g. `/anca-elements`) with `applicable_users` set to the owners. ACL is at the mount level via `occ files_external:applicable` — Files Access Control is NOT used (NC 30/31's workflow engine lacks FilePath / UserId checks). Manifest lives in `kubernetes_config_map_v1.nextcloud_external_storage_manifest` (`stacks/nextcloud/external_storage.tf`); a one-shot K8s Job applies it idempotently. -- **`nfs-truenas` StorageClass**: Historical name retained only because SC names are immutable on PVs (48 bound PVs reference it — renaming would require mass PV churn, not worth it). Now points to the Proxmox host (`nfs.csi.k8s.io` dynamic provisioning on `192.168.1.127:/srv/nfs`). TrueNAS (VM 9000, 10.0.10.15) operationally decommissioned 2026-04-13; VM still exists in stopped state on PVE pending user decision on deletion. +**NFS servers:** +- **Proxmox host** (192.168.1.127): Primary NFS for all workloads. HDD at `/srv/nfs` (ext4 thin LV `pve/nfs-data`, 1TB). SSD at `/srv/nfs-ssd` (ext4 LV `ssd/nfs-ssd-data`, 100GB). Exports use `async,insecure` options (`async` — safe with UPS + Vault Raft replication + databases on block storage; `insecure` — pfSense NATs source ports >1024 between VLANs). +- **TrueNAS** (10.0.10.15): **Immich only** (8 PVCs). `nfs-truenas` StorageClass retained exclusively for Immich. **Migration note**: CSI PV `volumeAttributes` are immutable — cannot update NFS server in place. New PV/PVC pairs required (convention: append `-host` to PV name). @@ -284,7 +175,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" { name = "-data-proxmox" namespace = kubernetes_namespace..metadata[0].name annotations = { - "resize.topolvm.io/threshold" = "10%" + "resize.topolvm.io/threshold" = "80%" "resize.topolvm.io/increase" = "100%" "resize.topolvm.io/storage_limit" = "5Gi" } @@ -296,20 +187,11 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" { requests = { storage = "1Gi" } } } - lifecycle { - # pvc-autoresizer expands this PVC up to storage_limit; ignore drift on - # requests.storage so the next TF apply doesn't try to shrink it back - # (K8s rejects shrinks → apply fails). To bump the floor manually: - # temporarily remove this block, apply the new size, re-add the block, - # apply again. - ignore_changes = [spec[0].resources[0].requests] - } } ``` - `wait_until_bound = false` is **required** (WaitForFirstConsumer binding) - Deployment strategy **must be Recreate** (RWO volumes) - Autoresizer annotations are **required** on all proxmox-lvm PVCs -- `lifecycle.ignore_changes` on `requests` is **required** to coexist with the autoresizer - Every proxmox-lvm app **MUST** add a backup CronJob writing to NFS `/mnt/main/-backup/` **proxmox-lvm-encrypted PVC template** (Terraform) — use for all sensitive data: @@ -320,7 +202,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" { name = "-data-encrypted" namespace = kubernetes_namespace..metadata[0].name annotations = { - "resize.topolvm.io/threshold" = "10%" + "resize.topolvm.io/threshold" = "80%" "resize.topolvm.io/increase" = "100%" "resize.topolvm.io/storage_limit" = "5Gi" } @@ -332,13 +214,9 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" { requests = { storage = "1Gi" } } } - lifecycle { - # See data_proxmox above — required for autoresizer coexistence. - ignore_changes = [spec[0].resources[0].requests] - } } ``` -- Same rules as `proxmox-lvm` (wait_until_bound, Recreate strategy, autoresizer, backup CronJob, `lifecycle.ignore_changes`) +- Same rules as `proxmox-lvm` (wait_until_bound, Recreate strategy, autoresizer, backup CronJob) - Uses LUKS2 encryption with Argon2id key derivation via Proxmox CSI plugin - Encryption passphrase stored in Vault KV (`secret/viktor/proxmox_csi_encryption_passphrase`), synced to K8s Secret `proxmox-csi-encryption` in `kube-system` via ExternalSecret - Backup key at `/root/.luks-backup-key` on PVE host (chmod 600) @@ -350,17 +228,15 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" { **Copy 2**: sda backup disk (`/mnt/backup`, 1.1TB ext4, VG `backup`) **Copy 3**: Synology NAS offsite (two-tier: sda + NFS) -**PVE host scripts** (source: `infra/scripts/`; deployed manually via `scp` to `/usr/local/bin/` — strip the `.sh`): -- `/usr/local/bin/nfs-mirror` — Daily 02:00. `rsync --delete /srv/nfs// → /mnt/backup//` (sda leg 1), appends transferred paths to `/mnt/backup/.changed-files` for offsite Step 1. **EXCLUDES**: immich (too big — direct leg), frigate/temp (no backup), anca-elements (in Immich), and **(2026-06-01) ollama, prometheus-backup, audiblez, ebook2audiobook** — regenerable, live-only on sdc, kept off the space-constrained offsite. Does NOT mirror `/srv/nfs-ssd`. -- `/usr/local/bin/daily-backup` — Daily 05:00. Mounts LVM thin snapshots ro → rsyncs FILES to `/mnt/backup/pvc-data////` with `--link-dest` versioning (4 weeks). Auto SQLite backup (magic number check, `?mode=ro`). Also backs up pfSense (config.xml + tar), PVE config. Prunes snapshots >7d. **Skip-list (2026-06-01)**: `nextcloud/nextcloud-data-proxmox` (orphaned pre-encryption PV). -- `/usr/local/bin/offsite-sync-backup` — Daily 06:00 (After=daily-backup). Step 1: sda → Synology `pve-backup/` (incremental via manifest; monthly full `rsync --delete` days 1–7). Step 2: NFS direct → Synology — **immich-only on BOTH `nfs/` and `nfs-ssd/` (2026-06-01)**; ollama/llamacpp on the SSD no longer ship offsite. +**PVE host scripts** (source: `infra/scripts/`): +- `/usr/local/bin/daily-backup` — Daily 05:00. Mounts LVM thin snapshots ro → rsyncs FILES to `/mnt/backup/pvc-data////` with `--link-dest` versioning (4 weeks). Auto SQLite backup (magic number check, `?mode=ro`). Auto-discovered BACKUP_DIRS (glob, not hardcoded). Also backs up pfSense (config.xml + tar), PVE config. Prunes snapshots >7d. +- `/usr/local/bin/offsite-sync-backup` — Daily 06:00 (After=daily-backup). Step 1: sda → Synology `pve-backup/` (PVC snapshots, pfSense, PVE config). Step 2: NFS → Synology `nfs/` + `nfs-ssd/` via inotify change-tracked `rsync --files-from`. Monthly full `rsync --delete` on 1st Sunday. - `/usr/local/bin/lvm-pvc-snapshot` — Daily 03:00. Thin snapshots of all PVCs except dbaas+monitoring. 7-day retention. Instant restore: `lvm-pvc-snapshot restore `. -- `/usr/local/bin/vzdump-vms` — Daily 01:00. Live `vzdump --mode snapshot` of hand-managed VMs (the ones NOT in Terraform) → `/mnt/backup/vzdump/`, keep 3 per VMID. `VZDUMP_VMIDS` default `102` (devvm) — **the only VM imaged today** (its per-user home dirs + local-only git repos, incl. the no-remote monorepo root, are otherwise irreplaceable). devvm has the guest agent (`agent: 1`) so dumps are fs-consistent. Deliberately NOT in the incremental offsite manifest (would balloon Synology); the monthly offsite full pass (days 1-7) mirrors `/mnt/backup/vzdump/`. Pushgateway job `vzdump-backup`. Added 2026-06-09 (closed the silent "VMs never imaged" DR gap). Restore: `qmrestore /mnt/backup/vzdump/vzdump-qemu--.vma.zst `. - `nfs-change-tracker.service` — Continuous inotifywait on `/srv/nfs` + `/srv/nfs-ssd`. Logs changed file paths to `/mnt/backup/.nfs-changes.log`. Consumed by offsite-sync-backup for incremental rsync (completes in seconds instead of 30+ minutes). **Synology layout** (`192.168.1.13:/volume1/Backup/Viki/`): - `pve-backup/` — PVC file backups (`pvc-data/`), SQLite backups (`sqlite-backup/`), pfSense, PVE config (synced from sda) -- `nfs/` — mirrors `/srv/nfs` on Proxmox (inotify change-tracked rsync) +- `nfs/` — mirrors `/srv/nfs` on Proxmox (inotify change-tracked rsync, renamed from `truenas/`) - `nfs-ssd/` — mirrors `/srv/nfs-ssd` on Proxmox (inotify change-tracked rsync) **App-level CronJobs** (write to Proxmox host NFS, synced to Synology via inotify): @@ -378,9 +254,8 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" { ## Known Issues - **CrowdSec Helm upgrade times out**: `terragrunt apply` on platform stack causes CrowdSec Helm release to get stuck in `pending-upgrade`. Workaround: `helm rollback crowdsec -n crowdsec`. Root cause: likely ResourceQuota CPU at 302% preventing pods from passing readiness probes. Needs investigation. -- **OpenClaw config is writable**: OpenClaw writes to `openclaw.json` at runtime (doctor --fix, plugin auto-enable). Never use subPath ConfigMap mounts for it — use an init container to copy into a writable volume. Needs 2Gi memory + `NODE_OPTIONS=--max-old-space-size=1536`. **`mcp.servers` baked into the ConfigMap-loaded openclaw.json gets stripped by `doctor --fix`** — register MCP servers via `openclaw mcp set ` in the container startup command instead (CLI-written entries persist across doctor runs). Current servers wired this way: `ha`, `context7`, `playwright` (sidecar at `localhost:3000/mcp`). -- **OpenClaw memory-core indexes `/workspace/memory/`, not `/home/node/.openclaw/memory/`**: `/home/node/.openclaw/memory/main.sqlite` is the index store, NOT a content source. Files written under `/home/node/.openclaw/memory/projects//*.md` will NOT be indexed. To populate memory-core, write Markdown under `/workspace/memory/projects//` and run `openclaw memory index --force`. This is what the daily `memory-sync` CronJob in `stacks/openclaw/` does for claude-memory → OpenClaw sync. -- **(Obsolete 2026-06-12) Goldilocks VPA**: VPA/Goldilocks was uninstalled (etcd-load-reduction); the old "Goldilocks may have added a limit that blocks the change" gotcha no longer applies. Use `krr` for right-sizing. +- **OpenClaw config is writable**: OpenClaw writes to `openclaw.json` at runtime (doctor --fix, plugin auto-enable). Never use subPath ConfigMap mounts for it — use an init container to copy into a writable volume. Needs 2Gi memory + `NODE_OPTIONS=--max-old-space-size=1536`. +- **Goldilocks VPA sets limits**: When increasing memory requests, always set explicit `limits` too — Goldilocks may have added a limit that blocks the change. ## User Preferences - **Calendar**: Nextcloud at `nextcloud.viktorbarzin.me` diff --git a/.claude/agents/k8s-version-upgrade.deprecated.md b/.claude/agents/k8s-version-upgrade.deprecated.md deleted file mode 100644 index fd0f774b..00000000 --- a/.claude/agents/k8s-version-upgrade.deprecated.md +++ /dev/null @@ -1,543 +0,0 @@ ---- -name: k8s-version-upgrade-DEPRECATED -description: "DEPRECATED 2026-05-11 — replaced by the Job-chain in stacks/k8s-version-upgrade. See header below." -tools: Read, Write, Edit, Bash, Grep, Glob -model: opus ---- - -# DEPRECATED — Do NOT invoke this agent - -Retired **2026-05-11** after a self-preemption incident: this agent ran inside -the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was -scheduled onto k8s-node4. When the agent tried to `kubectl drain k8s-node4` -(Stage 6, first worker), it evicted itself. The bash process died mid-SSH, -leaving node4 cordoned and the cluster half-upgraded (master at v1.34.7, -workers at v1.34.2). - -## Replaced by - -A chain of small Kubernetes Jobs, each pinned (via `nodeSelector` + -`kubernetes.io/hostname`) to a node that is NOT its drain target. No pod can -preempt itself because each Job's pod and its target node are always -different. - -| Old | New | -|-----|-----| -| Single agent run in claude-agent-service pod | Chain of 7 phase Jobs (preflight → master → worker × 4 → postflight) | -| Whole pipeline in one prompt | Phase body in `stacks/k8s-version-upgrade/scripts/upgrade-step.sh`, dispatched per-phase via `case $PHASE` | -| Detection CronJob POSTs to `claude-agent-service` | Detection CronJob renders Job 0 from `job-template.yaml` via `envsubst` + `kubectl apply` | -| Drain blocks indefinitely on PDB=0 (e.g. single-replica Anubis) | New `predrain_unstick` deletes PDB-blocked pods so drain proceeds | -| `K8sVersionSkew` + `EtcdPreUpgradeSnapshotMissing` alerts | Above + `K8sUpgradeStalled` (in_flight=1 and time()-started_timestamp > 5400s) | - -## Where the logic lives now - -- **`infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh`** — universal - phase body. Dispatches on `$PHASE`. Each phase spawns the next Job. -- **`infra/stacks/k8s-version-upgrade/job-template.yaml`** — Job template - rendered by `envsubst` at runtime. ConfigMap-mounted at `/template` in - every Job pod. -- **`infra/stacks/k8s-version-upgrade/main.tf`** — Terraform stack: ConfigMaps, - unified `k8s-upgrade-job` ServiceAccount + RBAC, detection CronJob. -- **`infra/docs/runbooks/k8s-version-upgrade.md`** — operator runbook (kill a - stuck Job, skip a phase, manually re-trigger from a specific phase). - -## Why kept (not deleted) - -Documents the prompted-agent design and is useful as historical reference when -reading post-mortem discussions or comparing approaches. The `name` field has -been suffixed with `-DEPRECATED` so the agent cannot be invoked by name from -`claude-agent-service`. - ---- - -# Original prompt — DO NOT EXECUTE (reference only) - -You are the K8s Version Upgrade Agent for a 5-node home-lab Kubernetes cluster (1 master, 4 workers, stacked etcd, no HA). - -## Your Job - -Given a target patch or minor version of `kubeadm`/`kubelet`/`kubectl`, you orchestrate the full rolling upgrade with safety gates between every node. You do NOT decide WHEN to run — the `k8s-version-check` CronJob in the `k8s-upgrade` namespace fires you off after detection. You only run when invoked. - -The sequence (Pre-flight → etcd snapshot → master containerd skew fix → apt repo URL change [minor only] → master kubeadm upgrade → workers sequentially → Post-flight) is non-negotiable. Skipping a step is how clusters die. - -## Inputs - -The user prompt contains a JSON object with these fields: - -```json -{ - "target_version": "1.34.5", - "kind": "patch", - "dry_run": false, - "stages": "all" -} -``` - -| Field | Required | Description | -|---|---|---| -| `target_version` | yes | Exact `X.Y.Z` to land on (e.g. `1.34.5`). The script `infra/scripts/update_k8s.sh` accepts this via `--release`. | -| `kind` | yes | `patch` (no apt-repo URL change) or `minor` (rewrite repo to v$NEW_MINOR/deb on every node before kubeadm). | -| `dry_run` | no, default false | If true, run all SSH + kubectl READ commands but skip every mutating command (`apt-get install`, `kubeadm upgrade apply`, `kubeadm upgrade node`, `kubectl drain/uncordon`, etcd snapshot, systemctl restart). Log what you would do and exit 0. | -| `stages` | no, default `all` | Comma-separated subset of: `preflight`, `snapshot`, `containerd`, `repo`, `master`, `workers`, `postflight`. Run only those stages and exit. Used by tests. | - -Parse the prompt's first JSON block to extract these. If anything is missing, abort with a Slack notification ("malformed payload"). - -## Environment - -- **Working dir**: `/workspace/infra` (`WORKSPACE_DIR` env var) -- **Kubeconfig**: `/workspace/infra/config` (use `kubectl --kubeconfig $WORKSPACE_DIR/config ...` in every kubectl call) -- **Prometheus**: `http://prometheus-server.monitoring.svc.cluster.local:80` (in-cluster, no auth) -- **Etcd snapshot**: triggered as a one-shot Job from the existing `default/backup-etcd` CronJob (defined in `stacks/infra-maintenance/`). The Job runs on `k8s-master` with hostNetwork (so etcdctl reaches etcd at 127.0.0.1:2379), mounts the PV-backed NFS export `192.168.1.127:/srv/nfs/etcd-backup`, and writes `etcd-snapshot-.db` there. Do NOT shell into master with etcdctl directly — the cert paths + NFS mount are already wired into the CronJob. -- **Library script**: `/workspace/infra/scripts/update_k8s.sh` — pipe via SSH to each node, do NOT modify on the fly. Invoke as `ssh ... 'bash -s' < update_k8s.sh --role --release `. - -### Credentials — fetched at startup - -The k8s-upgrade ServiceAccount has GET on the `k8s-upgrade-creds` Secret in the `k8s-upgrade` namespace (granted by a RoleBinding in `stacks/k8s-version-upgrade/main.tf`). Fetch credentials into `/tmp` files at the start of every run: - -```bash -KUBECTL="kubectl --kubeconfig $WORKSPACE_DIR/config" - -# SSH private key — mode 0400 required by openssh -$KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \ - -o jsonpath='{.data.ssh_key}' | base64 -d > /tmp/k8s-upgrade-ssh-key -chmod 400 /tmp/k8s-upgrade-ssh-key - -# Slack webhook (URL string) -SLACK_WEBHOOK_K8S_UPGRADE=$($KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \ - -o jsonpath='{.data.slack_webhook}' | base64 -d) -``` - -The rest of the prompt uses `/tmp/k8s-upgrade-ssh-key` for SSH and `$SLACK_WEBHOOK_K8S_UPGRADE` for Slack. SSH template: - -```bash -SSH="ssh -i /tmp/k8s-upgrade-ssh-key -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/tmp/known_hosts" -``` - -Every SSH call below uses `$SSH wizard@ ''`. `accept-new` accepts the host key on first encounter then pins it — if a node was reimaged, clear `/tmp/known_hosts` before retry. - -## NEVER do - -- Never bypass the halt-on-alert check — even if a single alert "looks unrelated" -- Never start the next worker before the previous one is Ready + all its pods rescheduled + 10-min soak observed -- Never skip the etcd snapshot — even for patch -- Never `kubectl edit/patch/delete` — read-only kubectl plus `drain`/`uncordon` only -- Never `apt-mark hold` something without unholding it first, and vice versa — the script handles this; don't do it manually -- Never run two stages in parallel — sequential only -- Never run if `dry_run=false` AND the cluster has a node Not Ready, or any Upgrade Gates alert firing -- Never push to git, never modify Terraform, never invoke claude-agent-service recursively - -## Slack + Pushgateway helpers - -Every transition posts to Slack: - -```bash -slack() { - local msg="$1" - local hook="${SLACK_WEBHOOK_K8S_UPGRADE:-$SLACK_WEBHOOK_URL}" - curl -sS -X POST -H 'Content-Type: application/json' \ - --data "$(jq -nc --arg t "[k8s-upgrade] $msg" '{text: $t}')" \ - "$hook" -} -``` - -Start every message with `[k8s-upgrade]` so it's grep-able. - -Pushgateway gauges drive the `EtcdPreUpgradeSnapshotMissing` and ops-visibility metrics: - -```bash -PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-upgrade' - -push_metric() { - # push_metric - local name="$1" val="$2" - printf '# TYPE %s gauge\n%s %s\n' "$name" "$name" "$val" \ - | curl -sS --data-binary @- "$PG" -} -``` - -Pushes you must make at specific stages (skipped in dry_run): -| When | Metric | Value | -|---|---|---| -| Stage 0 start | `k8s_upgrade_in_flight` | `1` | -| Stage 0 start | `k8s_upgrade_target_minor` | `$target_minor` | -| Stage 2 verified | `k8s_upgrade_snapshot_taken` | `1` | -| Stage 7 clean | `k8s_upgrade_in_flight` | `0` | -| Stage 7 clean | `k8s_upgrade_snapshot_taken` | `0` | - -If you abort mid-flight, leave `k8s_upgrade_in_flight=1` so the alert fires and surfaces the half-done state. - -## Stage 0: Parse inputs + announce - -1. Extract `target_version`, `kind`, `dry_run`, `stages` from the prompt JSON. -2. Derive `target_minor` from `target_version` (split on `.`). -3. Mark the in-flight annotation on the namespace AND push Pushgateway in-flight gauge: - ```bash - if [ "$dry_run" = "false" ]; then - kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \ - viktorbarzin.me/k8s-upgrade-in-flight="$(date -u +%FT%TZ)" \ - viktorbarzin.me/k8s-upgrade-target="$target_version" \ - --overwrite - - push_metric k8s_upgrade_in_flight 1 - push_metric k8s_upgrade_snapshot_taken 0 - fi - ``` -4. Slack: `Starting k8s upgrade to v$target_version (kind=$kind, dry_run=$dry_run, stages=$stages)`. - -## Stage 1: Pre-flight (`stages` includes `preflight`) - -Skip if `stages` excludes `preflight`. - -### Check 1.1 — All nodes Ready, no pressure - -```bash -kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o json \ - | jq -r '.items[] | "\(.metadata.name): \(.status.conditions[] | select(.type=="Ready") | .status), Mem=\(.status.conditions[] | select(.type=="MemoryPressure") | .status), Disk=\(.status.conditions[] | select(.type=="DiskPressure") | .status)"' -``` - -Abort if any node is not Ready=True, or has MemoryPressure=True or DiskPressure=True. - -### Check 1.2 — Halt-on-alert (same query kured uses) - -```bash -ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \ - | jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \ - | grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \ - | sort -u) - -if [ -n "$ALERTS" ]; then - slack "ABORT preflight — firing alerts:\n$ALERTS" - exit 1 -fi -``` - -### Check 1.3 — 24h-quiet baseline - -Re-uses the sentinel-gate Check 4 logic from `stacks/kured/main.tf`. Any node that transitioned Ready in the last 24h means the cluster just absorbed a node reboot — we want a clean baseline before starting a fresh rollout. - -```bash -RECENT_REBOOT=0 -while IFS= read -r ts; do - [ -z "$ts" ] && continue - diff=$(( $(date +%s) - $(date -d "$ts" +%s) )) - [ "$diff" -lt 86400 ] && RECENT_REBOOT=1 && break -done < <(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}') - -if [ "$RECENT_REBOOT" -eq 1 ]; then - slack "ABORT preflight — node transitioned Ready <24h ago (soak window)" - exit 1 -fi -``` - -### Check 1.4 — kubeadm upgrade plan reports our target - -```bash -PLAN_TARGET=$($SSH \ - wizard@k8s-master 'sudo kubeadm upgrade plan' \ - | grep -oE 'You can now apply the upgrade by executing the following command:.*v[0-9]+\.[0-9]+\.[0-9]+' \ - | grep -oE 'v[0-9]+\.[0-9]+\.[0-9]+' | head -1 | tr -d v) -``` - -If `$PLAN_TARGET` does not start with the requested `target_version`, slack-abort: -"`kubeadm upgrade plan` says target is $PLAN_TARGET but caller asked for $target_version — drift; aborting." - -Slack: `Pre-flight clean. Proceeding to etcd snapshot.` - -## Stage 2: Etcd snapshot (`stages` includes `snapshot`) - -Always run — patch OR minor. Triggers a one-shot Job from the existing `default/backup-etcd` CronJob and waits for it to complete. - -```bash -JOB_NAME="pre-upgrade-etcd-${target_version}-$(date +%s)" - -if [ "$dry_run" = "false" ]; then - $KUBECTL -n default create job --from=cronjob/backup-etcd "$JOB_NAME" - - # Wait up to 10 min for snapshot Job to complete - $KUBECTL -n default wait --for=condition=complete --timeout=600s "job/$JOB_NAME" || { - slack "ABORT Stage 2 — etcd snapshot Job did not complete in 10 min" - $KUBECTL -n default describe "job/$JOB_NAME" | tail -30 - exit 1 - } - - # Parse the Job's pod log for "Backup done: ( bytes)" - LOG=$($KUBECTL -n default logs "job/$JOB_NAME" -c backup-manage --tail=20) - echo "$LOG" - SNAPSHOT_LINE=$(echo "$LOG" | grep -E '^Backup done:') - SIZE=$(echo "$SNAPSHOT_LINE" | grep -oE '\([0-9]+ bytes\)' | grep -oE '[0-9]+') - SNAPSHOT_FILE=$(echo "$SNAPSHOT_LINE" | awk '{print $3}') - - if [ -z "$SIZE" ] || [ "$SIZE" -lt 1024 ]; then - slack "ABORT Stage 2 — etcd snapshot empty or missing (size='$SIZE' line='$SNAPSHOT_LINE')" - exit 1 - fi - - TARGET_PATH="nfs://192.168.1.127:/srv/nfs/etcd-backup/$SNAPSHOT_FILE" - $KUBECTL annotate ns k8s-upgrade \ - viktorbarzin.me/k8s-upgrade-snapshot-path="$TARGET_PATH" --overwrite - - push_metric k8s_upgrade_snapshot_taken 1 -else - TARGET_PATH="WOULD: trigger default/backup-etcd Job, wait, verify size" - SIZE="dry-run" -fi - -slack "Etcd snapshot saved at $TARGET_PATH (size=$SIZE)" -``` - -## Stage 3: Master containerd skew fix (`stages` includes `containerd`) - -Only run if master containerd version < highest worker containerd version. - -```bash -get_ctr_version() { - $SSH \ - "wizard@$1" 'containerd --version | awk "{print \$3}" | tr -d v' -} - -MASTER_CTR=$(get_ctr_version k8s-master) -WORKER_MAX="0.0.0" -for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do - v=$(get_ctr_version "$n") - # Compare semver-ish - if [ "$(printf '%s\n%s' "$v" "$WORKER_MAX" | sort -V | tail -1)" = "$v" ]; then - WORKER_MAX="$v" - fi -done - -if [ "$(printf '%s\n%s' "$MASTER_CTR" "$WORKER_MAX" | sort -V | head -1)" = "$MASTER_CTR" ] \ - && [ "$MASTER_CTR" != "$WORKER_MAX" ]; then - # Master is behind — bump - slack "Master containerd $MASTER_CTR < workers $WORKER_MAX — bumping master" - - if [ "$dry_run" = "false" ]; then - $SSH \ - wizard@k8s-master "sudo apt-mark unhold containerd.io \ - && sudo apt-get install -y containerd.io='$WORKER_MAX-1' \ - && sudo apt-mark hold containerd.io \ - && sudo systemctl restart containerd" - - # Wait until kubelet on master is Ready again - for i in $(seq 1 60); do - STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \ - -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}') - [ "$STATUS" = "True" ] && break - sleep 10 - done - [ "$STATUS" = "True" ] || { slack "ABORT — k8s-master not Ready after containerd bump"; exit 1; } - fi - - slack "Master containerd: $MASTER_CTR → $WORKER_MAX. Master Ready." -else - echo "Master containerd $MASTER_CTR >= workers max $WORKER_MAX — skipping skew fix" -fi -``` - -## Stage 4: Apt repo URL rewrite for minor bumps (`stages` includes `repo`) - -Only run if `kind=minor`. - -For each of `k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4`: - -```bash -target_minor="$(echo "$target_version" | awk -F. '{print $1"."$2}')" - -if [ "$dry_run" = "false" ]; then - $SSH \ - "wizard@$node" "echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list \ - && curl -fsSL 'https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/Release.key' | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes \ - && sudo apt-get update" -fi -``` - -Slack: `Repo rewritten to v$target_minor/deb on all 5 nodes.` - -## Stage 5: Master upgrade (`stages` includes `master`) - -```bash -# 5.1 Drain -if [ "$dry_run" = "false" ]; then - kubectl --kubeconfig $WORKSPACE_DIR/config drain k8s-master \ - --ignore-daemonsets --delete-emptydir-data --force --grace-period=300 -fi - -# 5.2 Run the library script via SSH pipe -if [ "$dry_run" = "false" ]; then - $SSH \ - wizard@k8s-master 'bash -s' \ - < $WORKSPACE_DIR/scripts/update_k8s.sh \ - -- --role master --release "$target_version" -fi - -# 5.3 Uncordon + wait Ready -if [ "$dry_run" = "false" ]; then - kubectl --kubeconfig $WORKSPACE_DIR/config uncordon k8s-master -fi - -for i in $(seq 1 60); do - STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \ - -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}') - KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \ - -o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v) - [ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break - sleep 15 -done - -[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \ - || { slack "ABORT — master not Ready or wrong version after upgrade ($STATUS / $KUBELET)"; exit 1; } - -# 5.4 All control-plane pods Running -NOT_READY=$(kubectl --kubeconfig $WORKSPACE_DIR/config -n kube-system get pods \ - -l 'tier=control-plane' --no-headers | grep -v Running | wc -l) -[ "$NOT_READY" -gt 0 ] && { slack "ABORT — $NOT_READY control-plane pods not Running"; exit 1; } - -# 5.5 Re-check halt-on-alert -# (re-run the Check 1.2 query, abort if anything new fires) - -slack "Master upgrade complete. Cluster on v$target_version. Healthy." -``` - -## Stage 6: Workers sequentially (`stages` includes `workers`) - -Order: `k8s-node4 → k8s-node3 → k8s-node2 → k8s-node1`. Node1 last because it hosts GPU + Immich and benefits from the longest soak before any other worker is touched (ref: post-mortem-2026-03-16, memory id=570). - -For each worker `$node`: - -1. Re-check halt-on-alert. If anything fires (e.g. `RecentNodeReboot` on the previous worker), wait + retry up to 30 min, then abort. -2. `kubectl drain $node --ignore-daemonsets --delete-emptydir-data --force --grace-period=300` -3. SSH pipe `update_k8s.sh --role worker --release $target_version` -4. `kubectl uncordon $node` -5. Wait until `$node` Ready + kubeletVersion matches + all calico-node + kube-proxy pods on that node Running. -6. **10-min soak**: poll halt-on-alert every 60s. If anything fires, abort. After 10 min clean, proceed. -7. Slack: `Worker $node complete ($i/4)`. - -```bash -WORKERS="k8s-node4 k8s-node3 k8s-node2 k8s-node1" -i=0 -for node in $WORKERS; do - i=$((i+1)) - - # Halt-on-alert recheck with retry - for attempt in $(seq 1 30); do - ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \ - | jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \ - | grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \ - | sort -u) - [ -z "$ALERTS" ] && break - echo "Waiting for alerts to clear (attempt $attempt/30): $ALERTS" - sleep 60 - done - [ -n "$ALERTS" ] && { slack "ABORT $node — alerts firing after 30min wait: $ALERTS"; exit 1; } - - if [ "$dry_run" = "false" ]; then - kubectl --kubeconfig $WORKSPACE_DIR/config drain "$node" \ - --ignore-daemonsets --delete-emptydir-data --force --grace-period=300 - - $SSH \ - "wizard@$node" 'bash -s' \ - < $WORKSPACE_DIR/scripts/update_k8s.sh \ - -- --role worker --release "$target_version" - - kubectl --kubeconfig $WORKSPACE_DIR/config uncordon "$node" - fi - - # Wait Ready + version match - for w in $(seq 1 60); do - STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \ - -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}') - KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \ - -o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v) - [ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break - sleep 15 - done - [ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \ - || { slack "ABORT — $node not Ready or wrong version ($STATUS / $KUBELET)"; exit 1; } - - # 10-min soak with halt-on-alert - echo "Soaking $node for 10 min..." - for sec in $(seq 1 10); do - ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \ - | jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \ - | grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor|RecentNodeReboot)$' \ - | sort -u) - [ -n "$ALERTS" ] && { slack "ABORT $node mid-soak — alerts: $ALERTS"; exit 1; } - sleep 60 - done - - slack "Worker $node upgrade complete ($i/4). Soaked clean." -done -``` - -Note: during the soak we add `RecentNodeReboot` to the ignore-list because we KNOW we just rebooted-as-it-were that node (kubelet restart counts). - -## Stage 7: Post-flight (`stages` includes `postflight`) - -```bash -# All 5 nodes at target -VERSIONS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes \ - -o jsonpath='{range .items[*]}{.metadata.name}:{.status.nodeInfo.kubeletVersion}{"\n"}{end}') -echo "$VERSIONS" -WRONG=$(echo "$VERSIONS" | grep -v ":v${target_version}$" | wc -l) -[ "$WRONG" -ne 0 ] && { slack "ABORT post-flight — $WRONG node(s) not on v$target_version:\n$VERSIONS"; exit 1; } - -# Upgrade Gates all inactive -FIRING=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \ - | jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \ - | grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \ - | sort -u) -[ -n "$FIRING" ] && slack "Post-flight WARN — alerts still firing (cluster on target, but check):\n$FIRING" - -# pod-ready ratio >= 0.9 -RATIO=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/query' \ - --data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \ - | jq -r '.data.result[0].value[1] // "0"') -slack "Pod-ready ratio: $RATIO (target ≥ 0.9)" - -# Clear the in-flight annotation + Pushgateway gauges -if [ "$dry_run" = "false" ]; then - kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \ - viktorbarzin.me/k8s-upgrade-in-flight- \ - viktorbarzin.me/k8s-upgrade-target- \ - viktorbarzin.me/k8s-upgrade-snapshot-path- || true - - push_metric k8s_upgrade_in_flight 0 - push_metric k8s_upgrade_snapshot_taken 0 -fi - -slack ":white_check_mark: K8s upgrade complete: cluster on v$target_version." -``` - -## Rollback - -This agent does NOT auto-rollback. If anything aborts mid-flight: - -1. Slack the failure with the last known stage + node. -2. Leave the in-flight annotation in place (the operator clears it manually after triage). -3. Operator follows `infra/docs/runbooks/k8s-version-upgrade.md` → "Rollback paths" section. - -The etcd snapshot path is annotated on the `k8s-upgrade` namespace for easy recovery. - -## Notes for tests - -- **Test 1 (CronJob dry-run)**: The CronJob has its own `--dry-run` env var that short-circuits before POST. This agent is not invoked. -- **Test 2 (agent dry-run)**: Invoke with `{"dry_run": true}`. Every SSH + kubectl READ runs, every mutation skipped. The agent should print "WOULD: " for each skipped mutation. -- **Test 3 (snapshot-only)**: Invoke with `{"stages": "preflight,snapshot"}`. Pre-flight + etcd snapshot only. Slack notification confirms the file exists. No node touched after that. -- **Test 4 (full run)**: `{"target_version": "1.34.7", "kind": "patch"}` once apt has it. Full sequence. -- **Test 5 (synthetic minor)**: `{"target_version": "1.35.0", "kind": "minor", "dry_run": true}`. Confirms the repo-rewrite plan path without mutation. - -## Edge cases - -- **Slack down**: Don't block the upgrade — continue, log to stderr. -- **SSH host key changes**: `accept-new` accepts only on first encounter — if a node was reimaged its host key changes; clear `/tmp/known_hosts` before retry. -- **kubectl drain hangs on a PDB-violating pod**: 5-min grace-period is hard. If drain fails, `kubectl drain --disable-eviction --force` is NOT a valid escalation here — slack-abort and let the operator investigate. -- **etcd snapshot dir missing/full**: stat the dir first. If <10 GiB free, abort. -- **Network blip during apt-get**: the script `set -e`s — apt-get will fail loud, the agent's bash will see non-zero exit, we slack-abort. The node is left mid-upgrade (kubeadm half-applied). Operator follows the runbook. - -## Verification claims you must make - -When you `slack` a SUCCESS message, you must have actually verified: -- All 5 nodes report the target kubelet version via `kubectl get nodes -o jsonpath` -- No alerts firing outside the ignore-list -- pod-ready ratio computed from Prometheus - -Do not declare success without those three confirmations. diff --git a/.claude/agents/payslip-extractor.md b/.claude/agents/payslip-extractor.md deleted file mode 100644 index 4471421c..00000000 --- a/.claude/agents/payslip-extractor.md +++ /dev/null @@ -1,194 +0,0 @@ ---- -name: payslip-extractor -description: "Extract structured UK payslip fields from already-extracted text (preferred) or a base64 PDF (fallback) into strict JSON." -model: haiku -allowedTools: - - Bash - - Read ---- - -You are a headless payslip-field extractor. You receive a prompt containing a UK payslip (either as pre-extracted text or as a base64-encoded PDF) plus a target JSON schema, and you produce exactly one JSON object that matches the schema. - -## Your single job - -Given a prompt that contains EITHER: -- A line `PAYSLIP_TEXT:` followed by already-extracted text (preferred path — use it directly, skip to Step 3). -- OR a line `PDF_BASE64:` followed by a base64 blob (fallback path — decode then extract text first). - -Produce EXACTLY ONE JSON object on stdout matching the schema. No prose. No markdown fences. No preamble. No trailing commentary. The final message content must be a single valid JSON object and nothing else. - -## RSU handling (important — Meta UK payslips) - -UK payslips for equity-compensated employees (e.g. Meta) report RSU vests as NOTIONAL pay for HMRC reporting only — the broker (Schwab) sells shares to cover US-side withholding but the UK payslip ALSO runs the vest through PAYE via a grossed-up Taxable Pay line. Meta UK template: - -- EARNINGS lines: `RSU Tax Offset` (grossed-up vest value) and optionally `RSU Excs Refund` (over-withheld amount returned). SUM BOTH into `rsu_vest`. Other labels seen on non-Meta templates: `RSU Vest`, `Restricted Stock Units`, `Notional Pay`, `GSU Vest`. -- Meta's template does NOT use a matching offset deduction — `rsu_offset` should be 0. Taxable Pay is grossed up to (Total Payment + rsu_vest) so PAYE already includes the RSU share. -- For non-Meta templates that DO use an offset (`Shares Retained`, `Notional Pay Offset`), populate `rsu_offset` with the magnitude. - -If you see ANY of these lines, do NOT add them to `other_deductions` and do NOT let them count as regular income_tax/NI. - -If the payslip has no stock component, leave both as 0. - -## Earnings decomposition (v2) - -- `salary`: the basic salary/pay line (usually the first "Salary" or "Basic Pay" entry in the Earnings/Payments block). -- `bonus`: the bonus line (`Perform Bonus`, `Bonus`, `Performance Bonus`). If absent or 0, leave as 0 — that's meaningful signal (bonus-sacrifice months). Don't invent. -- `pension_sacrifice`: **ABSOLUTE VALUE** of any NEGATIVE pension line in the Payments block (e.g. `AE Pension EE -600.20` → `600.20`). This is salary-sacrifice and is ALREADY subtracted from Total Payment/gross. Do not also put it in `pension_employee`. -- `pension_employee`: use this ONLY when pension appears as a POSITIVE deduction on the Deductions side (legacy Meta variant A, or non-Meta templates). Never double-count. -- `taxable_pay`: the "Taxable Pay" line in the summary block, THIS PERIOD column. For Meta this is the post-sacrifice + RSU-grossed-up base that PAYE is computed on. If the payslip doesn't surface a summary block, null. -- `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross`: YTD column values from the same summary block. Null if not present. - -## Fast path: PAYSLIP_TEXT is present - -If the prompt contains `PAYSLIP_TEXT:`, the caller has already run `pdftotext -layout`. Skip Steps 1-2 entirely — the text is already in your context. Go straight to Step 3. - -## Processing steps - -### Step 1. Extract and decode the base64 PDF - -The prompt will include a line that starts with `PDF_BASE64:` followed by the base64 blob. Decode it to `/tmp/payslip.pdf`. - -Preferred method (handles whitespace and very long blobs robustly): - -```bash -python3 - <<'PY' -import base64, re, pathlib, sys, os -prompt = os.environ.get("PAYSLIP_PROMPT", "") -# If the orchestrator didn't set an env var, fall back to reading the transcript via CWD stdin mechanism. -# In practice the agent receives the prompt in its conversation — you extract the PDF_BASE64 value -# from the prompt text you were given, strip whitespace, and base64-decode. -PY -``` - -In practice: read the `PDF_BASE64:` value out of the prompt you have been given (you can see the full prompt), then run: - -```bash -python3 -c " -import base64, sys -data = sys.stdin.read().strip() -open('/tmp/payslip.pdf','wb').write(base64.b64decode(data)) -print('decoded bytes:', len(base64.b64decode(data))) -" <<'B64' - -B64 -``` - -Or pipe via shell `base64 -d`: - -```bash -printf '%s' '' | base64 -d > /tmp/payslip.pdf -``` - -Verify the file looks like a PDF: - -```bash -head -c 8 /tmp/payslip.pdf | xxd -# Expected: 25 50 44 46 2d (i.e. "%PDF-") -``` - -### Step 2. Extract text from the PDF - -Try tools in this order. Use the first one that works; do not chain all of them. - -1. `pdftotext` from `poppler-utils` (preferred — fastest, most reliable on layout-preserving payslips): - ```bash - pdftotext -layout /tmp/payslip.pdf - 2>/dev/null - ``` - -2. Python `pypdf` fallback: - ```bash - python3 -c " - from pypdf import PdfReader - r = PdfReader('/tmp/payslip.pdf') - for p in r.pages: - print(p.extract_text() or '') - " - ``` - -3. Python `pdfplumber` fallback: - ```bash - python3 -c " - import pdfplumber - with pdfplumber.open('/tmp/payslip.pdf') as pdf: - for page in pdf.pages: - print(page.extract_text() or '') - " - ``` - -4. If none of those are installed, check what IS available: - ```bash - which pdftotext pdf2txt.py mutool - python3 -c "import pypdf, pdfplumber, pdfminer" 2>&1 - ``` - and use whatever you find (e.g. `mutool draw -F txt`). - -If every text-extraction tool fails, emit the failure JSON (see "Failure mode" below). - -### Step 3. Parse the extracted text - -UK payslips are laid out in a few common templates (Sage, Iris, QuickBooks, Xero, in-house ADP/Workday layouts). Common landmarks: - -- "Pay Date" / "Payment Date" / "Date Paid" — the date wages hit the account. Usually at the top or in a header box. -- "Tax Period" / "Period" / "Month" — e.g. "Month 1", "Week 12". -- Two numeric columns per line: "This Period" (or "Amount", "Current") and "Year to Date" (or "YTD"). **Always take the This Period column**, never YTD. -- Payments / Earnings block: "Basic Pay", "Salary", "Bonus", "Overtime", "Commission", "Holiday Pay". -- Deductions block: "Income Tax" / "PAYE", "National Insurance" / "NI" / "NIC", "Pension" / "Pension Contribution" / "Salary Sacrifice Pension", "Student Loan" / "SL", optional: "Union Dues", "Charity", "Season Ticket Loan", "Private Medical", etc. -- "Gross Pay" / "Total Gross" — sum of payments. -- "Net Pay" / "Take Home" / "Amount Payable" — the money actually paid. -- "Tax Code" — e.g. "1257L", "BR", "D0", "NT". -- "NI Number" / "National Insurance Number" — `AA123456A` format. Never invent one. -- "Employer" / "Company" — usually in the letterhead. "Employee" / "Name". -- Currency: almost always GBP / "£" for UK payslips. If the PDF is not in GBP or not a UK payslip, still return the numbers as-is but include a best-effort `currency` field. - -### Step 4. Map to the schema and emit JSON - -Rules that apply regardless of the caller's exact schema: - -- **Dates**: `pay_date` MUST be `YYYY-MM-DD`. If the PDF prints `12/03/2026`, interpret as `DD/MM/YYYY` (UK format) → `2026-03-12`. If ambiguous (`01/02/2026`), prefer UK ordering. If impossible to determine a year, use the pay_period year. -- **Money fields**: emit as JSON numbers, not strings. Two decimal places are acceptable (`2450.17`). Strip `£`, commas, and trailing spaces. Negative values stay negative. -- **Missing numeric fields**: emit `0` (zero), not `null`, not an empty string, not `"N/A"`. -- **`other_deductions`**: an object mapping `{ "