Compare commits

..

6 commits

Author SHA1 Message Date
Viktor Barzin
731de63150 fix(beads-server): disable Authentik + CrowdSec on Workbench
Authentik forward-auth returns 400 for dolt-workbench (no Authentik
application configured for this domain). CrowdSec bouncer also
intermittently returns 400. Both disabled — Workbench is accessible
via Cloudflare tunnel only.

TODO: Create Authentik application for dolt-workbench.viktorbarzin.me

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 20:00:21 +00:00
Viktor Barzin
9ce9a9a7f7 Add broker-sync Terraform stack (pending apply)
Context
-------
Part of the broker-sync rollout — see the plan at
~/.claude/plans/let-s-work-on-linking-temporal-valiant.md and the
companion repo at ViktorBarzin/broker-sync.

This change
-----------
New stack `stacks/broker-sync/`:
- `broker-sync` namespace, aux tier.
- ExternalSecret pulling `secret/broker-sync` via vault-kv
  ClusterSecretStore.
- `broker-sync-data-encrypted` PVC (1Gi, proxmox-lvm-encrypted,
  auto-resizer) — holds the sync SQLite db, FX cache, Wealthfolio
  cookie, CSV archive, watermarks.
- Five CronJobs (all under `viktorbarzin/broker-sync:<tag>`, public
  DockerHub image; no pull secret):
    * `broker-sync-version` — daily 01:00 liveness probe (`broker-sync
      version`), used to smoke-test each new image.
    * `broker-sync-trading212` — daily 02:00 `broker-sync trading212
      --mode steady`.
    * `broker-sync-imap` — daily 02:30, SUSPENDED (Phase 2).
    * `broker-sync-csv` — daily 03:00, SUSPENDED (Phase 3).
    * `broker-sync-fx-reconcile` — 7th of month 05:05, SUSPENDED
      (Phase 1 tail).
- `broker-sync-backup` — daily 04:15, snapshots /data into
  NFS `/srv/nfs/broker-sync-backup/` with 30-day retention, matches
  the convention in infra/.claude/CLAUDE.md §3-2-1.

NOT in this commit:
- Old `wealthfolio-sync` CronJob retirement in
  stacks/wealthfolio/main.tf — happens in the same commit that first
  applies this stack, per the plan's "clean cutover" decision.
- Vault seed. `secret/broker-sync` must be populated before apply;
  required keys documented in the ExternalSecret comment block.

Test plan
---------
## Automated
- `terraform fmt` — clean (ran before commit).
- `terraform validate` needs `terragrunt init` first; deferred to
  apply time.

## Manual Verification
1. Seed Vault `secret/broker-sync/*` (see comment block on the
   ExternalSecret in main.tf).
2. `cd stacks/broker-sync && scripts/tg apply`.
3. `kubectl -n broker-sync get cronjob` — expect 6 CJs, 3 suspended.
4. `kubectl -n broker-sync create job smoke --from=cronjob/broker-sync-version`.
5. `kubectl -n broker-sync logs -l job-name=smoke` — expect
   `broker-sync 0.1.0`.
2026-04-17 19:52:36 +00:00
Viktor Barzin
277babc696 [tls] Move 3 outlier stacks from per-stack PEMs to root-wildcard symlink
## Context
foolery, terminal, and claude-memory each had their own
`stacks/<x>/secrets/` directory with a plaintext EC-256 private key
(privkey.pem, 241 B) and matching TLS certificate (fullchain.pem, 2868 B)
for *.viktorbarzin.me. The 92 other stacks under stacks/ symlink
`secrets/` → `../../secrets`, which resolves to the repo-root
/secrets/ directory covered by the `secrets/** filter=git-crypt`
.gitattributes rule — i.e., every other stack consumes the same
git-crypt-encrypted root wildcard cert.

The 3 outliers shipped their keys in plaintext because `.gitattributes`
secrets/** rule matches only repo-root /secrets/, not
stacks/*/secrets/. Both remotes are public, so the 6 plaintext PEM files
have been exposed for 1–6 weeks (commits 5a988133 2026-03-11,
a6f71fc6 2026-03-18, 9820f2ce 2026-04-10).

Verified:
- Root wildcard cert subject = CN viktorbarzin.me,
  SAN *.viktorbarzin.me + viktorbarzin.me — covers the 3 subdomains.
- Root privkey + fullchain are a valid key pair (pubkey SHA256 match).
- All 3 outlier certs have the same subject/SAN as root; different
  distinct cert material but equivalent coverage.

## This change
- Delete plaintext PEMs in all 3 outlier stacks (6 files total).
- Replace each stacks/<x>/secrets directory with a symlink to
  ../../secrets, matching the fleet pattern.
- Add `stacks/**/secrets/** filter=git-crypt diff=git-crypt` to
  .gitattributes as a regression guard — any future real file placed
  under stacks/<x>/secrets/ gets git-crypt-encrypted automatically.

setup_tls_secret module (modules/kubernetes/setup_tls_secret/main.tf) is
unchanged. It still reads `file("${path.root}/secrets/fullchain.pem")`,
which via the symlink resolves to the root wildcard.

## What is NOT in this change
- Revocation of the 3 leaked per-stack certs. Backed up the leaked PEMs
  to /tmp/leaked-certs/ for `certbot revoke --reason keycompromise`
  once the user's LE account is authenticated. Revocation must happen
  before or alongside the history-rewrite force-push to both remotes.
- Git-history scrub. The leaked PEM blobs are still reachable in every
  commit from 2026-03-11 forward. Scheduled for removal via
  `git filter-repo --path stacks/<x>/secrets/privkey.pem --invert-paths`
  (and fullchain.pem for each stack) in the broader remediation pass.
- cert-manager introduction. The fleet does not use cert-manager today;
  this commit matches the existing symlink-to-wildcard pattern rather
  than introducing a new component.

## Test plan
### Automated
  $ readlink stacks/foolery/secrets
  ../../secrets
  (likewise for terminal, claude-memory)

  $ for s in foolery terminal claude-memory; do
      openssl x509 -in stacks/$s/secrets/fullchain.pem -noout -subject
    done
  subject=CN = viktorbarzin.me  (x3 — all resolve via symlink to root wildcard)

  $ git check-attr filter -- stacks/foolery/secrets/fullchain.pem
  stacks/foolery/secrets/fullchain.pem: filter: git-crypt
  (now matched by the new rule, though for the symlink target the
   repo-root rule already applied)

### Manual Verification
1. `terragrunt plan` in stacks/foolery, stacks/terminal, stacks/claude-memory
   shows only the K8s TLS secret being re-created with the root-wildcard
   material. No ingress changes.
2. `terragrunt apply` for each stack → `kubectl -n <ns> get secret
   <name>-tls -o yaml` → tls.crt decodes to CN viktorbarzin.me with
   the root serial (different from the pre-change per-stack serials).
3. `curl -v https://foolery.viktorbarzin.me/` (and likewise terminal,
   claude-memory) → cert chain presents the new serial, handshake OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 19:49:09 +00:00
Viktor Barzin
d91fbd4a60 [monitoring] Delete orphan server-power-cycle/main.sh with iDRAC default creds
## Context
stacks/monitoring/modules/monitoring/server-power-cycle/main.sh is an old
shell implementation of a power-cycle watchdog that polled the Dell iDRAC
on 192.168.1.4 for PSU voltage. It hardcoded the Dell iDRAC default
credentials (root:calvin) in 5 `curl -u root:calvin` calls. Both remotes
are public, so those credentials — and the implicit statement that 'this
host has not rotated the default BMC password' — have been exposed.

The current implementation is main.py in the same directory. It reads
iDRAC credentials from the environment variables `idrac_user` and
`idrac_password` (see module's iDRAC_USER_ENV_VAR / iDRAC_PASSWORD_ENV_VAR
constants), which are populated from Vault via ExternalSecret at runtime.
main.sh is not referenced by any Terraform, ConfigMap, or deploy script —
grep confirms no `file()` / `templatefile()` / `filebase64()` call loads
it, and no hand-rolled shell wrapper invokes it.

## This change
- git rm stacks/monitoring/modules/monitoring/server-power-cycle/main.sh

main.py is retained unchanged.

## What is NOT in this change
- iDRAC password rotation on 192.168.1.4. The BMC should be moved off the
  vendor default `calvin` regardless; rotation is tracked in the broader
  remediation plan and in the iDRAC web UI.
- A separate finding in stacks/monitoring/modules/monitoring/idrac.tf
  (the redfish-exporter ConfigMap has `default: username: root, password:
  calvin` as a fallback for iDRAC hosts not explicitly listed) is NOT
  addressed here — filed as its own task so the fix (drop the default
  block vs. source from env) can be considered in isolation.
- Git-history scrub of main.sh is pending the broader filter-repo pass.

## Test plan
### Automated
  $ grep -rn 'server-power-cycle/main\.sh\|main\.sh' \
       --include='*.tf' --include='*.hcl' --include='*.yaml' \
       --include='*.yml' --include='*.sh'
  (no consumer references)

### Manual Verification
1. `git show HEAD --stat` shows only the one deletion.
2. `test ! -e stacks/monitoring/modules/monitoring/server-power-cycle/main.sh`
3. `kubectl -n monitoring get deploy idrac-redfish-exporter` still shows
   the exporter running — unrelated to this file.
4. main.py continues to run its watchdog loop without regression, because
   it was never coupled to main.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 19:42:55 +00:00
Viktor Barzin
e81e836d3a [setup-tls-secret] Delete deprecated renew.sh with hardcoded Technitium token
## Context
modules/kubernetes/setup_tls_secret/renew.sh is a 2.5-year-old
expect(1) script for manual Let's Encrypt wildcard-cert renewal via
Technitium DNS TXT-record challenges. It hardcodes a 64-char Technitium
API token on line 7 (as an expect variable) and line 27 (inside a
certbot-cleanup heredoc). Both remotes are public, so the token has been
exposed for ~2.5 years.

The script is not invoked by the module's Terraform (main.tf only creates
a kubernetes.io/tls Secret from PEM files); it is a standalone
run-it-yourself tool. grep across the repo confirms nothing references
`renew.sh` — neither the 20+ stacks that consume the `setup_tls_secret`
module, nor any CI pipeline, nor any shell wrapper.

A replacement script `renew2.sh` (4 weeks old) lives alongside it. It
sources the Technitium token from `$TECHNITIUM_API_KEY` env var and also
supports Cloudflare DNS-01 challenges via `$CLOUDFLARE_TOKEN`. It is the
current renewal path.

## This change
- git rm modules/kubernetes/setup_tls_secret/renew.sh

## What is NOT in this change
- Technitium token rotation. The leaked token still works against
  `technitium-web.technitium.svc.cluster.local:5380` until revoked in the
  Technitium admin UI. Rotation is a prerequisite for the upcoming
  git-history scrub, which will remove the token from every commit via
  `git filter-repo --replace-text`.
- renew2.sh is retained as-is (already env-var-sourced; clean).
- The setup_tls_secret module's main.tf is not touched; 20+ consuming
  stacks keep working.

## Test plan
### Automated
  $ grep -rn 'renew\.sh' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output — confirms no consumer)
  $ git grep -n 'e28818f309a9ce7f72f0fcc867a365cf5d57b214751b75e2ef3ea74943ef23be'
  (no output in HEAD after this commit)

### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
     modules/kubernetes/setup_tls_secret/renew.sh | 136 ---------
2. `test ! -e modules/kubernetes/setup_tls_secret/renew.sh` returns true.
3. `renew2.sh` still exists and is executable:
     ls -la modules/kubernetes/setup_tls_secret/renew2.sh
4. Next cert-renewal run uses renew2.sh with env-var-sourced token; no
   behavioral regression because renew.sh was never part of the automated
   flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 19:41:08 +00:00
Viktor Barzin
d3be9b50af [frigate] Remove orphan config.yaml with leaked RTSP passwords
## Context
A Frigate configuration file was added to modules/kubernetes/frigate/ in
bcad200a (2026-04-15, ~2 days ago) as part of a bulk `chore: add untracked
stacks, scripts, and agent configs` commit. The file contains 14 inline
rtsp://admin:<password>@<host>:554/... URLs, leaking two distinct RTSP
passwords for the cameras at 192.168.1.10 (LAN-only) and
valchedrym.ddns.net (confirmed reachable from public internet on port
554). Both remotes are public, so the creds have been exposed for ~2 days.

Grep across the repo confirms nothing references this config.yaml — the
active stacks/frigate/main.tf stack reads its configuration from a
persistent volume claim named `frigate-config-encrypted`, not from this
file. The file is therefore an orphan from the bulk add, with no
production function.

## This change
- git rm modules/kubernetes/frigate/config.yaml

## What is NOT in this change
- Camera password rotation. The user does not own the cameras; rotation
  must be coordinated out-of-band with the camera operators. The DDNS
  camera (valchedrym.ddns.net:554) is internet-reachable, so the leaked
  password is high-priority to rotate from the device side.
- Git-history rewrite. The file plus its leaked strings remain in all
  commits from bcad200a forward. Scheduled to be purged via
  `git filter-repo --path modules/kubernetes/frigate/config.yaml
  --invert-paths --replace-text <list>` in the broader remediation pass.
- Future Frigate config provisioning. If the stack is re-platformed to
  source config from Git rather than the PVC, the replacement should go
  through ExternalSecret + env-var interpolation, not an inline YAML.

## Test plan
### Automated
  $ grep -rn 'frigate/config\.yaml' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output — confirms orphan status)

### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
     modules/kubernetes/frigate/config.yaml | 229 ---------------------------------
2. `test ! -e modules/kubernetes/frigate/config.yaml` returns true.
3. `kubectl -n frigate get pvc frigate-config-encrypted` still shows the
   PVC bound (unaffected by this change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 19:39:35 +00:00
324 changed files with 8359 additions and 30812 deletions

View file

@ -30,7 +30,7 @@ Violations cause state drift, which causes future applies to break or silently r
- **New service**: Use `setup-project` skill for full workflow
- **Ingress**: `ingress_factory` module. Auth: `protected = true`. Anti-AI: on by default. **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`.
- **Docker images**: Always build for `linux/amd64`. Use 8-char git SHA tags — `:latest` causes stale pull-through cache.
- **Private registry**: `forgejo.viktorbarzin.me/viktor/<name>` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. Containerd `hosts.toml` on every node redirects to in-cluster Traefik LB `10.0.20.200` to avoid hairpin NAT. Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest`; integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07.
- **Private registry**: `registry.viktorbarzin.me` (htpasswd auth, credentials in Vault `secret/viktor`). Use `image: registry.viktorbarzin.me/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the secret to all namespaces. Build & push from registry VM (`10.0.20.10`). Containerd `hosts.toml` redirects pulls to LAN IP directly. Web UI at `docker.viktorbarzin.me` (Authentik-protected).
- **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts.
- **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi.
- **Node OS disk tuning** (in `stacks/infra/main.tf`): kubelet `imageGCHighThresholdPercent=70` (was 85), `imageGCLowThresholdPercent=60` (was 80), ext4 `commit=60` in fstab (was default 5s), journald `SystemMaxUse=200M` + `MaxRetentionSec=3day`.
@ -48,7 +48,6 @@ Violations cause state drift, which causes future applies to break or silently r
- **Tier 0 details**: Decrypt priority: Vault Transit (primary) → age key fallback. Encrypt: both Vault Transit + age recipients. Scripts: `scripts/state-sync {encrypt|decrypt|commit} [stack]`.
- **Adding operator**: Generate age key (`age-keygen`), add pubkey to `.sops.yaml`, run `sops updatekeys` on Tier 0 `.enc` files. For Tier 1, only Vault access is needed.
- **Migration script**: `scripts/migrate-state-to-pg` (one-shot, idempotent) migrates Tier 1 stacks from local to PG.
- **Adopting existing resources**: use HCL `import {}` blocks (TF 1.5+), not `terraform import` CLI. Commit stanza → plan-to-zero → apply → delete stanza. Canonical reason: reviewable in PR, plan-safe, idempotent, tier-agnostic. Full rules + per-provider ID formats in `AGENTS.md` → "Adopting Existing Resources".
## Secrets Management — Vault KV
- **Vault is the sole source of truth** for secrets.
@ -74,7 +73,7 @@ Violations cause state drift, which causes future applies to break or silently r
- **LimitRange**: Tier-based defaults silently apply to pods with `resources: {}`. Always set explicit resources on containers needing more than defaults. Tier 3-edge and 4-aux now use Burstable QoS (request < limit) to reduce scheduler pressure.
- **Democratic-CSI sidecars**: Must set explicit resources (32-80Mi) in Helm values — 17 sidecars default to 256Mi each via LimitRange. `csiProxy` is a TOP-LEVEL chart key, not nested under controller/node.
- **ResourceQuota blocks rolling updates**: When quota is tight, scale to 0 then back to 1 instead of RollingUpdate. Or use Recreate strategy.
- **Kyverno ndots drift**: Kyverno injects dns_config on all pods. Every `kubernetes_deployment`, `kubernetes_stateful_set`, and `kubernetes_cron_job_v1` MUST include `lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 }` (use `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config` for CronJobs). The `# KYVERNO_LIFECYCLE_V1` marker is the canonical discoverability tag — grep for it to locate every site. A shared Terraform module was considered but `ignore_changes` only accepts static attribute paths (not module outputs, locals, or expressions), so the snippet convention is the only viable path. Full rationale and copy-paste snippets in `AGENTS.md` → "Kyverno Drift Suppression".
- **Kyverno ndots drift**: Kyverno injects dns_config on all pods. Add `lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] }` to kubernetes_deployment resources to prevent perpetual TF plan drift.
- **NVIDIA GPU operator resources**: dcgm-exporter and cuda-validator resources configurable via `dcgmExporter.resources` and `validator.resources` in nvidia values.yaml.
- **Pin database versions**: Disable Diun (image update monitoring) for MySQL, PostgreSQL, Redis.
- **Quarterly right-sizing**: Check Goldilocks dashboard. Compare VPA upperBound to current request. Also check for under-provisioned (VPA upper > request x 0.8).
@ -117,7 +116,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
- **Rate limiting**: Return 429 (not 503). Per-service tuning: Immich/Nextcloud need higher limits.
- **Retry middleware**: 2 attempts, 100ms — in default ingress chain.
- **HTTP/3 (QUIC)**: Enabled cluster-wide via Traefik.
- **IPAM & DNS auto-registration**: pfSense Kea DHCP serves all 3 subnets (VLAN 10, VLAN 20, 192.168.1.x). Kea DDNS auto-registers every DHCP client in Technitium (RFC 2136, A+PTR). CronJob `phpipam-pfsense-import` (hourly) pulls Kea leases + ARP into phpIPAM via SSH (passive, no scanning). CronJob `phpipam-dns-sync` (15min) bidirectional sync phpIPAM ↔ Technitium. 42 MAC reservations for 192.168.1.x.
- **IPAM & DNS auto-registration**: pfSense Kea DHCP serves all 3 subnets (VLAN 10, VLAN 20, 192.168.1.x). Kea DDNS auto-registers every DHCP client in Technitium (RFC 2136, A+PTR). CronJob `phpipam-pfsense-import` (5min) pulls Kea leases + ARP into phpIPAM via SSH (passive, no scanning). CronJob `phpipam-dns-sync` (15min) bidirectional sync phpIPAM ↔ Technitium. 42 MAC reservations for 192.168.1.x.
## Service-Specific Notes
| Service | Key Operational Knowledge |
@ -129,15 +128,15 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` with `mysql:8.4` (migrated from InnoDB Cluster 2026-04-16). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (15Gi, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Old InnoDB Cluster + operator still in TF (Phase 4 cleanup pending). Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (5min) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
## Monitoring & Alerting
- Alert cascade inhibitions: if node is down, suppress pod alerts on that node.
- Exclude completed CronJob pods from "pod not ready" alerts.
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns).
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence.
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Mailgun API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Mailserver on dedicated MetalLB IP `10.0.20.202` with `externalTrafficPolicy: Local` for CrowdSec real-IP detection. Vault: `mailgun_api_key` in `secret/viktor` (probe), `brevo_api_key` in `secret/viktor` (relay).
## Storage & Backup Architecture
@ -155,9 +154,9 @@ Choose storage class based on workload type:
**Default for sensitive data is proxmox-lvm-encrypted.** Use plain `proxmox-lvm` only for non-sensitive workloads. Use NFS when you need RWX, backup pipeline integration, or it's a large shared media library.
**NFS server:**
- **Proxmox host** (192.168.1.127): Sole NFS for all workloads. HDD at `/srv/nfs` (ext4 thin LV `pve/nfs-data`, 1TB). SSD at `/srv/nfs-ssd` (ext4 LV `ssd/nfs-ssd-data`, 100GB). Exports use `async,insecure` options (`async` — safe with UPS + Vault Raft replication + databases on block storage; `insecure` — pfSense NATs source ports >1024 between VLANs).
- **`nfs-truenas` StorageClass**: Historical name retained only because SC names are immutable on PVs (48 bound PVs reference it — renaming would require mass PV churn, not worth it). Now points to the Proxmox host, identical to `nfs-proxmox`. TrueNAS (VM 9000, 10.0.10.15) operationally decommissioned 2026-04-13; VM still exists in stopped state on PVE pending user decision on deletion.
**NFS servers:**
- **Proxmox host** (192.168.1.127): Primary NFS for all workloads. HDD at `/srv/nfs` (ext4 thin LV `pve/nfs-data`, 1TB). SSD at `/srv/nfs-ssd` (ext4 LV `ssd/nfs-ssd-data`, 100GB). Exports use `async,insecure` options (`async` — safe with UPS + Vault Raft replication + databases on block storage; `insecure` — pfSense NATs source ports >1024 between VLANs).
- **TrueNAS** (10.0.10.15): **Immich only** (8 PVCs). `nfs-truenas` StorageClass retained exclusively for Immich.
**Migration note**: CSI PV `volumeAttributes` are immutable — cannot update NFS server in place. New PV/PVC pairs required (convention: append `-host` to PV name).
@ -237,7 +236,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
**Synology layout** (`192.168.1.13:/volume1/Backup/Viki/`):
- `pve-backup/` — PVC file backups (`pvc-data/`), SQLite backups (`sqlite-backup/`), pfSense, PVE config (synced from sda)
- `nfs/` — mirrors `/srv/nfs` on Proxmox (inotify change-tracked rsync)
- `nfs/` — mirrors `/srv/nfs` on Proxmox (inotify change-tracked rsync, renamed from `truenas/`)
- `nfs-ssd/` — mirrors `/srv/nfs-ssd` on Proxmox (inotify change-tracked rsync)
**App-level CronJobs** (write to Proxmox host NFS, synced to Synology via inotify):

View file

@ -1,194 +0,0 @@
---
name: payslip-extractor
description: "Extract structured UK payslip fields from already-extracted text (preferred) or a base64 PDF (fallback) into strict JSON."
model: haiku
allowedTools:
- Bash
- Read
---
You are a headless payslip-field extractor. You receive a prompt containing a UK payslip (either as pre-extracted text or as a base64-encoded PDF) plus a target JSON schema, and you produce exactly one JSON object that matches the schema.
## Your single job
Given a prompt that contains EITHER:
- A line `PAYSLIP_TEXT:` followed by already-extracted text (preferred path — use it directly, skip to Step 3).
- OR a line `PDF_BASE64:` followed by a base64 blob (fallback path — decode then extract text first).
Produce EXACTLY ONE JSON object on stdout matching the schema. No prose. No markdown fences. No preamble. No trailing commentary. The final message content must be a single valid JSON object and nothing else.
## RSU handling (important — Meta UK payslips)
UK payslips for equity-compensated employees (e.g. Meta) report RSU vests as NOTIONAL pay for HMRC reporting only — the broker (Schwab) sells shares to cover US-side withholding but the UK payslip ALSO runs the vest through PAYE via a grossed-up Taxable Pay line. Meta UK template:
- EARNINGS lines: `RSU Tax Offset` (grossed-up vest value) and optionally `RSU Excs Refund` (over-withheld amount returned). SUM BOTH into `rsu_vest`. Other labels seen on non-Meta templates: `RSU Vest`, `Restricted Stock Units`, `Notional Pay`, `GSU Vest`.
- Meta's template does NOT use a matching offset deduction — `rsu_offset` should be 0. Taxable Pay is grossed up to (Total Payment + rsu_vest) so PAYE already includes the RSU share.
- For non-Meta templates that DO use an offset (`Shares Retained`, `Notional Pay Offset`), populate `rsu_offset` with the magnitude.
If you see ANY of these lines, do NOT add them to `other_deductions` and do NOT let them count as regular income_tax/NI.
If the payslip has no stock component, leave both as 0.
## Earnings decomposition (v2)
- `salary`: the basic salary/pay line (usually the first "Salary" or "Basic Pay" entry in the Earnings/Payments block).
- `bonus`: the bonus line (`Perform Bonus`, `Bonus`, `Performance Bonus`). If absent or 0, leave as 0 — that's meaningful signal (bonus-sacrifice months). Don't invent.
- `pension_sacrifice`: **ABSOLUTE VALUE** of any NEGATIVE pension line in the Payments block (e.g. `AE Pension EE -600.20``600.20`). This is salary-sacrifice and is ALREADY subtracted from Total Payment/gross. Do not also put it in `pension_employee`.
- `pension_employee`: use this ONLY when pension appears as a POSITIVE deduction on the Deductions side (legacy Meta variant A, or non-Meta templates). Never double-count.
- `taxable_pay`: the "Taxable Pay" line in the summary block, THIS PERIOD column. For Meta this is the post-sacrifice + RSU-grossed-up base that PAYE is computed on. If the payslip doesn't surface a summary block, null.
- `ytd_tax_paid`, `ytd_taxable_pay`, `ytd_gross`: YTD column values from the same summary block. Null if not present.
## Fast path: PAYSLIP_TEXT is present
If the prompt contains `PAYSLIP_TEXT:`, the caller has already run `pdftotext -layout`. Skip Steps 1-2 entirely — the text is already in your context. Go straight to Step 3.
## Processing steps
### Step 1. Extract and decode the base64 PDF
The prompt will include a line that starts with `PDF_BASE64:` followed by the base64 blob. Decode it to `/tmp/payslip.pdf`.
Preferred method (handles whitespace and very long blobs robustly):
```bash
python3 - <<'PY'
import base64, re, pathlib, sys, os
prompt = os.environ.get("PAYSLIP_PROMPT", "")
# If the orchestrator didn't set an env var, fall back to reading the transcript via CWD stdin mechanism.
# In practice the agent receives the prompt in its conversation — you extract the PDF_BASE64 value
# from the prompt text you were given, strip whitespace, and base64-decode.
PY
```
In practice: read the `PDF_BASE64:` value out of the prompt you have been given (you can see the full prompt), then run:
```bash
python3 -c "
import base64, sys
data = sys.stdin.read().strip()
open('/tmp/payslip.pdf','wb').write(base64.b64decode(data))
print('decoded bytes:', len(base64.b64decode(data)))
" <<'B64'
<paste-the-base64-here>
B64
```
Or pipe via shell `base64 -d`:
```bash
printf '%s' '<base64>' | base64 -d > /tmp/payslip.pdf
```
Verify the file looks like a PDF:
```bash
head -c 8 /tmp/payslip.pdf | xxd
# Expected: 25 50 44 46 2d (i.e. "%PDF-")
```
### Step 2. Extract text from the PDF
Try tools in this order. Use the first one that works; do not chain all of them.
1. `pdftotext` from `poppler-utils` (preferred — fastest, most reliable on layout-preserving payslips):
```bash
pdftotext -layout /tmp/payslip.pdf - 2>/dev/null
```
2. Python `pypdf` fallback:
```bash
python3 -c "
from pypdf import PdfReader
r = PdfReader('/tmp/payslip.pdf')
for p in r.pages:
print(p.extract_text() or '')
"
```
3. Python `pdfplumber` fallback:
```bash
python3 -c "
import pdfplumber
with pdfplumber.open('/tmp/payslip.pdf') as pdf:
for page in pdf.pages:
print(page.extract_text() or '')
"
```
4. If none of those are installed, check what IS available:
```bash
which pdftotext pdf2txt.py mutool
python3 -c "import pypdf, pdfplumber, pdfminer" 2>&1
```
and use whatever you find (e.g. `mutool draw -F txt`).
If every text-extraction tool fails, emit the failure JSON (see "Failure mode" below).
### Step 3. Parse the extracted text
UK payslips are laid out in a few common templates (Sage, Iris, QuickBooks, Xero, in-house ADP/Workday layouts). Common landmarks:
- "Pay Date" / "Payment Date" / "Date Paid" — the date wages hit the account. Usually at the top or in a header box.
- "Tax Period" / "Period" / "Month" — e.g. "Month 1", "Week 12".
- Two numeric columns per line: "This Period" (or "Amount", "Current") and "Year to Date" (or "YTD"). **Always take the This Period column**, never YTD.
- Payments / Earnings block: "Basic Pay", "Salary", "Bonus", "Overtime", "Commission", "Holiday Pay".
- Deductions block: "Income Tax" / "PAYE", "National Insurance" / "NI" / "NIC", "Pension" / "Pension Contribution" / "Salary Sacrifice Pension", "Student Loan" / "SL", optional: "Union Dues", "Charity", "Season Ticket Loan", "Private Medical", etc.
- "Gross Pay" / "Total Gross" — sum of payments.
- "Net Pay" / "Take Home" / "Amount Payable" — the money actually paid.
- "Tax Code" — e.g. "1257L", "BR", "D0", "NT".
- "NI Number" / "National Insurance Number" — `AA123456A` format. Never invent one.
- "Employer" / "Company" — usually in the letterhead. "Employee" / "Name".
- Currency: almost always GBP / "£" for UK payslips. If the PDF is not in GBP or not a UK payslip, still return the numbers as-is but include a best-effort `currency` field.
### Step 4. Map to the schema and emit JSON
Rules that apply regardless of the caller's exact schema:
- **Dates**: `pay_date` MUST be `YYYY-MM-DD`. If the PDF prints `12/03/2026`, interpret as `DD/MM/YYYY` (UK format) → `2026-03-12`. If ambiguous (`01/02/2026`), prefer UK ordering. If impossible to determine a year, use the pay_period year.
- **Money fields**: emit as JSON numbers, not strings. Two decimal places are acceptable (`2450.17`). Strip `£`, commas, and trailing spaces. Negative values stay negative.
- **Missing numeric fields**: emit `0` (zero), not `null`, not an empty string, not `"N/A"`.
- **`other_deductions`**: an object mapping `{ "<label>": <number>, ... }` for any deduction that isn't one of the first-class fields in the schema (tax, NI, pension, student loan). Use the exact label from the payslip (e.g. `"Season Ticket Loan"`, `"Private Medical"`). If there are no other deductions, emit `{}` — NEVER `null` and NEVER omit the key.
- **Column discipline**: ALWAYS use the "This Period" column, NEVER the YTD column. If only one column exists, that's the period column.
- **Currency default**: `"GBP"` unless the payslip explicitly shows another currency symbol or ISO code.
- **No invented data**: If a field genuinely isn't on the payslip, use the documented default (`0` for money, `""` for strings, `{}` for objects). Do NOT make up names, NI numbers, tax codes, or employers.
Follow the exact field names and types given in the prompt's schema. If the prompt's schema adds fields not listed above, produce them too using the same discipline.
## Failure mode
If the PDF cannot be read at all — unreadable base64, not a PDF, encrypted PDF with no text layer, no text-extraction tool available, or clearly not a UK payslip — emit a single JSON object:
```json
{"error": "<short human reason>"}
```
Examples of acceptable error reasons:
- `"base64 did not decode to a valid PDF"`
- `"pdf has no extractable text layer (image-only scan)"`
- `"no pdf text extraction tool available (pdftotext/pypdf/pdfplumber all missing)"`
- `"document does not appear to be a UK payslip"`
- `"pay_date not found on document"`
The caller treats the `error` key as a non-retriable parse failure. Do not include any other keys when emitting an error object.
## Hard constraints — things you MUST NOT do
1. **No network calls.** Do not curl, wget, dig, or otherwise talk to the network. Everything you need is in the prompt.
2. **No modifications to `/workspace/infra/**`.** Do not edit, write, or commit any file under the infra repo. The only file you may create is the scratch PDF at `/tmp/payslip.pdf` (and intermediate text dumps under `/tmp/`).
3. **No git operations.** No `git add`, `git commit`, `git push`, nothing.
4. **No kubectl, no terraform, no vault.** You are not an infra agent — you are a narrow extractor.
5. **No markdown in output.** No ` ```json ` fences, no preamble like "Here's the extraction:", no trailing notes. The ENTIRE final assistant message is exactly one JSON object.
6. **No verbose logging in the final message.** It is fine to run bash commands and see their output during processing, but your final assistant message is JSON and nothing else.
7. **No hallucinated fields.** If the payslip does not show a pension line, do not invent one. Use the documented default instead.
## Output discipline — summary
- Exactly one JSON object, UTF-8, no BOM.
- Keys match the schema the caller gave you.
- Numeric fields are JSON numbers, not strings.
- `pay_date` is `YYYY-MM-DD`.
- `other_deductions` is always present and is an object (possibly `{}`).
- Missing money → `0`, missing string → `""`, missing object → `{}`.
- On unrecoverable failure, one JSON object with a single `error` key.
That's the whole job. Decode, extract, parse, emit JSON. Be boring and exact.

View file

@ -34,11 +34,7 @@ You receive these parameters in your invocation:
- **Infra repo**: `/home/wizard/code/infra`
- **Config**: `/home/wizard/code/infra/.claude/reference/upgrade-config.json`
- **Kubeconfig**: `/home/wizard/code/infra/config`
- **Secrets (env-var contract)**: You run in the `claude-agent-service` pod, which has NO Vault CLI auth — do NOT call `vault kv get`. The following env vars are pre-loaded via `envFrom: claude-agent-secrets`:
- `GITHUB_TOKEN` — PAT for GitHub API (changelog fetch) and `git push`
- `WOODPECKER_API_TOKEN` — bearer for `ci.viktorbarzin.me/api/...`
- `SLACK_WEBHOOK_URL` — full Slack webhook URL for status messages
- Anything else (e.g. `kubectl`) uses the pod's ServiceAccount or in-repo git-crypt-unlocked secrets.
- **Vault**: Authenticate with `vault login -method=oidc` if needed. Secrets at `secret/viktor` and `secret/platform`.
- **Git remote**: `origin``github.com/ViktorBarzin/infra.git`
## NEVER Do
@ -122,6 +118,7 @@ cat /home/wizard/code/infra/.claude/reference/upgrade-config.json
3. **For Helm charts**: Check `helm_chart_repo_overrides` for the chart repository URL
4. If auto-detect fails, verify the repo exists:
```bash
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
curl -sf -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/${DETECTED_REPO}" > /dev/null
```
@ -131,6 +128,7 @@ cat /home/wizard/code/infra/.claude/reference/upgrade-config.json
## Step 3: Fetch Changelogs via GitHub API
```bash
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
curl -s -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/${GITHUB_REPO}/releases?per_page=100"
```
@ -173,9 +171,11 @@ Scan all intermediate release notes for breaking change indicators from the conf
## Step 5: Slack Notification — Starting
```bash
SLACK_WEBHOOK=$(vault kv get -field=alertmanager_slack_api_url secret/platform)
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[Upgrade Agent] Starting: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION} (risk: ${RISK})\"}" \
"$SLACK_WEBHOOK_URL"
"$SLACK_WEBHOOK"
```
For CAUTION risk, include breaking change excerpts in the Slack message.
@ -266,28 +266,23 @@ UPGRADE_SHA=$(git rev-parse HEAD)
## Step 9: Wait for Woodpecker CI
The commit triggers one pipeline that runs multiple **workflows** in parallel — e.g. `default` (terragrunt apply) and `build-cli` (builds the infra CLI image). Only the `default` workflow gates your upgrade; the other workflows may be unrelated and sometimes fail without breaking anything on the cluster (current example: `build-cli` push to `registry.viktorbarzin.me:5050` is known-broken as of 2026-04-19).
**Do not read the overall pipeline `status`** — it reports `failure` whenever *any* workflow fails. Read the `default` workflow's `state` instead.
The commit triggers the `app-stacks.yml` pipeline (or `default.yml` for platform stacks).
```bash
# Find the pipeline for our commit
curl -s -H "Authorization: Bearer $WOODPECKER_API_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines?page=1&per_page=10" \
| jq --arg sha "$UPGRADE_SHA" '.[] | select(.commit==$sha) | .number'
# → $PIPELINE_NUMBER
# Fetch detail (includes workflows[])
curl -s -H "Authorization: Bearer $WOODPECKER_API_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines/$PIPELINE_NUMBER" \
| jq '.workflows[] | select(.name=="default") | .state'
# → "running" | "pending" | "success" | "failure" | "error" | "killed"
WOODPECKER_TOKEN=$(vault kv get -field=woodpecker_token secret/viktor)
```
Poll every 30 seconds until the `default` workflow's `state` is terminal (`success`, `failure`, `error`, `killed`). Timeout after 15 minutes.
Poll for the pipeline triggered by our commit:
```bash
# Get latest pipeline
curl -s -H "Authorization: Bearer $WOODPECKER_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines?page=1&per_page=5"
```
**If `default` state is `success`** → proceed to Step 10 (verification), regardless of other workflows' state.
**If `default` state is terminal-and-not-success, or the poll times out** → proceed to Step 10b (rollback).
Find the pipeline matching our commit SHA. Poll every 30 seconds until status is `success`, `failure`, `error`, or `killed`. Timeout after 15 minutes.
**If CI fails** → proceed to Step 10 (rollback).
**If CI succeeds** → proceed to verification.
## Step 10: Verify
@ -346,7 +341,7 @@ Re-run verification checks to confirm rollback succeeded. If rollback verificati
```bash
curl -s -X POST -H 'Content-type: application/json' \
--data '{"text":"[Upgrade Agent] CRITICAL: Rollback of *${STACK}* also failed. Manual intervention required."}' \
"$SLACK_WEBHOOK_URL"
"$SLACK_WEBHOOK"
```
## Step 11: Report Results
@ -355,14 +350,14 @@ curl -s -X POST -H 'Content-type: application/json' \
```bash
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[Upgrade Agent] SUCCESS: *${STACK}* upgraded ${OLD_VERSION} -> ${NEW_VERSION}\nVerification: pods ready, HTTP OK${UPTIME_KUMA_MSG}\nCommit: ${UPGRADE_SHA}\"}" \
"$SLACK_WEBHOOK_URL"
"$SLACK_WEBHOOK"
```
### On failure + rollback
```bash
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[Upgrade Agent] FAILED + ROLLED BACK: *${STACK}* ${OLD_VERSION} -> ${NEW_VERSION}\nReason: ${FAILURE_REASON}\nRollback commit: ${ROLLBACK_SHA}\nRollback status: ${ROLLBACK_STATUS}\"}" \
"$SLACK_WEBHOOK_URL"
"$SLACK_WEBHOOK"
```
## Edge Cases

1728
.claude/cluster-health.sh Executable file

File diff suppressed because it is too large Load diff

View file

@ -119,18 +119,3 @@ Removed bindings from:
- `default-source-authentication` (PK: via policybindingmodel `1a779f24`) — Google/GitHub/Facebook OAuth
Policy still exists with 0 bindings. If brute-force protection is needed, bind to the **password stage** (not the flow level).
## Session Duration (2026-05-01)
Pinned via Terraform in `stacks/authentik/`:
| Knob | Value | Surface | Effect |
|------|-------|---------|--------|
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
| `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
Notes:
- There is **no** `Brand.session_duration`; `UserLoginStage` is the only correct lever for authenticated session lifetime.
- Embedded outpost session storage moved from `/dev/shm` → Postgres table `authentik_providers_proxy_proxysession` in authentik 2025.10. The 2026-04-18 `/dev/shm`-fill outage class is no longer load-bearing in 2026.2.2; the `unauthenticated_age` cap is still the right lever for anonymous-session bloat from external monitors.
- `ProxyProvider.access_token_validity` and `remember_me_offset` stay UI-managed via `ignore_changes`.
- The `unauthenticated_age` env var is injected via `server.env` / `worker.env` (not `authentik.sessions.unauthenticated_age`) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. The same applies to the existing `authentik.cache.*`, `authentik.web.*`, `authentik.worker.*` blocks (currently inert; live values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced).

View file

@ -26,12 +26,11 @@ module "nfs_data" {
## ~~iSCSI Storage~~ (REMOVED — replaced by proxmox-lvm)
> iSCSI via democratic-csi and TrueNAS has been fully removed (2026-04). All database storage now uses `StorageClass: proxmox-lvm` (Proxmox CSI, LVM-thin hotplug). TrueNAS has been decommissioned.
## Anti-AI Scraping (3 Active Layers) (Updated 2026-04-17)
## Anti-AI Scraping (5-Layer Defense)
Default `anti_ai_scraping = true` in ingress_factory. Disable per-service: `anti_ai_scraping = false`.
1. Bot blocking (ForwardAuth → poison-fountain) 2. X-Robots-Tag noai 3. Tarpit/poison content (standalone at poison.viktorbarzin.me)
Trap links (formerly layer 3) removed April 2026 — rewrite-body plugin broken on Traefik v3.6.12 (Yaegi bugs). `strip-accept-encoding` and `anti-ai-trap-links` middlewares deleted.
Rybbit analytics injection now via Cloudflare Worker (`stacks/rybbit/worker/`, HTMLRewriter, wildcard route `*.viktorbarzin.me/*`, 28 site ID mappings).
Key files: `stacks/poison-fountain/`, `stacks/rybbit/worker/`, `stacks/platform/modules/traefik/middleware.tf`
1. Bot blocking (ForwardAuth → poison-fountain) 2. X-Robots-Tag noai 3. Trap links before `</body>`
4. Tarpit (~100 bytes/sec) 5. Poison content (CronJob every 6h, `--http1.1` required)
Key files: `stacks/poison-fountain/`, `stacks/platform/modules/traefik/middleware.tf`
## Terragrunt Architecture
- Root `terragrunt.hcl`: DRY providers, backend, variable loading, `generate "tiers"` block

View file

@ -122,9 +122,8 @@ Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB
| `offsite-sync-backup.timer` | Timer | Daily 06:00 | Two-step rsync to Synology (sda + NFS via inotify) |
| `nfs-change-tracker.service` | Service | Continuous | inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log` |
## GPU Node (currently k8s-node1)
- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4) — physical passthrough, no Terraform pin
- **Taint**: `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to every NFD-discovered GPU node)
- **Label**: `nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD)
- GPU workloads need: `node_selector = { "nvidia.com/gpu.present" : "true" }` + nvidia toleration
- Taint applied via `null_resource.gpu_node_config` in `stacks/nvidia/modules/nvidia/main.tf`; node discovery keyed on the NFD `pci-10de.present` label so the taint follows the card to whichever host is carrying it
## GPU Node (k8s-node1)
- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4)
- **Taint**: `nvidia.com/gpu=true:NoSchedule`, **Label**: `gpu=true`
- GPU workloads need: `node_selector = { "gpu": "true" }` + nvidia toleration
- Taint applied via `null_resource.gpu_node_taint` in `modules/kubernetes/nvidia/main.tf`

View file

@ -19,7 +19,7 @@
| Service | Description | Stack |
|---------|-------------|-------|
| vaultwarden | Bitwarden-compatible password manager | platform |
| redis | Shared Redis 8.x via HAProxy at `redis-master.redis.svc.cluster.local` — 3-pod raw StatefulSet `redis-v2` (redis+sentinel+exporter per pod), quorum=2. Clients use HAProxy only, no sentinel fallback. | redis |
| redis | Shared Redis at `redis.redis.svc.cluster.local` | redis |
| immich | Photo management (GPU) | immich |
| nvidia | GPU device plugin | nvidia |
| metrics-server | K8s metrics | metrics-server |
@ -45,8 +45,7 @@
| nextcloud | File sync/share | nextcloud |
| calibre | E-book management (may be merged into ebooks stack) | calibre |
| onlyoffice | Document editing | onlyoffice |
| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier) | f1-stream |
| chrome-service | Headed Chromium WebSocket pool (`ws://chrome-service.chrome-service.svc:3000/<token>`) for sibling services driving anti-bot embeds | chrome-service |
| f1-stream | F1 streaming | f1-stream |
| rybbit | Analytics | rybbit |
| isponsorblocktv | SponsorBlock for TV | isponsorblocktv |
| actualbudget | Budgeting (factory pattern) | actualbudget |
@ -138,18 +137,3 @@ jellyfin, jellyseerr, tdarr, affine, health, family, openclaw
- `*.viktor.actualbudget` - Actualbudget factory instances
- `*.freedify` - Freedify factory instances
- `mailserver.*` - Mail server components (antispam, admin)
## Key Runbooks
Operational surfaces that aren't k8s services (VMs, pipelines, host-side
procedures) are documented in `infra/docs/runbooks/`:
| Surface | Runbook |
|---|---|
| Private Docker registry VM (10.0.20.10) | [registry-vm.md](../../docs/runbooks/registry-vm.md) |
| Rebuild after orphan-index incident | [registry-rebuild-image.md](../../docs/runbooks/registry-rebuild-image.md) |
| PVE host operations (backups, LVM) | [proxmox-host.md](../../docs/runbooks/proxmox-host.md) |
| NFS prerequisites and CSI mount options | [nfs-prerequisites.md](../../docs/runbooks/nfs-prerequisites.md) |
| pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) |
| Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) |
| Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) |

View file

@ -0,0 +1,102 @@
# Setup Shared Remote Executor
Skill for setting up Claude Code's shared remote executor in new projects.
## When to Use
- When adding Claude Code support to a new project
- When the user says "set up remote executor for this project"
- When working on a new project that needs remote command execution
## Prerequisites
- Shared executor already deployed at `~/.claude/` on wizard@10.0.10.10
- Project accessible via NFS from both macOS and the remote VM
## Setup Steps
### 1. Create .claude Directory
```bash
mkdir -p .claude/sessions
```
### 2. Create session-exec.sh Wrapper
Create `.claude/session-exec.sh` with the following content (adjust PROJECT_ROOT):
```bash
#!/bin/bash
# Project-Local Session Helper - Wrapper for shared executor
set -euo pipefail
SHARED_SESSION_EXEC="/home/wizard/.claude/session-exec.sh"
PROJECT_ROOT="/home/wizard/path/to/project" # UPDATE THIS
if [ -f "$SHARED_SESSION_EXEC" ]; then
if [ "${1:-}" = "create" ] || [ -z "${1:-}" ]; then
"$SHARED_SESSION_EXEC" create "$PROJECT_ROOT"
else
"$SHARED_SESSION_EXEC" "$@"
fi
else
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
SESSIONS_DIR="$SCRIPT_DIR/sessions"
SESSION_ID="${1:-$(date +%s)-$$-$RANDOM}"
ACTION="${2:-create}"
SESSION_DIR="$SESSIONS_DIR/$SESSION_ID"
case "$ACTION" in
create|init|"")
mkdir -p "$SESSION_DIR"
echo "ready" > "$SESSION_DIR/cmd_status.txt"
echo "$PROJECT_ROOT" > "$SESSION_DIR/workdir.txt"
> "$SESSION_DIR/cmd_input.txt"
> "$SESSION_DIR/cmd_output.txt"
echo "$SESSION_ID"
;;
cleanup|remove|delete)
[ -d "$SESSION_DIR" ] && rm -rf "$SESSION_DIR"
;;
status)
[ -d "$SESSION_DIR" ] && cat "$SESSION_DIR/cmd_status.txt"
;;
list)
[ -d "$SESSIONS_DIR" ] && ls -1 "$SESSIONS_DIR" 2>/dev/null
;;
esac
fi
```
Make executable: `chmod +x .claude/session-exec.sh`
### 3. Link Sessions Directory (on remote VM)
Run on the remote VM to add project sessions to the shared executor:
```bash
# Option A: Symlink project sessions (if using project-local sessions)
ln -sfn /path/to/project/.claude/sessions ~/.claude/sessions
# Option B: Use shared sessions (all projects share one directory)
# Just ensure ~/.claude/sessions exists
```
### 4. Create CLAUDE.md
Add execution instructions to `.claude/CLAUDE.md`:
```markdown
## Remote Command Execution
Uses shared executor at `~/.claude/` on wizard@10.0.10.10.
### Usage
\```bash
SESSION_ID=$(.claude/session-exec.sh)
echo "command" > .claude/sessions/$SESSION_ID/cmd_input.txt
sleep 1 && cat .claude/sessions/$SESSION_ID/cmd_status.txt
cat .claude/sessions/$SESSION_ID/cmd_output.txt
\```
Start executor: `~/.claude/remote-executor.sh` (on remote VM)
```
## Shared Executor Location
- Scripts: `~/.claude/remote-executor.sh`, `~/.claude/session-exec.sh`
- Sessions: `~/.claude/sessions/`
- Remote VM: wizard@10.0.10.10

View file

@ -7,314 +7,339 @@ description: |
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
(4) User mentions "health check", "cluster status", "cluster health",
(5) User asks "is everything running" or "any problems".
Runs 42 cluster-wide checks (nodes, workloads, monitoring, certs,
backups, external reachability) with safe auto-fix for evicted pods.
Runs 8 standard K8s health checks with safe auto-fix for evicted pods
and stuck CrashLoopBackOff pods.
author: Claude Code
version: 2.0.0
date: 2026-04-19
version: 1.0.0
date: 2026-02-21
---
# Cluster Health Check
## MANDATORY: Run the script first
## Overview
When this skill is invoked, your **first action** must be to run the
cluster health check script and reason over its output before doing
anything else. Do not improvise individual `kubectl` calls — the
script is the authoritative surface.
- **Script**: `/workspace/infra/.claude/cluster-health.sh`
- **Schedule**: CronJob runs every 30 minutes in the `openclaw` namespace
- **Slack notifications**: Posts results to the webhook URL in `$SLACK_WEBHOOK_URL`
- **Auto-fix**: Automatically deletes evicted/failed pods and CrashLoopBackOff pods with >10 restarts
- **Exit code**: 0 = healthy, 1 = issues found
## Quick Check
Run the health check interactively:
```bash
cd /home/wizard/code
bash infra/scripts/cluster_healthcheck.sh --json | tee /tmp/cluster-health.json
# Report only, no Slack notification
bash /workspace/infra/.claude/cluster-health.sh --no-slack
# Full run with Slack notification
bash /workspace/infra/.claude/cluster-health.sh
# Report only, no auto-fix and no Slack
bash /workspace/infra/.claude/cluster-health.sh --no-fix --no-slack
```
If the session is rooted elsewhere, fall back to the absolute path:
## What It Checks
```bash
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --json
```
Then:
1. Parse the JSON. Report the PASS/WARN/FAIL counts + overall verdict.
2. Iterate every FAIL and WARN check, describe what tripped, and propose
the remediation path (use the recipes below).
3. Only reach for ad-hoc `kubectl` commands when investigating a
specific failure beyond what the script reported.
Exit codes: `0` = healthy, `1` = warnings only, `2` = failures.
## Quick flags
```bash
# Human-readable report (default), no auto-fix
bash infra/scripts/cluster_healthcheck.sh
# Machine-readable JSON summary
bash infra/scripts/cluster_healthcheck.sh --json
# Only show WARN + FAIL (suppress PASS noise)
bash infra/scripts/cluster_healthcheck.sh --quiet
# Enable auto-fix (delete evicted pods, kick stuck CrashLoop pods)
bash infra/scripts/cluster_healthcheck.sh --fix
# Combined: quiet JSON without auto-fix
bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
# Custom kubeconfig
bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
```
## What It Checks (42 checks)
| # | Check | Notes |
|---|-------|-------|
| 1 | Node Status | NotReady nodes, version drift |
| 2 | Node Resources | CPU/mem >80% (warn) / >90% (fail) |
| 3 | Node Conditions | MemoryPressure / DiskPressure / PIDPressure |
| 4 | Problematic Pods | CrashLoopBackOff / Error / ImagePullBackOff |
| 5 | Evicted/Failed Pods | `status.phase=Failed` |
| 6 | DaemonSets | desired == ready |
| 7 | Deployments | ready == desired replicas |
| 8 | PVC Status | all Bound |
| 9 | HPA Health | targets not `<unknown>`, utilization <100% |
| 10 | CronJob Failures | job conditions `Failed=True` in last 24h |
| 11 | CrowdSec Agents | all pods Running |
| 12 | Ingress Routes | every ingress has an LB IP + Traefik LB |
| 13 | Prometheus Alerts | count of firing alerts |
| 14 | Uptime Kuma Monitors | internal + external monitors up |
| 15 | ResourceQuota Pressure | any quota >80% used |
| 16 | StatefulSets | ready == desired |
| 17 | Node Disk Usage | ephemeral-storage <80% |
| 18 | Helm Release Health | all `deployed` (no `pending-*`) |
| 19 | Kyverno Policy Engine | all pods Running |
| 20 | NFS Connectivity | 192.168.1.127 showmount / port 2049 |
| 21 | DNS Resolution | Technitium resolves internal + external |
| 22 | TLS Certificate Expiry | TLS `Secret` certs >30d valid |
| 23 | GPU Health | nvidia namespace + device-plugin Running |
| 24 | Cloudflare Tunnel | pods Running |
| 25 | Resource Usage | node CPU/mem headroom |
| 26 | HA Sofia — Entity Availability | Home Assistant unavailable/unknown count |
| 27 | HA Sofia — Integration Health | config entries setup_error / not_loaded |
| 28 | HA Sofia — Automation Status | disabled / stale (>30d) automations |
| 29 | HA Sofia — System Resources | HA CPU / mem / disk |
| 30 | Hardware Exporters | snmp / idrac-redfish / proxmox / tuya pods + scrapes |
| 31 | cert-manager — Certificate Readiness | Certificate CRs with `Ready!=True` |
| 32 | cert-manager — Certificate Expiry (<14d) | notAfter within 14d |
| 33 | cert-manager — Failed CertificateRequests | `Ready=False, reason=Failed` |
| 34 | Backup Freshness — Per-DB Dumps | MySQL + PG dumps within 25h |
| 35 | Backup Freshness — Offsite Sync | Pushgateway `backup_last_success_timestamp` <27h |
| 36 | Backup Freshness — LVM PVC Snapshots | newest thin snapshot <25h (SSH PVE) |
| 37 | Monitoring — Prometheus + Alertmanager | `/-/ready` + AM pods Running |
| 38 | Monitoring — Vault Sealed Status | `vault status` reports `Sealed: false` |
| 39 | Monitoring — ClusterSecretStore Ready | `vault-kv` + `vault-database` Ready |
| 40 | External — Cloudflared + Authentik Replicas | deployments fully ready |
| 41 | External — ExternalAccessDivergence Alert | alert not firing |
| 42 | External — Traefik 5xx Rate (15m) | top-10 services emitting 5xx |
| # | Check | Auto-Fix | Alerts |
|---|-------|----------|--------|
| 1 | **Node Health** — NotReady nodes, MemoryPressure, DiskPressure, PIDPressure | No | Yes |
| 2 | **Pod Health** — CrashLoopBackOff, ImagePullBackOff, ErrImagePull, Error | Yes (CrashLoop >10 restarts) | Yes |
| 3 | **Evicted/Failed Pods** — Pods in `Failed` phase | Yes (deletes all) | Yes |
| 4 | **Failed Deployments** — Deployments with ready != desired replicas | No | Yes |
| 5 | **Pending PVCs** — PersistentVolumeClaims not in `Bound` state | No | Yes |
| 6 | **Resource Pressure** — Node CPU or memory >80% (warn) or >90% (issue) | No | Yes |
| 7 | **CronJob Failures** — Failed CronJob-owned Jobs in the last 24h | No | Yes |
| 8 | **DaemonSet Health** — DaemonSets with desired != ready | No | Yes |
## Safe Auto-Fix Rules
`--fix` only performs operations that are genuinely reversible and
observable. Nothing here rewrites Terraform state or mutates the cluster
beyond "delete pod".
### Safe to auto-fix (the script does these automatically)
### Done automatically by `--fix`
1. **Evicted/Failed pods** — These are already terminated and just cluttering the namespace:
```bash
kubectl delete pods -A --field-selector=status.phase=Failed
```
- **Evicted / Failed pods** — delete them; the controller recreates.
```bash
kubectl delete pods -A --field-selector=status.phase=Failed
```
- **CrashLoopBackOff pods with >10 restarts** — delete once to reset
backoff timer.
2. **CrashLoopBackOff pods with >10 restarts** — The pod is stuck in a crash loop; deleting lets the controller recreate it with a fresh backoff timer:
```bash
kubectl delete pod -n <namespace> <pod-name> --grace-period=0
```
### NEVER auto-fix (requires human investigation)
- NotReady nodes
- MemoryPressure / DiskPressure / PIDPressure
- ImagePullBackOff (usually a bad tag / registry credential)
- Deployment ready-replica mismatch
- Pending PVCs
- Node CPU/memory >90%
- CronJob failures
- DaemonSet desired != ready
- Vault sealed
- ClusterSecretStore not Ready
- cert-manager Certificate failures
- Backup freshness regressions
- Any external-reachability failure
- **NotReady nodes** — Could be network, kubelet, or hardware issue; needs SSH investigation
- **DiskPressure / MemoryPressure / PIDPressure** — Root cause must be identified
- **ImagePullBackOff** — Usually a wrong image tag or registry issue; needs config fix
- **Failed deployments** — Could be resource limits, bad config, missing secrets
- **Pending PVCs** — Usually NFS export missing or storage class issue
- **Resource pressure >90%** — Need to identify which pods are consuming resources
- **CronJob failures** — Need to check job logs to understand why it failed
- **DaemonSet issues** — Could be node taints, resource limits, or image issues
## Deep-investigation recipes per failure mode
## Deep Investigation
### Node Issues (checks 1, 3, 17, 25)
When the health check reports issues, use these commands to investigate further.
### Node Issues
```bash
kubectl describe node <node>
# Describe the problematic node (events, conditions, capacity)
kubectl describe node <node-name>
# Check resource usage across all nodes
kubectl top nodes
kubectl get events --field-selector involvedObject.name=<node> --sort-by='.lastTimestamp'
# SSH to the node
ssh root@10.0.20.10X
# Check recent events on a specific node
kubectl get events --field-selector involvedObject.name=<node-name> --sort-by='.lastTimestamp'
# SSH to the node for direct inspection
ssh root@<node-ip>
systemctl status kubelet
journalctl -u kubelet --since "30 minutes ago" | tail -100
df -h ; free -h
df -h
free -h
```
Node IPs: `10.0.20.100` master, `.101` node1 (GPU), `.102` node2,
`.103` node3, `.104` node4.
### Pod Issues (checks 4, 5, 11, 19)
### Pod Issues
```bash
kubectl describe pod -n <ns> <pod>
kubectl logs -n <ns> <pod> --tail=200
kubectl logs -n <ns> <pod> --previous --tail=200
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20
# Describe the pod (events, conditions, container statuses)
kubectl describe pod -n <namespace> <pod-name>
# Check current logs
kubectl logs -n <namespace> <pod-name> --tail=100
# Check logs from the previous crashed container
kubectl logs -n <namespace> <pod-name> --previous --tail=100
# Check events in the namespace
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
# Check all pods in a namespace
kubectl get pods -n <namespace> -o wide
```
Common failure causes: OOMKilled (raise mem limit in Terraform), bad
config / missing env var, DB connection failure (check `dbaas` pods),
NFS mount failure (`showmount -e 192.168.1.127`), stale
imagePullSecret.
### Deployment / StatefulSet / DaemonSet (checks 6, 7, 16)
### Deployment Issues
```bash
kubectl describe deployment -n <ns> <name>
kubectl rollout status deployment -n <ns> <name>
kubectl rollout history deployment -n <ns> <name>
kubectl get rs -n <ns> -l app=<app>
# Describe the deployment (strategy, conditions, events)
kubectl describe deployment -n <namespace> <deployment-name>
# Check rollout status
kubectl rollout status deployment -n <namespace> <deployment-name>
# Check rollout history
kubectl rollout history deployment -n <namespace> <deployment-name>
# Check the replicaset
kubectl get rs -n <namespace> -l app=<app-label>
```
### PVC (check 8)
### PVC Issues
```bash
kubectl describe pvc -n <ns> <pvc>
kubectl get events -n <ns> --field-selector reason=FailedMount --sort-by='.lastTimestamp'
kubectl get pv | grep <pvc>
showmount -e 192.168.1.127
# Describe the PVC (events, status, storage class)
kubectl describe pvc -n <namespace> <pvc-name>
# Check PVs
kubectl get pv
# Check events related to PVCs
kubectl get events -n <namespace> --field-selector reason=FailedMount --sort-by='.lastTimestamp'
# Verify NFS export exists
showmount -e 10.0.10.15 | grep <service-name>
```
### cert-manager (checks 31, 32, 33)
### Resource Pressure
```bash
kubectl get certificate -A
kubectl describe certificate -n <ns> <name>
kubectl get certificaterequest -A
kubectl describe certificaterequest -n <ns> <name>
kubectl logs -n cert-manager deploy/cert-manager | tail -50
# Top nodes (CPU and memory usage)
kubectl top nodes
# Top pods sorted by memory (cluster-wide)
kubectl top pods -A --sort-by=memory | head -20
# Top pods sorted by CPU (cluster-wide)
kubectl top pods -A --sort-by=cpu | head -20
# Check resource requests/limits in a namespace
kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>
```
Common causes: ACME HTTP-01 challenge blocked, ClusterIssuer missing
DNS provider secret, rate-limit from Let's Encrypt.
## Common Remediation
### Backups (checks 34, 35, 36)
### Persistent CrashLoopBackOff
```bash
# Per-DB dumps (inside the DB pod)
kubectl exec -n dbaas mysql-standalone-0 -- ls -lah /backup/per-db/
kubectl exec -n dbaas pg-cluster-0 -- ls -lah /backup/per-db/
A pod keeps crashing even after the auto-fix deletes it.
# Pushgateway metrics
kubectl exec -n monitoring deploy/prometheus-server -- \
wget -qO- http://prometheus-prometheus-pushgateway:9091/metrics | \
grep backup_last_success_timestamp
1. **Check logs from the crashed container**:
```bash
kubectl logs -n <namespace> <pod-name> --previous --tail=200
```
# LVM snapshots on PVE host
ssh -o BatchMode=yes root@192.168.1.127 \
'lvs -o lv_name,lv_time,lv_size --noheadings | grep snap'
```
2. **Check the pod description for clues**:
```bash
kubectl describe pod -n <namespace> <pod-name>
```
Look for:
- `OOMKilled` in Last State — the container ran out of memory
- `Error` with exit code 1 — application error (bad config, missing env var, DB connection failure)
- `Error` with exit code 137 — killed by OOM killer or liveness probe
- `Error` with exit code 143 — SIGTERM (graceful shutdown failure)
If offsite sync is stale, the common cause is the
`offsite-sync-backup.service` systemd unit on the PVE host failing.
`ssh root@192.168.1.127 'systemctl status offsite-sync-backup'`.
3. **Common causes**:
- **OOMKilled**: Increase memory limits in Terraform (see below)
- **Bad config**: Check environment variables, secrets, config maps
- **DB connection failure**: Verify the database pod is running (`kubectl get pods -n dbaas`)
- **NFS mount failure**: Verify NFS export exists (`showmount -e 10.0.10.15`)
- **Missing secret**: Check if TLS secret or other secrets exist in the namespace
### Monitoring stack (checks 37, 38, 39)
### OOMKilled
```bash
# Prometheus
kubectl exec -n monitoring deploy/prometheus-server -- wget -qO- http://localhost:9090/-/ready
kubectl logs -n monitoring deploy/prometheus-server --tail=100
The container was killed because it exceeded its memory limit.
# Alertmanager
kubectl get pods -n monitoring | grep alertmanager
kubectl logs -n monitoring -l app=prometheus-alertmanager --tail=100
1. **Check current limits**:
```bash
kubectl describe pod -n <namespace> <pod-name> | grep -A 5 "Limits"
```
# Vault
kubectl exec -n vault vault-0 -- sh -c 'VAULT_ADDR=http://127.0.0.1:8200 vault status'
# If sealed: check raft peers with `vault operator raft list-peers` and unseal.
2. **Fix in Terraform** — Edit `modules/kubernetes/<service>/main.tf` and increase the memory limit:
```hcl
resources {
limits = {
memory = "2Gi" # Increase from current value
}
}
```
# ClusterSecretStore
kubectl get clustersecretstore
kubectl describe clustersecretstore vault-kv vault-database
kubectl logs -n external-secrets deploy/external-secrets --tail=100
```
3. **Apply the change**:
```bash
cd /workspace/infra
terraform apply -target=module.kubernetes_cluster.module.<service> -auto-approve
```
### External reachability (checks 40, 41, 42)
### ImagePullBackOff
```bash
# Cloudflared
kubectl get pods -n cloudflared
kubectl logs -n cloudflared -l app=cloudflared --tail=100
The container image cannot be pulled.
# Authentik
kubectl get pods -n authentik -l app=authentik-server
kubectl logs -n authentik -l app=authentik-server --tail=100
1. **Check the exact error**:
```bash
kubectl describe pod -n <namespace> <pod-name> | grep -A 5 "Events"
```
# ExternalAccessDivergence alert
kubectl exec -n monitoring deploy/prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/alerts' | \
python3 -m json.tool | grep -A 5 ExternalAccessDivergence
2. **Common causes**:
- **Wrong image tag**: Verify the tag exists on the registry (Docker Hub, ghcr.io, etc.)
- **Private registry without credentials**: Check if imagePullSecrets are configured
- **Pull-through cache issue**: The registry cache at `10.0.20.10` may have a stale entry
```bash
# Check pull-through cache ports:
# 5000 = docker.io, 5010 = ghcr.io, 5020 = quay.io, 5030 = registry.k8s.io
curl -s http://10.0.20.10:5000/v2/_catalog | python3 -m json.tool
```
- **Registry rate limit**: Docker Hub free tier has pull limits; pull-through cache helps avoid this
# Traefik 5xx — find the hot service
kubectl exec -n monitoring deploy/prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=topk(10,rate(traefik_service_requests_total{code=~%225..%22}%5B15m%5D))' \
| python3 -m json.tool
```
3. **Fix**: Update the image tag in the service's Terraform module and re-apply.
### OOMKilled remediation
### Node NotReady
1. `kubectl describe pod -n <ns> <pod> | grep -A 5 Limits`
2. Edit `infra/modules/kubernetes/<service>/main.tf` and raise
`resources.limits.memory`.
3. `cd /home/wizard/code/infra && scripts/tg apply` (Tier 1) or
`terraform apply -target=module.<service>` as appropriate.
A node has gone NotReady.
### ImagePullBackOff remediation
1. **Check node conditions**:
```bash
kubectl describe node <node-name> | grep -A 20 "Conditions"
```
1. `kubectl describe pod -n <ns> <pod> | grep -A 5 Events`
2. Verify tag exists on the source registry.
3. Check pull-through cache at `10.0.20.10:{5000,5010,5020,5030}`.
4. Update the image tag in Terraform + re-apply.
2. **SSH to the node and check kubelet**:
```bash
ssh root@<node-ip>
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago" | tail -50
```
### Persistent CrashLoopBackOff after auto-fix
3. **Check resources**:
```bash
# On the node
df -h # Disk space
free -h # Memory
top -bn1 # CPU/processes
```
1. `kubectl logs -n <ns> <pod> --previous --tail=200`
2. `kubectl describe pod -n <ns> <pod>` and check Last State:
- `OOMKilled` → raise memory limit
- Exit code 137 → OOM or probe killed
- Exit code 143 → SIGTERM / graceful shutdown failed
3. Cross-check dbaas + NFS + secrets are healthy.
4. **Node IPs** (for SSH):
- `10.0.20.100` — k8s-master
- `10.0.20.101` — k8s-node1 (GPU)
- `10.0.20.102` — k8s-node2
- `10.0.20.103` — k8s-node3
- `10.0.20.104` — k8s-node4
## Notes on the canonical / hardlink setup
## Slack Webhook
The authoritative copy of this SKILL.md lives at
`/home/wizard/code/.claude/skills/cluster-health/SKILL.md`. A hardlink
at `/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md`
points to the same inode so infra-rooted sessions also discover the
skill.
The script posts results to the Slack incoming webhook URL in `$SLACK_WEBHOOK_URL`. The message format uses Slack mrkdwn:
- All clear: green checkmark with node/pod count
- Warnings only: warning icon with details
- Issues found: red alert icon with auto-fixes applied and remaining issues
To verify the hardlink is intact:
The webhook URL is passed as an environment variable from `openclaw_skill_secrets` in `terraform.tfvars`.
```bash
stat -c '%i %n' \
/home/wizard/code/.claude/skills/cluster-health/SKILL.md \
/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md
```
## Infrastructure
Both should print the same inode number. If they diverge (e.g. `git
checkout` replaced the file rather than updating it), re-link:
| Component | Path / Location |
|-----------|----------------|
| Health check script | `/workspace/infra/.claude/cluster-health.sh` (in-pod) or `.claude/cluster-health.sh` (repo) |
| Terraform module | `modules/kubernetes/openclaw/main.tf` |
| CronJob definition | Defined in the OpenClaw Terraform module |
| Existing full healthcheck | `scripts/cluster_healthcheck.sh` (local-only, 24 checks with color output) |
| Infra repo (in pod) | `/workspace/infra` |
| kubectl (in pod) | `/tools/kubectl` |
| terraform (in pod) | `/tools/terraform` |
```bash
ln -f /home/wizard/code/.claude/skills/cluster-health/SKILL.md \
/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md
```
## Auto-File Incidents for SEV1/SEV2
After running health checks, if **SEV1 or SEV2 issues** are found (node down, multiple services affected, core service outage, or single important service down), auto-file a GitHub Issue:
### Severity Classification
- **SEV1**: Node NotReady, multiple services down, data at risk, core service outage (DNS, auth, ingress, databases)
- **SEV2**: Single non-core service down, degraded performance, persistent CrashLoopBackOff
- **SEV3**: Warnings only, resource pressure <90%, cosmetic do NOT auto-file
### Workflow
1. **Dedup check**: Before filing, query open incidents:
```bash
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
curl -s -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/ViktorBarzin/infra/issues?labels=incident&state=open&per_page=50"
```
If an open issue already covers the same service/namespace, **skip filing**.
2. **File the issue** with labels `incident`, `sev1` or `sev2`, `postmortem-required`:
- Title: `[AUTO] <Service/Namespace> — <brief symptom>`
- Body: full diagnostic dump (pod status, events, alerts, node state)
- The issue-automation GHA workflow will trigger the post-mortem pipeline automatically
3. **Auto-close recovered services**: If a service that previously had an auto-filed incident is now healthy:
```bash
# Comment and close
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
-d '{"body": "**Resolved** — Service recovered. Auto-closed by cluster health check."}'
curl -s -X PATCH -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>" \
-d '{"state": "closed"}'
```
## Post-Mortem Auto-Suggest
After running a healthcheck, if the cluster has **recovered from an unhealthy state** (previous run showed FAIL items that are now resolved), suggest writing a post-mortem:
> The cluster has recovered from the previous unhealthy state. Would you like me to write a post-mortem? Run `/post-mortem` to generate one.
This ensures incidents are documented while context is fresh.
## Notes
1. This script is designed to run inside the OpenClaw pod where kubectl is pre-configured via the ServiceAccount
2. The full `scripts/cluster_healthcheck.sh` script runs 24 checks and is meant for local interactive use; this skill's script runs 8 core checks optimized for automated CronJob execution
3. When investigating issues interactively, prefer running commands directly rather than re-running the script
4. All Terraform changes must go through the `.tf` files — never use `kubectl apply/edit/patch` for persistent changes

View file

@ -155,19 +155,3 @@ Common port is 80. Exceptions:
3. Add `time.sleep(0.3)` between bulk operations to avoid overloading
4. Homepage dashboard widget slug: `cluster-internal`
5. Cloudflare-proxied at `uptime.viktorbarzin.me`
## Terraform-Managed Monitors
There is NO `louislam/uptime-kuma` Terraform provider. Two patterns exist for
declarative monitor management in this stack:
- **External HTTPS monitors** — auto-discovered from ingress annotations by the
`external-monitor-sync` CronJob (`*/10 * * * *`). Opt-out via
`uptime.viktorbarzin.me/external-monitor: "false"` on the ingress.
- **Internal monitors (DBs, non-HTTP)** — declared in the
`local.internal_monitors` list in `stacks/uptime-kuma/modules/uptime-kuma/main.tf`
and synced by the `internal-monitor-sync` CronJob. To add one, append to the
list (provide `name`, `type`, `database_connection_string`,
`database_password_vault_key`, `interval`, `retry_interval`, `max_retries`)
and `scripts/tg apply`. The sync is idempotent — looks up by name, creates
if missing, patches if drifted. Existing monitors keep their id and history.

5
.gitignore vendored
View file

@ -65,11 +65,6 @@ state/infra/
backend.tf
providers.tf
.terraform.lock.hcl
cloudflare_provider.tf
tiers.tf
stacks/*/cloudflare_provider.tf
stacks/*/tiers.tf
stacks/*/terragrunt_rendered.json
# Kubernetes config (sensitive)
config

View file

@ -1,85 +1,38 @@
# Build the CI tools Docker image used by all infra pipelines.
# Triggers on push that touches ci/Dockerfile, or manual (API/UI) so
# rebuilds after a registry incident don't need a cosmetic Dockerfile edit.
# Triggers on changes to ci/Dockerfile only (push to master).
when:
- event: push
branch: master
path:
include:
- 'ci/Dockerfile'
- event: manual
event: push
branch: master
path:
include:
- 'ci/Dockerfile'
steps:
- name: build-and-push
image: woodpeckerci/plugin-docker-buildx
settings:
# Phase 4 of forgejo-registry-consolidation 2026-05-07 —
# registry.viktorbarzin.me dropped, Forgejo is the only target.
repo:
- forgejo.viktorbarzin.me/viktor/infra-ci
repo: registry.viktorbarzin.me:5050/infra-ci
dockerfile: ci/Dockerfile
context: ci/
tags:
- latest
- "${CI_COMMIT_SHA:0:8}"
platforms: linux/amd64
registry: registry.viktorbarzin.me:5050
logins:
- registry: forgejo.viktorbarzin.me
- registry: registry.viktorbarzin.me:5050
username:
from_secret: forgejo_user
from_secret: registry_user
password:
from_secret: forgejo_push_token
# Post-push integrity check is now redundant with the every-15min
# forgejo-integrity-probe in stacks/monitoring/, which walks
# /v2/_catalog + HEADs every blob across the entire Forgejo registry.
# If a corruption pattern emerges that the periodic probe misses,
# restore a verify step similar to the pre-Phase-4 version (see
# commit 49f4956f) but pointed at forgejo.viktorbarzin.me.
# Break-glass tarball: save the just-pushed infra-ci image to disk on the
# registry VM (10.0.20.10) so we can `docker load` it back into a node
# when Forgejo is unreachable. Pulls from Forgejo (the only registry now).
# Best-effort — failure here doesn't fail the pipeline.
# Recovery procedure: docs/runbooks/forgejo-registry-breakglass.md.
- name: breakglass-tarball
image: alpine:3.20
failure: ignore
environment:
REGISTRY_SSH_KEY:
from_secret: registry_ssh_key
FORGEJO_USER:
from_secret: forgejo_user
FORGEJO_PASS:
from_secret: forgejo_push_token
commands:
- apk add --no-cache openssh-client
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
- printf '%s\n' "$REGISTRY_SSH_KEY" > ~/.ssh/id_ed25519
- chmod 600 ~/.ssh/id_ed25519
- ssh-keyscan -t ed25519 10.0.20.10 >> ~/.ssh/known_hosts 2>/dev/null
- SHA=${CI_COMMIT_SHA:0:8}
- |
ssh -n -o BatchMode=yes root@10.0.20.10 "
set -e
mkdir -p /opt/registry/data/private/_breakglass
IMAGE=forgejo.viktorbarzin.me/viktor/infra-ci:$SHA
echo \$FORGEJO_PASS | docker login forgejo.viktorbarzin.me -u \$FORGEJO_USER --password-stdin
docker pull \$IMAGE
docker save \$IMAGE | gzip > /opt/registry/data/private/_breakglass/infra-ci-$SHA.tar.gz
ln -sfn infra-ci-$SHA.tar.gz /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz
ls -t /opt/registry/data/private/_breakglass/infra-ci-*.tar.gz \
| grep -v 'latest' | tail -n +6 | xargs -r rm -v
ls -lh /opt/registry/data/private/_breakglass/
"
from_secret: registry_password
- name: slack
image: curlimages/curl
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"CI image built: forgejo.viktorbarzin.me/viktor/infra-ci:${CI_COMMIT_SHA:0:8} (and registry-private mirror)\"}" \
--data "{\"text\":\"CI image built: registry.viktorbarzin.me:5050/infra-ci:${CI_COMMIT_SHA:0:8}\"}" \
"$SLACK_WEBHOOK" || true
environment:
SLACK_WEBHOOK:

View file

@ -23,14 +23,6 @@ steps:
username: viktorbarzin
password:
from_secret: dockerhub-pat
# Private registry on :5050 requires htpasswd auth since 2026-03-22.
# Without this, buildx pushes the second repo but blob HEAD comes
# back 401 → pipeline fails → CI false-negative (see bd code-12b).
- registry: registry.viktorbarzin.me:5050
username:
from_secret: registry_user
password:
from_secret: registry_password
dockerfile: cli/Dockerfile
context: cli
auto_tag: true

View file

@ -25,7 +25,7 @@ clone:
steps:
- name: apply
image: forgejo.viktorbarzin.me/viktor/infra-ci:latest
image: registry.viktorbarzin.me:5050/infra-ci:latest
pull: true
backend_options:
kubernetes:
@ -37,12 +37,6 @@ steps:
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
# Each `- |` command runs in a fresh shell, so we can't rely on an
# `export VAULT_ADDR=...` in the auth command persisting — pin it at
# step level. VAULT_TOKEN is still per-command; we persist it to
# ~/.vault-token (auto-read by `vault` CLI) so downstream commands
# don't need explicit token propagation.
VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
commands:
# ── Skip CI commits ──
- |
@ -61,17 +55,9 @@ steps:
# ── Vault auth ──
- |
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \
export VAULT_ADDR=http://vault-active.vault.svc.cluster.local:8200
export VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token)
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
echo "ERROR: Vault K8s auth failed (role=ci, ns=woodpecker)" >&2
exit 1
fi
# Persist for downstream `- |` blocks (each runs in a fresh shell,
# so exporting VAULT_TOKEN wouldn't help). `vault`, `scripts/tg`,
# and `scripts/state-sync` all fall through to ~/.vault-token when
# the env var is unset.
umask 077; printf '%s' "$VAULT_TOKEN" > "$HOME/.vault-token"
# ── Detect changed stacks ──
- |
@ -128,7 +114,7 @@ steps:
# ── Pre-warm provider cache ──
- |
if [ -s .platform_apply ] || [ -s .app_apply ]; then
FIRST_STACK=$(cat .platform_apply .app_apply 2>/dev/null | head -1)
FIRST_STACK=$(head -1 .platform_apply .app_apply 2>/dev/null | head -1)
if [ -n "$FIRST_STACK" ]; then
echo "Pre-warming provider cache from stacks/$FIRST_STACK..."
cd "stacks/$FIRST_STACK" && terragrunt init --terragrunt-non-interactive -input=false 2>&1 | tail -3 && cd ../..
@ -137,7 +123,6 @@ steps:
# ── Apply platform stacks (serial, with Vault advisory locks) ──
- |
FAILED_PLATFORM_STACKS=""
if [ -s .platform_apply ]; then
echo "=== Applying platform stacks (serial, locked) ==="
while read -r stack; do
@ -150,9 +135,8 @@ steps:
if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)"
else
echo "$OUTPUT" | tail -50
echo "$OUTPUT" | tail -5
echo "[$stack] FAILED (exit $EXIT)"
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
fi
else
echo "$OUTPUT" | tail -3
@ -160,12 +144,9 @@ steps:
fi
done < .platform_apply
fi
# Deferred until after app stacks so both lists get a chance to run.
echo "$FAILED_PLATFORM_STACKS" > .platform_failed
# ── Apply app stacks (serial, with Vault advisory locks) ──
- |
FAILED_APP_STACKS=""
if [ -s .app_apply ]; then
echo "=== Applying app stacks (serial, locked) ==="
while read -r stack; do
@ -178,9 +159,8 @@ steps:
if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)"
else
echo "$OUTPUT" | tail -50
echo "$OUTPUT" | tail -5
echo "[$stack] FAILED (exit $EXIT)"
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
fi
else
echo "$OUTPUT" | tail -3
@ -188,15 +168,6 @@ steps:
fi
done < .app_apply
fi
# Fail the step loudly so the pipeline `default` workflow state
# reflects reality — the service-upgrade agent and CI alert cascade
# both rely on this (see bd code-e1x). Lock-skipped stacks are NOT
# counted as failures.
FAILED_PLATFORM=$(cat .platform_failed 2>/dev/null | tr -d ' ')
if [ -n "$FAILED_PLATFORM" ] || [ -n "$FAILED_APP_STACKS" ]; then
echo "=== FAILED STACKS: platform=[$FAILED_PLATFORM ] apps=[$FAILED_APP_STACKS ] ==="
exit 1
fi
# ── Commit and push state changes ──
- |

View file

@ -14,7 +14,7 @@ clone:
steps:
- name: detect-drift
image: forgejo.viktorbarzin.me/viktor/infra-ci:latest
image: registry.viktorbarzin.me:5050/infra-ci:latest
pull: true
backend_options:
kubernetes:
@ -42,15 +42,10 @@ steps:
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token)
# ── Run terraform plan on all stacks ──
# Emits two timestamps per drifted stack so the Pushgateway/Prometheus
# side can compute drift-age-hours via `time() - drift_stack_first_seen`.
- |
DRIFTED=""
CLEAN=0
ERRORS=""
NOW=$(date +%s)
# Metrics accumulator — written once per stack, then pushed as a batch.
METRICS=""
for stack_dir in stacks/*/; do
stack=$(basename "$stack_dir")
@ -61,50 +56,12 @@ steps:
EXIT=$?
case $EXIT in
0)
echo "OK (no changes)"
CLEAN=$((CLEAN + 1))
# drift_stack_state=0 means clean; age-hours irrelevant so we
# still push 0 so per-stack gauges don't go stale.
METRICS="${METRICS}drift_stack_state{stack=\"$stack\"} 0\n"
METRICS="${METRICS}drift_stack_age_hours{stack=\"$stack\"} 0\n"
;;
1)
echo "ERROR"
ERRORS="$ERRORS $stack"
METRICS="${METRICS}drift_stack_state{stack=\"$stack\"} 2\n"
;;
2)
echo "DRIFT DETECTED"
DRIFTED="$DRIFTED $stack"
# Fetch first-seen timestamp from Pushgateway (preserve across runs).
FIRST_SEEN=$(curl -s "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics" \
| awk -v s="$stack" '$1 == "drift_stack_first_seen{stack=\""s"\"}" {print $2; exit}')
if [ -z "$FIRST_SEEN" ] || [ "$FIRST_SEEN" = "0" ]; then
FIRST_SEEN="$NOW"
fi
AGE_HOURS=$(( (NOW - FIRST_SEEN) / 3600 ))
METRICS="${METRICS}drift_stack_state{stack=\"$stack\"} 1\n"
METRICS="${METRICS}drift_stack_first_seen{stack=\"$stack\"} $FIRST_SEEN\n"
METRICS="${METRICS}drift_stack_age_hours{stack=\"$stack\"} $AGE_HOURS\n"
;;
0) echo "OK (no changes)"; CLEAN=$((CLEAN + 1)) ;;
1) echo "ERROR"; ERRORS="$ERRORS $stack" ;;
2) echo "DRIFT DETECTED"; DRIFTED="$DRIFTED $stack" ;;
esac
done
# Summary counters — single gauge per run.
DRIFT_COUNT=$(echo "$DRIFTED" | wc -w)
ERROR_COUNT=$(echo "$ERRORS" | wc -w)
METRICS="${METRICS}drift_stack_count $DRIFT_COUNT\n"
METRICS="${METRICS}drift_error_count $ERROR_COUNT\n"
METRICS="${METRICS}drift_clean_count $CLEAN\n"
METRICS="${METRICS}drift_detection_last_run_timestamp $NOW\n"
# ── Push to Pushgateway ──
# One batched push keeps the run atomic: either all metrics land or none.
printf "%b" "$METRICS" | curl -s --data-binary @- \
http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/drift-detection \
|| echo "(pushgateway unavailable, metrics lost for this run)"
echo ""
echo "=== Drift Detection Summary ==="
echo "Clean: $CLEAN stacks"

View file

@ -9,70 +9,52 @@ clone:
steps:
- name: run-issue-responder
image: alpine:3.20
image: python:3.12-alpine
commands:
- apk add --no-cache curl jq
- apk add --no-cache openssh-client curl jq
# Authenticate to Vault via K8s SA JWT
- |
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
VAULT_RESP=$(curl -sf -X POST http://vault-active.vault.svc.cluster.local:8200/v1/auth/kubernetes/login \
-d "{\"role\":\"ci\",\"jwt\":\"$$SA_TOKEN\"}")
VAULT_TOKEN=$(echo "$$VAULT_RESP" | jq -r .auth.client_token)
if [ -z "$$VAULT_TOKEN" ] || [ "$$VAULT_TOKEN" = "null" ]; then
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}")
VAULT_TOKEN=$(echo "$VAULT_RESP" | jq -r .auth.client_token)
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
echo "ERROR: Vault authentication failed"
exit 1
fi
echo "Vault authenticated"
# Fetch API token for claude-agent-service
# Fetch DevVM SSH key
- |
AGENT_TOKEN=$(curl -sf -H "X-Vault-Token: $$VAULT_TOKEN" \
http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/claude-agent-service | \
jq -r '.data.data.api_bearer_token')
if [ -z "$$AGENT_TOKEN" ] || [ "$$AGENT_TOKEN" = "null" ]; then
echo "ERROR: Failed to fetch agent API token"
curl -sf -H "X-Vault-Token: $VAULT_TOKEN" \
http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/ci/infra | \
jq -r '.data.data.devvm_ssh_key' > /tmp/devvm-key
chmod 600 /tmp/devvm-key
if [ ! -s /tmp/devvm-key ]; then
echo "ERROR: Failed to fetch DevVM SSH key"
exit 1
fi
echo "Agent token fetched"
# Submit job to claude-agent-service
echo "SSH key fetched"
# SSH to DevVM and run issue-responder agent
- |
ISSUE_NUM="${ISSUE_NUMBER:-}"
ISSUE_TITLE="${ISSUE_TITLE:-}"
ISSUE_LABELS="${ISSUE_LABELS:-}"
ISSUE_URL="${ISSUE_URL:-}"
if [ -z "$$ISSUE_NUM" ]; then
if [ -z "$ISSUE_NUM" ]; then
echo "ERROR: No issue number provided"
exit 1
fi
echo "Processing issue #$$ISSUE_NUM: $$ISSUE_TITLE"
echo "Processing issue #$ISSUE_NUM: $ISSUE_TITLE"
echo "Labels: $ISSUE_LABELS"
PAYLOAD=$(jq -n \
--arg prompt "Process GitHub Issue #$$ISSUE_NUM: $$ISSUE_TITLE. Labels: $$ISSUE_LABELS. URL: $$ISSUE_URL. Read the issue body via GitHub API, investigate, and take appropriate action." \
--arg agent ".claude/agents/issue-responder" \
'{prompt: $prompt, agent: $agent, max_budget_usd: 10, timeout_seconds: 1800}')
RESP=$(curl -sf -X POST \
-H "Authorization: Bearer $$AGENT_TOKEN" \
-H "Content-Type: application/json" \
-d "$$PAYLOAD" \
http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute)
JOB_ID=$(echo "$$RESP" | jq -r '.job_id')
echo "Job submitted: $$JOB_ID"
# Poll for completion (30min max)
- |
for i in $(seq 1 120); do
sleep 15
RESULT=$(curl -sf \
-H "Authorization: Bearer $$AGENT_TOKEN" \
http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs/$$JOB_ID)
STATUS=$(echo "$$RESULT" | jq -r '.status')
echo "[$$i/120] Status: $$STATUS"
if [ "$$STATUS" != "running" ]; then
echo "$$RESULT" | jq .
if [ "$$STATUS" = "completed" ]; then exit 0; else exit 1; fi
fi
done
echo "ERROR: Job timed out after 30 minutes"
exit 1
ssh -i /tmp/devvm-key -o StrictHostKeyChecking=no wizard@10.0.10.10 \
"cd ~/code && git -C infra stash && git -C infra pull --rebase && git -C infra stash pop 2>/dev/null; \
~/.local/bin/claude -p \
--agent infra/.claude/agents/issue-responder \
--dangerously-skip-permissions \
--max-budget-usd 10 \
'Process GitHub Issue #${ISSUE_NUM}: ${ISSUE_TITLE}. Labels: ${ISSUE_LABELS}. URL: ${ISSUE_URL}. Read the issue body via GitHub API, investigate, and take appropriate action.'"
# Cleanup
- rm -f /tmp/devvm-key

View file

@ -17,7 +17,7 @@ steps:
- name: parse-and-implement
image: python:3.12-alpine
commands:
- apk add --no-cache jq curl git
- apk add --no-cache jq curl git openssh-client
- sh scripts/postmortem-pipeline.sh
- name: notify-slack

View file

@ -1,63 +0,0 @@
# Sync infra/scripts/pve-nfs-exports → PVE host /etc/exports on change.
#
# Wave 6b of the state-drift consolidation plan: move the "scp + exportfs -ra"
# deploy step out of runbook-human-hands and into CI so the Proxmox NFS export
# table tracks git.
#
# Trigger: push to master that touches `scripts/pve-nfs-exports`. The file
# header documents the deploy invocation; this pipeline codifies it.
#
# Credentials:
# - pve_ssh_key: Woodpecker repo-secret (ed25519 keypair provisioned
# 2026-04-18 as `woodpecker-pve-nfs-exports-sync`). Public key lives in
# /root/.ssh/authorized_keys on the PVE host. Private key mirrored in
# Vault `secret/woodpecker/pve_ssh_key` for recovery.
when:
- event: push
branch: master
path: scripts/pve-nfs-exports
- event: manual
clone:
git:
image: woodpeckerci/plugin-git
settings:
depth: 1
attempts: 3
steps:
- name: deploy
image: alpine:3.20
environment:
PVE_SSH_KEY:
from_secret: pve_ssh_key
SLACK_WEBHOOK:
from_secret: slack_webhook
commands:
- apk add --no-cache openssh-client curl
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
- printf '%s\n' "$PVE_SSH_KEY" > ~/.ssh/id_ed25519
- chmod 600 ~/.ssh/id_ed25519
# Pin host key — CI's ~/.ssh/known_hosts is ephemeral, so accept-new on first pull.
- ssh-keyscan -t ed25519 192.168.1.127 >> ~/.ssh/known_hosts 2>/dev/null
# Diff what we'd ship, so pipeline logs show the intended change.
- echo '---diff---' && ssh -o BatchMode=yes root@192.168.1.127 "cat /etc/exports" > /tmp/remote.exports || true
- diff -u /tmp/remote.exports scripts/pve-nfs-exports || true
- echo '---applying---'
- scp -o BatchMode=yes scripts/pve-nfs-exports root@192.168.1.127:/etc/exports
- ssh -o BatchMode=yes root@192.168.1.127 "exportfs -ra && exportfs -s | head -5"
- echo '---done---'
- name: slack
image: curlimages/curl:8.11.0
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\"PVE /etc/exports sync: ${CI_PIPELINE_STATUS}\"}" \
"$SLACK_WEBHOOK" || true
when:
status: [success, failure]

View file

@ -1,156 +0,0 @@
# Sync modules/docker-registry/* → /opt/registry/ on docker-registry VM
# (10.0.20.10) on change, and bounce containers + nginx when needed.
#
# Replaces the manual "ssh + scp + docker compose up -d" that was required
# after the 2026-04-19 `registry:2 → registry:2.8.3` pin landed. The deploy
# flow is now: edit a file in modules/docker-registry/ → git push → this
# pipeline runs → registry VM picks up the change.
#
# Trigger: push to master that touches any managed file (see `when.path`),
# or a manual run via Woodpecker UI / API.
#
# Credentials:
# - registry_ssh_key: Woodpecker repo-secret (ed25519 keypair provisioned
# 2026-04-19 as `woodpecker-registry-config-sync`). Public key lives in
# /root/.ssh/authorized_keys on 10.0.20.10. Private key mirrored in
# Vault `secret/woodpecker/registry_ssh_key` (subkeys private_key /
# public_key / known_hosts_entry) for recovery.
#
# Why bounce nginx every time: nginx caches upstream DNS at startup, so if
# any registry-* container gets recreated (new IP on the docker bridge),
# nginx keeps forwarding to a stale address. Always restart nginx as the
# last step — see docs/runbooks/registry-vm.md § "Bouncing registry
# containers — the nginx DNS trap".
when:
- event: push
branch: master
path:
include:
- 'modules/docker-registry/docker-compose.yml'
- 'modules/docker-registry/fix-broken-blobs.sh'
- 'modules/docker-registry/cleanup-tags.sh'
- 'modules/docker-registry/nginx_registry.conf'
- 'modules/docker-registry/config-private.yml'
- event: manual
clone:
git:
image: woodpeckerci/plugin-git
settings:
depth: 1
attempts: 3
steps:
- name: deploy
image: alpine:3.20
environment:
REGISTRY_SSH_KEY:
from_secret: registry_ssh_key
commands:
- apk add --no-cache openssh-client rsync
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
- printf '%s\n' "$REGISTRY_SSH_KEY" > ~/.ssh/id_ed25519
- chmod 600 ~/.ssh/id_ed25519
# Pin host key — CI's ~/.ssh/known_hosts is ephemeral, so accept-new on first pull.
- ssh-keyscan -t ed25519 10.0.20.10 >> ~/.ssh/known_hosts 2>/dev/null
- echo '---detecting changed files---'
- |
# Mirror the remote state of each file so we can diff and decide what bounces.
CHANGED=""
for f in docker-compose.yml fix-broken-blobs.sh cleanup-tags.sh nginx_registry.conf config-private.yml; do
LOCAL="modules/docker-registry/$f"
REMOTE="/opt/registry/$f"
if [ ! -f "$LOCAL" ]; then
echo "skip $f (not in repo)"
continue
fi
# Pull the remote copy into /tmp for a diff. ssh -n avoids stdin-hogging.
REMOTE_CONTENT=$(ssh -n -o BatchMode=yes root@10.0.20.10 "cat $REMOTE 2>/dev/null || true")
LOCAL_CONTENT=$(cat "$LOCAL")
if [ "$LOCAL_CONTENT" = "$REMOTE_CONTENT" ]; then
echo "unchanged: $f"
else
echo "---diff: $f ---"
echo "$REMOTE_CONTENT" > /tmp/remote.txt
diff -u /tmp/remote.txt "$LOCAL" | head -40 || true
CHANGED="$CHANGED $f"
fi
done
echo "CHANGED_FILES=$CHANGED"
printf '%s' "$CHANGED" > /tmp/changed
- echo '---applying---'
- |
CHANGED=$(cat /tmp/changed)
if [ -z "$CHANGED" ]; then
echo "No files changed — exiting cleanly (manual run with no drift)."
exit 0
fi
# Ship every managed file unconditionally — scp is cheap, idempotency is safe.
scp -o BatchMode=yes \
modules/docker-registry/docker-compose.yml \
modules/docker-registry/fix-broken-blobs.sh \
modules/docker-registry/cleanup-tags.sh \
modules/docker-registry/nginx_registry.conf \
modules/docker-registry/config-private.yml \
root@10.0.20.10:/opt/registry/
ssh -n -o BatchMode=yes root@10.0.20.10 '
chmod +x /opt/registry/fix-broken-blobs.sh /opt/registry/cleanup-tags.sh
'
- echo '---bouncing containers + nginx---'
- |
CHANGED=$(cat /tmp/changed)
# Compose-visible files: docker-compose.yml (image tag, mounts) and
# config-private.yml (registry config → needs registry-private reload).
BOUNCE_COMPOSE=0
BOUNCE_NGINX=0
echo "$CHANGED" | grep -q "docker-compose.yml" && BOUNCE_COMPOSE=1
echo "$CHANGED" | grep -q "config-private.yml" && BOUNCE_COMPOSE=1
echo "$CHANGED" | grep -q "nginx_registry.conf" && BOUNCE_NGINX=1
if [ "$BOUNCE_COMPOSE" = "1" ]; then
echo "compose-visible change → pull + up -d"
ssh -n -o BatchMode=yes root@10.0.20.10 '
cd /opt/registry
docker compose pull 2>&1 | tail -5
docker compose up -d 2>&1 | tail -20
'
# Any compose recreate requires nginx DNS refresh too.
BOUNCE_NGINX=1
fi
if [ "$BOUNCE_NGINX" = "1" ]; then
echo "bouncing nginx to flush upstream DNS cache"
ssh -n -o BatchMode=yes root@10.0.20.10 '
docker restart registry-nginx
sleep 3
docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}" | grep -E "registry-"
'
fi
if [ "$BOUNCE_COMPOSE" = "0" ] && [ "$BOUNCE_NGINX" = "0" ]; then
echo "only script files changed (cron-picks-up semantics) — no bounce needed"
fi
- echo '---verify---'
- |
ssh -n -o BatchMode=yes root@10.0.20.10 '
echo "=== catalog ==="
# Prove auth + routing survived.
curl -sk -o /dev/null -w "catalog (unauth → 401 expected): HTTP %{http_code}\n" \
https://127.0.0.1:5050/v2/
echo "=== integrity scan (dry-run) ==="
python3 /opt/registry/fix-broken-blobs.sh --dry-run 2>&1 | tail -5
'
- name: slack
image: curlimages/curl:8.11.0
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\"Registry config sync on 10.0.20.10: ${CI_PIPELINE_STATUS}\"}" \
"$SLACK_WEBHOOK" || true
when:
status: [success, failure]

View file

@ -15,49 +15,6 @@
- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
- **Plan all**: `cd stacks && terragrunt run --all --non-interactive -- plan`
## Adopting Existing Resources — Use `import {}` Blocks, Not the CLI
When bringing a live cluster/Vault/Cloudflare resource under Terraform management, use an HCL `import {}` block (Terraform 1.5+). Do **NOT** use `terraform import` on the CLI for anything landing in this repo — the CLI path leaves no audit trail and makes multi-operator adoption fragile.
**Canonical workflow:**
1. Write the `resource` block that matches the live object.
2. In the same stack, add an `import {}` stanza naming the target and the provider-specific ID:
```hcl
import {
to = helm_release.kured
id = "kured/kured" # Helm ID format: <namespace>/<release-name>
}
resource "helm_release" "kured" {
name = "kured"
namespace = "kured"
repository = "https://kubereboot.github.io/charts/"
chart = "kured"
version = "5.7.0"
# ... values matching the live release
}
```
3. `scripts/tg plan` — every change it proposes is real divergence between HCL and live state. Iterate on values until the plan is **0 changes**.
4. `scripts/tg apply` — the import runs alongside whatever zero-change apply you have. If your plan is 0 changes, this commits only the state-ownership transfer.
5. After the apply lands cleanly, **delete the `import {}` block** in a follow-up commit. The resource is now fully TF-owned and the stanza would be a no-op that clutters diffs.
**Why `import {}` and not `terraform import`:**
- Reviewable in PRs before any state mutation. The CLI path is an out-of-band action nobody sees.
- Plan-safe: the `import` plan step shows the exact object being adopted. Mistyped IDs or the wrong resource address are caught before apply, not after.
- Survives state backend changes (Tier 0 SOPS vs Tier 1 PG) transparently — both work identically from the operator's perspective because both use `scripts/tg`.
- Re-runnable: if the apply fails partway through, the `import {}` block is idempotent. The CLI path's state mutation is not.
**Finding the provider-specific ID:** each provider has its own convention.
| Resource | ID format | Example |
|---|---|---|
| `helm_release` | `<namespace>/<release-name>` | `kured/kured` |
| `kubernetes_manifest` | `{"apiVersion":"...","kind":"...","metadata":{"namespace":"...","name":"..."}}` | (pass as HCL object literal) |
| `kubernetes_<kind>_v1` | `<namespace>/<name>` for namespaced, `<name>` for cluster-scoped | `kube-system/coredns` |
| `authentik_provider_proxy` | provider UUID | `0eecac07-97c7-443c-...` |
| `cloudflare_record` | `<zone-id>/<record-id>` | `abc123/def456` |
## Secrets Management (SOPS)
- **`config.tfvars`** — plaintext config (hostnames, IPs, DNS records, public keys)
- **`secrets.sops.json`** — SOPS-encrypted secrets (passwords, tokens, SSH keys, API keys)
@ -99,13 +56,13 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
- `config.tfvars` — non-secret configuration (plaintext)
- `secrets.sops.json` — all secrets (SOPS-encrypted JSON)
- `terraform.tfvars` — legacy secrets file (git-crypt, kept for reference)
- `scripts/cluster_healthcheck.sh`42-check cluster health script (nodes, workloads, monitoring, certs, backups, external reachability)
- `scripts/cluster_healthcheck.sh` — 25-check cluster health script
## Storage
- **NFS** (`nfs-proxmox` StorageClass): For app data. Use the `nfs_volume` module, never inline `nfs {}` blocks.
- **proxmox-lvm-encrypted** (`proxmox-lvm-encrypted` StorageClass): **Default for all sensitive data** — databases, auth, email, passwords, git repos, health data. LUKS2 encryption via Proxmox CSI. Passphrase in Vault, backup key on PVE host.
- **proxmox-lvm** (`proxmox-lvm` StorageClass): For non-sensitive stateful apps (configs, caches, tools). Proxmox CSI driver.
- **NFS server**: Proxmox host at 192.168.1.127 (sole NFS). HDD NFS at `/srv/nfs` (2TB ext4 LV `pve/nfs-data`), SSD NFS at `/srv/nfs-ssd` (100GB ext4 LV `ssd/nfs-ssd-data`). Exports use `async` mode (safe with UPS + databases on block storage). TrueNAS (VM 9000, 10.0.10.15) decommissioned 2026-04-13. Legacy `nfs-truenas` StorageClass name retained (48 PVs bind it; SC names are immutable on PVs) but now points to the Proxmox host, identical to `nfs-proxmox`.
- **NFS server**: Proxmox host at 192.168.1.127. HDD NFS at `/srv/nfs` (2TB ext4 LV `pve/nfs-data`), SSD NFS at `/srv/nfs-ssd` (100GB ext4 LV `ssd/nfs-ssd-data`). Exports use `async` mode (safe with UPS + databases on block storage). TrueNAS (10.0.10.15) decommissioned.
- **SQLite on NFS is unreliable** (fsync issues) — always use proxmox-lvm or local disk for databases.
- **NFS mount options**: Always `soft,timeo=30,retrans=3` to prevent uninterruptible sleep (D state).
- **NFS export directory must exist** on the Proxmox host before Terraform can create the PV.
@ -113,47 +70,11 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
- **daily-backup** (Daily 05:00): Auto-discovered BACKUP_DIRS (glob), auto SQLite backup (magic number + `?mode=ro`), pfSense, PVE config. No NFS mirror step (NFS syncs directly to Synology via inotify).
- **offsite-sync-backup** (Daily 06:00): Step 1: sda→Synology `pve-backup/`. Step 2: NFS→Synology `nfs/`+`nfs-ssd/` via `rsync --files-from` (inotify change log). Monthly full `--delete`.
- **nfs-change-tracker.service**: inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log`. Incremental syncs complete in seconds.
- **Synology layout** (`/volume1/Backup/Viki/`): `pve-backup/` (from sda), `nfs/` (from `/srv/nfs`), `nfs-ssd/` (from `/srv/nfs-ssd`).
- **Synology layout** (`/volume1/Backup/Viki/`): `pve-backup/` (from sda), `nfs/` (from `/srv/nfs`), `nfs-ssd/` (from `/srv/nfs-ssd`). `truenas/` renamed to `nfs/`, `pve-backup/nfs-mirror/` removed.
## Shared Variables (never hardcode)
`var.nfs_server` (192.168.1.127), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host`
## Redis Service Naming (read before wiring a new consumer)
The Redis stack (`stacks/redis/`) exposes three distinct entry points. Pick the one that matches the client's connection pattern — the wrong one causes READONLY errors or silent connection drops.
| Endpoint | Port(s) | Use for | Backed by |
|----------|---------|---------|-----------|
| `redis-master.redis.svc.cluster.local` | 6379 (redis), 26379 (sentinel) | **Default for new services.** Write-safe — HAProxy health-checks nodes and routes only to the current master. Matches `var.redis_host`. | `kubernetes_service.redis_master` → HAProxy → Bitnami StatefulSet |
| `redis-node-{0,1,2}.redis-headless.redis.svc.cluster.local` | 26379 | **Long-lived connections (PUBSUB, BLPOP, MONITOR, Sidekiq).** Use a sentinel-aware client with master name `mymaster`. Example: `stacks/nextcloud/chart_values.yaml:32-54`. | Bitnami-created headless service → pod DNS |
| `redis.redis.svc.cluster.local` | 6379 | **Do NOT use.** Helm chart's default service — selector patched by `null_resource.patch_redis_service` to match `redis-haproxy`, so today it behaves like `redis-master`. This patch is load-bearing but temporary; consumers hard-coded on this name are tracked in a beads follow-up (T0). | Bitnami chart (patched) |
**HAProxy's `timeout client 30s` closes idle raw Redis connections** — any client that holds a connection open for pub/sub, blocking commands, or replication streams MUST use the sentinel path. Uptime Kuma's Redis monitor hit this limit and had to be re-pointed at the sentinel endpoint (see memory id=748).
**When onboarding a new service:** start from `redis-master.redis.svc.cluster.local:6379` via `var.redis_host`. Only reach for sentinel discovery if the client library supports it natively (ioredis, redis-py Sentinel, go-redis FailoverClient, Sidekiq `sentinels` array) AND the workload uses long-lived connections.
## Kyverno Drift Suppression (`# KYVERNO_LIFECYCLE_V1`)
Kyverno's admission webhook mutates every pod with a `dns_config { option { name = "ndots"; value = "2" } }` block (fixes NxDomain search-domain floods — see `k8s-ndots-search-domain-nxdomain-flood` skill). Terraform does not manage that field, so without suppression every pod-owning resource shows perpetual `spec[0].template[0].spec[0].dns_config` drift.
**Rule**: every `kubernetes_deployment`, `kubernetes_stateful_set`, `kubernetes_daemon_set`, and `kubernetes_cron_job_v1` MUST include the following `lifecycle` block, tagged with the `# KYVERNO_LIFECYCLE_V1` marker so every site is greppable:
```hcl
# kubernetes_deployment / kubernetes_stateful_set / kubernetes_daemon_set
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
# kubernetes_cron_job_v1 (extra job_template nesting)
lifecycle {
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
```
**Why not a shared module?** Terraform's `ignore_changes` meta-argument only accepts static attribute paths. It rejects module outputs, locals, variables, and any expression. A DRY module is therefore impossible — the canonical pattern IS the snippet + marker. When `kubernetes_manifest` resources get Kyverno `generate.kyverno.io/*` annotations mutated, a sibling convention `# KYVERNO_MANIFEST_V1` will be introduced (Phase B).
**Audit**: `rg "KYVERNO_LIFECYCLE_V1" stacks/ | wc -l` — should grow (never shrink). Add the marker to every new pod-owning resource. The `_template/main.tf.example` stub shows the canonical form.
## Tier System
`0-core` | `1-cluster` | `2-gpu` | `3-edge` | `4-aux` — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label.
- Containers without explicit `resources {}` get default limits (256Mi for edge/aux — causes OOMKill for heavy apps)
@ -163,10 +84,10 @@ lifecycle {
## Infrastructure
- **Proxmox**: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM)
- **Nodes**: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4
- **GPU**: `node_selector = { "nvidia.com/gpu.present" : "true" }` + toleration `nvidia.com/gpu`. The label is auto-applied by NFD/gpu-feature-discovery on any node with an NVIDIA PCI device — nothing is hostname-pinned, so the GPU card can move between nodes without Terraform edits.
- **GPU**: `node_selector = { "gpu": "true" }` + toleration `nvidia.com/gpu`
- **Pull-through cache**: 10.0.20.10 — docker.io (:5000), ghcr.io (:5010) only. Caches stale manifests for :latest tags — use versioned tags or pre-pull with `ctr --hosts-dir ''` to bypass.
- **pfSense**: 10.0.20.1 (gateway, firewall, DNS forwarding)
- **MySQL InnoDB Cluster**: 1 instance on proxmox-lvm (scaled from 3 — only Uptime Kuma + phpIPAM remain), PriorityClass `mysql-critical` + PDB, anti-affinity excludes any GPU node (`nvidia.com/gpu.present=true`) so MySQL moves off the GPU host automatically if the card is relocated
- **MySQL InnoDB Cluster**: 1 instance on proxmox-lvm (scaled from 3 — only Uptime Kuma + phpIPAM remain), PriorityClass `mysql-critical` + PDB, anti-affinity excludes k8s-node1 (GPU node)
- **SMTP**: `var.mail_host` port 587 STARTTLS (not internal svc address — cert mismatch)
## Contributor Onboarding
@ -184,7 +105,7 @@ lifecycle {
- **NFS exports**: Create dir on Proxmox host (`ssh root@192.168.1.127 "mkdir -p /srv/nfs/<service>"`), add to `/etc/exports`, run `exportfs -ra`.
## Automated Service Upgrades
- **Pipeline**: DIUN (detect) → n8n webhook (filter + rate limit) → HTTP POST → `claude-agent-service` (K8s)`claude -p` (upgrade agent)
- **Pipeline**: DIUN (detect) → n8n webhook (filter + rate limit) → SSH`claude -p` (upgrade agent)
- **Agent**: `.claude/agents/service-upgrade.md` — analyzes changelogs, backs up DBs, bumps versions, verifies health, rolls back on failure
- **Config**: `.claude/reference/upgrade-config.json` — GitHub repo mappings, DB-backed services, skip patterns
- **Rate limit**: Max 5 upgrades per 6h DIUN scan cycle (configured in n8n workflow)

View file

@ -5,7 +5,6 @@ ARG TERRAFORM_VERSION=1.5.7
ARG TERRAGRUNT_VERSION=0.99.4
ARG SOPS_VERSION=3.9.4
ARG KUBECTL_VERSION=1.34.0
ARG VAULT_VERSION=1.18.1
# Install system packages (single layer)
RUN apk add --no-cache \
@ -35,16 +34,6 @@ RUN curl -fsSL "https://dl.k8s.io/release/v${KUBECTL_VERSION}/bin/linux/amd64/ku
-o /usr/local/bin/kubectl \
&& chmod +x /usr/local/bin/kubectl
# Vault CLI — required by scripts/tg for Tier 1 stack PG credential reads
# and Tier 0 advisory locks. Pinned to server version (1.18.1). Without this
# the CI pipeline surfaces the misleading "Cannot read PG credentials" error
# because scripts/tg swallows stderr ("vault: not found").
RUN curl -fsSL "https://releases.hashicorp.com/vault/${VAULT_VERSION}/vault_${VAULT_VERSION}_linux_amd64.zip" \
-o /tmp/vault.zip \
&& unzip /tmp/vault.zip -d /usr/local/bin/ \
&& rm /tmp/vault.zip \
&& vault version
# Provider cache directory (shared across stacks)
ENV TF_PLUGIN_CACHE_DIR=/tmp/terraform-plugin-cache
ENV TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE=1

Binary file not shown.

View file

@ -80,6 +80,8 @@ def sofia():
pfsense >> k8s_switch
with Cluster('Management Network'):
mgt_switch = Switch()
# Truenas
truenas = Storage("Truenas")
# pxe server
pxe_server = Rack("PXE Server")
# HA
@ -89,6 +91,7 @@ def sofia():
devvm_vpn_client = VPN("Tailscale Client")
vpn_clients["devvm"] = devvm_vpn_client
mgt_switch >> truenas
mgt_switch >> pxe_server
mgt_switch >> home_assistant
mgt_switch >> devvm

View file

@ -20,7 +20,7 @@ This repository contains the configuration and documentation for a homelab Kuber
| [Overview](architecture/overview.md) | Infrastructure overview, hardware specs, VM inventory, and service catalog |
| [Networking](architecture/networking.md) | Network topology, VLANs, routing, and firewall rules |
| [VPN](architecture/vpn.md) | Headscale mesh VPN and Cloudflare Tunnel configuration |
| [Storage](architecture/storage.md) | Proxmox host NFS, Proxmox CSI (LVM-thin + LUKS2), and persistent volume management |
| [Storage](architecture/storage.md) | TrueNAS NFS, democratic-csi, and persistent volume management |
| [Authentication](architecture/authentication.md) | Authentik SSO, OIDC flows, and service integration |
| [Security](architecture/security.md) | CrowdSec IPS, Kyverno policies, and security controls |
| [Monitoring](architecture/monitoring.md) | Prometheus, Grafana, Loki, and observability stack |

View file

@ -16,10 +16,10 @@ n8n Webhook (POST /webhook/<uuid>)
│ rate limit: max 5 upgrades per 6h window
HTTP POST → claude-agent-service (K8s)
SSH → Dev VM (10.0.10.10)
claude -p "upgrade agent prompt" (in-cluster)
claude -p "upgrade agent prompt"
Service Upgrade Agent
@ -54,7 +54,7 @@ Service Upgrade Agent
- Only `status=update` (skip `new`, `unchanged`)
- Skip databases, custom images, infra images, `:latest`
- **Rate limiting**: Max 5 upgrades per 6-hour window using `$getWorkflowStaticData('global')`
- **Action**: HTTP POST to `claude-agent-service.claude-agent.svc:8080/execute` with the upgrade agent prompt
- **Action**: SSH to dev VM, runs `claude -p` with the upgrade agent prompt
### Upgrade Agent
- **Prompt**: `.claude/agents/service-upgrade.md`
@ -173,35 +173,7 @@ Key behaviors observed:
| Secret | Vault Path | Purpose |
|--------|-----------|---------|
| n8n webhook URL | `secret/diun``n8n_webhook_url` | DIUN → n8n trigger |
| Agent API bearer token | `secret/claude-agent-service``api_bearer_token` | n8n → claude-agent-service `/execute` auth. Synced into both `claude-agent` ns (consumer) and `n8n` ns (caller) via ESO. n8n exposes it to the container as `CLAUDE_AGENT_API_TOKEN` env var. |
| Claude OAuth (primary) | `secret/claude-agent-service``claude_oauth_token` | Long-lived 1-year token from `claude setup-token`. Consumed by the CLI via `CLAUDE_CODE_OAUTH_TOKEN` env var (set on the container via `envFrom`). Preferred over the short-lived `.credentials.json` — CLI skips the refresh dance entirely. Rotate yearly; alert fires 30d out. |
| Claude OAuth (spares) | `secret/claude-agent-service-spare-{1,2}``claude_oauth_token` | Failover tokens. Minted alongside primary (verified Anthropic does NOT revoke earlier sessions on new mint). Swap into primary if revocation or compromise. |
| GitHub PAT | `secret/viktor``github_pat` | Changelog fetch (5000 req/hr) |
| Slack webhook | `secret/platform``alertmanager_slack_api_url` | Upgrade notifications |
| Woodpecker token | `secret/viktor``woodpecker_token` | CI pipeline polling |
## OAuth token lifecycle
The CLI supports two auth modes. We use the second — long-lived.
| Mode | How minted | TTL | Needs refresh? | When to use |
|------|-----------|-----|----------------|-------------|
| `claude login``.credentials.json` | Interactive browser OAuth | Access ~6h + refresh token | Yes — CLI auto-refreshes on startup if refresh token valid | Human dev machines |
| `claude setup-token` → opaque `sk-ant-oat01-*` | Interactive browser OAuth | **1 year** | No — expires hard | **Headless / service accounts (us)** |
When both are present on disk, `CLAUDE_CODE_OAUTH_TOKEN` env var wins.
**Harvesting headless**: `setup-token` uses Ink (React for terminals) and needs a real PTY with **≥300-column width**. At 80-col, Ink wraps and DROPS one character at the wrap boundary (107-char invalid instead of 108-char valid). Python wrapper pattern documented in memory; we harvested 2 spare tokens into Vault on 2026-04-18 using a temporary harvester pod.
**Monitoring**: CronJob `claude-oauth-expiry-monitor` (claude-agent ns, every 6h) pushes `claude_oauth_token_expiry_timestamp{path="..."}` to Pushgateway. Alerts: `ClaudeOAuthTokenExpiringSoon` (30d, warn), `ClaudeOAuthTokenCritical` (7d, crit), `ClaudeOAuthTokenMonitorStale` (48h no push, warn), `ClaudeOAuthTokenMonitorNeverRun` (metric absent, warn).
**Rotation**: on alert, harvest a new token, `vault kv patch secret/claude-agent-service claude_oauth_token=<new>`, update the `claude_oauth_token_mint_epochs` local in `stacks/claude-agent-service/main.tf`, `scripts/tg apply` → alert clears on next cron tick.
## n8n workflow gotchas
The `DIUN Upgrade Agent` workflow is imported once into n8n's PG DB — it is **not** Terraform-managed. The JSON at `stacks/n8n/workflows/diun-upgrade.json` is a backup; the live state lives in `workflow_entity.nodes`. Drift between the two is possible.
- **HTTP Request node header expressions must use template-literal form**: `=Bearer {{ $env.CLAUDE_AGENT_API_TOKEN }}` works; `='Bearer ' + $env.CLAUDE_AGENT_API_TOKEN` does NOT evaluate and sends an empty/bogus header → 401 from claude-agent-service.
- **`N8N_BLOCK_ENV_ACCESS_IN_NODE=false`** must be set on the n8n deployment for expressions to read `$env.*` at all.
- **Troubleshooting 401**: the workflow will show `success` status on the webhook node but error on `Run Upgrade Agent`. Inspect in n8n UI → Executions, or query `execution_entity` + `execution_data` directly. Claude-agent-service logs will also show `POST /execute HTTP/1.1 401 Unauthorized`.
- **Patching the live workflow** (one-off, since it's not in TF): `UPDATE workflow_entity SET nodes = REPLACE(nodes::text, OLD, NEW)::json WHERE name = 'DIUN Upgrade Agent';`
| Dev VM SSH key | n8n credentials store → `devvm-ssh` | n8n → dev VM SSH |

View file

@ -209,7 +209,7 @@ graph LR
| Vault Backup | Weekly Sunday 02:00, 30d | CronJob in `vault` | raft snapshot |
| Redis Backup | Weekly Sunday 03:00, 30d | CronJob in `redis` | BGSAVE + copy |
| Vaultwarden Integrity Check | Hourly | CronJob in `vaultwarden` | PRAGMA integrity_check → metric |
| ~~TrueNAS Cloud Sync~~ | **DECOMMISSIONED 2026-04-13** | Was TrueNAS Cloud Sync Task 1 | Replaced by offsite-sync-backup + inotify change tracking on Proxmox host NFS |
| ~~TrueNAS Cloud Sync~~ | **DECOMMISSIONED** | Was TrueNAS Cloud Sync Task 1 | Replaced by offsite-sync-backup |
## How It Works
@ -217,7 +217,7 @@ graph LR
Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62 Proxmox CSI PVCs. These are CoW snapshots — instant creation, minimal overhead, sharing the thin pool's free space.
**Script**: `/usr/local/bin/lvm-pvc-snapshot` on PVE host (source: `infra/scripts/lvm-pvc-snapshot.sh`). Deploy: `scp infra/scripts/lvm-pvc-snapshot.sh root@192.168.1.127:/usr/local/bin/lvm-pvc-snapshot`
**Script**: `/usr/local/bin/lvm-pvc-snapshot` on PVE host (source: `infra/scripts/lvm-pvc-snapshot`)
**Schedule**: Daily 03:00 via systemd timer, 7-day retention
**Discovery**: Auto-discovers PVC LVs matching `vm-*-pvc-*` pattern in VG `pve` thin pool `data`
@ -226,7 +226,7 @@ Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62
- They already have app-level dumps (Layer 2)
- Including them causes ~36% write amplification; excluding them reduces overhead to ~0%
**Monitoring**: Pushes metrics to Pushgateway via NodePort (30091). Alerts: `LVMSnapshotStale` (>30h since last run + 30m `for:`), `LVMSnapshotFailing`, `LVMThinPoolLow` (<15% free).
**Monitoring**: Pushes metrics to Pushgateway via NodePort (30091). Alerts: `LVMSnapshotStale` (>24h), `LVMSnapshotFailing`, `LVMThinPoolLow` (<15% free).
**Restore**: `lvm-pvc-snapshot restore <pvc-lv> <snapshot-lv>` — auto-discovers K8s workload, scales down, swaps LVs, scales back up. See `docs/runbooks/restore-lvm-snapshot.md`.
@ -234,7 +234,7 @@ Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62
**Backup disk**: sda (1.1TB RAID1 SAS) → VG `backup` → LV `data` → ext4 → mounted at `/mnt/backup` on PVE host. Dedicated backup disk, independent of live storage.
**Script**: `/usr/local/bin/daily-backup` on PVE host (source: `infra/scripts/daily-backup.sh`)
**Script**: `/usr/local/bin/daily-backup` on PVE host (source: `infra/scripts/daily-backup`)
**Schedule**: Daily 05:00 via systemd timer
**Retention**: 4 weekly versions (weeks 0-3 via `--link-dest` hardlink dedup)
@ -334,14 +334,14 @@ Two-step offsite sync:
**Monthly full sync**: On 1st Sunday of month, runs `rsync --delete` for cleanup (removes orphaned files on Synology).
**Destination**:
- `Synology/Backup/Viki/nfs/` — mirrors `/srv/nfs`
- `Synology/Backup/Viki/nfs/` — mirrors `/srv/nfs` (renamed from `truenas/`)
- `Synology/Backup/Viki/nfs-ssd/` — mirrors `/srv/nfs-ssd`
**Monitoring**: Pushes `offsite_backup_sync_last_success_timestamp` to Pushgateway. Alerts: `OffsiteBackupSyncStale` (>8d), `OffsiteBackupSyncFailing`.
#### ~~TrueNAS Cloud Sync~~ — DECOMMISSIONED 2026-04-13
#### ~~TrueNAS Cloud Sync~~ — DECOMMISSIONED
> TrueNAS Cloud Sync was decommissioned along with TrueNAS (2026-04-13). The current offsite path is inotify-change-tracked rsync from the Proxmox host NFS (`/srv/nfs`, `/srv/nfs-ssd`) to Synology.
> TrueNAS Cloud Sync was decommissioned along with TrueNAS (2026-04). The `Synology/Backup/Viki/truenas/` directory was renamed to `nfs/` to reflect the new consolidated layout.
## Configuration
@ -673,7 +673,7 @@ module "nfs_backup" {
~~CloudSyncNeverRun~~ REMOVED (TrueNAS decommissioned) │
~~CloudSyncFailing~~ REMOVED (TrueNAS decommissioned) │
│ VaultwardenIntegrityFail integrity_ok == 0 │
│ LVMSnapshotStale > 30h since last snapshot │
│ LVMSnapshotStale > 24h since last snapshot │
│ LVMSnapshotFailing snapshot creation failed │
│ LVMThinPoolLow < 15% free space in thin pool
│ WeeklyBackupStale > 8d since last success │
@ -692,16 +692,6 @@ module "nfs_backup" {
- ~~CloudSync monitor~~: Removed (TrueNAS decommissioned)
- Vaultwarden integrity: Pushes `vaultwarden_sqlite_integrity_ok` hourly
**Pushgateway persistence**: The Pushgateway is configured with
`--persistence.file=/data/pushgateway.bin --persistence.interval=1m`
on a 2Gi `proxmox-lvm-encrypted` PVC (helm values:
`prometheus-pushgateway.persistentVolume`). Without this, every pod
restart drops in-memory metrics. Once-per-day pushers (offsite-sync,
weekly backup) are otherwise invisible for up to 24h if the
Pushgateway restarts between pushes — which is exactly what triggered
the 2026-04-22 backup_offsite_sync FAIL (node3 kubelet hiccup at
11:42 UTC terminated the Pushgateway 8h after the 03:12 UTC push).
**Alert routing**:
- All backup alerts → Slack `#infra-alerts`
- Vaultwarden integrity fail → Slack `#infra-critical` (immediate action required)

View file

@ -1,136 +0,0 @@
# chrome-service — In-cluster headed Chromium pool
## Overview
`chrome-service` is a single-replica, persistent-profile, bearer-token-gated
Playwright **launch-server** that exposes a headed Chromium browser over a
WebSocket. Sibling services connect to it instead of running their own
in-process Chromium when the upstream's anti-bot tooling
(`disable-devtool.js` redirect-to-google trap, console-clear timing tricks,
`navigator.webdriver` checks) defeats a headless browser.
Initial caller: `f1-stream`'s `playback_verifier`. Future callers attach
via the WS+token contract documented in `stacks/chrome-service/README.md`.
## Why a separate stack
In-process Chromium inside `f1-stream`:
- Runs **headless** by default (no `Xvfb`/`DISPLAY`).
- Has the `HeadlessChromium/...` UA suffix and `navigator.webdriver === true`.
- Trips `disable-devtool.js`'s **Performance** detector — Playwright's CDP
adds latency to `console.log(largeArray)` vs `console.table(largeArray)`,
which the lib reads as "DevTools is open" and redirects to
`https://www.google.com/`.
`chrome-service` solves this by:
1. Running **headed** under `Xvfb :99` (via `playwright launch-server` with
a JSON config that pins `headless: false`).
2. Living in a long-lived pod so JIT browser launch latency disappears.
3. Allowing a per-context init script
(`stacks/chrome-service/files/stealth.js` ~ 40 lines, vendored from
`puppeteer-extra-plugin-stealth`) to spoof `webdriver`, `chrome.runtime`,
`plugins`, `languages`, `Permissions.query`, WebGL renderer strings, and
to hide the `disable-devtool-auto` script-tag attribute so the lib's
IIFE exits early.
## Wire protocol
```text
ws://chrome-service.chrome-service.svc.cluster.local:3000/<TOKEN>
┌───────────────────────────────┼───────────────────────────────┐
│ caller pod │ chrome-service pod
│ (e.g. f1-stream) │ (single replica)
│ │
│ CHROME_WS_URL ──────────────┘
│ CHROME_WS_TOKEN ─── from `secret/chrome-service.api_bearer_token` (ESO)
│ await chromium.connect(f"{ws}/{token}")
│ await ctx.add_init_script(STEALTH_JS)
│ page.goto("https://upstream.com/embed/...")
└─── ←── pages render under Xvfb, headed Chromium ──── ─────────┘
```
## Image pin
Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
`stacks/chrome-service/main.tf`) and the Python client
(`playwright==1.48.0` in callers' `requirements.txt`) **must match
minor-versions**. Bump in lockstep — Playwright protocol changes between
minors and the client cannot connect to a mismatched server.
The Microsoft image ships only the browser binaries, not the `playwright`
npm SDK; the start command runs `npx -y playwright@1.48.0 launch-server`
which downloads the SDK on first start (cached under `$HOME/.npm` via the
PVC) and reuses it on subsequent restarts.
## Storage
- **`chrome-service-profile-encrypted`** (PVC, 2Gi → 10Gi autoresize,
`proxmox-lvm-encrypted`) — Chromium user-data dir + npm cache.
Encrypted because cookies/localStorage may include third-party auth tokens
for sites callers drive. `HOME=/profile` so npx caches there.
- **`chrome-service-backup-host`** (NFS, RWX) — destination for a 6-hourly
CronJob that `tar -czf /backup/<YYYY_MM_DD_HH>.tar.gz -C /profile .`,
retention 30 days.
## Auth + secrets
- Vault KV `secret/chrome-service.api_bearer_token` — 32-byte URL-safe
random, rotated by hand:
`vault kv put secret/chrome-service api_bearer_token=$(python3 -c 'import secrets; print(secrets.token_urlsafe(32))')`.
- ESO syncs into namespace-local Secret `chrome-service-secrets`
(server pod) and `chrome-service-client-secrets` (each caller pod).
- Reloader (`reloader.stakater.com/auto = "true"`) cascades token rotation
to both server and any annotated caller — no manual rollout.
## Network controls
- **`kubernetes_network_policy_v1.ws_ingress`** — two separate ingress
rules on the same policy:
- **TCP/3000** (Playwright WS): only namespaces labelled
`chrome-service.viktorbarzin.me/client = "true"` (plus an explicit
fallback for `f1-stream` by `kubernetes.io/metadata.name`).
- **TCP/6080** (noVNC HTTP+WS): only the `traefik` namespace, since
the public-facing path is `chrome.viktorbarzin.me` ingress →
Traefik → sidecar. Authentik forward-auth still gates external
access at the Traefik layer.
- **WS port 3000** is internal-only (no ingress, no Cloudflare DNS).
- **noVNC sidecar** (`forgejo.viktorbarzin.me/viktor/chrome-service-novnc`)
exposes a live HTML5 view of the headed Chromium session via
`x11vnc` (connected to Xvfb on `localhost:6099`) bridged to
`websockify` on port 6080. Service `chrome` maps :80 → :6080 and is
exposed via `ingress_factory` at `chrome.viktorbarzin.me`,
Authentik-gated. Both static page and WebSocket upgrade share the
same path — Cloudflare proxy, Cloudflared tunnel, Traefik, and
Authentik forward-auth all preserve `Upgrade: websocket`.
## Adding a new caller
See `stacks/chrome-service/README.md` for the four-step recipe:
1. Label the caller's namespace.
2. Add an `ExternalSecret` pulling `secret/chrome-service`.
3. Inject `CHROME_WS_URL` + `CHROME_WS_TOKEN` env vars.
4. Vendor `stealth.js` and apply via `await context.add_init_script(...)`
after every `new_context()`.
## Limits + risks
- **Anti-bot vs stealth arms race** — when an upstream beats us (DRM
license check, device-fingerprint mismatch, hotlink protection that
whitelists specific parent domains), the verifier returns
`is_playable=False` and the extractor moves on. No user-visible
breakage, just empty stream lists for that source.
- **JWPlayer DRM error 102630** — observed with several hmembeds embeds
even from the headed chrome-service. The license check bails because
the request origin isn't on the embed's allowlist; this is upstream
policy, not an infra defect.
- **Single replica + RWO PVC** — the deployment uses `Recreate` strategy.
Brief outage on rollout, ~30s for browser warmup.
- **No `/metrics` endpoint** — the cluster's generic
`KubePodCrashLooping` rule covers basic alerting. A Prometheus scrape
exporter is day-2 work.

View file

@ -19,7 +19,7 @@ graph LR
I --> J[Pull from DockerHub<br/>or Pull-Through Cache]
K[Pull-Through Cache<br/>10.0.20.10] -.-> J
L[forgejo.viktorbarzin.me<br/>Private Registry on Forgejo] -.-> J
L[registry.viktorbarzin.me<br/>Private Registry] -.-> J
style B fill:#2088ff
style F fill:#4c9e47
@ -33,7 +33,7 @@ graph LR
| GitHub Actions | Cloud | `.github/workflows/build-and-deploy.yml` | Build Docker images, push to DockerHub |
| Woodpecker CI | Self-hosted | `ci.viktorbarzin.me` | Deploy to Kubernetes cluster |
| DockerHub | Cloud | `viktorbarzin/*` | Public image registry |
| Private Registry | Forgejo Packages | `forgejo.viktorbarzin.me/viktor` | Private container images (PAT auth, retention CronJob) — migrated from registry.viktorbarzin.me 2026-05-07 |
| Private Registry | Custom | `registry.viktorbarzin.me` | Private images, htpasswd auth |
| Pull-Through Cache | Custom | `10.0.20.10:5000` (docker.io)<br/>`10.0.20.10:5010` (ghcr.io) | LAN cache for remote registries |
| Kyverno | Cluster | `kyverno` namespace | Auto-sync registry credentials to all namespaces |
| Vault | Cluster | `vault.viktorbarzin.me` | K8s auth for Woodpecker pipelines |
@ -102,8 +102,7 @@ Woodpecker API uses numeric IDs (not owner/name):
1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10`
2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss
3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access
4. **Private registry** has been Forgejo's built-in OCI registry at `forgejo.viktorbarzin.me/viktor/<image>` since 2026-05-07. Auth via PAT (Vault `secret/ci/global/forgejo_push_token` for push, `secret/viktor/forgejo_pull_token` for pull). The pre-migration `registry:2.8.3`-based private registry on `registry.viktorbarzin.me:5050` was the root cause of three orphan-index incidents in three weeks (2026-04-13, 2026-04-19, 2026-05-04 — see `docs/post-mortems/2026-04-19-registry-orphan-index.md` and the full migration writeup at `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md`). The five pull-through caches on `10.0.20.10` (ports 5000/5010/5020/5030/5040) stay in place for upstream registries.
5. **Integrity probe** (`registry-integrity-probe` CronJob in `monitoring` ns, every 15m) walks `/v2/_catalog` → tags → indexes → child manifests via HEAD and pushes `registry_manifest_integrity_failures` to Pushgateway; alerts `RegistryManifestIntegrityFailure` / `RegistryIntegrityProbeStale` / `RegistryCatalogInaccessible` page on broken state. Authoritative check (HTTP API, not filesystem).
4. **Private registry** (`registry.viktorbarzin.me`) uses htpasswd auth, credentials stored in Vault
### Infra Pipelines (Woodpecker-only)
@ -112,14 +111,7 @@ Woodpecker API uses numeric IDs (not owner/name):
| default | `.woodpecker/default.yml` | Terragrunt apply on push |
| renew-tls | `.woodpecker/renew-tls.yml` | Certbot renewal cron |
| build-cli | `.woodpecker/build-cli.yml` | Build and push to dual registries |
| build-ci-image | `.woodpecker/build-ci-image.yml` | Build `infra-ci` tooling image (triggered by `ci/Dockerfile` change or manual); post-push HEADs every blob via `verify-integrity` step to catch orphan-index pushes |
| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered build for k8s-portal subdirectory |
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` to `/opt/registry/` on `10.0.20.10` when any managed file changes; bounces containers + nginx per `docs/runbooks/registry-vm.md` |
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports``/etc/exports` on PVE host |
| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new `docs/post-mortems/*.md` via headless Claude agent |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift detection |
| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues |
| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
## Configuration

View file

@ -18,7 +18,7 @@ graph TB
subgraph Proxmox["Proxmox VE"]
direction TB
MASTER["VM 200: k8s-master<br/>8c / 32GB<br/>10.0.20.100"]
NODE1["VM 201: k8s-node1<br/>16c / 32GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:PreferNoSchedule"]
NODE1["VM 201: k8s-node1<br/>16c / 32GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:NoSchedule"]
NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
@ -72,7 +72,7 @@ graph TB
| VM | VMID | vCPUs | RAM | Network | Role | Taints |
|----|------|-------|-----|---------|------|--------|
| k8s-master | 200 | 8 | 32GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` |
| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to whichever node carries the GPU) |
| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:NoSchedule` |
| k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
@ -85,9 +85,9 @@ graph TB
|-----------|-------|
| Device | NVIDIA Tesla T4 (16GB GDDR6) |
| PCIe Address | 0000:06:00.0 |
| Assigned VM | VMID 201 (k8s-node1) — physical location only, no Terraform pin |
| Node Label | `nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD) |
| Node Taint | `nvidia.com/gpu=true:PreferNoSchedule` (applied by `null_resource.gpu_node_config` to every NFD-tagged GPU node) |
| Assigned VM | VMID 201 (k8s-node1) |
| Node Label | `gpu=true` |
| Node Taint | `nvidia.com/gpu=true:NoSchedule` |
| Driver | NVIDIA GPU Operator |
| Resource Name | `nvidia.com/gpu` |
@ -273,8 +273,8 @@ resources {
### GPU Resource Management
**Node Selection**: GPU pods must:
1. Tolerate `nvidia.com/gpu=true:PreferNoSchedule` taint
2. Select `nvidia.com/gpu.present=true` label (auto-applied by gpu-feature-discovery wherever the card is)
1. Tolerate `nvidia.com/gpu=true:NoSchedule` taint
2. Select `gpu=true` label
3. Request `nvidia.com/gpu: 1` resource
**Example**:
@ -286,7 +286,7 @@ spec:
value: "true"
effect: NoSchedule
nodeSelector:
nvidia.com/gpu.present: "true"
gpu: "true"
containers:
- name: app
resources:
@ -294,14 +294,6 @@ spec:
nvidia.com/gpu: 1
```
**Portability**: No Terraform code references a specific hostname for
GPU scheduling. If the GPU card is physically moved to a different
node, gpu-feature-discovery moves the `nvidia.com/gpu.present=true`
label with it, and `null_resource.gpu_node_config` re-applies the
`nvidia.com/gpu=true:PreferNoSchedule` taint to the new host on the
next apply (discovery keyed on
`feature.node.kubernetes.io/pci-10de.present=true`).
**GPU Workloads**:
- Ollama (LLM inference)
- ComfyUI (Stable Diffusion workflows)
@ -537,7 +529,7 @@ kubectl describe pod <pod-name> -n <namespace>
```
0/5 nodes are available: 5 Insufficient nvidia.com/gpu.
```
**Fix**: Verify the GPU-carrying node is Ready and has the `nvidia.com/gpu.present=true` label. Check `kubectl get nodes -l nvidia.com/gpu.present=true` — if empty, gpu-feature-discovery hasn't labeled any node (operator not running, driver not loaded, or PCI passthrough broken).
**Fix**: Verify GPU node (201) is Ready and labeled `gpu=true`.
### Pods OOMKilled repeatedly
@ -622,7 +614,7 @@ spec:
value: "true"
effect: NoSchedule
nodeSelector:
nvidia.com/gpu.present: "true"
gpu: "true"
containers:
- name: app
resources:

View file

@ -120,29 +120,9 @@ graph TB
### Redis
Single shared cluster for all 17 consumers (Immich, Authentik, Nextcloud, Paperless, Dawarich Sidekiq, Traefik, etc.). HAProxy (3 replicas, PDB minAvailable=2) is the sole client-facing path — clients talk only to `redis-master.redis.svc.cluster.local:6379` and HAProxy health-checks backends via `INFO replication`, routing only to `role:master`.
**Architecture**:
3 pods in StatefulSet `redis-v2`, each co-locating redis + sentinel + redis_exporter, using `docker.io/library/redis:8-alpine` (8.6.2). HAProxy (3 replicas, PDB minAvailable=2) routes clients to the current master via 1s `INFO replication` tcp-checks. Full context behind the April 2026 rework in beads `code-v2b`.
- 3 redis pods + 3 co-located sentinels (quorum=2). Odd sentinel count eliminates split-brain.
- **Pod anti-affinity is `required` (hard)** — each redis pod must land on a distinct node. Soft anti-affinity previously let the scheduler co-locate 2/3 pods on the same node; when that node (`k8s-node3`) went `NotReady→Ready` at 11:42 UTC on 2026-04-22 it took 2 redis pods with it and the cluster lost quorum. Cluster-wide PV `nodeAffinity` matches one zone (`topology.kubernetes.io/region=pve, zone=pve`), so PVCs rebind freely on reschedule.
- `podManagementPolicy=Parallel` + init container that regenerates `sentinel.conf` on every boot by probing peer sentinels for consensus master (priority: sentinel vote → peer role:master with slaves → deterministic pod-0 fallback). No persistent sentinel runtime state — can't drift out of sync with reality (root cause of 2026-04-19 PM incident).
- redis.conf has `include /shared/replica.conf`; the init container writes either an empty file (master) or `replicaof <master> 6379` (replicas), so pods come up already in the right role — no bootstrap race.
- **Sentinel hostname persistence**: `sentinel resolve-hostnames yes` + `sentinel announce-hostnames yes` in the init-generated sentinel.conf are mandatory — without them, sentinel stores resolved IPs in its rewritten config, and pod-IP churn on restart breaks failover. The MONITOR command itself must be issued with a hostname and the flags must be active before MONITOR, otherwise sentinel stores an IP that goes stale the next time the pod is deleted.
- **Failover timing (tuned 2026-04-22)**: `sentinel down-after-milliseconds=15000` + `sentinel failover-timeout=60000`. Redis liveness probe `timeout_seconds=10, failure_threshold=5`; sentinel liveness probe same. LUKS-encrypted LVM + BGSAVE fork can briefly stall master I/O >5s, which under the old 5s/30s sentinel timings + 3s/3 probes induced spurious `+sdown``+odown``+switch-master` cycles every 1-2 minutes. The new values absorb normal BGSAVE pauses without triggering failover.
- **HAProxy check smoothing (tuned 2026-04-22)**: `check inter 2s fall 3 rise 2` (was `1s / 2 / 2`) + `timeout check 5s` (was `3s`). The aggressive 1s polling used to race sentinel failovers — during a legitimate promote, HAProxy could catch the old master serving `role:slave` in the 1-3s window before re-probing the new master, leaving the backend empty and clients receiving `ReadOnlyError`.
- **Headless service `publish_not_ready_addresses=false`** (flipped 2026-04-22). Previously `true` meant HAProxy's DNS resolver saw not-yet-ready pods during rollouts, compounding the check-race above. Sentinel peer discovery is unaffected because sentinels announce to each other explicitly via `sentinel announce-hostnames yes`.
- Memory: master + replicas `requests=limits=768Mi`. Concurrent BGSAVE + AOF-rewrite fork can double RSS via COW, so headroom must cover it. `auto-aof-rewrite-percentage=200` + `auto-aof-rewrite-min-size=128mb` tune down rewrite frequency.
- Persistence: RDB (`save 900 1 / 300 100 / 60 10000`) + AOF `appendfsync=everysec`. Disk-wear analysis on 2026-04-19 (sdb Samsung 850 EVO 1TB, 150 TBW): Redis contributes <1 GB/day cluster-wide 40+ year runway at the 20% TBW budget.
- `maxmemory=640mb` (83% of 768Mi limit), `maxmemory-policy=allkeys-lru`.
- Weekly RDB backup to NFS (`/srv/nfs/redis-backup/`, Sunday 03:00, 28-day retention, pushes Pushgateway metrics).
- Auth disabled this phase — NetworkPolicy is the isolation layer. Enabling `requirepass` + rolling creds to all 17 clients is a planned follow-up.
**Observability** (redis-v2 only): `oliver006/redis_exporter:v1.62.0` sidecar per pod on port 9121, auto-scraped via Prometheus pod annotation. Alerts: `RedisDown`, `RedisMemoryPressure`, `RedisEvictions`, `RedisReplicationLagHigh`, `RedisForkLatencyHigh`, `RedisAOFRewriteLong`, `RedisReplicasMissing`, `RedisBackupStale`, `RedisBackupNeverSucceeded`.
**Why this design** — four incidents in April 2026 drove the rework: (a) 2026-04-04 service selector routed reads+writes to master+replica causing `READONLY` errors; (b) 2026-04-19 AM master OOMKilled during BGSAVE+PSYNC with the 256Mi limit too tight for a 204 MB working set under COW amplification; (c) 2026-04-19 PM sentinel runtime state drifted (only 2 sentinels, no majority) and routed writes to a slave; (d) 2026-04-22 five-factor flap cascade — soft anti-affinity let 2/3 pods co-locate on `k8s-node3`, node bounced NotReady→Ready and took quorum with it; aggressive sentinel/probe timing (5s/30s + 3s/3) amplified disk-I/O stalls under LUKS-encrypted LVM into spurious `+switch-master` loops; HAProxy's 1s polling raced sentinel failovers and routed writes to demoted masters; `publish_not_ready_addresses=true` fed not-yet-ready pods into HAProxy DNS; downstream `realestate-crawler-celery` CrashLoopBackOff closed the feedback loop. See beads epic `code-v2b` for the full plan and linked challenger analyses.
- Shared instance at `redis.redis.svc.cluster.local`
- Used for caching and session storage
- No persistence (ephemeral)
### SQLite (Per-App)

View file

@ -1,10 +1,10 @@
# DNS Architecture
Last updated: 2026-04-19 (WS C — NodeLocal DNSCache deployed; WS D — pfSense Unbound replaces dnsmasq; WS E — Kea multi-IP DHCP option 6 + TSIG-signed DDNS)
Last updated: 2026-04-15
## Overview
DNS is served by a split architecture: **Technitium DNS** handles internal resolution (`.viktorbarzin.lan`) and recursive lookups, while **Cloudflare DNS** manages all public domains (`.viktorbarzin.me`). Kubernetes pods use **CoreDNS** which forwards to Technitium for internal zones. All three Technitium instances run on encrypted block storage with zone replication via AXFR every 30 minutes. A **NodeLocal DNSCache** DaemonSet runs on every node and transparently intercepts pod DNS traffic, caching responses locally so pods keep resolving even during CoreDNS, Technitium, or pfSense disruptions.
DNS is served by a split architecture: **Technitium DNS** handles internal resolution (`.viktorbarzin.lan`) and recursive lookups, while **Cloudflare DNS** manages all public domains (`.viktorbarzin.me`). Kubernetes pods use **CoreDNS** which forwards to Technitium for internal zones. All three Technitium instances run on encrypted block storage with zone replication via AXFR every 30 minutes.
## Architecture Diagram
@ -22,15 +22,14 @@ graph TB
end
subgraph "pfSense (10.0.20.1)"
pf_unbound[Unbound<br/>Resolver<br/>auth-zone AXFR]
pf_dnsmasq[dnsmasq<br/>Forwarder]
pf_kea[Kea DHCP4<br/>3 subnets, 53 reservations]
pf_ddns[Kea DHCP-DDNS<br/>RFC 2136]
pf_nat[NAT rdr<br/>UDP 53 → Technitium]
end
subgraph "Kubernetes Cluster"
NodeLocalDNS[NodeLocal DNSCache<br/>DaemonSet, 5 nodes<br/>169.254.20.10 + 10.96.0.10]
CoreDNS[CoreDNS<br/>kube-system<br/>.:53 + viktorbarzin.lan:53]
KubeDNSUpstream[kube-dns-upstream<br/>ClusterIP, selects CoreDNS pods]
subgraph "Technitium HA (namespace: technitium)"
Primary[Primary<br/>technitium]
@ -52,17 +51,16 @@ graph TB
Internet -->|DNS query| CF
CF -->|CNAME to tunnel| CFTunnel
LAN -->|DNS query UDP 53| pf_unbound
LAN -->|DNS query UDP 53| pf_nat
pf_nat -->|forward| LB_DNS
pf_kea -->|lease event| pf_ddns
pf_ddns -->|A + PTR| LB_DNS
pf_unbound -->|AXFR viktorbarzin.lan| LB_DNS
pf_unbound -->|public queries DoT :853| CF
pf_dnsmasq -->|.viktorbarzin.lan| LB_DNS
pf_dnsmasq -->|public queries| CF
NodeLocalDNS -->|cache miss| KubeDNSUpstream
KubeDNSUpstream --> CoreDNS
CoreDNS -->|.viktorbarzin.lan| ClusterIP
CoreDNS -->|public queries| pf_unbound
CoreDNS -->|public queries| pf_dnsmasq
LB_DNS --> Primary
LB_DNS --> Secondary
@ -82,9 +80,8 @@ graph TB
|-----------|----------|---------|---------|
| Technitium DNS | K8s namespace `technitium` | 14.3.0 | Primary internal DNS + recursive resolver |
| CoreDNS | K8s `kube-system` | Cluster default | K8s service discovery + forwarding to Technitium |
| NodeLocal DNSCache | K8s `kube-system` (DaemonSet) | `k8s-dns-node-cache:1.23.1` | Per-node DNS cache, transparent interception on 10.96.0.10 + 169.254.20.10. Insulates pods from CoreDNS/Technitium/pfSense disruption. |
| Cloudflare DNS | SaaS | N/A | Public domain management (~50 domains) |
| pfSense Unbound | 10.0.20.1 | pfSense 2.7.2 (Unbound 1.19) | DNS resolver on LAN/OPT1/WAN; AXFR-slaves `viktorbarzin.lan` from Technitium; DoT upstream to Cloudflare |
| pfSense dnsmasq | 10.0.20.1 | pfSense 2.7.x | DNS forwarder for management VLAN |
| Kea DHCP-DDNS | 10.0.20.1 | pfSense 2.7.x | Automatic DNS registration on DHCP lease |
| phpIPAM | K8s namespace `phpipam` | v1.7.0 | IPAM ↔ DNS bidirectional sync |
@ -93,22 +90,19 @@ graph TB
| Stack | Path | DNS Resources |
|-------|------|---------------|
| Technitium | `stacks/technitium/` | 3 deployments, services, PVCs, 4 CronJobs, CoreDNS ConfigMap |
| NodeLocal DNSCache | `stacks/nodelocal-dns/` | DaemonSet (5 pods), ConfigMap, kube-dns-upstream Service, headless metrics Service |
| Cloudflared | `stacks/cloudflared/` | Cloudflare DNS records (A, AAAA, CNAME, MX, TXT), tunnel config |
| phpIPAM | `stacks/phpipam/` | dns-sync CronJob, pfsense-import CronJob |
| pfSense | `stacks/pfsense/` | VM config only (Unbound config is managed out-of-band via pfSense web UI / direct config.xml edits; see `docs/runbooks/pfsense-unbound.md`) |
| pfSense | `stacks/pfsense/` | VM config (DNS config is via pfSense web UI) |
## DNS Resolution Paths
### K8s Pod → Internal Domain (.viktorbarzin.lan)
```
Pod → NodeLocal DNSCache (intercepts on kube-dns:10.96.0.10)
→ cache hit: serve locally (TTL 30s / stale up to 86400s via CoreDNS upstream)
→ cache miss: forward to kube-dns-upstream (selects CoreDNS pods directly)
→ CoreDNS: template matches 2+ labels before .viktorbarzin.lan → NXDOMAIN
→ CoreDNS: forward to Technitium ClusterIP (10.96.0.53)
→ Technitium resolves from viktorbarzin.lan zone
Pod → CoreDNS (kube-dns:53)
→ template: if 2+ labels before .viktorbarzin.lan → NXDOMAIN (ndots:5 junk filter)
→ forward to Technitium ClusterIP (10.96.0.53)
→ Technitium resolves from viktorbarzin.lan zone
```
The ndots:5 template in CoreDNS short-circuits queries like `www.cloudflare.com.viktorbarzin.lan` (caused by K8s search domain expansion) by returning NXDOMAIN for any query with 2+ labels before `.viktorbarzin.lan`. Only single-label queries (e.g., `idrac.viktorbarzin.lan`) reach Technitium.
@ -116,54 +110,41 @@ The ndots:5 template in CoreDNS short-circuits queries like `www.cloudflare.com.
### K8s Pod → Public Domain
```
Pod → NodeLocal DNSCache (intercepts on kube-dns:10.96.0.10)
→ cache hit: serve locally
→ cache miss: forward to kube-dns-upstream (selects CoreDNS pods directly)
→ CoreDNS: forward to pfSense (10.0.20.1), fallback 8.8.8.8, 1.1.1.1
→ pfSense Unbound:
- .viktorbarzin.lan → local auth-zone (AXFR-cached from Technitium)
- public → DoT to Cloudflare (1.1.1.1 / 1.0.0.1 port 853)
Pod → CoreDNS (kube-dns:53)
→ forward to pfSense (10.0.20.1), fallback 8.8.8.8, 1.1.1.1
→ pfSense dnsmasq → Cloudflare (1.1.1.1)
```
### LAN Client (192.168.1.x) → Any Domain
```
Client gets DNS=192.168.1.2 (pfSense WAN) from DHCP
→ pfSense Unbound listens on 192.168.1.2:53 directly (no NAT rdr)
- .viktorbarzin.lan → auth-zone (AXFR-cached from Technitium 10.0.20.201)
Survives full Technitium/K8s outage — auth-zone keeps serving from
/var/unbound/viktorbarzin.lan.zone with `fallback-enabled: yes`.
- .viktorbarzin.me (non-proxied) and other public → DoT to Cloudflare
(1.1.1.1 / 1.0.0.1 on port 853, SNI cloudflare-dns.com)
→ pfSense NAT rdr on WAN interface → Technitium LB (10.0.20.201)
→ Technitium resolves:
- .viktorbarzin.lan → local zone
- .viktorbarzin.me (non-proxied) → recursive, then Split Horizon translates
176.12.22.76 → 10.0.20.200 for 192.168.1.0/24 clients
- other → recursive to Cloudflare DoH (1.1.1.1)
```
**Trade-off vs. prior NAT rdr**: Split Horizon hairpin translation
(`176.12.22.76 → 10.0.20.200` for 192.168.1.x clients) was only applied
when queries reached Technitium via the NAT rdr. With Unbound answering
on 192.168.1.2:53 directly, non-proxied `*.viktorbarzin.me` queries on the
192.168.1.x LAN return the public IP, which the TP-Link AP can't hairpin.
If hairpin is broken on LAN for a given non-proxied service, the fix is
either (a) switch the service to proxied (via `dns_type = "proxied"`)
or (b) add a local-data override on pfSense Unbound. The pre-Unbound
state is documented in the `docs/runbooks/pfsense-unbound.md` rollback
section.
Client source IPs are preserved (no SNAT on 192.168.1.x → 10.0.20.x path) — Technitium logs show real per-device IPs.
### Management VLAN (10.0.10.x) → Any Domain
```
Client gets DNS from Kea DHCP → pfSense (10.0.10.1)
→ pfSense Unbound:
- .viktorbarzin.lan → auth-zone (local)
- other → DoT to Cloudflare (1.1.1.1 / 1.0.0.1 port 853)
→ pfSense dnsmasq:
- .viktorbarzin.lan → forward to Technitium (10.0.20.201)
- other → forward to Cloudflare (1.1.1.1)
```
### K8s VLAN (10.0.20.x) → Any Domain
```
Client gets DNS from Kea DHCP → pfSense (10.0.20.1)
→ pfSense Unbound:
- .viktorbarzin.lan → auth-zone (local)
- other → DoT to Cloudflare (1.1.1.1 / 1.0.0.1 port 853)
→ pfSense dnsmasq:
- .viktorbarzin.lan → forward to Technitium (10.0.20.201)
- other → forward to Cloudflare (1.1.1.1)
```
## Technitium DNS — Internal DNS Server
@ -212,7 +193,7 @@ All three pods share the `dns-server=true` label, so the DNS LoadBalancer (10.0.
| `0.168.192.in-addr.arpa` | Primary | PTR | Reverse DNS for Valchedrym site |
| `emrsn.org` | Primary (stub) | — | Returns NXDOMAIN locally (avoids 27K+ daily corporate query floods) |
**Dynamic updates**: Enabled via `UseSpecifiedNetworkACL` from pfSense IPs (10.0.20.1, 10.0.10.1, 192.168.1.2) **AND require a valid TSIG signature** on `viktorbarzin.lan`, `10.0.10.in-addr.arpa`, `20.0.10.in-addr.arpa`, `1.168.192.in-addr.arpa`. Policy: `updateSecurityPolicies = [{tsigKeyName: "kea-ddns", domain: "*.<zone>", allowedTypes: ["ANY"]}]`. Unsigned updates from the allowlisted pfSense source IPs are refused ("Dynamic Updates Security Policy"). TSIG key `kea-ddns` (HMAC-SHA256) present on primary/secondary/tertiary; secret in Vault `secret/viktor/kea_ddns_tsig_secret`. Applied 2026-04-19 (WS E, bd `code-o6j`).
**Dynamic updates**: Enabled via `UseSpecifiedNetworkACL` from pfSense IPs (10.0.20.1, 10.0.10.1, 192.168.1.2) for Kea DDNS RFC 2136 updates.
### Resolver Settings
@ -271,61 +252,29 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons
Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).
## NodeLocal DNSCache
A DaemonSet in `kube-system` (`node-local-dns`, image `registry.k8s.io/dns/k8s-dns-node-cache:1.23.1`) runs on every node including the control plane. Each pod uses `hostNetwork: true` + `NET_ADMIN` and installs iptables NOTRACK rules so it transparently serves DNS on both:
- **169.254.20.10** — the canonical link-local IP from the upstream docs
- **10.96.0.10** — the `kube-dns` ClusterIP, so existing pods (which already use this as their nameserver) hit the on-node cache with no kubelet change
Cache misses go to a separate `kube-dns-upstream` ClusterIP service (not `kube-dns`, to avoid looping back to ourselves) that selects the CoreDNS pods directly via `k8s-app=kube-dns`.
Priority class is `system-node-critical`; tolerations are permissive (`operator: Exists`) so the DaemonSet runs on tainted master and other reserved nodes. Kyverno `dns_config` drift is suppressed via `ignore_changes` on the DaemonSet.
**Caching**: `cluster.local:53` caches 9984 success / 9984 denial entries with 30s/5s TTLs. Other zones cache 30s. If CoreDNS is killed, nodes keep answering cached names — verified on 2026-04-19 by deleting all three CoreDNS pods and running `dig @169.254.20.10 idrac.viktorbarzin.lan` + `dig @169.254.20.10 github.com` from a pod (both returned answers).
**Kubelet clusterDNS**: **Unchanged** — still `10.96.0.10`. NodeLocal DNSCache co-listens on that IP so traffic interception is transparent; switching kubelet to `169.254.20.10` would require a rolling reconfigure of every node and provides no additional cache benefit over transparent mode.
**Metrics**: A headless Service `node-local-dns` (ClusterIP `None`) exposes each pod on port `9253` for Prometheus scraping (annotated `prometheus.io/scrape=true`).
## CoreDNS Configuration
CoreDNS is managed via Terraform in `stacks/technitium/modules/technitium/` — the Corefile ConfigMap lives in `main.tf`, and scaling/PDB are in `coredns.tf` (a `kubernetes_deployment_v1_patch` against the kubeadm-managed Deployment).
CoreDNS is managed via a Terraform `kubernetes_config_map` resource in `stacks/technitium/modules/technitium/main.tf`.
```
.:53 {
errors / health / ready
kubernetes cluster.local in-addr.arpa ip6.arpa # K8s service discovery
prometheus :9153 # Metrics
forward . 10.0.20.1 8.8.8.8 1.1.1.1 {
policy sequential # try upstreams in order
health_check 5s # mark unhealthy in 5s
max_fails 2
}
cache {
success 10000 300 6
denial 10000 300 60
serve_stale 86400s # resilience during upstream outage
}
forward . 10.0.20.1 8.8.8.8 1.1.1.1 # pfSense → Google → Cloudflare
cache (success 10000 300, denial 10000 300)
loop / reload / loadbalance
}
viktorbarzin.lan:53 {
template: .*\..*\.viktorbarzin\.lan\.$ → NXDOMAIN # ndots:5 junk filter
forward . 10.96.0.53 { # Technitium ClusterIP
health_check 5s
max_fails 2
}
cache (success 10000 300, denial 10000 300, serve_stale 86400s)
forward . 10.96.0.53 # Technitium ClusterIP
cache (success 10000 300, denial 10000 300)
}
```
**Scaling**: 3 replicas, `required` anti-affinity on `kubernetes.io/hostname` (spread across 3 distinct nodes). PodDisruptionBudget `coredns` with `minAvailable=2`.
**Kyverno ndots injection**: A Kyverno policy injects `ndots:2` on all pods cluster-wide to reduce search domain expansion noise. The template regex is a second layer of defense for any queries that still get expanded.
**Failover behaviour**: With `policy sequential` on the root forward block, CoreDNS tries pfSense first; if `health_check 5s` detects pfSense as down, it fails over to 8.8.8.8 then 1.1.1.1 within ~5s rather than timing out per-query. Combined with `serve_stale`, pods keep resolving cached names for up to 24h even with full upstream failure.
## Cloudflare DNS — External Domains
All public domains are under the `viktorbarzin.me` zone. DNS records are **auto-created per service** via the `ingress_factory` module's `dns_type` parameter. A small number of records (Helm-managed ingresses, special cases) remain centrally managed in `config.tfvars`.
@ -375,9 +324,9 @@ The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`)
Devices get automatic DNS registration without manual intervention. See [networking.md § IPAM & DNS Auto-Registration](networking.md#ipam--dns-auto-registration) for the full data flow diagram.
Summary:
1. **Kea DHCP** on pfSense assigns IP (53 reservations across 3 subnets). DHCP option 6 (DNS servers) is pushed with two IPs per internal subnet: internal resolver + AdGuard public fallback (`94.140.14.14`) — clients survive an internal DNS outage.
2. **Kea DDNS** sends **TSIG-signed** RFC 2136 dynamic update to Technitium (A + PTR records) — immediate. Key `kea-ddns` (HMAC-SHA256); Technitium enforces both source-IP ACL and TSIG signature on `viktorbarzin.lan` + reverse zones.
3. **phpipam-pfsense-import** CronJob (hourly) pulls Kea leases + ARP table into phpIPAM
1. **Kea DHCP** on pfSense assigns IP (53 reservations across 3 subnets)
2. **Kea DDNS** sends RFC 2136 dynamic update to Technitium (A + PTR records) — immediate
3. **phpipam-pfsense-import** CronJob (5min) pulls Kea leases + ARP table into phpIPAM
4. **phpipam-dns-sync** CronJob (15min) pushes named phpIPAM hosts → Technitium A + PTR, pulls Technitium PTR → phpIPAM hostnames
## Automation CronJobs
@ -389,7 +338,7 @@ Summary:
| `technitium-split-horizon-sync` | `15 */6 * * *` | technitium | Split Horizon + DNS Rebinding Protection on all 3 instances |
| `technitium-dns-optimization` | `30 */6 * * *` | technitium | Min cache TTL 60s, emrsn.org stub zone |
| `phpipam-dns-sync` | `*/15 * * * *` | phpipam | Bidirectional phpIPAM ↔ Technitium DNS sync |
| `phpipam-pfsense-import` | `0 * * * *` | phpipam | Import Kea DHCP leases + ARP from pfSense |
| `phpipam-pfsense-import` | `*/5 * * * *` | phpipam | Import Kea DHCP leases + ARP from pfSense |
### Password Rotation Flow
@ -411,62 +360,28 @@ Vault DB engine rotates password
| Metric Source | Dashboard | Alerts |
|---------------|-----------|--------|
| Technitium query logs (PostgreSQL) | Grafana `technitium-dns.json` | — |
| CoreDNS Prometheus metrics (:9153) | Grafana CoreDNS dashboard | `CoreDNSErrors`, `CoreDNSForwardFailureRate` |
| Technitium zone-sync CronJob (Pushgateway) | — | `TechnitiumZoneSyncFailed`, `TechnitiumZoneSyncStale`, `TechnitiumZoneCountMismatch` |
| Technitium DNS pod availability | — | `TechnitiumDNSDown` |
| `dns-anomaly-monitor` CronJob (Pushgateway) | — | `DNSQuerySpike`, `DNSQueryRateDropped`, `DNSHighErrorRate` |
| CoreDNS Prometheus metrics (:9153) | Grafana CoreDNS dashboard | — |
| Uptime Kuma | External monitors for all proxied domains | ExternalAccessDivergence (15min) |
### Metrics pushed by `technitium-zone-sync`
The zone-sync CronJob (runs every 30min) pushes the following to the Prometheus Pushgateway under `job=technitium-zone-sync`:
| Metric | Labels | Meaning |
|--------|--------|---------|
| `technitium_zone_sync_status` | — | 0 = last run succeeded, 1 = at least one zone failed to create |
| `technitium_zone_sync_failures` | — | Number of zones that failed to create this run |
| `technitium_zone_sync_last_run` | — | Unix timestamp of last run (used by `TechnitiumZoneSyncStale`) |
| `technitium_zone_count` | `instance=primary\|<replica-host>` | Zone count on each Technitium instance (drives `TechnitiumZoneCountMismatch`) |
### DNS alert rewrites
- `DNSQuerySpike` was previously broken: it compared current queries against `dns_anomaly_avg_queries`, which was computed from a per-pod `/tmp/dns_avg` file. Each CronJob run started with a fresh `/tmp`, so `NEW_AVG == TOTAL_QUERIES` every time and the spike condition could never fire. Rewritten to use `avg_over_time(dns_anomaly_total_queries[1h] offset 15m)` which compares against the actual 1h Prometheus history.
- `DNSQueryRateDropped` (new): fires when query rate drops below 50% of 1h average — upstream clients may be failing to reach Technitium.
## Troubleshooting
### DNS Not Resolving Internal Domains
1. Check NodeLocal DNSCache pods first — pod queries go through these: `kubectl -n kube-system get pod -l k8s-app=node-local-dns -o wide`
2. Check Technitium pods: `kubectl get pod -n technitium`
3. Check all 3 are healthy: `kubectl get pod -n technitium -l dns-server=true`
4. Test via NodeLocal DNSCache from a pod: `kubectl exec -it <pod> -- dig @169.254.20.10 idrac.viktorbarzin.lan`
5. Bypass NodeLocal DNSCache (test CoreDNS directly): `kubectl exec -it <pod> -- dig @<kube-dns-upstream-ClusterIP> idrac.viktorbarzin.lan` (`kubectl get svc -n kube-system kube-dns-upstream`)
6. Check CoreDNS logs: `kubectl logs -n kube-system -l k8s-app=kube-dns`
7. Verify ClusterIP service: `kubectl get svc -n technitium technitium-dns-internal`
1. Check Technitium pods: `kubectl get pod -n technitium`
2. Check all 3 are healthy: `kubectl get pod -n technitium -l dns-server=true`
3. Test from a pod: `kubectl exec -it <pod> -- nslookup idrac.viktorbarzin.lan 10.96.0.53`
4. Check CoreDNS logs: `kubectl logs -n kube-system -l k8s-app=kube-dns`
5. Verify ClusterIP service: `kubectl get svc -n technitium technitium-dns-internal`
### LAN Clients Can't Resolve
1. Verify pfSense Unbound is running: `ssh admin@10.0.20.1 "sockstat -l -4 -p 53 | grep unbound"` — expect listeners on `192.168.1.2:53`, `10.0.10.1:53`, `10.0.20.1:53`, `127.0.0.1:53`
2. Verify the auth-zone is loaded: `ssh admin@10.0.20.1 "unbound-control -c /var/unbound/unbound.conf list_auth_zones"` — expect `viktorbarzin.lan. serial N`
3. Test from LAN: `dig @192.168.1.2 idrac.viktorbarzin.lan` (should return with `aa` flag)
4. Test public upstream: `dig @192.168.1.2 example.com +dnssec` (should have `ad` flag — DoT via Cloudflare working)
5. If auth-zone can't AXFR: check Technitium `viktorbarzin.lan` zone options → `zoneTransferNetworkACL` contains `10.0.20.1, 10.0.10.1, 192.168.1.2`
6. See `docs/runbooks/pfsense-unbound.md` for full Unbound runbook and rollback instructions
1. Verify pfSense NAT rule redirects UDP 53 on WAN to 10.0.20.201
2. Check Technitium LB service: `kubectl get svc -n technitium technitium-dns`
3. Test from LAN: `dig @192.168.1.2 idrac.viktorbarzin.lan`
4. Check `externalTrafficPolicy: Local` — if no Technitium pod runs on the node receiving traffic, it drops
### Hairpin NAT Not Working (LAN → *.viktorbarzin.me Fails)
Since 2026-04-19 (Workstream D), pfSense Unbound answers LAN DNS queries
directly instead of forwarding to Technitium, so the Technitium Split Horizon
post-processing does NOT run for 192.168.1.x clients anymore. Non-proxied
services break hairpin on LAN clients again. Options:
1. **Switch service to proxied Cloudflare** (preferred) — set `dns_type = "proxied"` in the `ingress_factory` module call; DNS now resolves to Cloudflare edge, hairpin-independent.
2. **Add a local-data override on pfSense Unbound** — under `Services → DNS Resolver → Host Overrides`, set `<service>.viktorbarzin.me → 10.0.20.200` (Traefik LB IP). This is equivalent to what Split Horizon did, applied at the resolver.
3. **Revert to prior NAT rdr + Technitium Split Horizon** — documented in `docs/runbooks/pfsense-unbound.md` rollback section.
K8s-side Split Horizon is still configured and applies when `*.viktorbarzin.me` queries DO reach Technitium (e.g., from pods that query via CoreDNS → Technitium forwarding for `.viktorbarzin.me` via pfSense). Verify Technitium split-horizon app:
1. Verify Split Horizon app is installed on all instances
2. Check CronJob status: `kubectl get cronjob -n technitium technitium-split-horizon-sync`
3. Run the job manually: `kubectl create job --from=cronjob/technitium-split-horizon-sync test-sh -n technitium`
@ -501,8 +416,6 @@ For external `.viktorbarzin.me` records:
## Incident History
- **2026-04-14 (SEV1)**: NFS `fsid=0` caused Technitium primary data loss on restart. Fixed by migrating all 3 instances to `proxmox-lvm-encrypted`, adding zone-sync CronJob (30min AXFR). See [post-mortem](../post-mortems/2026-04-14-nfs-fsid0-dns-vault-outage.md).
- **2026-04-19 (hardening, not outage)**: Workstream D — pfSense Unbound replaces dnsmasq as the pfSense DNS service. Unbound AXFR-slaves `viktorbarzin.lan` from Technitium so LAN-side resolution survives a full K8s outage. WAN NAT rdr `192.168.1.2:53 → 10.0.20.201` removed (Unbound listens on WAN directly). DoT upstream via Cloudflare. See `docs/runbooks/pfsense-unbound.md` and bd `code-k0d`.
- **2026-04-19 (hardening, not outage)**: Workstream E — Kea DHCP now pushes TWO DNS IPs (internal + AdGuard public fallback `94.140.14.14`) via option 6 to the internal subnets (10.0.10/24, 10.0.20/24); 192.168.1/24 was already dual-IP (served by TP-Link). Kea DHCP-DDNS now TSIG-signs its RFC 2136 updates (key `kea-ddns`, HMAC-SHA256) and the Technitium zones require both source-IP ACL AND TSIG signature. See `docs/runbooks/pfsense-unbound.md` § "Kea DHCP-DDNS TSIG" and bd `code-o6j`.
## Related

View file

@ -178,11 +178,11 @@ flowchart LR
subgraph "Kubernetes Cluster"
C -->|Yes| D[Woodpecker Pipeline]
D --> E[Vault Auth<br/>K8s SA JWT]
E --> F[Fetch API Token]
E --> F[Fetch SSH Key]
end
subgraph "claude-agent-service (K8s)"
F --> G[HTTP POST /execute]
subgraph "DevVM (10.0.10.10)"
F --> G[SSH + Claude Code]
G --> H[issue-responder agent]
H --> I[Investigate / Implement]
I --> J[Comment on Issue]

View file

@ -1,147 +1,72 @@
# Mail Server Architecture
Last updated: 2026-04-19 (code-yiu Phase 6: MetalLB LB retired; traffic now enters via pfSense HAProxy with PROXY v2)
Last updated: 2026-04-12 (Brevo relay migration)
## Overview
Self-hosted email for `viktorbarzin.me` using docker-mailserver 15.0.0 on Kubernetes. Inbound mail arrives directly via MX record to the home IP on port 25. Outbound mail relays through Brevo EU (`smtp-relay.brevo.com:587` — migrated from Mailgun on 2026-04-12; SPF record cut over on 2026-04-18). Roundcubemail provides webmail access. CrowdSec protects SMTP/IMAP from brute-force attacks using real client IPs: pfSense HAProxy injects the PROXY v2 header on each backend connection so the mailserver pod sees the true source IP despite kube-proxy SNAT. See [`runbooks/mailserver-pfsense-haproxy.md`](../runbooks/mailserver-pfsense-haproxy.md) for ops details.
Self-hosted email for `viktorbarzin.me` using docker-mailserver 15.0.0 on Kubernetes. Inbound mail arrives directly via MX record to the home IP on port 25. Outbound mail relays through Mailgun EU. Roundcubemail provides webmail access. CrowdSec protects SMTP/IMAP from brute-force attacks using real client IPs via `externalTrafficPolicy: Local` on a dedicated MetalLB IP.
## Architecture Diagram
Two independent paths into the mailserver pod:
- **External** (MX traffic, webmail clients over WAN): Internet → pfSense → HAProxy → NodePort → **alt container ports** (2525/4465/5587/10993) that **require** PROXY v2 framing.
- **Intra-cluster** (Roundcube, E2E probe): same pod, **stock container ports** (25/465/587/993), **no** PROXY framing.
One Deployment, one pod, two sets of Postfix `master.cf` services + Dovecot `inet_listener` blocks, two Kubernetes Services (`mailserver` ClusterIP + `mailserver-proxy` NodePort).
```mermaid
flowchart TB
%% External ingress path
SENDER[Sending MTA<br/>arbitrary public IP] -->|MX lookup + SMTP<br/>:25| MX[mail.viktorbarzin.me<br/>A 176.12.22.76]
MX --> PF[pfSense WAN<br/>vtnet0 192.168.1.2]
PF -->|NAT rdr<br/>WAN:25/465/587/993<br/>→ 10.0.20.1:same| HAP
HAP[pfSense HAProxy<br/>4 TCP frontends on 10.0.20.1<br/>send-proxy-v2 to backends]
HAP -->|round-robin<br/>tcp-check inter 120s| KN{k8s worker<br/>node1..4}
KN -->|NodePort 30125-30128<br/>ETP: Cluster → kube-proxy SNAT| PODEXT
%% Internal ingress path
RC[Roundcubemail pod] -->|SMTP :587 + IMAP :993<br/>no PROXY| SVC[Service mailserver<br/>ClusterIP 10.103.108.x<br/>25/465/587/993]
PROBE[email-roundtrip-monitor<br/>CronJob every 20m] -->|IMAP :993<br/>no PROXY| SVC
SVC -->|kube-proxy routes| PODINT
%% The pod — two listener sets, one process tree
subgraph POD["mailserver pod (docker-mailserver 15.0.0)"]
direction LR
PODEXT[Alt ports<br/>2525 / 4465 / 5587 / 10993<br/><b>PROXY v2 REQUIRED</b><br/>smtpd_upstream_proxy_protocol=haproxy<br/>haproxy = yes]
PODINT[Stock ports<br/>25 / 465 / 587 / 993<br/>PROXY-free]
PODEXT --> POSTFIX
PODINT --> POSTFIX
POSTFIX[Postfix<br/>postscreen + smtpd + cleanup + queue]
POSTFIX --> RSPAMD[Rspamd<br/>spam + DKIM + DMARC]
RSPAMD --> DOVECOT[Dovecot IMAP<br/>LMTP deliver]
DOVECOT --> MAILBOX[(Maildir storage<br/>mailserver-data-encrypted PVC<br/>proxmox-lvm-encrypted LUKS2)]
graph TB
subgraph "Inbound Mail"
SENDER[Sending MTA] -->|MX lookup| MX[mail.viktorbarzin.me:25]
MX -->|176.12.22.76:25| PF[pfSense NAT]
PF -->|10.0.20.202:25| MLB[MetalLB<br/>ETP: Local]
MLB --> POSTFIX[Postfix MTA]
end
%% Outbound
POSTFIX -->|queued mail<br/>SASL + TLS| BREVO[Brevo EU Relay<br/>smtp-relay.brevo.com:587<br/>300/day free tier]
BREVO --> RECIPIENT[External Recipient]
subgraph "Mail Processing"
POSTFIX --> RSPAMD[Rspamd<br/>Spam/DKIM/DMARC]
RSPAMD --> DOVECOT[Dovecot IMAP]
DOVECOT --> MAILBOX[(Mailboxes<br/>proxmox-lvm PVC)]
end
%% Webmail HTTP path
USER[User browser] -->|HTTPS| CF[Cloudflare proxy<br/>mail.viktorbarzin.me]
CF --> TUNNEL[Cloudflared tunnel<br/>pfSense → Traefik]
TUNNEL --> TRAEFIK[Traefik Ingress<br/>Authentik-protected]
TRAEFIK --> RC
subgraph "Outbound Mail"
POSTFIX_OUT[Postfix] -->|SASL + TLS| MAILGUN[Brevo EU Relay<br/>smtp-relay.brevo.com:587]
MAILGUN --> RECIPIENT[Recipient]
end
%% Security
POSTFIX -.->|log stream<br/>real client IPs from PROXY v2| CSAGENT[CrowdSec Agent<br/>postfix + dovecot parsers]
CSAGENT -.-> CSLAPI[CrowdSec LAPI]
CSLAPI -.->|bouncer decisions<br/>ban external IPs| PF
subgraph "Webmail"
USER[User] -->|HTTPS| TRAEFIK[Traefik Ingress]
TRAEFIK --> RC[Roundcubemail]
RC -->|IMAP 993| DOVECOT
RC -->|SMTP 587| POSTFIX_OUT
end
%% Monitoring
PROBE -.->|Brevo HTTP API<br/>triggers external delivery| MX
PROBE -.->|Push on roundtrip success| PUSH[Pushgateway + Uptime Kuma]
subgraph "Security"
MLB -->|Real client IPs| CS_AGENT[CrowdSec Agent<br/>postfix + dovecot parsers]
CS_AGENT --> CS_LAPI[CrowdSec LAPI]
end
classDef extPath fill:#ffedd5,stroke:#ea580c,stroke-width:2px
classDef intPath fill:#dbeafe,stroke:#2563eb,stroke-width:2px
classDef pod fill:#dcfce7,stroke:#15803d
classDef sec fill:#fee2e2,stroke:#dc2626
class SENDER,MX,PF,HAP,KN,PODEXT extPath
class RC,PROBE,SVC,PODINT intPath
class POSTFIX,RSPAMD,DOVECOT,MAILBOX pod
class CSAGENT,CSLAPI sec
subgraph "Monitoring"
PROBE[E2E Roundtrip Probe<br/>CronJob every 20m] -->|Mailgun API| SENDER
PROBE -->|IMAP check| DOVECOT
PROBE --> PUSH[Pushgateway + Uptime Kuma]
DEXP[Dovecot Exporter<br/>:9166] --> PROM[Prometheus]
end
```
### PROXY v2 sequence (external SMTP roundtrip)
Illustrates the wire-level sequence of a Brevo probe email arriving at our MX. Same sequence applies to any external sender.
```mermaid
sequenceDiagram
autonumber
participant C as External MTA<br/>(e.g. Brevo 77.32.148.26)
participant PF as pfSense WAN<br/>192.168.1.2:25
participant HAP as pfSense HAProxy<br/>10.0.20.1:25
participant N as k8s-node:30125<br/>ETP: Cluster
participant P as Postfix postscreen<br/>pod:2525
C->>PF: TCP SYN dst=192.168.1.2:25
PF->>HAP: NAT rdr rewrites dst → 10.0.20.1:25
HAP->>N: TCP connect (src=10.0.20.1, dst=k8s-node:30125)
Note over HAP,N: HAProxy opens a NEW TCP flow<br/>to the backend k8s node.
HAP->>N: PROXY v2 header<br/>(source=77.32.148.26, dest=10.0.20.1)
N->>P: kube-proxy SNAT src=k8s-node IP<br/>forwards PROXY header + payload to pod
P->>P: Parse PROXY v2 header<br/>smtpd_client_addr := 77.32.148.26<br/>(despite kube-proxy SNAT on the wire)
P-->>C: SMTP banner 220 mail.viktorbarzin.me
C-->>P: EHLO / MAIL FROM / RCPT TO / DATA
Note over P,C: Real client IP logged in maillog,<br/>fed to CrowdSec postfix parser.
P->>P: → smtpd → Rspamd → Dovecot → mailbox
```
## Components
| Component | Version | Location | Purpose |
|-----------|---------|----------|---------|
| docker-mailserver | 15.0.0 | `mailserver` namespace | Postfix MTA + Dovecot IMAP + Rspamd (single container) |
| docker-mailserver | 15.0.0 | `mailserver` namespace | Postfix MTA + Dovecot IMAP + Rspamd |
| Roundcubemail | 1.6.13-apache | `mailserver` namespace | Webmail UI (MySQL-backed) |
| Dovecot Exporter | latest | Sidecar in mailserver pod | Prometheus metrics (port 9166) |
| Rspamd | Built into docker-mailserver | — | Spam filtering, DKIM signing, DMARC verification |
| pfSense HAProxy | 2.9-dev6 (`pfSense-pkg-haproxy-devel`) | pfSense VM | TCP reverse proxy injecting PROXY v2 for external mail |
| Brevo EU (ex-Sendinblue) | SaaS | — | Outbound SMTP relay (300/day free) |
Dovecot exporter was retired in code-1ik (2026-04-19) — `viktorbarzin/dovecot_exporter` speaks the pre-2.3 `old_stats` FIFO protocol which docker-mailserver 15.0.0's Dovecot 2.3.19 no longer emits.
## Port mapping
The mailserver pod exposes **8 TCP listeners**: 4 stock + 4 alt. Two Kubernetes Services front them depending on whether the client can inject PROXY v2.
| Mail protocol | Service port | K8s Service | Container port | NodePort | PROXY v2? | Who uses this path |
|---|---|---|---|---|---|---|
| SMTP (plain + STARTTLS) | 25 | `mailserver` ClusterIP | 25 | — | ❌ stock | Intra-cluster only (not used — internal clients send via 587) |
| SMTPS (implicit TLS) | 465 | `mailserver` ClusterIP | 465 | — | ❌ stock | Intra-cluster (Roundcube rarely uses this) |
| Submission (STARTTLS) | 587 | `mailserver` ClusterIP | 587 | — | ❌ stock | **Roundcube pod** → mailserver.svc:587 |
| IMAPS | 993 | `mailserver` ClusterIP | 993 | — | ❌ stock | **Roundcube pod** + E2E probe → mailserver.svc:993 |
| SMTP | 25 | `mailserver-proxy` NodePort | 2525 | 30125 | ✅ required | External MX traffic via pfSense HAProxy |
| SMTPS | 465 | `mailserver-proxy` NodePort | 4465 | 30126 | ✅ required | External SMTPS submission |
| Submission | 587 | `mailserver-proxy` NodePort | 5587 | 30127 | ✅ required | External STARTTLS submission (mail clients over WAN) |
| IMAPS | 993 | `mailserver-proxy` NodePort | 10993 | 30128 | ✅ required | External IMAPS (mail clients over WAN) |
The alt listeners are set up by:
- **Postfix**: `user-patches.sh` (shipped via ConfigMap `mailserver-user-patches`) appends 3 entries to `master.cf` with `-o postscreen_upstream_proxy_protocol=haproxy` (for 2525) or `-o smtpd_upstream_proxy_protocol=haproxy` (for 4465/5587).
- **Dovecot**: `dovecot.cf` ConfigMap adds a second `inet_listener` inside `service imap-login` with `haproxy = yes`, plus `haproxy_trusted_networks = 10.0.20.0/24` to allow PROXY headers from the k8s node subnet (post kube-proxy SNAT the source IP is always a node IP).
## Mail Flow
### Inbound
```
Internet → MX: mail.viktorbarzin.me (priority 1)
→ A record: 176.12.22.76 (non-proxied Cloudflare DNS-only)
→ pfSense NAT rdr: WAN:{25,465,587,993} → 10.0.20.1:{same}
→ pfSense HAProxy (TCP mode, send-proxy-v2 on backend)
→ k8s-node:{30125..30128} NodePort (mailserver-proxy, ETP: Cluster)
→ kube-proxy → pod alt listener (2525/4465/5587/10993)
→ Postfix postscreen / smtpd / Dovecot parses PROXY v2 header
→ Rspamd (spam + DKIM + DMARC) → Dovecot → mailbox
→ pfSense NAT: port 25 → 10.0.20.202:25
→ MetalLB (dedicated IP, ETP: Local — preserves real client IPs)
→ Postfix → Rspamd (spam + DKIM + DMARC check) → Dovecot → mailbox
```
No backup MX. If the server is down, sender MTAs queue and retry for 4-5 days per SMTP standards (RFC 5321).
@ -170,13 +95,13 @@ All managed in Terraform at `stacks/cloudflared/modules/cloudflared/cloudflare.t
| MX | `viktorbarzin.me` | `mail.viktorbarzin.me` (pri 1) | Inbound mail routing |
| A | `mail.viktorbarzin.me` | `176.12.22.76` (non-proxied) | Mail server IP |
| AAAA | `mail.viktorbarzin.me` | `2001:470:6e:43d::2` | IPv6 (HE tunnel) |
| TXT (SPF) | `viktorbarzin.me` | `v=spf1 include:spf.brevo.com ~all` | Authorize Brevo for outbound (soft-fail during cutover; was `include:mailgun.org -all` until 2026-04-18 Brevo migration) |
| TXT (DKIM) | `s1._domainkey` | RSA 1024-bit key | Mailgun DKIM (roundtrip probe only — inbound testing still uses Mailgun API) |
| TXT (SPF) | `viktorbarzin.me` | `v=spf1 include:mailgun.org -all` | Authorize Mailgun for outbound |
| TXT (DKIM) | `s1._domainkey` | RSA 1024-bit key | Mailgun DKIM (roundtrip probe) |
| TXT (DKIM) | `mail._domainkey` | RSA 2048-bit key | Rspamd self-hosted DKIM signing |
| CNAME (DKIM) | `brevo1._domainkey` | b1.viktorbarzin-me.dkim.brevo.com | Brevo outbound DKIM (delegated) |
| CNAME (DKIM) | `brevo2._domainkey` | b2.viktorbarzin-me.dkim.brevo.com | Brevo outbound DKIM (delegated) |
| TXT | `viktorbarzin.me` | `brevo-code:a6ef1dd9...` | Brevo domain verification |
| TXT (DMARC) | `_dmarc` | `p=quarantine; pct=100; rua=mailto:dmarc@viktorbarzin.me` | DMARC enforcement; aggregate reports land in-domain at `dmarc@viktorbarzin.me` (tracked under code-569; current live record still points at `e21c0ff8@dmarc.mailgun.org` pending cutover) |
| TXT (DMARC) | `_dmarc` | `p=quarantine; pct=100` | DMARC enforcement, reports to Mailgun + ondmarc |
| TXT (MTA-STS) | `_mta-sts` | `v=STSv1; id=20260412` | TLS enforcement for inbound |
| TXT (TLSRPT) | `_smtp._tls` | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS failure reporting |
@ -189,13 +114,9 @@ Reverse DNS for `176.12.22.76` returns `176-12-22-76.pon.spectrumnet.bg.` (ISP-a
### CrowdSec Integration
- **Collections**: `crowdsecurity/postfix` + `crowdsecurity/dovecot` (installed)
- **Log acquisition**: CrowdSec agents parse mailserver pod logs for brute-force patterns
- **Real client IPs**: pfSense HAProxy injects PROXY v2 header on each backend connection; Postfix (`postscreen_upstream_proxy_protocol=haproxy` / `smtpd_upstream_proxy_protocol=haproxy` on alt ports) + Dovecot (`haproxy = yes` on alt IMAPS listener) parse it to recover the true source IP despite kube-proxy SNAT. Replaces the pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme (see code-yiu)
- **Real client IPs**: `externalTrafficPolicy: Local` on dedicated MetalLB IP `10.0.20.202` preserves original client IPs (not SNATed to node IPs)
- **Decisions**: CrowdSec bans/challenges attackers via firewall bouncer rules
### Fail2ban Disabled (CrowdSec is the Policy)
docker-mailserver ships Fail2ban, but it is explicitly disabled here: `ENABLE_FAIL2BAN = "0"` at [`stacks/mailserver/modules/mailserver/main.tf:68`](../../stacks/mailserver/modules/mailserver/main.tf). CrowdSec is the cluster-wide bouncer for SSH, HTTP, and SMTP/IMAP brute-force defence — it already parses the `postfix` and `dovecot` log streams via the collections listed above and applies decisions at the LB/firewall layer. Enabling Fail2ban in-pod would create a duplicate response path (two systems racing to ban the same IP from different enforcement points), add iptables churn inside the container, and fragment the audit trail across two decision stores. Decision (2026-04-18): keep it disabled; CrowdSec owns this policy.
### Rspamd
- Spam filtering with phishing detection and Oletools
- DKIM signing (selector `mail`, 2048-bit RSA)
@ -218,30 +139,28 @@ anvil_rate_time_unit = 60s
## Monitoring
### E2E Roundtrip Probe
CronJob `email-roundtrip-monitor` (every 20 min, `*/20 * * * *`):
1. Sends test email via **Brevo HTTP API** to `smoke-test@viktorbarzin.me` (Brevo delivers it to our MX over the public internet, exercising the full external-ingress path).
2. Email hits WAN → pfSense HAProxy → k8s-node:30125 → pod :2525 postscreen (PROXY v2) → Postfix → catch-all delivers to `spam@` mailbox.
3. Verifies delivery via IMAP — connects to `mailserver.mailserver.svc.cluster.local:993` (intra-cluster path, no PROXY), searches by UUID marker.
4. Deletes test email, pushes metrics to Pushgateway + Uptime Kuma.
Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from ExternalSecret `mailserver-probe-secrets` (synced from Vault `secret/viktor` + `secret/platform.mailserver_accounts`) — see code-39v.
CronJob `email-roundtrip-monitor` (every 10 min):
1. Sends test email via Mailgun HTTP API to `smoke-test@viktorbarzin.me`
2. Email hits MX → Postfix → catch-all delivers to `spam@` mailbox
3. Verifies delivery via IMAP (searches by UUID marker)
4. Deletes test email, pushes metrics to Pushgateway + Uptime Kuma
### Prometheus Alerts
| Alert | Threshold | Severity |
|-------|-----------|----------|
| MailServerDown | No replicas for 5m | warning |
| EmailRoundtripFailing | Probe failing for 30m | warning |
| EmailRoundtripStale | No success in >80m (60m threshold + for:20m) | warning |
| EmailRoundtripStale | No success in >40m | warning |
| EmailRoundtripNeverRun | Metric absent for 40m | warning |
### Uptime Kuma Monitors
- TCP SMTP on `176.12.22.76:25` — full external path (DNS → WAN → pfSense HAProxy → mailserver)
- TCP `mailserver.svc:{587,993}` — intra-cluster ClusterIP path
- TCP `10.0.20.1:{25,993}` — pfSense HAProxy health (post code-yiu Phase 6)
- E2E Push monitor (receives push from `email-roundtrip-monitor` probe)
- TCP SMTP on `176.12.22.76:25` (external, 60s interval)
- TCP IMAP on `10.0.20.202:993` (internal)
- E2E Push monitor (receives push from roundtrip probe)
### Dovecot exporter — retired
`viktorbarzin/dovecot_exporter` was removed in code-1ik (2026-04-19). It spoke the pre-2.3 `old_stats` FIFO protocol; Dovecot 2.3.19 (docker-mailserver 15.0.0) no longer emits that, so the scrape only ever returned `dovecot_up{scope="user"} 0`. If Dovecot metrics become valuable, reach for a 2.3+ compatible exporter (e.g. `jtackaberry/dovecot_exporter`) and re-add the scrape + alerts. The previously-created `mailserver-metrics` ClusterIP Service was also removed.
### Dovecot Exporter
- Sidecar container in mailserver pod, port 9166
- Scraped by Prometheus for IMAP connection metrics
## Terraform
@ -258,21 +177,17 @@ Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from External
| `secret/platform` | `mailserver_accounts` | User credentials (JSON) |
| `secret/platform` | `mailserver_aliases` | Postfix virtual aliases |
| `secret/platform` | `mailserver_opendkim_key` | DKIM private key |
| `secret/platform` | `mailserver_sasl_passwd` | Brevo relay credentials (`[smtp-relay.brevo.com]:587 <login>:<key>`) |
| `secret/viktor` | `brevo_api_key` | Brevo API key — used by BOTH outbound SMTP SASL (postfix) AND the E2E roundtrip probe (sends external test mail via Brevo HTTP) |
| `secret/viktor` | `mailgun_api_key` | Historical; no longer used by the probe post code-n5l/Phase-5 work. Kept for reference. |
| `secret/platform` | `mailserver_sasl_passwd` | Mailgun relay credentials |
| `secret/viktor` | `mailgun_api_key` | Mailgun API for E2E probe (inbound testing) |
| `secret/viktor` | `brevo_api_key` | Brevo API key (stored for reference) |
## Storage
| PVC | Size | Storage Class | Purpose |
|-----|------|---------------|---------|
| `mailserver-data-encrypted` | 2Gi (auto-resize 5Gi) | `proxmox-lvm-encrypted` (LUKS2) | Maildir + Postfix queue + state + logs |
| `roundcubemail-html-encrypted` | 1Gi | `proxmox-lvm-encrypted` | Roundcube PHP code + user session data |
| `roundcubemail-enigma-encrypted` | 1Gi | `proxmox-lvm-encrypted` | Roundcube Enigma (PGP) user keys |
| `mailserver-backup-host` (RWX) | 10Gi | `nfs-truenas` (historical SC name, Proxmox host NFS) | `mailserver-backup` CronJob destination (`/srv/nfs/mailserver-backup/<YYYY-WW>/`) |
| `roundcube-backup-host` (RWX) | 10Gi | `nfs-truenas` (historical SC name, Proxmox host NFS) | `roundcube-backup` CronJob destination |
**Backup**: daily `mailserver-backup` + `roundcube-backup` CronJobs rsync data PVCs to NFS. NFS directory is picked up by the PVE host's inotify-driven `/usr/local/bin/offsite-sync-backup` which pushes to Synology (weekly). See [Storage & Backup Architecture](storage.md) for the 3-2-1 flow.
| `mailserver-data-proxmox` | 2Gi (auto-resize 5Gi) | proxmox-lvm | Mail data, state, logs |
| `roundcubemail-html-proxmox` | 1Gi | proxmox-lvm | Roundcube web files |
| `roundcubemail-enigma-proxmox` | 1Gi | proxmox-lvm | Roundcube encryption |
## Decisions & Rationale
@ -291,23 +206,19 @@ Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from External
- **Decision**: Rspamd replaces both SpamAssassin and OpenDKIM in a single component
- **Tradeoff**: Higher memory usage (~150-200MB) but simpler stack
### Client-IP Preservation (pfSense HAProxy + PROXY v2)
- **Current (2026-04-19, bd code-yiu)**: pfSense HAProxy listens on `10.0.20.1:{25,465,587,993}`, forwards to k8s NodePort 30125-30128 with `send-proxy-v2` on each backend connection. The mailserver pod exposes parallel listeners (2525/4465/5587/10993) that REQUIRE the PROXY v2 header, while the stock ports 25/465/587/993 stay PROXY-free for intra-cluster traffic (Roundcube, probe). The mailserver Service is ClusterIP-only; ETP is no longer a concern for external traffic.
- **Historical (2026-04-12 → 2026-04-19)**: Dedicated MetalLB IP `10.0.20.202` with `externalTrafficPolicy: Local` — required pod/speaker colocation; kube-proxy preserved client IP only when pod was on the same node as the advertising speaker.
- **Why switched**: ETP:Local made the mailserver's single replica drop inbound mail silently during pod reschedule (30-60s GARP flip). HAProxy with `send-proxy-v2` lets the pod reschedule to any node and recover IP-preservation through the header.
- **Tradeoff**: pfSense now runs HAProxy (one more service in the firewall's responsibility); alt container ports + extra Service are ~80 lines of Terraform. The win is HA without IP-preservation compromise.
- **Runbook**: [`runbooks/mailserver-pfsense-haproxy.md`](../runbooks/mailserver-pfsense-haproxy.md).
### Dedicated MetalLB IP for CrowdSec
- **Decision**: Mailserver gets `10.0.20.202` (separate from shared `10.0.20.200`) with `externalTrafficPolicy: Local`
- **Why**: Shared IP with ETP: Cluster SNATs away real client IPs, making CrowdSec detections and Postfix rate limiting useless
- **Tradeoff**: Uses one extra IP from the MetalLB pool. Requires separate pfSense NAT rule.
## Troubleshooting
### Inbound mail not arriving
1. **DNS/MX**: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me`
2. **WAN reachability**: `nc -zw5 mail.viktorbarzin.me 25` from outside
3. **pfSense NAT**: verify WAN:{25,465,587,993} rdr to `10.0.20.1` (HAProxy VIP). `ssh admin@10.0.20.1 'pfctl -sn' | grep '10.0.20.1'`
4. **HAProxy health**: `ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio"` — at least one backend in `srv_op_state=2` (UP) per pool
5. **Container listener**: `kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- ss -ltn | grep -E ':(25|2525|465|4465|587|5587|993|10993)\b'` — 8 lines expected
6. **Postfix queue + delivery**: `kubectl logs -n mailserver deploy/mailserver -c docker-mailserver | grep -E 'from=|reject|smtpd-proxy'`
7. **CrowdSec decisions**: `kubectl exec -n crowdsec deploy/crowdsec-lapi -- cscli decisions list`
1. Check MX: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me`
2. Check port 25: `nc -zw5 mail.viktorbarzin.me 25`
3. Check pfSense NAT rule: port 25 → `10.0.20.202:25`
4. Check Postfix logs: `kubectl logs -n mailserver deploy/mailserver -c docker-mailserver | grep -E 'from=|reject'`
5. Check if CrowdSec is blocking the sender: `kubectl exec -n crowdsec deploy/crowdsec-lapi -- cscli decisions list`
### Outbound mail failing
1. Check Brevo relay: `kubectl logs -n mailserver deploy/mailserver -c docker-mailserver | grep relay` — should show `relay=smtp-relay.brevo.com`

View file

@ -63,7 +63,6 @@ graph TB
| External Monitor Sync | Python 3.12 | `stacks/uptime-kuma/` | CronJob (10min) syncs `[External]` monitors from `cloudflare_proxied_names` |
| dcgm-exporter | Configurable resources | `stacks/monitoring/modules/monitoring/` | NVIDIA GPU metrics collection |
| Email Roundtrip Probe | Python 3.12 | `stacks/mailserver/modules/mailserver/` | E2E email delivery verification via Mailgun API + IMAP |
| Forgejo Registry Integrity Probe | Alpine 3.20 + curl/jq | `stacks/monitoring/modules/monitoring/main.tf` | CronJob every 15m: walks `/v2/_catalog` on `forgejo.viktorbarzin.me` (HTTP via in-cluster service), HEADs every tagged manifest + index child; emits `registry_manifest_integrity_*` metrics to Pushgateway. Replaces the legacy `registry-integrity-probe` against `registry.viktorbarzin.me:5050` decommissioned in Phase 4 of forgejo-registry-consolidation 2026-05-07. |
## How It Works
@ -76,9 +75,7 @@ Prometheus scrapes metrics from all cluster components and applications using Se
### External Monitoring
The `external-monitor-sync` CronJob (every 10min, `stacks/uptime-kuma/`) ensures Uptime Kuma has `[External] <service>` monitors for externally-reachable ingresses. Discovery is **opt-OUT**: the script lists every ingress via the K8s API and creates a monitor for any host ending in `.viktorbarzin.me`, skipping only those annotated `uptime.viktorbarzin.me/external-monitor: "false"`. Both `ingress_factory` and the `reverse-proxy` factory emit that annotation when the caller sets `external_monitor = false`; leaving it null keeps the opt-in default (important for helm-provisioned ingresses that don't go through our factories). The legacy `cloudflare_proxied_names` ConfigMap is a fallback if the K8s API discovery fails.
These monitors test the full external access path (DNS → Cloudflare → Tunnel → Traefik → Service) from inside the cluster. The status-page-pusher groups them as "External Reachability" and pushes a `external_internal_divergence_count` metric to Pushgateway when services are externally down but internally up. Alert `ExternalAccessDivergence` fires after 15min of divergence.
The `external-monitor-sync` CronJob (every 10min, `stacks/uptime-kuma/`) ensures Uptime Kuma has `[External] <service>` monitors for every service in `cloudflare_proxied_names`. These monitors test the full external access path (DNS → Cloudflare → Tunnel → Traefik → Service) from inside the cluster. The status-page-pusher groups them as "External Reachability" and pushes a `external_internal_divergence_count` metric to Pushgateway when services are externally down but internally up. Alert `ExternalAccessDivergence` fires after 15min of divergence.
Data flows from targets through Prometheus storage to Grafana dashboards. Applications emit logs to stdout/stderr which are aggregated by Loki and queryable through Grafana's log viewer.
@ -158,14 +155,9 @@ spec:
#### Email Monitoring Alerts
- **EmailRoundtripFailing**: E2E email probe returning failure for >30m
- **EmailRoundtripStale**: No successful email round-trip in >80m (60m threshold + for:20m)
- **EmailRoundtripStale**: No successful email round-trip in >40m
- **EmailRoundtripNeverRun**: Email probe has never reported (40m)
#### Registry Integrity Alerts
- **RegistryManifestIntegrityFailure**: Private registry serving 404 for manifests it advertises (orphan OCI-index children) — fires after 30m of `registry_manifest_integrity_failures > 0`. Remediation: rebuild affected image per `docs/runbooks/registry-rebuild-image.md`.
- **RegistryIntegrityProbeStale**: Probe hasn't reported in >1h (CronJob broken)
- **RegistryCatalogInaccessible**: Probe cannot fetch `/v2/_catalog` (auth failure or registry down)
The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10 min) in the `mailserver` namespace that:
1. Sends a test email via Mailgun HTTP API to `smoke-test@viktorbarzin.me`
2. Email lands in the `spam@` catch-all mailbox via MX delivery

View file

@ -1,6 +1,6 @@
# Networking Architecture
Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS TSIG-signed)
Last updated: 2026-04-12
## Overview
@ -28,6 +28,7 @@ graph TB
subgraph "VLAN 10 - Management<br/>10.0.10.0/24"
Proxmox[Proxmox Host<br/>10.0.10.1]
TrueNAS[TrueNAS<br/>10.0.10.15]
DevVM[DevVM<br/>10.0.10.10]
Registry[Registry VM<br/>10.0.20.10]
end
@ -63,6 +64,7 @@ graph TB
vmbr0 -.physical link.- eno1
vmbr0 --> vmbr1
vmbr1 -.VLAN 10.- Proxmox
vmbr1 -.VLAN 10.- TrueNAS
vmbr1 -.VLAN 10.- DevVM
vmbr1 -.VLAN 20.- pfSense
vmbr1 -.VLAN 20.- Tech
@ -104,7 +106,7 @@ flowchart LR
end
subgraph K8s["Kubernetes"]
Import[CronJob<br/>pfsense-import<br/>hourly]
Import[CronJob<br/>pfsense-import<br/>every 5min]
Sync[CronJob<br/>dns-sync<br/>every 15min]
IPAM[phpIPAM<br/>Web UI + API]
MySQL[(MySQL<br/>InnoDB)]
@ -142,14 +144,14 @@ flowchart LR
### DHCP Coverage
| Subnet | DHCP Server | DNS option 6 | Reservations | DDNS | Notes |
|--------|------------|--------------|--------------|------|-------|
| 10.0.10.0/24 (Mgmt) | Kea on pfSense | `10.0.10.1, 94.140.14.14` | 3 (devvm, pxe, ha) | Yes (TSIG) | VMs with static MACs |
| 10.0.20.0/24 (K8s) | Kea on pfSense | `10.0.20.1, 94.140.14.14` | 7 (master, nodes 1-5, registry) | Yes (TSIG) | K8s cluster nodes |
| 192.168.1.0/24 (LAN) | **TP-Link AP** | `192.168.1.2, 94.140.14.14` | 42 (all home devices) | Yes | pfSense Kea WAN is disabled |
| 10.3.2.0/24 (VPN) | Static | — | — | No | WireGuard peers |
| 192.168.0.0/24 (Valchedrym) | OpenWRT | — | — | No | Remote site |
| 192.168.8.0/24 (London) | GL-iNet | — | — | No | Remote site |
| Subnet | DHCP Server | Reservations | DDNS | Notes |
|--------|------------|--------------|------|-------|
| 10.0.10.0/24 (Mgmt) | Kea on pfSense | 4 (devvm, truenas, pxe, ha) | Yes | VMs with static MACs |
| 10.0.20.0/24 (K8s) | Kea on pfSense | 7 (master, nodes 1-5, registry) | Yes | K8s cluster nodes |
| 192.168.1.0/24 (LAN) | Kea on pfSense | 42 (all home devices) | Yes | TP-Link is dumb AP only |
| 10.3.2.0/24 (VPN) | Static | — | No | WireGuard peers |
| 192.168.0.0/24 (Valchedrym) | OpenWRT | — | No | Remote site |
| 192.168.8.0/24 (London) | GL-iNet | — | No | Remote site |
## How It Works
@ -158,7 +160,7 @@ flowchart LR
The Proxmox host uses a dual-bridge architecture:
- **vmbr0**: Physical bridge on interface `eno1`, connected to upstream LAN (192.168.1.0/24). Proxmox management IP is 192.168.1.127.
- **vmbr1**: Internal VLAN-aware bridge, acts as a trunk carrying:
- **VLAN 10 (Management)**: 10.0.10.0/24 — Proxmox, DevVM
- **VLAN 10 (Management)**: 10.0.10.0/24 — Proxmox, TrueNAS, DevVM
- **VLAN 20 (Kubernetes)**: 10.0.20.0/24 — All K8s nodes, services, MetalLB IPs
VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the upstream LAN via NAT.
@ -312,14 +314,9 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
**pfSense**:
- Config: Not Terraform-managed (pfSense web UI / config.xml)
- DHCP: Kea DHCP4 on the two internal VLANs (VLAN 10 = 10.0.10.0/24, VLAN 20 = 10.0.20.0/24). WAN/192.168.1.0/24 is served by the TP-Link dumb AP — pfSense's Kea WAN subnet is disabled.
- **DNS option 6** (per-subnet, WS E 2026-04-19):
- 10.0.10.0/24 → `10.0.10.1, 94.140.14.14` (internal Unbound + AdGuard Home public fallback)
- 10.0.20.0/24 → `10.0.20.1, 94.140.14.14`
- 192.168.1.0/24 → `192.168.1.2, 94.140.14.14` (served by TP-Link, unchanged by WS E)
- Rationale: clients survive an internal resolver outage by falling through to AdGuard (`94.140.14.14`) — confirmed via null-route drill on 2026-04-19.
- DHCP: Kea DHCP4 on all 3 subnets (VLAN 10, VLAN 20, WAN/LAN 192.168.1.0/24)
- 42 MAC→IP reservations for 192.168.1.0/24 (all known home devices)
- DHCP DDNS: Kea DHCP-DDNS sends **TSIG-signed** RFC 2136 updates to Technitium (key `kea-ddns`, HMAC-SHA256; secret in Vault `secret/viktor/kea_ddns_tsig_secret`). Zone `viktorbarzin.lan` + reverse zones require both a pfSense-source IP AND a valid TSIG signature. Config: `/usr/local/etc/kea/kea-dhcp-ddns.conf` (hand-managed on pfSense; pre-WS-E backup at `kea-dhcp-ddns.conf.2026-04-19-pre-tsig`).
- DHCP DDNS: Kea DHCP-DDNS sends RFC 2136 updates to Technitium on every lease grant (forward A + reverse PTR)
- Firewall rules: Allow K8s egress, block inter-VLAN by default
**Technitium**:
@ -338,7 +335,7 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
- Stack: `stacks/phpipam/`
- Web UI: `phpipam.viktorbarzin.me` (Authentik-protected)
- Database: MySQL InnoDB cluster (`mysql.dbaas.svc.cluster.local`)
- Device import: CronJob `phpipam-pfsense-import` hourly — queries Kea DHCP leases + pfSense ARP table via SSH (no active scanning)
- Device import: CronJob `phpipam-pfsense-import` every 5min — queries Kea DHCP leases + pfSense ARP table via SSH (no active scanning)
- DNS sync: CronJob `phpipam-dns-sync` every 15min — bidirectional sync between phpIPAM and Technitium DNS (push named hosts → A+PTR, pull DNS hostnames → unnamed phpIPAM entries)
- Subnets tracked: 10.0.10.0/24, 10.0.20.0/24, 192.168.1.0/24, 10.3.2.0/24, 192.168.8.0/24, 192.168.0.0/24
- API: REST API enabled (app `claude`, ssl_token auth), MCP server available for agent access
@ -367,7 +364,7 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
1. **Single flat network**: Simpler, but no isolation between management and workload traffic.
2. **Routed network with physical VLANs**: Requires switch with VLAN support.
**Decision**: vmbr0 (physical) + vmbr1 (VLAN trunk) gives isolation without requiring managed switches. Management traffic (Proxmox, DevVM) stays on VLAN 10, K8s workloads stay on VLAN 20. Failures in K8s don't affect access to Proxmox or storage.
**Decision**: vmbr0 (physical) + vmbr1 (VLAN trunk) gives isolation without requiring managed switches. Management traffic (Proxmox, TrueNAS) stays on VLAN 10, K8s workloads stay on VLAN 20. Failures in K8s don't affect access to Proxmox or storage.
### Why Cloudflared Tunnel Instead of Port Forwarding?

View file

@ -92,7 +92,7 @@ graph TB
| 203 | k8s-node3 | 8 | 32GB | vmbr1:vlan20 | - | Worker node |
| 204 | k8s-node4 | 8 | 32GB | vmbr1:vlan20 | - | Worker node |
| 220 | docker-registry | 4 | 4GB | vmbr1:vlan20 | 10.0.20.10 | Private Docker registry |
| ~~9000~~ | ~~truenas~~ | — | — | — | ~~10.0.10.15~~ | **DECOMMISSIONED 2026-04-13** — NFS now served by Proxmox host (192.168.1.127). VM still exists in stopped state on PVE pending user decision on deletion. |
| ~~9000~~ | ~~truenas~~ | — | — | — | ~~10.0.10.15~~ | **DECOMMISSIONED** — NFS now served by Proxmox host (192.168.1.127) |
### Kubernetes Cluster
@ -139,7 +139,7 @@ The Kubernetes cluster consists of 5 nodes:
- **k8s-node1 (201)**: 16c/32GB GPU node with Tesla T4 passthrough, tainted for GPU workloads only
- **k8s-node2-4 (202-204)**: 8c/32GB workers running general-purpose workloads
GPU passthrough on node1 uses PCIe device 0000:06:00.0. The NVIDIA GPU Operator's gpu-feature-discovery auto-labels whichever node carries the card with `nvidia.com/gpu.present=true`; `null_resource.gpu_node_config` taints the same set of nodes with `nvidia.com/gpu=true:PreferNoSchedule`. No hostname is hardcoded — moving the card to a different node requires no Terraform edits.
GPU passthrough on node1 uses PCIe device 0000:06:00.0, with Kubernetes taint `nvidia.com/gpu=true:NoSchedule` and label `gpu=true` to ensure only GPU-requesting pods schedule there.
### Service Organization
@ -213,7 +213,7 @@ Secrets are stored in HashiCorp Vault under `secret/`:
**Rationale**:
- **Flexibility**: Easy to snapshot, clone, and roll back VMs during upgrades
- **Isolation**: Management network (devvm) separated from Kubernetes
- **Isolation**: Management network (TrueNAS, devvm) separated from Kubernetes
- **GPU passthrough**: Can dedicate GPU to a single node without tainting the entire host
- **Multi-purpose**: Same physical host can run non-K8s VMs (pfSense, Home Assistant)

View file

@ -2,7 +2,7 @@
## Overview
The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation.
The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 5-layer anti-AI scraping defense. All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation.
## Architecture Diagram
@ -59,7 +59,7 @@ Every incoming request passes through 6 security layers:
1. **Cloudflare WAF** - DDoS protection, bot detection, firewall rules (external)
2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP
3. **CrowdSec Bouncer** - IP reputation check against LAPI (fail-open on error)
4. **Anti-AI Scraping** - 3-layer bot defense (optional per service, updated 2026-04-17)
4. **Anti-AI Scraping** - 5-layer bot defense (optional per service)
5. **Authentik ForwardAuth** - Authentication check (if `protected = true`)
6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach)
7. **Retry Middleware** - Auto-retry on transient errors (2 attempts, 100ms delay)
@ -131,12 +131,10 @@ This prevents resource exhaustion and enforces governance without manual quota m
| `sync-tier-label` | Propagate tier label to child resources | Enforce |
| `goldilocks-vpa-auto-mode` | Disable VPA globally (VPA off) | Enforce |
### Anti-AI Scraping (3 Active Layers) (Updated 2026-04-17)
### Anti-AI Scraping (5-Layer Defense)
Enabled by default via `ingress_factory` module. Disable per-service with `anti_ai_scraping = false`.
Active middleware chain: `ai-bot-block` (ForwardAuth) + `anti-ai-headers` (X-Robots-Tag). The `strip-accept-encoding` and `anti-ai-trap-links` middlewares were removed in April 2026 due to Traefik v3.6.12 Yaegi plugin incompatibility with the rewrite-body plugin.
#### Layer 1: Bot Blocking (ForwardAuth)
- Middleware calls `poison-fountain` service before backend
@ -150,16 +148,25 @@ Active middleware chain: `ai-bot-block` (ForwardAuth) + `anti-ai-headers` (X-Rob
- Instructs compliant bots to skip content
- Lightweight, no performance impact
#### ~~Layer 3: Trap Links~~ (REMOVED)
#### Layer 3: Trap Links
Removed April 2026. The rewrite-body Traefik plugin used to inject hidden trap links broke on Traefik v3.6.12 due to Yaegi runtime bugs. The companion `strip-accept-encoding` middleware was also removed.
- JavaScript injects invisible links before `</body>`
- Links point to honeypot endpoints
- Legitimate browsers don't click, bots follow
- Triggered bots get added to ban list
#### Layer 3 (formerly 4): Tarpit / Poison Content
#### Layer 4: Tarpit
- Serves AI bots extremely slowly (~100 bytes/sec)
- Wastes bot resources, makes scraping uneconomical
- Humans see normal speed (only applies to detected bots)
#### Layer 5: Poison Content
- `poison-fountain` service still exists as a standalone service at `poison.viktorbarzin.me`
- Serves AI bots extremely slowly (~100 bytes/sec tarpit)
- CronJob every 6 hours generates fake content
- Trap links are no longer injected into real pages, but bots that discover `poison.viktorbarzin.me` directly still get tarpitted and poisoned
- Injects misleading/nonsense data into pages shown to bots
- Degrades AI training data quality
- **Requires `--http1.1` flag** to work with current HTTP/2 setup
**Implementation**: See `stacks/poison-fountain/` and `stacks/platform/modules/traefik/middleware.tf`
@ -279,13 +286,13 @@ spec:
- **Better observability**: Collect violation metrics before enforcing
- **Selective enforcement**: Move to enforce mode per-policy after validation
### Why Multi-Layer Anti-AI Defense? (Updated 2026-04-17)
### Why 5-Layer Anti-AI Defense?
- **Defense in depth**: Each layer catches different bot types
- **Compliant bots**: Layer 2 (X-Robots-Tag) handles respectful crawlers
- **Persistent bots**: Tarpit makes scraping uneconomical
- **Poison content**: Degrades training data for bots that reach poison-fountain
- Layer 3 (trap links via rewrite-body) was removed due to Traefik v3 plugin incompatibility
- **Dumb bots**: Layer 3 (trap links) catches simple scrapers
- **Persistent bots**: Layer 4 (tarpit) makes scraping uneconomical
- **Sophisticated bots**: Layer 5 (poison content) degrades training data
### Why Fail-Open Mode?
@ -375,16 +382,15 @@ spec:
2. Verify backend isn't returning transient errors: Check for 5xx responses
3. Disable retry for specific service: Remove retry middleware from `ingress_factory`
### Poison Content Not Serving (Updated 2026-04-17)
### Poison Content Not Injecting
**Problem**: Bots not receiving poisoned content on `poison.viktorbarzin.me`.
**Note**: Poison content is no longer injected into real pages (rewrite-body removed). It is only served directly via the `poison.viktorbarzin.me` subdomain.
**Problem**: Bots not receiving poisoned content.
**Fix**:
1. Verify CronJob running: `kubectl get cronjob -n poison-fountain`
2. Check logs: `kubectl logs -n poison-fountain -l app=poison-fountain`
3. Manually trigger: `kubectl create job --from=cronjob/poison-content manual-poison`
2. Check logs: `kubectl logs -n poison-fountain -l app=poison-content-injector`
3. Ensure `--http1.1` flag set (required for HTTP/2 backends)
4. Manually trigger: `kubectl create job --from=cronjob/poison-content manual-poison`
## Related

View file

@ -16,13 +16,13 @@ All services storing sensitive data were migrated to `proxmox-lvm-encrypted` on
- **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (2TB) — bulk media and backup targets
- **SSD NFS**: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB) — high-performance data (Immich ML)
Both `StorageClass: nfs-truenas` and `StorageClass: nfs-proxmox` point to the Proxmox host and are functionally identical. The `nfs-truenas` name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster.
Both `StorageClass: nfs-truenas` (name kept for compatibility) and `StorageClass: nfs-proxmox` (identical) point to the Proxmox host. Migrated from TrueNAS (10.0.10.15) which has been fully decommissioned.
**Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, auto SQLite backups, pfSense backups, and PVE config. NFS data syncs directly to Synology via inotify change tracking (not stored on sda). Independent of live storage (sdc).
**History (2026-04-02)**: iSCSI block volumes migrated from democratic-csi (TrueNAS iSCSI → ZFS → LVM-thin) to Proxmox CSI (direct LVM-thin hotplug). democratic-csi iSCSI driver removed.
**Migration (2026-04-02)**: All iSCSI block volumes were migrated from democratic-csi (TrueNAS iSCSI → ZFS → LVM-thin) to Proxmox CSI (direct LVM-thin hotplug). democratic-csi iSCSI driver has been removed.
**History (2026-04-13)**: TrueNAS (VM 9000, 10.0.10.15) fully decommissioned. NFS storage migrated to the Proxmox host (192.168.1.127). ZFS datasets under `/mnt/main/` and `/mnt/ssd/` moved to ext4 LVs at `/srv/nfs/` and `/srv/nfs-ssd/`. Legacy PVs referencing `/mnt/main/` paths still work (bind-mounted or symlinked on the Proxmox host); new PVs use `/srv/nfs/` and `/srv/nfs-ssd/`. TrueNAS VM still exists in stopped state on PVE pending user decision on deletion.
**Migration (2026-04)**: TrueNAS (10.0.10.15) fully decommissioned. All NFS storage migrated to the Proxmox host (192.168.1.127). ZFS datasets under `/mnt/main/` and `/mnt/ssd/` moved to ext4 LVs at `/srv/nfs/` and `/srv/nfs-ssd/`. Legacy PVs referencing `/mnt/main/` paths still work (bind-mounted or symlinked on the Proxmox host); new PVs use `/srv/nfs/` and `/srv/nfs-ssd/`.
## Architecture Diagram
@ -39,7 +39,7 @@ graph TB
end
subgraph K8s["Kubernetes Cluster"]
CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-proxmox (+ legacy nfs-truenas)<br/>soft,timeo=30,retrans=3"]
CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-truenas / nfs-proxmox<br/>soft,timeo=30,retrans=3"]
CSI_PVE["Proxmox CSI plugin<br/>StorageClass: proxmox-lvm<br/>StorageClass: proxmox-lvm-encrypted"]
NFS_PV["NFS PersistentVolumes<br/>RWX, ~100 volumes"]
@ -77,10 +77,10 @@ graph TB
| Proxmox NFS (HDD) | LV `pve/nfs-data`, 2TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services |
| Proxmox NFS (SSD) | LV `ssd/nfs-ssd-data`, 100GB ext4 | 192.168.1.127:/srv/nfs-ssd | High-performance data (Immich ML) |
| nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver |
| StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage, points to Proxmox host |
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | **Historical name** — functionally identical to `nfs-proxmox`, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. |
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | NFS storage (name kept for compatibility, points to Proxmox) |
| StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage (identical to nfs-truenas) |
| TF module `nfs_volume` | `modules/kubernetes/nfs_volume/` | Infra repo | Static NFS PV/PVC factory |
| ~~TrueNAS VM~~ | **DECOMMISSIONED 2026-04-13** | Was VM 9000 at 10.0.10.15 | Replaced by Proxmox NFS. VM still in stopped state pending deletion. |
| ~~TrueNAS VM~~ | **DECOMMISSIONED** | Was VMID 9000 at 10.0.10.15 | Replaced by Proxmox NFS (2026-04) |
| ~~democratic-csi-iscsi~~ | **REMOVED** | Was namespace: iscsi-csi | Replaced by Proxmox CSI (2026-04-02) |
| ~~StorageClass `iscsi-truenas`~~ | **REMOVED** | Was cluster-wide | Replaced by `proxmox-lvm` |
@ -105,7 +105,7 @@ graph TB
**Note**: Some legacy PVs still reference `/mnt/main/<service>` paths. These work via compatibility symlinks/bind-mounts on the Proxmox host. New PVs should use `/srv/nfs/<service>` or `/srv/nfs-ssd/<service>`.
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-proxmox` StorageClass (or the legacy `nfs-truenas` for existing PVs) via PVCs.
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-truenas` or `nfs-proxmox` StorageClass via PVCs.
### Block Storage Flow (Proxmox CSI) — NEW
@ -129,9 +129,7 @@ graph TB
5. **Passphrase management**: ExternalSecret syncs passphrase from Vault KV (`secret/viktor/proxmox_csi_encryption_passphrase`) → K8s Secret. Backup key at `/root/.luks-backup-key` on PVE host.
**Services on encrypted storage (2026-04-15 migration):**
vaultwarden, dbaas (mysql+pg+pgadmin), mailserver, nextcloud, forgejo, matrix, n8n, affine, health, hackmd, redis, headscale, frigate, meshcentral, technitium, actualbudget, grampsweb, owntracks, wealthfolio, monitoring (alertmanager)
**Services migrated later** (post-audit catch-up): paperless-ngx (2026-04-25 — sensitive document scans had been left on plain `proxmox-lvm` by an abandoned attempt; rsync swap cleaned up the orphan and re-did via Terraform). Vault raft cluster (2026-04-25 — all 3 voters migrated from `nfs-proxmox` to `proxmox-lvm-encrypted` after the 2026-04-22 raft-leader-deadlock post-mortem found NFS fsync semantics incompatible with raft consensus log; rolled non-leader-first with force-finalize on the pvc-protection finalizer to avoid pod-recreating on the old PVCs).
vaultwarden, dbaas (mysql+pg+pgadmin), mailserver, nextcloud, forgejo, matrix, n8n, affine, health, hackmd, redis, headscale, frigate, meshcentral, technitium, actualbudget, grampsweb, owntracks, paperless-ngx, wealthfolio, monitoring (alertmanager)
**CSI node plugin memory**: Requires 1280Mi limit for LUKS2 Argon2id key derivation (~1GiB). Set via `node.plugin.resources` in Helm values (not `node.resources`).
@ -166,7 +164,7 @@ SQLite uses `fsync()` to guarantee durability. NFS's soft mount + async semantic
|------|---------|
| `/etc/exports` (on Proxmox host) | NFS export configuration for all service shares |
| `stacks/proxmox-csi/` | Terraform stack for Proxmox CSI plugin + StorageClass |
| `stacks/nfs-csi/` | NFS CSI driver + StorageClasses (`nfs-proxmox` + legacy `nfs-truenas`) |
| `stacks/nfs-csi/` | NFS CSI driver + StorageClasses (`nfs-truenas`, `nfs-proxmox`) |
| `modules/kubernetes/nfs_volume/` | Reusable module for static NFS PV/PVC creation |
| `config.tfvars` | Variable `nfs_server = "192.168.1.127"` shared by all stacks |
@ -175,10 +173,8 @@ SQLite uses `fsync()` to guarantee durability. NFS's soft mount + async semantic
| Path | Contents |
|------|----------|
| `secret/viktor/proxmox_csi_encryption_passphrase` | LUKS2 encryption passphrase for `proxmox-lvm-encrypted` StorageClass |
| ~~`secret/viktor/truenas_ssh_key`~~ | **REMOVED** — was SSH key for democratic-csi SSH driver (TrueNAS decommissioned 2026-04-13) |
| ~~`secret/viktor/truenas_root_password`~~ | **REMOVED** — was TrueNAS root password (TrueNAS decommissioned 2026-04-13) |
| ~~`secret/viktor/truenas_api_key`~~ | **REMOVED** — was TrueNAS API key (TrueNAS decommissioned 2026-04-13) |
| ~~`secret/viktor/truenas_ssh_private_key`~~ | **REMOVED** — was TrueNAS SSH private key (TrueNAS decommissioned 2026-04-13) |
| ~~`secret/viktor/truenas_ssh_key`~~ | **LEGACY** — was SSH key for democratic-csi SSH driver (TrueNAS decommissioned) |
| ~~`secret/viktor/truenas_root_password`~~ | **LEGACY** — was TrueNAS root password (TrueNAS decommissioned) |
### Terraform Stacks

View file

@ -63,10 +63,10 @@ sequenceDiagram
Cloudflare-->>AdGuard: A record (Cloudflare IP)
AdGuard-->>Client: Response
Note over Client: Query: nextcloud.viktorbarzin.lan
Note over Client: Query: truenas.viktorbarzin.lan
Client->>AdGuard: DNS query
AdGuard->>Technitium: Forward (.lan domain)
Technitium-->>AdGuard: A record (10.0.20.200)
Technitium-->>AdGuard: A record (10.0.10.15)
AdGuard-->>Client: Response
Note over Client,Technitium: If Cloudflared tunnel is down:
@ -370,14 +370,14 @@ dns_config:
### Can't Resolve .lan Domains from VPN
**Symptoms**: `nslookup nextcloud.viktorbarzin.lan` returns `NXDOMAIN`.
**Symptoms**: `nslookup truenas.viktorbarzin.lan` returns `NXDOMAIN`.
**Diagnosis**: Check DNS chain: Client → AdGuard → Technitium.
**Steps**:
1. Verify AdGuard is running: `kubectl get pod -n adguard`
2. Check AdGuard conditional forwarding: Query AdGuard directly: `nslookup nextcloud.viktorbarzin.lan <adguard-ip>`
3. Check Technitium: `nslookup nextcloud.viktorbarzin.lan 10.0.20.101`
2. Check AdGuard conditional forwarding: Query AdGuard directly: `nslookup truenas.viktorbarzin.lan <adguard-ip>`
3. Check Technitium: `nslookup truenas.viktorbarzin.lan 10.0.20.101`
**Common causes**:
1. **AdGuard not forwarding .lan**: Conditional forwarding rule missing or misconfigured.

View file

@ -1,7 +1,5 @@
# Anti-AI Scraping System Design
> **Status (Updated 2026-04-17):** Partially superseded. Layer 3 (trap links via rewrite-body plugin) removed due to Traefik v3.6.12 Yaegi plugin incompatibility. The `strip-accept-encoding` and `anti-ai-trap-links` middlewares have been deleted. Rybbit analytics injection moved from Traefik rewrite-body to a Cloudflare Worker (`infra/stacks/rybbit/worker/`). Active layers: 1 (bot-block), 2 (headers), 4 (tarpit), 5 (poison content).
## Problem
AI scrapers crawl public web services to harvest training data. We want to:
@ -11,7 +9,7 @@ AI scrapers crawl public web services to harvest training data. We want to:
## Architecture
Four active defense layers applied to all public services via Traefik (Layer 3 removed April 2026):
Five defense layers applied to all public services via Traefik:
```
Internet -> Cloudflare -> Traefik
@ -20,7 +18,7 @@ Internet -> Cloudflare -> Traefik
|
+-- Layer 2: Headers -> X-Robots-Tag: noai, noimageai
|
+-- [REMOVED] Layer 3: Rewrite-body trap links (April 2026 — Yaegi bugs in Traefik v3.6.12)
+-- Layer 3: Rewrite-body -> inject hidden trap links into HTML
|
+-- Layer 4: Poison service -> serve cached Poison Fountain data
|
@ -70,10 +68,13 @@ All defined in `stacks/platform/modules/traefik/middleware.tf`:
- Sets `X-Robots-Tag: noai, noimageai` on all responses
- Added to all public services via ingress_factory
**`anti-ai-trap-links` (rewrite-body plugin)** — REMOVED (Updated 2026-04-17):
- Removed due to Traefik v3.6.12 Yaegi runtime bugs making the rewrite-body plugin unreliable
- The companion `strip-accept-encoding` middleware was also removed (only existed for rewrite-body)
- Trap link injection is no longer active; poison-fountain still serves tarpit content standalone
**`anti-ai-trap-links` (rewrite-body plugin)**:
- Regex: `</body>` -> injects hidden div with trap links + `</body>`
- Links point to `https://poison.viktorbarzin.me/article/<slug>`
- CSS: invisible to humans (`position:absolute;left:-9999px;height:0;overflow:hidden;aria-hidden=true`)
- Only processes `text/html` responses
- Requires strip-accept-encoding companion middleware (already exists)
- Applied globally via ingress_factory
### 4. Trap subdomain: poison.viktorbarzin.me
@ -87,7 +88,7 @@ All defined in `stacks/platform/modules/traefik/middleware.tf`:
New variables:
- `anti_ai_scraping` (bool, default: true) - enable all anti-AI layers
- When true, adds to middleware chain: `ai-bot-block`, `anti-ai-headers`
- When true, adds to middleware chain: `ai-bot-block`, `anti-ai-headers`, `strip-accept-encoding`, `anti-ai-trap-links`
- Services can opt out with `anti_ai_scraping = false`
## Human User Protection
@ -96,7 +97,7 @@ New variables:
|---------|-----------|
| Hidden links visible | CSS `position:absolute;left:-9999px;height:0;overflow:hidden` + `aria-hidden="true"` |
| False positive blocking | Only blocks specific AI bot User-Agent strings; no browser matches these |
| Performance overhead | ForwardAuth is a string match (<1ms). Rybbit injected via Cloudflare Worker (not Traefik). |
| Performance overhead | ForwardAuth is a string match (<1ms). Rewrite-body already proven with Rybbit. |
| Poison content leakage | Only served on poison.viktorbarzin.me, not linked from any navigation |
| Slow responses | Tarpit only applies to poison.viktorbarzin.me, not to real services |

View file

@ -1,142 +0,0 @@
# NFS-Hostile Workload Migration — Design
**Date**: 2026-04-25
**Author**: Viktor (with Claude)
**Status**: Phase 1 done, Phase 2 in progress
**Beads**: code-gy7h (Vault), code-ahr7 (Immich PG)
## Problem
The 2026-04-22 Vault Raft leader deadlock (post-mortem
`2026-04-22-vault-raft-leader-deadlock.md`) traced to NFS client
writeback stalls poisoning kernel state. Recovery took 2h43m and
required hard-resetting 3 of 4 cluster VMs. Two workload classes on
NFS are NFS-hostile per the criteria in
`infra/.claude/CLAUDE.md` ("Critical services MUST NOT use NFS"):
1. **Postgres with WAL fsync per commit** — Immich primary
2. **Vault Raft consensus log** — fsync per append-entry, 3 replicas
Everything else on NFS (47 PVCs, ~455 GiB) is correctly placed:
RWX media libraries, append-only backups, ML caches.
## Decision
Migrate exactly those two workload classes to
`proxmox-lvm-encrypted` (LUKS2 LVM-thin via Proxmox CSI). No iSCSI,
no RWX media migration, no backup-target migration.
## Rationale
- Block storage decouples PG / Raft fsync from NFS client kernel
state. Failure mode that triggered the post-mortem cannot recur for
these workloads.
- `proxmox-lvm-encrypted` is the documented default for sensitive data
(`infra/.claude/CLAUDE.md` storage decision rule). It already backs
~28 PVCs across the cluster — pattern is proven.
- Existing nightly `lvm-pvc-snapshot` PVE host script (03:00, 7-day
retention) auto-picks-up new PVCs via thin snapshots — no extra
backup wiring needed for the live data side.
- LUKS2 satisfies "encrypted at rest for sensitive data" requirement.
## Out of scope
- iSCSI evaluation (already retired 2026-04-13).
- RWX media (Immich library, music, ebooks) — correct placement.
- Backup target PVCs (`*-backup` on NFS) — append-only, NFS-tolerant.
- Prometheus 200 GiB — already on `proxmox-lvm`.
## Pattern per workload
### Immich PG (single replica, Deployment, Recreate strategy)
- Add new RWO PVC on `proxmox-lvm-encrypted`.
- Quiesce app pods (server + ML + frame).
- `pg_dumpall` from running NFS pod → local file.
- Swap deployment `claim_name` → encrypted PVC.
- PG bootstraps fresh on empty PVC; restore dump.
- REINDEX vector indexes (`clip_index`, `face_index`).
- Backup CronJob keeps writing to NFS module (correct: append-only).
### Vault Raft (3 replicas, StatefulSet, helm-managed)
- Change `dataStorage.storageClass` and `auditStorage.storageClass`
from `nfs-proxmox``proxmox-lvm-encrypted`.
- StatefulSet `volumeClaimTemplates` is immutable → use
`kubectl delete sts vault --cascade=orphan` then re-apply (memory
pattern for VCT swaps).
- Per-pod rolling: delete pod + PVCs, controller recreates with new
template. Auto-unseal sidecar handles unseal; raft `retry_join`
rejoins cluster.
- 24h validation window between pods. Migrate non-leader pods first;
step-down current leader before migrating it last.
- Backup target (`vault-backup-host` on NFS) stays on NFS.
## Risks and rollbacks
### Immich PG
- pg_dumpall captures schema + data, not file-level state. Vector
index versions matter (vchord 0.3.0 unchanged; vector 0.8.0 →
0.8.1 is a minor automatic bump on `CREATE EXTENSION` — confirmed
benign). Rollback: revert `claim_name`, scale apps; old NFS PVC
retained for 7 days post-migration.
### Vault Raft
- Cluster keeps quorum from 2 standby replicas while one pod is
swapped. Migrating the leader last avoids quorum churn.
- Recovery anchor: pre-migration `vault operator raft snapshot save`
+ nightly `vault-raft-backup` CronJob. RTO < 1h via snapshot
restore.
## Helm `securityContext.pod` replace-not-merge (Vault, discovered during execution)
The Vault helm chart sets pod-level securityContext defaults
(`fsGroup=1000, runAsGroup=1000, runAsUser=100, runAsNonRoot=true`)
from chart templates, not from values.yaml. When `main.tf` provided
its own `server.statefulSet.securityContext.pod = {fsGroupChangePolicy
= "OnRootMismatch"}` the helm rendering REPLACED the chart defaults
rather than merging into them. On NFS this was harmless (`async,
insecure` exports made the volume world-writable enough for any UID),
but on a fresh ext4 LV via Proxmox CSI the volume root is `root:root`
and vault user (UID 100) cannot open `/vault/data/vault.db`.
vault-1 and vault-2 happened to be Running with the correct
securityContext because their pod specs were written into etcd
**before** the customization landed; helm chart upgrades don't
restart pods, so the broken values lay dormant until vault-0 was
recreated by the orphan-deleted STS during this migration.
Resolution: provide all five fields (`fsGroup`, `fsGroupChangePolicy`,
`runAsGroup`, `runAsUser`, `runAsNonRoot`) explicitly in main.tf so
`runAsGroup=1000` etc. survive future chart bumps. Idempotent on
both fresh PVCs and existing pods.
## Init container chicken-and-egg (Immich PG, discovered during execution)
The pre-existing `write-pg-override-conf` init container on the
Immich PG deployment writes `postgresql.override.conf` directly to
`PGDATA`. On a populated NFS PVC this was a no-op (init was already
run). On the fresh encrypted PVC, the file made `initdb` refuse the
non-empty directory and the pod CrashLoopBackOff'd.
Resolution: gate the init container on `PG_VERSION` presence — first
boot skips the override write, PG `initdb`s cleanly; force a pod
restart and the second boot writes the override and PG loads
`vchord` / `vectors` / `pg_prewarm` before the dump restore. Change
is permanent and idempotent (correct on both fresh and initialised
PVCs). One restart pre-migration only.
## Verification
End-to-end DONE when:
- `kubectl get pvc -A | grep nfs-proxmox` returns only the
`vault-backup-host` PVC (or zero, if backup PVC moves elsewhere).
- `vault operator raft list-peers` shows 3 voters on
`proxmox-lvm-encrypted`, leader elected.
- Immich PG `\dx` matches pre-migration extensions (vector minor
drift OK).
- `lvm-pvc-snapshot` captures new LVs in next 03:00 run.
- 7 consecutive days of clean backup CronJob runs and no new alerts.

View file

@ -1,169 +0,0 @@
# NFS-Hostile Workload Migration — Plan
**Date**: 2026-04-25
**Design**: `2026-04-25-nfs-hostile-migration-design.md`
**Beads**: code-gy7h (Vault, epic), code-ahr7 (Immich PG)
## Phase 1 — Immich PG (DONE 2026-04-25)
| Step | Done |
|---|---|
| Snapshot extensions + row counts to `/tmp/immich-pre-migration-*` | ✓ |
| Quiesce `immich-server` + `immich-machine-learning` + `immich-frame` | ✓ |
| `pg_dumpall``/tmp/immich-pre-migration-<ts>.sql` (1.9 GB) | ✓ |
| Add `kubernetes_persistent_volume_claim.immich_postgresql_encrypted` (10Gi, autoresize 20Gi cap) | ✓ |
| Swap `claim_name` at `infra/stacks/immich/main.tf` deployment | ✓ |
| Patch init container to gate on `PG_VERSION` (chicken-and-egg fix) | ✓ |
| Force pod restart so override.conf gets written | ✓ |
| Restore dump | ✓ |
| `REINDEX clip_index`, `REINDEX face_index` | ✓ |
| Scale apps back up | ✓ |
| Verify: `\dx`, row counts (~111k assets), HTTP 200 internal/external | ✓ |
| LV present on PVE host (`vm-9999-pvc-...`) | ✓ |
### Phase 1 follow-ups (not blocking)
- Old NFS PVC `immich-postgresql-data-host` retained 7 days for
rollback. After 2026-05-02: remove `module.nfs_postgresql_host`
from `infra/stacks/immich/main.tf` and the CronJob's reference.
- Backup CronJob (`postgresql-backup`) still writes to the NFS
module. After cleanup, point it at a dedicated backup PVC or to
the existing `immich-backups` NFS share.
## Phase 2 — Vault Raft (DONE 2026-04-25)
**Phase 2 complete 2026-04-25; all 3 voters on `proxmox-lvm-encrypted`.**
### Pre-flight (T-0) — DONE 2026-04-25 15:50 UTC
- [x] Verify all 3 vault pods sealed=false, raft healthy.
- [x] Take fresh `vault operator raft snapshot save` (anchor saved at
`/tmp/vault-pre-migration-20260425-155029.snap`, 1.5 MB).
- [ ] Optional: scale ESO to 0 — skipped (auto-unseal sidecar is
independent; ESO refresh churn is non-disruptive for one swap).
- [x] Confirmed leader is **vault-2** → migrate vault-0 first
(non-leader), vault-1 next, vault-2 last (with step-down).
Plan originally assumed vault-0 was leader; same intent
(non-leader first).
- [x] Thin pool headroom: 54.63% used, plenty for 6 × 2 GiB LVs.
### Step 0 — Helm values + StatefulSet swap — DONE 2026-04-25 16:08 UTC
- [x] Edit `infra/stacks/vault/main.tf`: change
`dataStorage.storageClass` and `auditStorage.storageClass`
from `nfs-proxmox``proxmox-lvm-encrypted`.
- [x] `kubectl -n vault delete sts vault --cascade=orphan` (StatefulSet
`volumeClaimTemplates` is immutable; orphan keeps pods+PVCs
alive while we recreate the controller with the new template).
- [x] `tg apply -target=helm_release.vault` → recreates STS with new
VCT (full-stack `tg plan` blocks on unrelated for_each-with-
apply-time-keys errors at lines 848/865/909/917; targeted
apply on the helm release alone is the right scope here).
Existing pods still on old NFS PVCs.
### Step 1 — Roll vault-0 first (non-leader) — DONE 2026-04-25 16:18 UTC
- [x] `kubectl -n vault delete pod vault-0 --grace-period=30`
- [x] `kubectl -n vault delete pvc data-vault-0 audit-vault-0`
- [x] STS controller recreated pod; new PVCs auto-provisioned on
`proxmox-lvm-encrypted` (LVs `vm-9999-pvc-fb732fd7-...` data
4.12%, `vm-9999-pvc-36451f42-...` audit 3.99%).
- [x] **Hit and fixed**: vault-0 CrashLoopBackOff'd with
`permission denied` on `/vault/data/vault.db`. The helm chart's
`statefulSet.securityContext.pod` block in main.tf only set
`fsGroupChangePolicy`, replacing (not merging) the chart's
defaults `fsGroup=1000, runAsGroup=1000, runAsUser=100,
runAsNonRoot=true`. NFS exports made the missing fsGroup a
no-op; ext4 LV needs it to chown the volume root for the
vault user. Old vault-1/vault-2 pods were created before that
block was added so they still had the chart-default
securityContext from their original spec. Fix: provide all
five fields explicitly in main.tf and re-apply. Same root
cause will affect vault-1 and vault-2 swaps unless this stays
in place.
- [x] Wait Ready; auto-unseal sidecar unsealed; `retry_join` rejoined
raft cluster.
- [x] Verify: `vault operator raft list-peers` shows 3 voters,
vault-0 follower, leader=vault-2. External HTTPS 200.
### Step 2 — 24h soak (SKIPPED per user direction 2026-04-25)
User instructed "continue with all the remaining actions" — soak
gates compressed to per-pod settle windows + raft-state verification
between rollings. No Raft alarms, no Vault errors observed at each
verification gate.
### Step 3 — Roll vault-1 — DONE 2026-04-25
- [x] Force-finalize PVCs to break re-mount race:
`kubectl -n vault patch pvc data-vault-1 audit-vault-1 -p '{"metadata":{"finalizers":null}}' --type=merge`.
(Initial pod-then-PVC delete recreated pod on the OLD NFS PVCs
because pvc-protection finalizer hadn't cleared. Lesson learned
and applied to vault-2 below.)
- [x] Pod recreated on encrypted PVCs; auto-unsealed; rejoined raft.
### Step 4 — Settle window — DONE 2026-04-25
3-check verification over 90s; raft index advancing (2730010→2730012),
all 3 voters healthy.
### Step 5 — Roll vault-2 (leader) — DONE 2026-04-25
- [x] `vault operator step-down` on vault-2; vault-0 took leadership.
Confirmed vault-0 active, vault-1+vault-2 standby before delete.
- [x] Snapshot anchor at `/tmp/vault-pre-vault2.snap` (1.5 MB) from new
leader vault-0.
- [x] Force-finalize + delete PVCs + delete pod (lesson from vault-1).
- [x] Pod recreated on encrypted PVCs; auto-unsealed; rejoined raft.
- [x] `vault operator raft list-peers` shows 3 voters all healthy on
encrypted storage; leader vault-0.
### Step 6 — Cleanup — DONE 2026-04-25
- [x] `kubectl get pvc -A` cross-cluster shows zero PVCs on
`nfs-proxmox` SC (only Released PVs remain → Phase 3).
- [x] Removed inline `kubernetes_storage_class.nfs_proxmox` from
`infra/stacks/vault/main.tf` (was lines 2942).
- [x] All 3 PVC pairs on `proxmox-lvm-encrypted`.
- [x] `vault operator raft autopilot state` healthy=true.
- [x] External `https://vault.viktorbarzin.me/v1/sys/health` = 200.
## Phase 3 — Released-PV cleanup (FOLLOW-UP)
### Step 3.1 — vault Released PVs — DONE 2026-04-25
6 vault NFS PVs (Released, `nfs-proxmox` SC, Retain policy) deleted
along with their NFS subdirectories on PVE host (~1.5 GB reclaimed):
| PV | Claim | Size on disk |
|---|---|---|
| pvc-004a5d3b-… | data-vault-2 | 45M |
| pvc-808a78ec-… | audit-vault-1 | 1.4M |
| pvc-918ee7c1-… | audit-vault-0 | 3.2M |
| pvc-9d2ddcb4-… | data-vault-0 | 46M |
| pvc-a659711d-… | data-vault-1 | 46M |
| pvc-d2e65109-… | audit-vault-2 | 1.4G |
Procedure: `kubectl delete pv <name>` (cluster object only — Retain
policy means CSI never touches NFS) then `rm -rf /srv/nfs/<dir>` on
192.168.1.127.
### Step 3.2 — Cluster-wide Released PV sweep (DEFERRED)
~50 other Released PVs persist across the cluster (~200 GiB on
`proxmox-lvm` and `proxmox-lvm-encrypted`). Out of scope for the
2026-04-25 NFS-hostile session per user direction. To reclaim:
1. List Released PVs, confirm LV exists on PVE.
2. `kubectl delete pv <name>` (CSI removes underlying LV when PV is
orphaned with `Retain` reclaim policy and no PVC reference).
3. If LV survives: manual `lvremove pve/vm-9999-pvc-<uuid>`.
## Rollback
| Phase | Trigger | Action |
|---|---|---|
| 1 | Immich UI broken / data loss | Revert `claim_name`; restore from `/tmp/immich-pre-migration-*.sql` to old NFS PVC |
| 2 (mid-rolling) | Single pod broken | Delete the encrypted PVC; recreate with NFS SC explicitly; cluster keeps quorum from 2 healthy pods |
| 2 (post-rolling, raft corrupt) | Cluster-wide failure | `vault operator raft snapshot restore <pre-migration.snap>` |
| Catastrophic | All Vault data lost | Restore from latest `/srv/nfs/vault-backup/` snapshot via CronJob output |

View file

@ -1,195 +0,0 @@
# Forgejo Registry Consolidation — Design
**Date**: 2026-05-07
**Status**: Approved
## Problem
`registry-private` (the `registry:2` container on the docker-registry
VM at `10.0.20.10`) has hit `distribution#3324` corruption three
times in three weeks (2026-04-13, 2026-04-19, 2026-05-04). Each
incident required manual blob recovery and another round of
hardening to `cleanup-tags.sh` and the GC procedure. The integrity
probe catches it within 15 minutes now, but every hit still costs
~1h of cleanup, and we keep tightening the same loose screw.
Root cause is a known race in `distribution`: tag deletes that race
with concurrent garbage collection produce orphan OCI-index children.
Upstream has not patched it; our mitigations (probe, blob
fix-up script, idempotent cleanup) reduce blast radius but don't
remove the failure mode.
Forgejo (deployed for OAuth and personal repos at
`forgejo.viktorbarzin.me`) ships a built-in OCI registry as part of
the Packages feature, default-on in v11. Using it removes
`distribution`-the-engine from the path entirely, replaces it with
Forgejo's own implementation backed by Forgejo's DB+blob store, and
gets us source hosting + image hosting in one resource.
The PVE host RAM upgrade from 142GB to 272GB (memory id=569) means
the cluster can absorb the resource bump Forgejo needs for the
registry workload (1Gi → 1Gi).
## Decision
Move every image currently on `registry.viktorbarzin.me:5050` to
Forgejo's OCI registry at `forgejo.viktorbarzin.me`. Decommission
`registry-private` after a 14-day dual-push bake.
Pull-through caches for upstream registries (DockerHub, GHCR, Quay,
k8s.gcr, Kyverno) stay on the registry VM permanently — Forgejo
won't serve as a pull-through, so the chicken-and-egg of "Forgejo
pulling its own image through itself" never arises.
## Design
### Registry hostname
Image references become `forgejo.viktorbarzin.me/viktor/<image>:<tag>`.
The `viktor/` prefix is the Forgejo owner namespace; all current
private images ship under that single owner.
### Auth
Two service-account users:
| User | Scope | Vault key | Used by |
|---|---|---|---|
| `cluster-puller` | `read:package` | `secret/viktor/forgejo_pull_token` | cluster-wide `registry-credentials` Secret, monitoring probe |
| `ci-pusher` | `write:package` | `secret/ci/global/forgejo_push_token` | Woodpecker pipelines (synced via `vault-woodpecker-sync` CronJob) |
A third PAT (`secret/viktor/forgejo_cleanup_token`, also belongs to
`ci-pusher`) drives the retention CronJob — kept separate from the
push PAT so a leaked CI token doesn't immediately enable mass deletes.
PATs have no expiry. Rotation policy: regenerate via Forgejo Web UI
and `vault kv patch` if a leak is suspected; ESO/sync downstream is
automatic.
### Cluster pull path
`registry-credentials` is a single Secret in `kyverno` ns, cloned
into every namespace by the existing
`sync-registry-credentials` ClusterPolicy. We extend its
`dockerconfigjson` `auths` map with a fourth entry for
`forgejo.viktorbarzin.me`. **No new Secret, no new ClusterPolicy,
no `imagePullSecrets =` line edits across stacks.**
Containerd `hosts.toml` redirects `forgejo.viktorbarzin.me` → in-cluster
Traefik LB at `10.0.20.200`, the same pattern used for
`registry.viktorbarzin.me``10.0.20.10:5050`. Avoids hairpin NAT
through the WAN gateway for in-cluster pulls.
### Push path
Woodpecker pipelines push to BOTH targets during the bake:
```yaml
- name: build-and-push
image: woodpeckerci/plugin-docker-buildx
settings:
repo:
- registry.viktorbarzin.me/<name>
- forgejo.viktorbarzin.me/viktor/<name>
logins:
- registry: registry.viktorbarzin.me
username:
from_secret: registry_user
password:
from_secret: registry_password
- registry: forgejo.viktorbarzin.me
username:
from_secret: forgejo_user
password:
from_secret: forgejo_push_token
```
The `vault-woodpecker-sync` CronJob (every 6h) propagates
`secret/ci/global` keys to every Woodpecker repo as global secrets.
### Retention
Forgejo's per-package "Cleanup Rules" UI is per-user runtime DB
state, not Terraform-driven. Retention runs as a CronJob in the
`forgejo` namespace, schedule `0 4 * * *`, that:
1. Lists all container packages under the `viktor` owner.
2. Groups by package name.
3. Keeps newest 10 versions + always keeps `latest`.
4. DELETEs the rest via `/api/v1/packages/{owner}/{type}/{name}/{version}`.
First 7 days run with `DRY_RUN=true` — script logs what it would
delete but issues no DELETE calls. After log review, flip the
`forgejo_cleanup_dry_run` local in `cleanup.tf` to false.
### Integrity monitoring
Mirror the existing `registry-integrity-probe` CronJob: walk
`/v2/_catalog`, walk every tag, HEAD every manifest + index child,
push `registry_manifest_integrity_*` metrics. Existing
Prometheus alerts fire on the `instance` label, so they cover both
probes automatically once the alert annotations are made
instance-aware (done in this change).
### Source migration
Projects currently living as plain dirs in the local-only monorepo
become standalone Forgejo repos. Two GitHub-hosted private repos
(`beadboard`, `claude-memory-mcp`) move to Forgejo and are archived
on GitHub.
CI standardises on Woodpecker for everything in scope. The two
projects that used GHA (build + Woodpecker-deploy via GHA-hosted
DockerHub push) keep DockerHub for legacy compatibility but their
canonical image source becomes Forgejo.
### Break-glass for infra-ci
`infra-ci` is the Docker image used by all infra Woodpecker
pipelines, including `default.yml` (terragrunt apply). If Forgejo is
unreachable at the moment we need to apply, `infra-ci` is
unreachable, and we can't apply our way out.
Mitigation: dual-push step also `docker save | gzip` the built
infra-ci image to:
- `/opt/registry/data/private/_breakglass/infra-ci-<sha>.tar.gz` on
the registry VM disk (Copy 1)
- `/srv/nfs/forgejo-breakglass/` on the NAS (Copy 2)
A `latest` symlink in each location points at the most recent.
Recovery procedure (`docs/runbooks/forgejo-registry-breakglass.md`):
scp tarball → `docker load``ctr -n k8s.io images import` → fix
Forgejo via that node.
### Cutover style
**Dual-push bake**: pipelines push to both registries for ≥14 days.
Pods continue pulling from `registry.viktorbarzin.me`. After bake:
1. Per-project PR: flip `image=` lines in Terraform stacks. Pod
re-pull naturally on next rollout.
2. Phase 4: stop `registry-private` container, remove its
`auths` entry from the cluster Secret, drop containerd hosts.toml
entry.
## Why not alternatives
| Option | Rejected because |
|---|---|
| Stay on `registry-private` | Three corruption incidents in three weeks; mitigation cost rising |
| Run a fresh registry container alongside (no Forgejo) | Same upstream, same `distribution#3324` failure mode |
| GHCR / DockerHub for all private images | Public-by-default model + push rate limits; loses owner-owned blob storage |
| Harbor | Heavier than Forgejo registry, would need its own DB + ingress, no source-hosting integration |
## Risks
See plan doc § "Risk register" for the full table. Top three:
1. **Forgejo registry hits the same corruption pattern.** Mitigated
by 14-day bake + integrity probe within 15 min.
2. **Forgejo down → infra-ci unreachable → can't apply.** Mitigated
by tarball break-glass on VM + NAS.
3. **Pod re-pulls fail after `image=` flip due to containerd cache
poisoning.** Mitigated by hosts.toml deployment + per-project
`kubectl rollout restart` in Phase 3.

View file

@ -1,152 +0,0 @@
# Forgejo Registry Consolidation — Plan
**Date**: 2026-05-07
**Status**: Approved — execution in progress (Phase 0)
**Design**: `2026-05-07-forgejo-registry-consolidation-design.md`
This is the implementation roadmap for migrating off `registry-private`
onto Forgejo's OCI registry. See the design doc for problem
statement and rationale. Execution spans 5 phases over ≥3 weeks.
## Phase 0 — Prepare Forgejo (1 PR, no cutover risk)
| Task | File / artifact |
|---|---|
| Bump Forgejo memory request+limit 384Mi → 1Gi | `infra/stacks/forgejo/main.tf` |
| Add `FORGEJO__packages__ENABLED=true` and `FORGEJO__packages__CHUNKED_UPLOAD_PATH=/data/tmp/package-upload` env vars (defensive — already default in v11) | `infra/stacks/forgejo/main.tf` |
| Bump Forgejo PVC 5Gi → 15Gi, auto-resize cap 20Gi → 50Gi | `infra/stacks/forgejo/main.tf` |
| Bump ingress `max_body_size = "5g"` (wired into ingress_factory as a Buffering middleware) | `infra/stacks/forgejo/main.tf`, `infra/modules/kubernetes/ingress_factory/main.tf` |
| Create `cluster-puller` (read:package), `ci-pusher` (write:package), and a third `cleanup` PAT on `ci-pusher`; store PATs in Vault | runbook: `docs/runbooks/forgejo-registry-setup.md` |
| Extend `registry-credentials` Secret with 4th `auths` entry for `forgejo.viktorbarzin.me` | `infra/stacks/kyverno/modules/kyverno/registry-credentials.tf` |
| Add containerd `hosts.toml` entry redirecting `forgejo.viktorbarzin.me` → in-cluster Traefik LB `10.0.20.200` | `infra/stacks/infra/main.tf` cloud-init + new `infra/scripts/setup-forgejo-containerd-mirror.sh` for existing nodes |
| Forgejo retention CronJob (`0 4 * * *`, dry-run for first 7 days) | new `infra/stacks/forgejo/cleanup.tf` + `infra/stacks/forgejo/files/cleanup.sh` |
| Forgejo integrity probe CronJob (`*/15 * * * *`) | `infra/stacks/monitoring/modules/monitoring/main.tf` |
| Make existing alerts instance-aware so they cover both registries | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` |
**Smoke test (must pass before declaring Phase 0 done):**
- `docker login forgejo.viktorbarzin.me` succeeds.
- Push a hello-world image to `forgejo.viktorbarzin.me/viktor/smoketest:1` succeeds.
- `crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1` from a k8s
node succeeds, using the auto-synced `registry-credentials` Secret.
- A fresh namespace gets the cloned Secret with 4 `auths` entries.
- Delete the smoketest package via API.
- Forgejo integrity probe completes once and pushes metrics.
## Phase 1 — Source migration (parallel-safe, no production impact)
For each project the recipe is identical:
1. `git init` + push to `forgejo.viktorbarzin.me/viktor/<name>`
register in Woodpecker via OAuth.
2. Add `.woodpecker.yml` based on `payslip-ingest/.woodpecker.yml`.
Push step uses `woodpeckerci/plugin-docker-buildx` with TWO
`repo:` entries (dual-push).
3. Confirm first build pushes to BOTH registries.
Projects (bake clock starts at "all dual-push"):
| Project | Action |
|---|---|
| `claude-agent-service` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
| `fire-planner` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
| `wealthfolio-sync` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
| `hmrc-sync` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
| `freedify` | Push from monorepo to Forgejo. New `.woodpecker.yml`. (Upstream is gone.) |
| `payslip-ingest` | Already on Forgejo. Add second `repo:` entry to `.woodpecker.yml`. |
| `job-hunter` | Already on Forgejo. Add second `repo:` entry. |
| `beadboard` | Push to Forgejo. New `.woodpecker.yml`. Disable GHA workflow. **Don't archive GitHub yet** (deferred to Phase 3). |
| `claude-memory-mcp` | Push to Forgejo. New `.woodpecker.yml`. |
| `infra-ci` | Edit `.woodpecker/build-ci-image.yml` to dual-push. ALSO `docker save | gzip` to `/opt/registry/data/private/_breakglass/` on VM AND `/srv/nfs/forgejo-breakglass/` on NAS. Pin a `latest` symlink. |
Break-glass runbook (`docs/runbooks/forgejo-registry-breakglass.md`)
documents the recovery path.
## Phase 2 — Bake (≥14 days)
- No `image=` lines change. Pods still pull from
`registry.viktorbarzin.me`.
- **Daily smoke check**: pull a recent image from Forgejo as
`cluster-puller`, verify integrity (HEAD on manifest + each blob).
- **Bake exit criteria**:
- Zero `RegistryManifestIntegrityFailure` alerts on Forgejo.
- Zero `ContainerNearOOM` for the forgejo pod.
- Retention CronJob has run ≥14 times successfully.
- At least one full Sunday GC cycle has elapsed.
- Switch retention CronJob to `DRY_RUN=false` on day 7, observe
until day 14.
## Phase 3 — Cutover (one PR per project, single session)
Order = lowest blast radius first. Each step:
`image=` flip → `kubectl rollout restart` → verify pull from Forgejo.
1. `payslip-ingest` (`infra/stacks/payslip-ingest/main.tf`)
2. `job-hunter` (`infra/stacks/job-hunter/main.tf`)
3. `claude-agent-service` (`infra/stacks/claude-agent-service/main.tf`)
4. `fire-planner` (`infra/stacks/fire-planner/main.tf`)
5. `wealthfolio-sync` (`infra/stacks/wealthfolio/main.tf`)
6. `freedify` (`infra/stacks/freedify/factory/main.tf`)
7. `chrome-service` (`infra/stacks/chrome-service/main.tf`)
8. `beads-server` / `beadboard` (`infra/stacks/beads-server/main.tf`).
Then `gh repo archive ViktorBarzin/beadboard`.
9. `infra-ci` — flip `image:` references in 4 `.woodpecker/*.yml`
files in the infra repo. Verify next push to master applies cleanly.
10. `claude-memory-mcp` — update `CLAUDE.md` install instruction from
`claude plugins install github:ViktorBarzin/claude-memory-mcp` to
`claude plugins install https://forgejo.viktorbarzin.me/viktor/claude-memory-mcp.git`.
`gh repo archive ViktorBarzin/claude-memory-mcp`.
## Phase 4 — Decommission
| Step | File / location |
|---|---|
| Stop `registry-private` container on VM (10.0.20.10): edit `/opt/registry/docker-compose.yml`, comment out service, `docker compose up -d --remove-orphans`. (Manual SSH — cloud-init won't redeploy on TF apply per memory id=1078.) | live VM |
| Update cloud-init template to match the new compose file | `infra/stacks/infra/main.tf:288` |
| Delete `auths` entries for `registry.viktorbarzin.me` / `:5050` / `10.0.20.10:5050` from the dockerconfigjson | `infra/stacks/kyverno/modules/kyverno/registry-credentials.tf` |
| Drop `registry.viktorbarzin.me` and `10.0.20.10:5050` `hosts.toml` entries on each node + cloud-init template | `infra/stacks/infra/main.tf` cloud-init + ad-hoc script |
| After 1 week of no incidents, delete `/opt/registry/data/private/` blob storage on the VM (~2.6GB freed) | manual SSH |
## Phase 5 — Docs
In the same commit as the Phase 4 closing:
| Doc | Update |
|---|---|
| `docs/runbooks/registry-vm.md` | Note `registry-private` is gone; pull-through caches and break-glass tarballs only |
| `docs/runbooks/registry-rebuild-image.md` | Replaced by NEW `forgejo-registry-rebuild-image.md` |
| `docs/runbooks/forgejo-registry-rebuild-image.md` (NEW) | Forgejo PVC restore procedure |
| `docs/runbooks/forgejo-registry-breakglass.md` (NEW) | infra-ci tarball recovery |
| `docs/architecture/ci-cd.md` | Image registry section flips to Forgejo |
| `docs/architecture/monitoring.md` | Integrity probe target updated |
| `infra/.claude/CLAUDE.md` | Registry references updated |
| `CLAUDE.md` (monorepo root) | claude-memory-mcp install URL updated |
| `infra/.claude/reference/service-catalog.md` | Cross-reference checked |
## Critical files modified
| File | Phase | What |
|---|---|---|
| `infra/stacks/forgejo/main.tf` | 0 | Memory bump, packages env vars, PVC bump, ingress max_body_size |
| `infra/stacks/forgejo/cleanup.tf` (NEW) | 0 | Retention CronJob |
| `infra/stacks/forgejo/files/cleanup.sh` (NEW) | 0 | Retention script (mounted via ConfigMap) |
| `infra/modules/kubernetes/ingress_factory/main.tf` | 0 | Wire `max_body_size` into a Traefik Buffering middleware |
| `infra/stacks/kyverno/modules/kyverno/registry-credentials.tf` | 0 | Add 4th `auths` entry |
| `infra/stacks/infra/main.tf` | 0 + 4 | Containerd hosts.toml block (add Forgejo, later remove registry-private); compose template update |
| `infra/scripts/setup-forgejo-containerd-mirror.sh` (NEW) | 0 | One-shot rollout for existing nodes |
| `infra/stacks/monitoring/modules/monitoring/main.tf` | 0 | Forgejo integrity probe CronJob |
| `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` | 0 | Make alerts instance-aware |
| `infra/stacks/monitoring/main.tf` | 0 | Plumb `forgejo_pull_token` into module |
| `infra/.woodpecker/build-ci-image.yml` | 1 | Dual-push to add Forgejo target + tarball break-glass |
| `<each-project>/.woodpecker.yml` | 1 | Dual-push (NEW for fire-planner, wealthfolio-sync, hmrc-sync, freedify, beadboard, claude-memory-mcp; EDIT for payslip-ingest, job-hunter, claude-agent-service) |
| `infra/.woodpecker/{default,drift-detection,build-cli}.yml` | 3 | Flip `image:` to Forgejo for infra-ci |
| `infra/stacks/{beads-server,chrome-service,claude-agent-service,fire-planner,freedify/factory,job-hunter,payslip-ingest,wealthfolio}/main.tf` | 3 | Flip `image =` to Forgejo |
## Verification
- **Push** (Phase 0/1): `docker push forgejo.viktorbarzin.me/viktor/<name>` visible in Forgejo Web UI under viktor/.
- **Pull** (Phase 0): `crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1` succeeds with auto-synced Secret.
- **Dual-push** (Phase 1): every Woodpecker pipeline run pushes to BOTH endpoints — confirmed via HEAD checks on `<reg>:<sha>` for both.
- **Bake** (Phase 2): existing daily Forgejo `/api/healthz` external monitor stays green; integrity probe stays green; no `ContainerNearOOM` for forgejo pod.
- **Cutover** (Phase 3): `kubectl rollout status deploy/<svc> -n <ns>` succeeds. `kubectl describe pod` shows the image was pulled from `forgejo.viktorbarzin.me`.
- **Decommission** (Phase 4): `docker ps` on registry VM no longer shows `registry-private`. Brand-new namespace gets the Secret with only the Forgejo `auths` entry. Pull still works.

View file

@ -1,150 +0,0 @@
# Post-Mortem: Authentik Embedded Outpost `/dev/shm` Fills — Cluster-Wide Auth Blocked
| Field | Value |
|-------|-------|
| **Date** | 2026-04-18 |
| **Duration** | ~44h for first-affected user (Emil, Apr 16 17:00 → Apr 18 12:40 UTC); ~30min for cluster-wide impact (Apr 18 12:10 → 12:40 UTC) |
| **Severity** | SEV2 — authentication blocked for all users on all Authentik-protected services |
| **Affected Services** | ~30+ Authentik-protected subdomains (every service using the `authentik-forward-auth` Traefik middleware) |
| **Status** | Root cause fixed; permanent mitigation applied; alerting still TODO |
## Summary
The `ak-outpost-authentik-embedded-outpost` pod's `/dev/shm` (default 64 MB tmpfs) filled to 100% with ~44,000 `session_*` files. Once full, every forward-auth request failed to write its session state with `ENOSPC` and the outpost returned HTTP 400 instead of the usual 302 → login redirect. All users on all protected services were unable to log in.
Detection was delayed because the initial user report (Emil) looked like a per-user bug — investigation spent two days chasing hypotheses about non-ASCII headers, user privileges, cookie corruption, and a newly-deployed Cloudflare Worker before the real cause was found in the outpost logs.
## Impact
- **User-facing**: HTTP 400 on initial GET of any Authentik-protected site (`terminal`, `grafana`, `immich`, `proxmox`, `london`, etc.). Existing sessions whose cookies were still cached worked until their cookie rotation attempt, then broke.
- **Blast radius**: Every service using the `authentik-forward-auth` middleware via the "Domain wide catch all" Proxy provider. Public and internal.
- **Duration**: First user (Emil) broken since 2026-04-16 ~17:00 UTC after his last valid session. Cluster-wide block when Viktor's cached session stopped being sufficient — roughly 2026-04-18 12:10 UTC. Fixed 12:40 UTC.
- **Data loss**: None. Session state in tmpfs is ephemeral by design.
- **Monitoring gap**: No Prometheus alert on outpost `/dev/shm` usage. No alert on outpost 400 response rate. Uptime Kuma external monitors hitting protected services returned 400s for 40+ hours without paging.
## Timeline (UTC)
| Time | Event |
|------|-------|
| **Apr 15 ~09:21** | `ak-outpost-authentik-embedded-outpost-587598dc4b-fvzzz` pod started (normal rolling restart, unrelated to this incident). `/dev/shm` fresh. |
| **Apr 16 16:23:32** | Emil's last successful `authorize_application` event from his iPhone Brave (`85.255.235.23`). After this point, his subsequent requests create session files — his new sessions work briefly, then `/dev/shm` fills and every new session write fails. |
| **Apr 16 ~17:00 (approx)** | `/dev/shm` at ~44,000 files = 100% full. New forward-auth requests start returning 400 across the board. Viktor's browser still has a valid cached cookie so his requests succeed without writing new session files. |
| **Apr 17 10:30 (approx)** | Emil reports "terminal.viktorbarzin.me returns 400" to Viktor. |
| **Apr 18 09:0012:30** | Deep investigation begins. Multiple hypotheses tested and rejected: non-ASCII bytes in Emil's `name` field, policy denial, cookie corruption, Rybbit Cloudflare Worker (deployed 2026-04-17 — suspicious timing, turned out unrelated), plaintext redirect scheme. |
| **Apr 18 12:20:39** | First direct evidence found: 2 Chrome 400s in Traefik logs from Emil's IP `176.12.22.76` (BG) on `terminal.viktorbarzin.me`, request missing `authentik_proxy_*` cookie. Redirect loop observed on iPhone IPv6 `2620:10d:c092:500::7:8c0d`. |
| **Apr 18 12:34** | Viktor reports he can no longer log in either. |
| **Apr 18 12:38** | `curl` against direct Traefik (`--resolve` bypassing Cloudflare) returns the same 400 with Authentik's CSP header — Cloudflare Worker exonerated. |
| **Apr 18 12:39** | Outpost log grep finds the smoking gun: `failed to save session: write /dev/shm/session_XXX: no space left on device`. |
| **Apr 18 12:40:13** | `kubectl delete pod ak-outpost-authentik-embedded-outpost-587598dc4b-fvzzz` — tmpfs cleared on pod restart. Replacement pod `-8qscr` Running within 8s. Cluster unblocked. |
| **Apr 18 12:41** | Verified: direct-Traefik and via-CF curls both return `HTTP 302` to Authentik auth flow. Viktor authenticates successfully on `proxmox.viktorbarzin.me`. |
| **Apr 18 12:53** | Permanent fix applied via Authentik API: `PATCH /api/v3/outposts/instances/{uuid}/` setting `config.kubernetes_json_patches` to mount `emptyDir {medium: Memory, sizeLimit: 512Mi}` at `/dev/shm`. |
| **Apr 18 12:54** | Authentik controller reconciled the Deployment within 5s. `kubectl rollout restart` triggered new pod `-k5hv8`. `/dev/shm` now `tmpfs 256M` (4× the previous capacity; K8s clamps the tmpfs size to pod memory policy, but usage is capped at `sizeLimit=512Mi`). Forward-auth verified working. |
## Root Cause Chain
```
[1] goauthentik/proxy outpost uses gorilla/sessions FileStore
└─> each forward-auth request that has no valid session cookie writes
/dev/shm/session_<random> (~1500 bytes/file)
├─> [2] Catch-all Proxy provider's access_token_validity = hours=168 (7 days)
│ └─> each file's MaxAge = 7 days
│ └─> Upstream 5-min GC (PR #15798, shipped in ≥ 2025.10) can only
│ delete files whose MaxAge has EXPIRED, not whose age exceeds any
│ shorter threshold
├─> [3] Measured creation rate: ~18 files/min (Uptime-Kuma monitors +
│ real user traffic)
│ └─> 18/min × 60 × 24 × 7 = 181,440 steady-state files expected
└─> [4] Pod's /dev/shm default: 64 MB tmpfs (Kubernetes default)
└─> 64 MB / 1500 bytes ≈ 44,000 files maximum
└─> Full in approx 44,000 / (18 × 60) min ≈ 41 hours
└─> Actual observed time: pod started Apr 15 ~09:21,
first ENOSPC ~Apr 16 ~17:00 ≈ 32 hours
(some excess from Uptime-Kuma bursts)
[ENOSPC] -> every new forward-auth request fails -> outpost returns HTTP 400
-> Traefik forwards the 400 to the browser
-> user sees "400 Bad Request" on every protected site
```
## Why Diagnosis Took So Long
The initial report was framed as "Emil can't access terminal" — a per-user symptom. All four pre-registered hypotheses in the triage plan (non-ASCII bytes in header value, oversized cookie, corrupt user attribute, provider policy rejecting groups) were per-user explanations, all of which turned out to be falsified.
Contributing distractions:
1. **Misattribution in initial research** — an `authorize_application` event for Viktor (`vbarzin@gmail.com`) at 2026-04-18 08:09 was initially attributed to Emil. This led to the incorrect conclusion that Emil was authenticating successfully today.
2. **Rybbit analytics Cloudflare Worker deployed 2026-04-17** (see memory #792, commit around 2026-04-17 21:26 UTC) ran on `*.viktorbarzin.me/*`. Suspicious timing — Viktor's first instinct was "this must be the Worker." The Worker WAS adding long cookies to browser state, but not the cause of the 400. Exonerated by direct-Traefik curl returning the same 400.
3. **Viktor's cached session masked the outage** — only unauthenticated requests wrote new session files. Viktor's valid cookie kept working until the outpost needed to rotate state, at which point he also hit 400.
4. **The tell is in the outpost logs, not anywhere else.** `grep 'no space left on device'` on the outpost logs would have found it in seconds, but the investigation scope started with user records, then cookies, then the Worker — outpost logs weren't grepped until hour 3+.
## Contributing Factors
1. **No alert on outpost `/dev/shm` usage.** A simple `kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8` or equivalent cAdvisor metric would have paged hours before users noticed.
2. **No alert on outpost HTTP 400 rate.** `increase(authentik_outpost_http_requests_total{status="400"}[15m])` went from ~0 to thousands — invisible to our monitoring.
3. **No alert on "Uptime-Kuma external monitors all turning red simultaneously."** Every external monitor for a protected service started failing, but each is individually monitored — correlated failures across dozens of services didn't trigger a higher-level alert.
4. **Default Kubernetes `/dev/shm` is 64 MB.** This is fine for most workloads, but the goauthentik proxy outpost writes one session file per unauthenticated request with a 7-day retention. The default sizing is an accident waiting to happen on any busy deployment.
5. **Upstream issue [#20093](https://github.com/goauthentik/authentik/issues/20093)** ("External Proxy Outpost cannot use persistent session backend") is still OPEN as of 2026-04-18. Known architectural limitation.
6. **Catch-all Proxy provider is UI-managed, not Terraform-managed.** Its `access_token_validity` and the outpost's `kubernetes_json_patches` are configured in Authentik's PostgreSQL database, not in code. This means the fix applied today is invisible to `git log` and vulnerable to drift if someone changes it in the UI.
## Detection Gaps
| Gap | Impact | Fix |
|-----|--------|-----|
| No alert on outpost `/dev/shm` usage | Outage progressed from "Emil only" to "everyone" over 40+ hours silently | Add Prometheus alert: `kubelet_volume_stats_used_bytes{namespace="authentik",persistentvolumeclaim=~"dshm.*"} / kubelet_volume_stats_capacity_bytes > 0.8` (or per-container cAdvisor metric if emptyDir not a PVC) |
| No alert on outpost 400 rate spike | ~thousands of 400s over 40h didn't page | Alert on `increase(traefik_service_requests_total{code="400",service=~".*viktorbarzin-me.*"}[15m]) > N` OR on outpost-specific 400 metric |
| Uptime Kuma external monitors not cross-correlated | Dozens of red monitors didn't trigger a cluster-wide alert | Add meta-alert: "more than N [External] Uptime Kuma monitors down within 10 min" — strong signal of shared-infra failure |
| Outpost logs not searched during initial triage | Investigation went down 4 wrong paths before finding the real error | Runbook addition: for any Authentik forward-auth issue, FIRST command is `kubectl -n authentik logs -l goauthentik.io/outpost-name=authentik-embedded-outpost --since=1h \| grep -iE 'error\|no space'` |
## Prevention Plan
### P0 — Prevent this exact failure
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P0 | Size `/dev/shm` up via `kubernetes_json_patches` on the embedded outpost config | Config | `PATCH /api/v3/outposts/instances/0eecac07-97c7-443c-8925-05f2f4fe3e47/` with `config.kubernetes_json_patches.deployment` adding an `emptyDir {medium: Memory, sizeLimit: 512Mi}` volume at `/dev/shm`. Authentik reconciles the Deployment within 5 minutes. **Applied 2026-04-18 12:53 UTC.** | **DONE** |
### P1 — Detect this next time
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P1 | Prometheus alerts on outpost `/dev/shm` fill (two thresholds) | Alert | Group `Authentik Outpost` added in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`. `AuthentikOutpostMemoryHigh` (warning, working set > 1.5 GiB for 15m) + `AuthentikOutpostMemoryCritical` (critical, > 1.8 GiB for 5m) + `AuthentikOutpostRestarts` (warning, > 2 restarts in 30m). Applied 2026-04-18 13:16 UTC; loaded in Prometheus, state=inactive. | **DONE** |
| P1 | Uptime-Kuma meta-monitor: "N+ external monitors down simultaneously" | Alert | Either a Prometheus rule over `uptime_kuma_monitor_status == 0` counts, or a dedicated external probe. Very strong signal of shared-infra failure. | TODO |
| P1 | Bump tmpfs `sizeLimit` from 512Mi → 2Gi + set explicit container memory limit 2560Mi | Config | Patched outpost `kubernetes_json_patches` via Authentik API. 2026-04-18 13:06 UTC (sizeLimit), 13:22 UTC (container limit). **Gotcha**: `sizeLimit` alone is insufficient — writes to tmpfs count against container cgroup memory, and Kyverno's `tier-defaults` LimitRange sets a default `limits.memory: 256Mi` which OOM-kills the container before tmpfs fills. Fix is to also set `containers[0].resources.limits.memory``sizeLimit + working_set_headroom`. Verified 1.5 GB file write succeeds on the configured pod; df reports 2.0 GB tmpfs. Gives ~8× growth headroom at current probe rate. | **DONE** |
### P2 — Codify the fix so it survives drift
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P2 | Codify the catch-all Proxy provider + embedded outpost config in Terraform | Architecture | Adopt `goauthentik/authentik` Terraform provider in `infra/stacks/authentik/`. Import the existing UUID `0eecac07-97c7-443c-8925-05f2f4fe3e47` and the catch-all provider pk=5. Move `kubernetes_json_patches` into TF so the fix is reviewable in git. | TODO |
| P2 | Runbook: Authentik forward-auth troubleshooting | Docs | Add a runbook at `docs/runbooks/authentik-forward-auth-400.md` with the "grep outpost logs first" first step, plus pointer commands for `/dev/shm` usage, session file count, and recent authorize events. | TODO |
### P3 — Upstream + architectural
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P3 | Comment/support on authentik issue [#20093](https://github.com/goauthentik/authentik/issues/20093) | Upstream | Request either a persistent-backed session store (Redis/DB) OR a configurable GC interval shorter than the default 5 min. | TODO |
| P3 | Consider shortening `access_token_validity` from 168h (7 days) to 24h | Config | Reduces steady-state session file count from ~181k to ~26k (7× reduction). Trade-off: users re-auth daily. Viktor's call on UX tolerance. | TODO |
| P3 | Evaluate moving forward-auth away from the embedded outpost | Architecture | The embedded outpost is a single replica Go binary with in-memory session state. An external, multi-replica outpost with Redis-backed sessions is the production-grade deployment. Probably overkill for a home-lab, but worth noting. | TODO (paused) |
## Lessons Learned
1. **When a per-user bug affects a shared infrastructure layer, suspect the shared layer, not the user.** The framing "Emil gets 400" led the first two hours of investigation down four user-specific rabbit holes. A sanity check ("does ANY user's non-cached request to a protected site return 400?") would have cut to the chase in minutes.
2. **Check the outpost logs first, not last.** For any Authentik forward-auth oddity, the first `kubectl logs` should be on the outpost pod, grepping for `error` and `ENOSPC`. The outpost is the component that actually makes the 400/302 decision.
3. **Cache + low-request users mask outages longer than you'd think.** Viktor had a valid cookie and his browser kept using it without writing new session files; he couldn't reproduce the bug Emil saw. The outage felt per-user until his cookie rotation needed to write state. **Any outage that "only affects some users" needs an active check from a fresh, cookie-less context**`curl` with no cookie jar is the fastest way.
4. **Default tmpfs sizing + per-request file writes = ticking clock.** 64 MB of `/dev/shm` is a Kubernetes default, not a considered choice. Any workload that writes per-request files into tmpfs without aggressive GC will eventually fill, and the time-to-fill scales inversely with request rate. Worth auditing other services that might have the same pattern.
5. **UI-managed Authentik config is invisible to git review.** Our catch-all Proxy provider, embedded outpost config, property mappings, and policy bindings are all in Authentik's PostgreSQL database. The fix applied today (`kubernetes_json_patches`) is durable but not discoverable from `git log`. Drift risk. Codify in Terraform.
6. **Recently-deployed things are prime suspects but not always guilty.** The Rybbit Cloudflare Worker was deployed 2026-04-17 with a wildcard route. Viktor's intuition was "that's the recent change, must be the cause." It was a plausible theory and worth checking — but `curl --resolve` to bypass Cloudflare proved it innocent within 30 seconds. Always have a way to bypass the suspect layer cheaply.
## References
- Memory #836-841: incident details stored in claude-memory MCP (2026-04-18 12:42 UTC).
- Upstream issue: [goauthentik/authentik#20093](https://github.com/goauthentik/authentik/issues/20093) (open).
- Related upstream fix: [PR #15798](https://github.com/goauthentik/authentik/pull/15798) — 5-min session GC shipped in ≥ 2025.10 (our version 2026.2.2 has it, but insufficient alone).
- Beads task: `code-zru` (P1 bug).

View file

@ -1,246 +0,0 @@
# Post-Mortem: Private Registry Orphan OCI-Index — Repeat Incident
| Field | Value |
|-------|-------|
| **Date** | 2026-04-19 (first occurrence 2026-04-13) |
| **Duration** | ~40 min of blocked CI each time; only detected via pipeline failures |
| **Severity** | SEV2 — all infra CI pipelines using `infra-ci:latest` failed (P366 → P376 all exit 126 "image can't be pulled") |
| **Affected Services** | Every Woodpecker pipeline that starts with `image: registry.viktorbarzin.me:5050/infra-ci:latest``default.yml`, `build-cli.yml`, `renew-tls.yml`, `drift-detection.yml`, `provision-user.yml`, `k8s-portal.yml`, `postmortem-todos.yml`, `issue-automation.yml`, `pve-nfs-exports-sync.yml` |
| **Status** | Hot fix green (three commits: `a05d63ee`, `6371e75e`, `c113be4d` — URL fix + rebuild). This doc captures the permanent fix landed in the same branch. |
## Summary
On 2026-04-19 ~09:00 UTC, every infra CI pipeline started failing at the
`clone` step with "image can't be pulled". The image in question — the CI
toolchain image `registry.viktorbarzin.me:5050/infra-ci:latest` — resolved
to an OCI image index whose `linux/amd64` platform manifest
(`sha256:98f718c8…`) and its in-toto attestation
(`sha256:27d5ab83…`) returned **HTTP 404** from the private registry.
The index record itself still existed — it's the children that had been
garbage-collected out from under it.
This is the **second identical incident**: the same failure mode occurred
on 2026-04-13 against a different image. Both times the immediate fix was
to rebuild the image from scratch; both times the root cause was left
unaddressed.
## Impact
- **User-facing**: all CI pipelines failed. No automated Terraform applies,
no TLS renewal, no drift detection. Manual workflows (Woodpecker UI
reruns) all failed with the same error.
- **Blast radius**: every pipeline that pulls `infra-ci`. Does NOT affect
k8s workloads (those pull via containerd, which goes through the
pull-through proxy on :5000/:5010 — a completely different code path).
- **Duration on 2026-04-19**: from first P366 failure to the hot-fix
commit `c113be4d` — roughly 40 min. Pipelines that had already been
triggered queued up until the rebuild restored `:latest`.
- **Data loss**: none. The registry has the index object; the child
manifests are re-producible by rebuilding the source image.
- **Monitoring gap**: nothing alerted. The only signal was the individual
pipeline failures from Woodpecker. No Prometheus alert fires on "the
registry served a 404 for a tag that exists".
## Timeline (UTC, 2026-04-19)
| Time | Event |
|------|-------|
| ~09:00 | P366 (`default.yml` on master) fails with exit 126. |
| 09:0011:00 | P367, P368, … P376 all fail with the same error. Nobody pages — there's no alert configured. |
| 11:15 | User notices and investigates: `skopeo inspect` reveals the missing platform manifest. |
| 11:20 | Hot fix phase begins: `a05d63ee` fixes a push-URL misalignment, `6371e75e` and `c113be4d` trigger a full rebuild. |
| 11:40 | Rebuild completes; `infra-ci:latest` resolves to a fresh, complete index. Pipelines green from P377 onward. |
| 11:45 | User requests a proper root-cause fix: "this is the second time — what's actually broken?" |
| 12:00 | Investigation begins (this document's work). |
## Root Cause Chain
```
[1] cleanup-tags.sh runs daily at 02:00 on the registry VM
└─> For each repository, keeps the last 10 tags by mtime, rmtrees the rest.
This walks `_manifests/tags/<tag>` directly, bypassing the registry API.
├─> [2] Subtle on-disk asymmetry: a registry:2 tag rmtree removes
│ BOTH the `_manifests/tags/<tag>/` dir AND — on 2.8.x — the
│ per-repo revision-link files under
`<repo>/_manifests/revisions/sha256/<child-digest>/link` for
│ every child referenced by that tag's index. The raw blob data
│ under `/var/lib/registry/docker/registry/v2/blobs/sha256/<.>/data`
│ is NOT touched — GC owns that, and GC only runs Sunday.
├─> [3] If ANOTHER tag's index still references one of those same
│ children (common — successive rebuilds share layers), the child
│ blob survives. But the revision-link is gone, so the registry
│ API can no longer map `<repo>/manifests/sha256:<child>` back
│ to the blob. HEAD → 404, even though the bytes are on disk.
│ distribution/distribution#3324 is the upstream class of this bug.
└─> [4] Result: the surviving index (e.g. `infra-ci:5319f03e`) is
intact on disk, its children's blob data files are intact on
disk, but HEAD `/v2/infra-ci/manifests/sha256:98f718c8…`
returns 404. The registry has the bytes, but cannot find them
through the API because the per-repo link bridge is gone.
[pull] containerd resolves `infra-ci:latest`
├─> GET /v2/infra-ci/manifests/latest → 200 OK, returns the index
└─> GET /v2/infra-ci/manifests/sha256:98f718c8… → 404 Not Found
└─> containerd fails the pull with "manifest unknown"
└─> woodpecker exit 126
```
> **Detection-gotcha** uncovered 2026-04-19 while implementing
> `fix-broken-blobs.sh`: a scan that checks `/blobs/sha256/<child>/data` for
> presence is NOT equivalent to "can the registry serve this child?" The
> authoritative check is whether
> `<repo>/_manifests/revisions/sha256/<child>/link` exists. The script
> was rewritten to check the per-repo link file after the HTTP probe
> caught 38 real orphans the filesystem scan had reported clean.
## Why Existing Remediation Missed It
1. **`fix-broken-blobs.sh` only scans layer links.** The existing cron
walks `_layers/sha256/` and removes link files whose blob `data` is
missing. It does NOT inspect `_manifests/revisions/sha256/` to see
whether an image-index's referenced children still exist. That's
exactly the class of orphan this incident represents.
2. **`registry:2` image tag was floating.** `docker-compose.yml` pinned
only to `registry:2`. Whatever Docker Inc. last rebuilt as
"v2-current" was running, with no version pin. Any regression in
the upstream walker would silently swap in.
3. **No integrity monitoring.** Prometheus alerted on cache hit rate
and registry-down, but nothing probes "are the manifests the registry
advertises actually fetchable?"
4. **CI pipeline didn't verify its own push.** `buildx --push` returns
success as soon as it uploads. If a child blob upload 0-byted or
the client disconnected mid-push (distinct from the GC mode but the
same on-disk symptom), nothing would notice until the next pull.
## Permanent Fix — Three Phases
### Phase 1 — Detection (ship today)
1. **Post-push integrity check** in `.woodpecker/build-ci-image.yml`.
After `build-and-push`, a new step walks the just-pushed manifest
(and every child of an image index) and HEADs every referenced blob.
Any non-200 fails the pipeline immediately, catching broken pushes at
the source rather than leaking them to consumers.
2. **Prometheus alert `RegistryManifestIntegrityFailure`.** A new
CronJob (`registry-integrity-probe`, every 15m, in the `monitoring`
namespace) walks the private registry's catalog, HEADs every tag's
manifest, follows each image index's children, and pushes
`registry_manifest_integrity_failures` to Pushgateway. Accompanying
alerts: `RegistryIntegrityProbeStale`, `RegistryCatalogInaccessible`.
3. **Post-mortem** — this document. Linked from
`.claude/reference/service-catalog.md` via the new runbook.
### Phase 2 — Prevention
4. **Pin `registry:2` → `registry:2.8.3`** in
`modules/docker-registry/docker-compose.yml` (all six registry
services). Removes the floating-tag footgun.
5. **Extend `fix-broken-blobs.sh`** to scan every
`_manifests/revisions/sha256/<digest>` that is an image index and
flag children whose blob `data` file is missing. The script prints a
loud WARNING per orphan; it does not auto-delete the index, because
deleting a published image is a conscious decision, not an automated
repair.
### Phase 3 — Recovery tooling
6. **Manual event trigger** on `build-ci-image.yml`. Rebuilds no longer
need a cosmetic Dockerfile edit — POST to the Woodpecker API or
click "Run manually" in the UI.
7. **Runbook** `docs/runbooks/registry-rebuild-image.md` — exact
command sequence for the next time this happens, plus fallback paths.
## Out of Scope
- **Pull-through caches.** The DockerHub / GHCR mirrors on
`:5000` / `:5010` are healthy (74.5% cache hit rate, no 404s). The
orphan problem is private-registry-only. No changes to nginx or
containerd `hosts.toml`.
- **Registry HA / replication.** Single-VM SPOF is a known
architectural choice. Harbor or a replicated registry would solve
more than this incident requires, at multi-day cost. Synology offsite
snapshots already give RPO < 1 day.
- **Disabling `cleanup-tags.sh`.** Keeping storage bounded is still
necessary; the fix is detection + rebuild, not "stop cleaning up".
## Lessons
- **Repeat incidents deserve root-cause work, not a third hot-fix.** The
2026-04-13 incident was closed when CI turned green. Without a probe
and without a scan for orphan indexes, the next incident was
inevitable — and it happened six days later against a different image.
- **"No alert fired, so it wasn't detected" is a monitoring gap, not an
outage feature.** The registry was serving 404s for 2+ hours before
anyone noticed, because our only signal was "pipeline failures" and
our eyes were elsewhere. The new probe closes that gap.
- **CI pipelines should verify their own output.** The `buildx --push`
"success" exit code is not a guarantee of pulled-back integrity — as
this incident proves. A 30-second post-push HEAD walk is cheap
insurance.
## Related
- **Prior incident (same failure mode, different image)**: memory `709`
/ `710` — 2026-04-13.
- **Runbook**: `docs/runbooks/registry-rebuild-image.md` (new).
- **Hot-fix commits**: `a05d63ee`, `6371e75e`, `c113be4d`.
- **Upstream bug class**: `distribution/distribution#3324`.
## 2026-04-19 — Bulk cleanup sweep (beads code-8hk + code-jh3c)
Same failure class, broader scope. The `registry-integrity-probe`
surfaced 38 broken manifest references persisting after the 04-19
infra-ci fix. `beads-dispatcher` + `beads-reaper` CronJobs were stuck
`ImagePullBackOff` on `claude-agent-service:0c24c9b6` for >6h. All 34
affected `repo:tag` pairs were OCI indexes whose `linux/amd64` child
manifests were absent from blob storage (same orphan pattern).
**Action taken**:
1. Bumped `beads-server/main.tf` var default `claude_agent_service_image_tag`
from `0c24c9b6``2fd7670d` (the canonical tag in
`claude-agent-service/main.tf`), reused — same image already healthy
on the registry. `scripts/tg apply` on `beads-server`. Deleted the
stuck Jobs so new CronJob ticks could fire.
2. Enumerated 34 broken `(repo, tag, parent_digest)` triples via HTTP
probe using `registry-probe-credentials` K8s Secret. Deleted each
via `DELETE /v2/<repo>/manifests/<digest>` (33× 202, 1× 404 —
claude-agent-service:latest pointed at an already-deleted digest).
3. Ran `docker exec registry-private /bin/registry garbage-collect
/etc/docker/registry/config.yml` — reclaimed ~3GB of orphan blob
storage.
4. Rebuilt the 3 in-use broken tags (all 3 OCI-index parents pointed
at missing children, so no cached copies would survive pod
reschedule):
- `freedify:latest` / `freedify:c803de02` — built on registry VM
directly (no CI pipeline exists for this image; python FastAPI).
- `beadboard:17a38e43` / `beadboard:latest` — GHA
`workflow_dispatch` failed at registry login (missing
`REGISTRY_USERNAME`/`REGISTRY_PASSWORD` GH secrets). Built on
registry VM directly as the fallback. GitHub secret gap is a
follow-up — beads `code-8hk` notes it.
- `priority-pass-backend:ae1420a0` / `priority-pass-frontend:ae1420a0`
— Woodpecker pipeline #8 on repo 81. Pipeline `kubectl set image`'d
the Deployment to `ae1420a0` (drift vs TF `v5`/`v8` defaults, but
that drift is pre-existing, not introduced by this cleanup).
- `wealthfolio-sync:latest`**not rebuilt**. Monthly CronJob (next
run 2026-05-01), no source tree or CI pipeline available in the
monorepo; deferred for separate follow-up.
**Post-cleanup state**:
- Probe: 39 tags, 0 failures. `registry_manifest_integrity_failures{} = 0`.
- Alert `RegistryManifestIntegrityFailure` cleared (was firing for
5h 32m).
- No `ImagePullBackOff` pods anywhere in the cluster.
- 28 of 34 deleted manifests were **dangling tags not referenced by any
workload** — old `382d6b1*`, `v2`-`v7`, `yt-fallback`, etc. Safe
deletes, no rebuilds needed.
**Permanent fix still in flight**: Phase 2/3 of this post-mortem
(post-push verification in CI, atomic `cleanup-tags.sh`) — not
addressed by this cleanup. The probe continues to be the
authoritative detector.

View file

@ -1,155 +0,0 @@
# Post-Mortem: Vault Raft Leader Deadlock + NFS Kernel Client Corruption Cascade
> **Resolution status (2026-04-25):** Resolved structurally by code-gy7h
> migration. All 3 vault voters now on `proxmox-lvm-encrypted` block
> storage; the NFS fsync incompatibility that triggered the original
> raft hang is no longer reachable. See
> `docs/plans/2026-04-25-nfs-hostile-migration-plan.md` Phase 2.
| Field | Value |
|-------|-------|
| **Date** | 2026-04-22 |
| **Duration** | External endpoint 503 from ~09:00 UTC to ~11:43 UTC (~2h 43m). vault-2 became active leader 11:43:28 UTC. |
| **Severity** | SEV1 (Vault — single source of secrets for 40+ services) |
| **Affected Services** | All ESO-backed services (password rotation paused). CronJobs that read plan-time secrets (14 stacks). Woodpecker CI (blocked pipeline `d39770b3`). Everything with `ExternalSecret` refresh interval ≤ 2h. |
| **Status** | Vault HA operational with vault-0 + vault-2 quorum. vault-1 still stuck ContainerCreating on node2 (third node2 reboot pending; workload can accept 2/3 quorum). Terraform fix committed as `2f1f9107`; apply pending. |
## Summary
A Vault raft leader (`vault-2`) entered a stuck goroutine state where its cluster port (8201) accepted TCP but never completed msgpack RPC. Standbys could not detect leader death because the TCP layer looked healthy, so no re-election fired. The only recovery was to kill the leader. During recovery, abrupt `kubectl delete --force` of the stuck Vault pods left kernel-side NFS client state on k8s-node1/node3/node4 in a corrupted state — **all new NFS mounts from those nodes timed out at 110s**, while existing mounts kept working. This created a cascade: the stuck leader blocked quorum, killing the leader broke NFS on the destination node for the recreated pod, force-killing the stuck pods left zombie `containerd-shim` processes kubelet couldn't clean up, and the resulting volume-manager loops pegged kubelet into 2-minute timeouts. Recovery required a VM hard-reset for node2 and node3 (kubelet was zombie on both). vault-0 remains down pending node4 reboot.
## Impact
- **User-facing**: `vault.viktorbarzin.me` returned HTTP 503 for ~2h. Any service that needed a Vault token during that window was degraded; Woodpecker CI pipeline blocked.
- **Blast radius**: 3/3 Vault pods affected (raft deadlock blocked re-election even with standbys up). Three k8s nodes degraded simultaneously with kernel NFS client stuck state (node1, node3, node4). Two nodes required VM hard-reset to recover kubelet (node2, node3).
- **Duration**: Degraded ~2h; resolution required sequential hard reboots.
- **Data loss**: None. Raft data integrity preserved on NFS. vault-1 came up with index 2475732, caught up to 2476009+ once leader was elected.
- **Observability gap**: No alert fired for the stuck raft leader. Standbys report `HA Mode: standby, Active Node Address: <leader IP>` as if healthy even when leader is hung.
## Timeline (UTC)
| Time | Event |
|------|-------|
| **~09:00** | `vault-2` (original raft leader) enters hung state — port 8201 open but msgpack RPCs hang. Its own logs go silent. Standbys continue heartbeat/appendEntries with `msgpack decode error [pos 0]: i/o timeout`. Neither standby triggers re-election because raft transport does not distinguish "TCP open + silent" from "TCP open + healthy". |
| **~09:15** | External endpoint starts serving 503. Woodpecker CI pipeline `d39770b3` blocks waiting for Vault. |
| **09:59** | Operator force-deletes `vault-2` pod — replacement comes up on node3 and enters candidate loop (term=32), cannot get quorum because DNS for `vault-0` is NXDOMAIN (ContainerCreating) and vault-1 does not respond (its raft goroutine also hung). |
| **10:07** | Operator force-deletes `vault-1` — new `vault-1` gets scheduled to node2. Its raft would be fine, but kubelet on node2 hangs in the pod cleanup path for the old pod's NFS mount. Concurrently, a new `vault-0` pod is attempted on node4, but **NFS mount from node4 times out at 110s** — the host kernel NFS client is in a degraded state that blocks all new mounts (including to completely different NFS paths like `/srv/nfs/ytdlp`). |
| **10:09** | Diagnostic test: from node1 and node4 CSI pods, `mount -t nfs -o nfsvers=4 192.168.1.127:/srv/nfs/ytdlp /tmp/test` times out. From node2 and node3 the same mount succeeds. NFS server is healthy (`showmount -e` works; `rpcinfo` shows all programs registered). The common factor on the broken nodes: they had a force-terminated Vault pod earlier in the session, leaving stuck `mount.nfs` processes in D-state. |
| **10:18** | Manual unmount of stale NFS mount from the force-deleted old vault-0 pod on node4. New mount attempts from CSI still time out — clearing the old mount did not recover kernel NFS client state. |
| **10:22** | Workaround discovered: mounting with `nfsvers=4.0` or `nfsvers=4.1` (instead of default `nfsvers=4` which negotiates to 4.2) succeeds on broken nodes. Confirms the stuck state is version-specific (NFSv4.2 session state), not a general NFS issue. Decision: rather than change CSI mount options cluster-wide (risk of remounting existing 48+ PVs), fix the nodes directly. |
| **10:31** | Investigated node2 kubelet state: old `vault-1` container shows `vault` process in **Z (zombie)** state with its `sh` wrapper stuck in `do_wait` in kernel (`zap_pid_ns_processes`). Containerd-shim PID killed manually — `sh` and zombie reparented to init but remained stuck (uninterruptible kernel wait tied to NFS). |
| **10:34** | Attempted `systemctl restart kubelet` on node2 — kubelet itself went into Z (zombie) with 2 tasks still attached. Classic NFS-related kernel deadlock. |
| **10:42** | **Decision: hard-reset node2 VM** (`qm reset 202`). Disruption: 22 pods evicted. |
| **10:43** | node2 back up (Ready). CSI registered. New `vault-1` scheduled to node2. NFS mount succeeded (fresh kernel state). Kubelet began chowning volume — **extremely slow, ~3 files per minute over NFS**. |
| **10:48** | `vault-1` (2/2 Running) unsealed. **Raft leader elected: `vault-2` wins term 32, election tally=2** (vault-1 voted yes once it came up, vault-0 unreachable). However vault-2's vault-layer (HA active/standby) never transitioned to active — raft leader with `active_time: 0001-01-01T00:00:00Z` and `/sys/ha-status` returning 500. |
| **10:50** | Restarted `vault-2` pod to force clean leader transition. New `vault-2` stuck in chown loop on node3 (same pattern as node2 earlier). |
| **10:54** | Patched the Vault `StatefulSet` with `fsGroupChangePolicy: OnRootMismatch` so subsequent recreations skip the recursive chown. |
| **10:57** | Force-deleted `vault-2` and `06fa940b` pod directory on node3. New pod spawned but kubelet again stuck on phantom state from the old pod. |
| **11:01** | **Hard-reset node3 VM** (`qm reset 203`). |
| **11:03** | First 200 response: vault-1 elected leader, vault-2 standby. Premature celebration — vault-1's audit log on node2 NFS starts timing out; `/sys/ha-status` returns 500 even though raft thinks vault-1 is active. |
| **~11:18** | Service regresses. `vault-1` audit writes hanging (`event not processed by enough 'sink' nodes, context deadline exceeded`). Readiness probe fails; pod goes 1/2; `vault-active` endpoint stays pointed at vault-1's IP but backend unresponsive → 503. |
| **11:22** | Force-restart `vault-1` to trigger re-election with new pod. Delete + containerd-shim cleanup leaves yet another zombie on node2. Same pattern: force-delete → zombie. |
| **11:29** | **Hard-reset node4 VM** (`qm reset 204`). Rationale: vault-0 was still blocked there; 74 pods on node4 contribute to NFS server load (load avg 16 on PVE). After reboot, vault-0 mounts its PVCs on fresh kernel state and comes up 2/2 Running 11:31. |
| **11:31** | Increased PVE NFS threads from 16 to 64 (`echo 64 > /proc/fs/nfsd/threads`). Did not help immediate mount failures — the stuck state is per-client kernel, not server capacity. |
| **11:38** | Discover DNS resolution issue: vault-2's Go resolver returns NXDOMAIN for short names `vault-0.vault-internal` even though glibc resolver works. CoreDNS restart issued earlier didn't fix. Restart vault-2 pod to force fresh resolver state. |
| **11:42** | **Second hard-reset of node3 VM** (`qm reset 203`). Kubelet+CSI re-register; vault-2 scheduled, NFS mounts finally succeed on fresh kernel state. |
| **11:43:28** | **vault-2 becomes active leader.** External endpoint returns 200 and stays there. vault-0 follower, catches up to index 2477632+. vault-1 still stuck on node2; left for later recovery. |
## Root Cause Chain
```
[1] Vault-2 raft goroutine hang (root cause — upstream Vault bug or infra-induced)
└─> Cluster port 8201 accepts TCP but never responds to msgpack RPCs
└─> Standbys' appendEntries calls return `msgpack decode error [pos 0]: i/o timeout`
└─> Raft protocol: no re-election because leader is heartbeating at the TCP level
└─> External endpoint returns 503 because HA layer has no active leader
[2] Recovery complication — abrupt pod termination
└─> `kubectl delete --force --grace-period=0` on vault-0/1/2
└─> containerd-shim fails to kill container cleanly (NFS I/O in D-state)
└─> vault process ends as zombie; sh wrapper stuck in do_wait
└─> Kubelet retries forever, cannot tear down old pod volumes
└─> NFS-CSI unmount requests succeed at the NFS layer but kubelet's
volume state-machine never marks the volume as unmounted
(stale 0000-mode mount directory blocks teardown completion)
[3] Kernel NFS client corruption on node1/node4
└─> Force-terminated Vault pod left stuck `mount.nfs` processes in D-state
└─> Kernel NFS4.2 client session state corrupted (held open mount slot)
└─> All subsequent mount syscalls for nfsvers=4 block 110s+ waiting for
session slot that will never be freed
└─> Manual workaround: nfsvers=4.1 bypasses the corrupted session state
[4] Kubelet starvation
└─> Combination of (2) and (3) means kubelet is stuck in a 2-minute volume-setup
context deadline loop — each iteration times out, new iteration restarts,
infinite loop
└─> Hard VM reset is the only exit
└─> After reset, kubelet starts clean, CSI re-registers, mounts succeed
[5] Slow recursive chown amplifies impact
└─> Default fsGroupChangePolicy: Always (Vault Helm chart 0.29.1 default)
└─> Kubelet walks every file on NFS setting gid=1000
└─> Over a 1GB audit log and a 47MB raft.db on NFS with timeo=30,retrans=3,
each chown syscall takes seconds; kubelet 2-minute deadline runs out
before the walk finishes
└─> Loop never exits even when ownership is already correct
```
## Why This Failed
1. **Raft transport does not detect stuck leaders.** If TCP is open and the process is alive enough to hold the port, standbys assume the leader is healthy. A stuck goroutine that never responds to RPCs appears to raft as "leader with high RTT" and does not trigger re-election. This is an upstream Vault bug (or at least a missing liveness check).
2. **Abrupt pod termination + NFS = kernel-level zombie.** When a Vault pod holding an NFS mount is force-killed before it cleanly closes file handles, the kernel's NFS4.2 client session state enters a corrupted state. This blocks all new mounts from that node — not just to the same NFS path, but to ANY NFS path on the same server. The fix is a kernel reboot; there is no userspace recovery.
3. **Vault data on NFS violates the documented rule.** `infra/.claude/CLAUDE.md` explicitly states: *"Critical services MUST NOT use NFS storage — circular dependency risk."* Vault currently uses `nfs-proxmox` for both `dataStorage` and `auditStorage`. If Vault had been on `proxmox-lvm-encrypted`, none of the NFS corruption cascade would have happened.
4. **fsGroupChangePolicy: Always is the Helm default.** Every pod restart walks every file over NFS. On a 1GB audit log with degraded NFS RTT, this takes longer than kubelet's internal 2-minute deadline, causing infinite restart loops. `OnRootMismatch` makes chown a no-op when the root is already correct (which it always is after first setup).
5. **No alert for this failure mode.** Prometheus alerts exist for `VaultSealed`, `VaultDown` (`up` metric), and backup staleness, but none for "raft leader has been running without advancing commit index" or "standby reports leader but leader's `/sys/ha-status` returns 500".
## Remediation (Applied)
- [x] Hard-reset node2 and node3 VMs to clear kernel NFS state and kubelet zombies.
- [x] Manually patched live `StatefulSet vault/vault` with `fsGroupChangePolicy: OnRootMismatch` to stop the chown loop.
- [x] Lazy-unmounted stale NFS mounts from force-deleted pod directories on node2 and node3.
- [x] Removed stale kubelet pod directories (`/var/lib/kubelet/pods/<UID>`) that had 0000-mode mount subdirectories blocking teardown.
- [x] Updated `stacks/vault/main.tf` with the `fsGroupChangePolicy` setting so the next `scripts/tg apply vault` makes it durable.
## Remediation (Pending)
- [ ] **Hard-reset node4** to recover vault-0 (same NFS kernel corruption pattern).
- [ ] **Run `scripts/tg apply` on the vault stack** to persist the fsGroupChangePolicy change.
- [ ] **Add Prometheus alert `VaultRaftLeaderStuck`** — fire when `vault_raft_last_index_gauge` (or derivation from `vault_runtime_total_gc_runs`) stops advancing for >2 minutes while `vault_core_active` is 1.
- [ ] **Add Prometheus alert `VaultHAStatusUnavailable`** — fire when `vault_core_active{}` reports 0 across all pods but `up{job="vault"}` reports 1 (HA layer broken but pods alive).
- [ ] **Migrate Vault to `proxmox-lvm-encrypted` block storage** — eliminates the entire NFS failure class. This follows the rule already documented in `infra/.claude/CLAUDE.md`. Tracked as beads task (open after Dolt is back up; currently down on node4).
- [ ] **Consider raising kubelet volume-manager deadline** for large-volume chown scenarios, or document the `fsGroupChangePolicy: OnRootMismatch` requirement for all NFS-backed StatefulSets.
- [ ] **Runbook**: `docs/runbooks/vault-raft-leader-deadlock.md` — how to detect stuck leader, safe force-restart procedure that avoids zombie pods, NFS kernel state recovery.
## Contributing Factors
1. **NFS mount options use bare `nfsvers=4`**. This negotiates to the highest version the server supports (NFSv4.2). When 4.2 session state corrupts, mounts fail; 4.1 works. Pinning to `nfsvers=4.1` in the `nfs-proxmox` StorageClass would make the failure mode recoverable without node reboot, but would also require recreating 48+ existing PVs (volumeAttributes are immutable). Deferred.
2. **`kubectl delete --force` is the default for stuck pods**. Operators reach for force-delete when a pod won't terminate, but this leaves containerd in an inconsistent state when the underlying storage is hung. Better approach: identify the stuck process (typically `mount.nfs` or a kernel NFS callback) and fix the root cause before force-deleting.
3. **Beads / Dolt server was on node4**, so beads task tracking went offline during this incident and couldn't be used to log progress cross-session.
4. **node1 was cordoned mid-incident** to prevent rescheduling to a node with confirmed NFS issues, but this reduced the scheduling surface for anti-affinity-sensitive StatefulSets.
## Learnings
1. **NFS for stateful critical services is structurally unsafe.** When NFS breaks, the recovery involves killing pods → which can break NFS further → until a reboot. The rule exists for a reason; Vault should never have been on NFS.
2. **Raft liveness needs application-layer probing, not TCP.** Every time we've seen a "stuck leader" issue in the homelab, TCP was fine and the app was unresponsive. A lightweight RPC probe with a short timeout and Prometheus alert would catch this in minutes instead of hours.
3. **kubelet volume-manager is fragile against stuck NFS.** Once kubelet enters a chown loop with a context deadline shorter than the chown duration, it cannot make progress — even when the filesystem is otherwise healthy. `OnRootMismatch` is effectively mandatory for any pod with `fsGroup` and a volume >100MB.
4. **VM hard-reset is cheap but disruptive.** The two reboots took ~60 seconds each but evicted 22+44 = 66 pods. Doing this twice in one session is a lot of churn. A post-mortem-driven improvement: pre-prepare "hot-standby" capacity so we can cordon+drain instead of hard-reset when kubelet zombies appear.
5. **Documentation of this rule is worth more than the rule itself.** The CLAUDE.md already says "critical services must not use NFS". The vault stack violates it. The rule without enforcement (validation, linting, CI) is ignored during the rush to ship.
## References
- Related: `docs/post-mortems/2026-04-14-nfs-fsid0-dns-vault-outage.md` — previous Vault+NFS incident (different root cause, similar blast pattern).
- Vault helm chart 0.29.1 default `fsGroupChangePolicy` is unset (behaves as `Always`).
- Upstream Vault HA layer: raft leader → vault-active transition is in `vault/external_tests/raft`. Stuck goroutine pattern not documented as a known issue.

View file

@ -1,185 +0,0 @@
# Beads Auto-Dispatch Runbook
Users can hand work to the headless `beads-task-runner` agent by assigning a
bead to the sentinel user `agent`. Two CronJobs in the `beads-server`
namespace drive the pipeline:
- **`beads-dispatcher`** — every 2 min: picks up the highest-priority
`assignee=agent`/`status=open` bead with non-empty acceptance criteria,
claims it by flipping to `in_progress`, and POSTs it to BeadBoard's
`/api/agent-dispatch`. BeadBoard forwards to `claude-agent-service` with
the existing bearer-token flow.
- **`beads-reaper`** — every 10 min: flips any `assignee=agent` +
`status=in_progress` bead whose `updated_at` is older than 30 min to
`status=blocked` with an explanatory note. Catches pod crashes mid-run.
The manual BeadBoard Dispatch button continues to work in parallel.
## Flow diagram
```
user: bd assign <id> agent
Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐
│ │
▼ │
CronJob: beads-dispatcher │
1. GET beadboard/api/agent-status (busy?) │
2. bd query 'assignee=agent AND status=open' │
3. bd update -s in_progress (claim) │
4. POST beadboard/api/agent-dispatch │
5. bd note "dispatched: job=…" │
│ │
▼ │
claude-agent-service /execute │
beads-task-runner agent runs; notes/closes bead │
│ │
▼ │
done ──► next tick picks up the next bead ───────────────┘
CronJob: beads-reaper (every 10 min)
for bead (assignee=agent, status=in_progress, updated_at > 30 min):
bd note "reaper: no progress for Nm — blocking"
bd update -s blocked
```
## Usage
### Hand a bead to the agent
```
bd create "Title" \
-d "Full context — files, services, error messages. Any agent with no prior context must be able to execute this." \
--acceptance "Concrete, verifiable criteria" \
-p 2
bd assign <new-id> agent
```
**Acceptance criteria is required.** Beads without it are skipped by the
dispatcher and stay in `open` forever. This is intentional — the
`beads-task-runner` agent expects clear done conditions.
### Take a bead back (unassign)
```
bd assign <id> ""
```
If the bead is already `in_progress`, also reset it:
```
bd update <id> -s open
```
### Pause auto-dispatch
```
cd infra/stacks/beads-server
scripts/tg apply -var=beads_dispatcher_enabled=false
```
This sets `spec.suspend: true` on both CronJobs. Existing running jobs
continue; no new ticks fire. Re-enable by re-applying with
`beads_dispatcher_enabled=true` (the default). Manual BeadBoard Dispatch
remains available while paused.
### Read the logs
```
# Recent dispatcher runs
kubectl -n beads-server get jobs --selector=job-name --sort-by=.metadata.creationTimestamp | grep beads-dispatcher | tail
kubectl -n beads-server logs job/<dispatcher-job-name>
# Tail the underlying agent once a bead dispatches
kubectl -n claude-agent logs -l app=claude-agent-service -f
# Inspect reaper decisions
kubectl -n beads-server get jobs | grep beads-reaper | tail
kubectl -n beads-server logs job/<reaper-job-name>
```
### Inspect a specific bead's dispatch history
```
bd show <id> --json | jq '{status, assignee, notes, updated_at}'
```
Both the dispatcher and reaper write dated notes (`auto-dispatcher claimed
at…`, `dispatched: job=…`, `reaper: no progress for…`) so the audit trail
lives on the bead itself.
## Reaper semantics — when a bead becomes `blocked`
The reaper flips a bead to `blocked` if:
- `assignee = agent`, AND
- `status = in_progress`, AND
- `updated_at` is more than **30 minutes** in the past.
Every `bd note` bumps `updated_at`, so a well-behaved `beads-task-runner`
agent never trips the reaper — it notes progress as it works. A `blocked`
bead is a signal that:
- the agent pod crashed mid-run (`kubectl -n claude-agent delete pod` test),
- the job hit its 15-minute budget timeout inside `claude-agent-service`
without notes (rare — the agent usually notes failure before exiting),
- `claude-agent-service` was restarted during the run (in-memory job state
is lost; see [known risks](#known-risks)).
Recovery: read the reaper note, reopen manually if appropriate:
```
bd update <id> -s open
bd assign <id> agent # re-arm for next dispatcher tick
```
## Design choices
- **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd
client can set it (`bd assign <id> agent`).
- **Sequential dispatch** — matches `claude-agent-service`'s single-slot
`asyncio.Lock`. With a 2-min poll cadence and ~5-min average run,
throughput is ~12 beads/hour. Parallelism is a separate plan.
- **Fixed agent (`beads-task-runner`)** — read-only rails, matches BeadBoard's
manual Dispatch button. Broader-privilege agents stay manual.
- **CronJob (not in-service polling, not n8n)** — matches existing infra
pattern (OpenClaw task-processor, certbot-renewal, backups), TF-managed,
easy to pause.
- **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing
the image-seeded file. The CronJob's init step copies it into `/tmp/.beads/`
because `bd` may touch the parent directory and ConfigMap mounts are
read-only.
## Known risks
- **In-memory job state in `claude-agent-service`** — if the pod restarts
mid-run, the job record is lost. The reaper catches this after 30 min.
Persistent job store is deferred.
- **Prompt injection via bead fields** — a malicious bead description could
try to steer the agent. The `beads-task-runner` rails + token budget +
timeout are the defense. Identical exposure as the manual Dispatch button.
- **Image tag drift**`claude_agent_service_image_tag` in
`stacks/beads-server/main.tf` mirrors `local.image_tag` in
`stacks/claude-agent-service/main.tf`. Bump both when the image rebuilds,
or the dispatcher/reaper will run on an older layer. (They only need
`bd`, `curl`, `jq` — stable across rebuilds — so the drift is low-risk.)
- **`bd` JSON schema changes** — the reaper's `jq` reads `.id` and
`.updated_at`. If a future `bd` upgrade renames these, the reaper breaks
silently (no reaping, no alert). `BD_VERSION` is pinned in the image
Dockerfile.
## Verification after change
```
# Both CronJobs exist with the right schedule / SUSPEND state
kubectl -n beads-server get cronjob
# End-to-end smoke test
bd create "auto-dispatch smoke test" \
-d "Read /etc/hostname inside the agent sandbox and close." \
--acceptance "bd note includes 'hostname=' and bead is closed."
bd assign <new-id> agent
# within 2 min:
bd show <new-id> --json | jq '.notes'
# → contains 'auto-dispatcher claimed' + 'dispatched: job=<uuid>'
```

View file

@ -1,126 +0,0 @@
# Runbook: Forgejo registry break-glass — recovering infra-ci
Last updated: 2026-05-07
## When to use this runbook
When **all** of the following are true:
1. Forgejo (`forgejo.viktorbarzin.me`) is unreachable.
2. `registry-private` is also gone (post-Phase 4 of the consolidation),
so you can't fall back to `registry.viktorbarzin.me:5050/infra-ci`.
3. You need to run an infra Woodpecker pipeline (apply, build-cli,
drift-detection, etc.) — but those pipelines pull `infra-ci` and
crash because the registry is down.
If only Forgejo is down but `registry-private` is still alive, the
pipelines work — `image:` references in `infra/.woodpecker/*.yml`
still hit `registry.viktorbarzin.me:5050/infra-ci` until Phase 3
flips them. Skip this runbook entirely.
## What's available
The `build-ci-image.yml` Woodpecker pipeline saves a tarball after
each successful push:
| Location | Path |
|---|---|
| Registry VM disk (10.0.20.10) | `/opt/registry/data/private/_breakglass/infra-ci-<sha>.tar.gz` |
| Registry VM disk (latest symlink) | `/opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz` |
| Synology NAS (offsite copy via daily-backup sync) | `/volume1/Backup/Viki/pve-backup/_forgejo-breakglass/` |
The registry VM keeps the last 5 tarballs. Synology mirrors them
through the existing offsite-sync-backup job (`/usr/local/bin/
offsite-sync-backup`).
## Recovery procedure
The goal is to get a working `infra-ci` image onto a k8s node so
Woodpecker pods can run it. Then run a Woodpecker pipeline that
restores Forgejo from PVC backup or rebuilds it.
### Step 1 — copy the tarball to a node
From your workstation (the registry VM is reachable but Forgejo is
not — the rest of the cluster might be in a similar partial state):
```bash
ssh wizard@10.0.20.103 # any responsive k8s node
sudo mkdir -p /var/breakglass
sudo scp root@10.0.20.10:/opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz \
/var/breakglass/
```
If the registry VM is also down, fall back to Synology:
```bash
sudo scp 192.168.1.13:/volume1/Backup/Viki/pve-backup/_forgejo-breakglass/infra-ci-latest.tar.gz \
/var/breakglass/
```
### Step 2 — load into containerd
`docker load` won't help on a k8s node — it loads into the docker
daemon, which kubelet/containerd doesn't see. Use `ctr`:
```bash
sudo ctr -n k8s.io images import /var/breakglass/infra-ci-latest.tar.gz
sudo ctr -n k8s.io images list | grep infra-ci
```
Confirm the image is tagged with the original repository name
(`registry.viktorbarzin.me:5050/infra-ci:<sha>` — the tarball was
saved with that tag, NOT the Forgejo name).
### Step 3 — pin pods to this node
Add a node selector or taint-toleration to whatever pipeline you
need to run. Simplest: cordon the other nodes briefly so Woodpecker
schedules onto this one.
```bash
for n in $(kubectl get nodes -o name | grep -v $(hostname)); do
kubectl cordon ${n#node/}
done
```
Run the pipeline. After it completes:
```bash
for n in $(kubectl get nodes -o name); do
kubectl uncordon ${n#node/}
done
```
### Step 4 — fix the underlying problem
The pipeline you just ran was meant to restore Forgejo. Common
options:
- **Forgejo PVC corrupt**`docs/runbooks/forgejo-registry-rebuild-image.md`
walks through PVC restore from LVM snapshot or PVE backup.
- **Forgejo OOM-loop** — bump memory request+limit in
`infra/stacks/forgejo/main.tf` and apply.
- **Forgejo unreachable due to network** — check Traefik, MetalLB,
pfSense.
Once Forgejo is back, run `build-ci-image.yml` manually so the
tarball regenerates with the latest commit.
## Why this exists
The 2026-04-19 post-mortem on the registry-orphan-index incident
showed that a single registry going corrupt could block ALL infra
pipelines (because every pipeline pulls `infra-ci` from that
registry). The dual-push to Forgejo + registry-private removes that
single-point-of-failure during the bake. After Phase 4
decommissions registry-private, the tarball is the last line of
defense.
## Why on the registry VM and not in-cluster
The Forgejo pod and registry-private pod both depend on cluster
networking + storage. The registry VM is an independent
non-clustered VM with local storage. If the cluster is in a bad
state, the VM's disk is still readable from any other host on the
LAN.

View file

@ -1,128 +0,0 @@
# Runbook: Rebuild an Image on the Forgejo OCI Registry
Last updated: 2026-05-07
## When to use this
Pipelines pulling from `forgejo.viktorbarzin.me/viktor/<image>` fail with:
- `failed to resolve reference … : not found`
- `manifest unknown`
- HEAD on a manifest/blob digest returns 404
- `forgejo-integrity-probe` CronJob in `monitoring` reports
`registry_manifest_integrity_failures > 0` for
`instance="forgejo.viktorbarzin.me"`
This is the Forgejo equivalent of the registry-private orphan-index
failure mode (`docs/post-mortems/2026-04-19-registry-orphan-index.md`).
Cause is usually package-version delete races with an in-flight pull,
or PVC corruption. Fix is to rebuild the image from source and
re-push, so Forgejo receives a complete, fresh upload.
If the symptom is different (Forgejo unreachable, PVC OOM,
authentication failure), use:
- `docs/runbooks/forgejo-registry-setup.md` for auth + token issues
- `docs/runbooks/forgejo-registry-breakglass.md` if Forgejo + the
cluster are both unreachable
- `docs/runbooks/restore-pvc-from-backup.md` for PVC corruption
## Phase 1 — Confirm the diagnosis
From any host:
```sh
REG=forgejo.viktorbarzin.me
USER=cluster-puller
PASS="$(vault kv get -field=forgejo_pull_token secret/viktor)"
IMAGE=viktor/payslip-ingest
TAG=latest
# 1. Confirm the manifest exists at all.
curl -sk -u "$USER:$PASS" \
-H 'Accept: application/vnd.oci.image.index.v1+json,application/vnd.oci.image.manifest.v1+json' \
"https://$REG/v2/$IMAGE/manifests/$TAG" | jq '.mediaType, .manifests[].digest // .config.digest'
# 2. HEAD each child / config / layer digest. Any non-200 = confirmed.
for d in $(curl -sk -u "$USER:$PASS" -H 'Accept: application/vnd.oci.image.index.v1+json' \
"https://$REG/v2/$IMAGE/manifests/$TAG" | jq -r '.manifests[].digest // empty'); do
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
-I "https://$REG/v2/$IMAGE/manifests/$d")
echo "$d → $code"
done
```
The probe's last log run is also a fast way to see what's affected:
```sh
kubectl -n monitoring logs \
$(kubectl -n monitoring get pods -l job-name -o name \
| grep forgejo-integrity-probe | head -1)
```
## Phase 2 — Rebuild and re-push
Forgejo lets you delete a specific package version through the API.
Doing this **before** the rebuild ensures the new push doesn't
collide with the half-broken existing entry.
```sh
# Delete the broken version (replace TAG with the actual tag).
curl -X DELETE -H "Authorization: token $(vault kv get -field=forgejo_cleanup_token secret/viktor)" \
"https://$REG/api/v1/packages/viktor/container/$(basename $IMAGE)/$TAG"
```
Rebuild via Woodpecker (manual run if the pipeline isn't triggered
by a code change):
1. Open `https://ci.viktorbarzin.me/repos/<repo>/manual` for the
project.
2. Click **Run pipeline** with `branch=master`.
3. Wait for the build-and-push step to complete.
4. Confirm the new version is visible in Forgejo Web UI under
`viktor/<image>` → Packages → Container.
## Phase 3 — Restart consumers
Pods that already cached the broken digest may continue using it.
Force a fresh pull:
```sh
kubectl rollout restart deploy/<service> -n <ns>
```
If the pod still fails, the new manifest digest may not have
propagated through containerd's cache. Drain + restart containerd on
the affected node:
```sh
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
ssh wizard@<node> sudo systemctl restart containerd
kubectl uncordon <node>
```
## Phase 4 — Verify integrity recovery
The next probe run (every 15 min) will report:
```
registry_manifest_integrity_failures{instance="forgejo.viktorbarzin.me"} 0
```
The `RegistryManifestIntegrityFailure` alert resolves automatically
30 minutes after the metric goes back to 0.
## Why this happens
Forgejo's OCI registry stores blobs in its own DB+filesystem. Unlike
`registry:2` + `distribution`, it doesn't have the
[`distribution#3324`](https://github.com/distribution/distribution/issues/3324)
GC-vs-tag-delete race. But it can still reach a broken state if:
- The retention CronJob deletes a version while a pull is in flight
on the same digest.
- The PVC fills up mid-push (`docs/runbooks/restore-pvc-from-backup.md`).
- A Forgejo upgrade migrates the package schema and a row is dropped.
In all cases the recovery procedure is identical: delete the broken
version through the API, rebuild from source, force consumers to
re-pull.

View file

@ -1,163 +0,0 @@
# Runbook: Forgejo OCI registry — initial setup
Last updated: 2026-05-07
This runbook covers the **one-time** bootstrap of Forgejo's container
registry, executed during Phase 0 of the registry consolidation plan
(`docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md`).
After this runbook is complete, the Forgejo OCI registry at
`forgejo.viktorbarzin.me` accepts pushes from CI and pulls from the
cluster, with retention and integrity monitoring in place.
## Order of operations
The Terraform stacks reference Vault keys that don't exist on a fresh
cluster. Create the keys **before** running `scripts/tg apply`.
1. Apply the resource bumps (memory, PVC, ingress body size,
packages env vars) — these don't depend on the new Vault keys.
2. Create the service-account users + PATs in Forgejo.
3. Push the PATs to Vault.
4. Apply the rest of Phase 0 (registry-credentials extension,
monitoring probe, retention CronJob).
### Step 1 — apply Forgejo deployment bumps
```bash
cd infra/stacks/forgejo
scripts/tg apply
```
Wait for the new pod to come up at the bumped 1Gi memory request and
the resized 15Gi PVC. Verify packages are enabled:
```bash
kubectl exec -n forgejo deploy/forgejo -- forgejo manager flush-queues
kubectl exec -n forgejo deploy/forgejo -- env | grep PACKAGES
```
### Step 2 — create service-account users
`forgejo admin user create` is idempotent only with
`--must-change-password=false`. Re-running it on an existing user
errors out — that's fine; skip on rerun.
```bash
# cluster-puller — read:package PAT for in-cluster pulls.
kubectl exec -n forgejo deploy/forgejo -- \
forgejo admin user create \
--username cluster-puller \
--email cluster-puller@viktorbarzin.me \
--password "$(openssl rand -base64 24)" \
--must-change-password=false
# ci-pusher — write:package PAT for CI dual-push, also reused as the
# cleanup CronJob credential (write:package includes delete).
kubectl exec -n forgejo deploy/forgejo -- \
forgejo admin user create \
--username ci-pusher \
--email ci-pusher@viktorbarzin.me \
--password "$(openssl rand -base64 24)" \
--must-change-password=false
```
The user passwords are throwaway — we only ever auth via PAT. Forgejo
admin can reset them at any time from the Web UI.
### Step 3 — generate the PATs
PATs **must** be generated through the Web UI logged in as the
respective user (the CLI doesn't expose token creation). To log in
without OAuth (registration is disabled for everyone except `viktor`,
the admin), use the per-user temporary password from step 2.
For each of `cluster-puller` and `ci-pusher`:
1. Sign out of `viktor`.
2. Go to `https://forgejo.viktorbarzin.me/user/login` and sign in
with the throwaway password.
3. Settings → Applications → Generate new token.
4. Name: `cluster-pull` / `ci-push`. **Expiration: never.**
5. Scopes:
- `cluster-puller`: `read:package`
- `ci-pusher`: `write:package` (covers read+write+delete)
6. Save the token shown on the next page — it is **not** displayed again.
For the cleanup CronJob, generate a third PAT on `ci-pusher`:
7. Repeat steps 4-6 with name `cleanup`, scope `write:package`.
### Step 4 — push PATs to Vault
```bash
vault login -method=oidc
# Read-only, used by the cluster-wide registry-credentials Secret and
# by the Forgejo integrity probe.
vault kv patch secret/viktor \
forgejo_pull_token=<paste cluster-puller PAT>
# Write+delete, used by the retention CronJob inside Forgejo's
# namespace.
vault kv patch secret/viktor \
forgejo_cleanup_token=<paste ci-pusher cleanup PAT>
# Write, propagated by vault-woodpecker-sync to all Woodpecker repos.
vault kv patch secret/ci/global \
forgejo_user=ci-pusher \
forgejo_push_token=<paste ci-pusher push PAT>
```
### Step 5 — apply the rest of Phase 0
```bash
# Registry credential Secret (now reads forgejo_pull_token).
cd infra/stacks/kyverno && scripts/tg apply
# Monitoring probe + retention CronJob.
cd infra/stacks/monitoring && scripts/tg apply
cd infra/stacks/forgejo && scripts/tg apply
# Containerd hosts.toml on each existing k8s node — VM cloud-init
# only fires on first boot.
infra/scripts/setup-forgejo-containerd-mirror.sh
```
## Verification
```bash
# Login from a workstation with docker.
echo "<ci-pusher PAT>" | docker login forgejo.viktorbarzin.me -u ci-pusher --password-stdin
# Push a smoketest image.
docker pull alpine:3.20
docker tag alpine:3.20 forgejo.viktorbarzin.me/viktor/smoketest:1
docker push forgejo.viktorbarzin.me/viktor/smoketest:1
# Pull from a k8s node.
ssh wizard@<node> sudo crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1
# Confirm the cluster-wide Secret was synced into a fresh namespace.
kubectl create namespace forgejo-smoketest
kubectl get secret -n forgejo-smoketest registry-credentials \
-o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | jq '.auths | keys'
# Expect: ["10.0.20.10:5050", "forgejo.viktorbarzin.me",
# "registry.viktorbarzin.me", "registry.viktorbarzin.me:5050"]
kubectl delete namespace forgejo-smoketest
# Delete the smoketest package via API.
curl -X DELETE -H "Authorization: token <ci-pusher cleanup PAT>" \
https://forgejo.viktorbarzin.me/api/v1/packages/viktor/container/smoketest/1
```
## When to revisit
- **PAT rotation**: PATs created here have no expiry by design. If a
PAT leaks, regenerate via the Web UI and `vault kv patch` the new
value into the same key — the next `terragrunt apply` will sync it
to all consumers within minutes (Kyverno ClusterPolicy clones the
Secret, vault-woodpecker-sync runs every 6h).
- **New service account**: if a future workload needs different
scopes, add a parallel user/PAT here rather than expanding existing
PAT scope. Principle of least privilege.

View file

@ -1,222 +0,0 @@
# pfSense HAProxy for Mailserver — Runbook
Last updated: 2026-04-19 (Phase 6 complete)
## What & why
External mail traffic (SMTP/IMAP) requires **real client IP visibility** for
CrowdSec + Postfix rate-limiting. MetalLB cannot inject PROXY-protocol
headers (see [`mailserver-proxy-protocol.md`](./mailserver-proxy-protocol.md)),
so pfSense runs a small HAProxy that:
1. Listens on the pfSense VLAN20 IP (`10.0.20.1`) on all 4 mail ports,
2. Forwards each connection to a k8s node's NodePort with `send-proxy-v2`,
3. Injects PROXY v2 framing so Postfix/Dovecot see the original client IP,
4. TCP-checks every k8s worker via dedicated **non-PROXY healthcheck NodePorts**
(30145/30146/30147 → pod stock 25/465/587 listeners, no PROXY required).
This split path avoids the `smtpd_peer_hostaddr_to_sockaddr` fatal that
used to fire on every PROXY-aware health probe and throttled real client
connections.
Corresponding k8s-side setup (`stacks/mailserver/modules/mailserver/`):
- ConfigMap `mailserver-user-patches``user-patches.sh` appends 3 alt
`master.cf` services to Postfix:
- `:2525` postscreen (alt :25) with `postscreen_upstream_proxy_protocol=haproxy`
- `:4465` smtpd (alt :465 SMTPS) with `smtpd_upstream_proxy_protocol=haproxy`
- `:5587` smtpd (alt :587 submission) with `smtpd_upstream_proxy_protocol=haproxy`
- ConfigMap `mailserver.config` adds Dovecot `inet_listener imaps_proxy` on
port 10993 with `haproxy = yes` and `haproxy_trusted_networks = 10.0.20.0/24`.
- Service `mailserver-proxy` (NodePort, ETP:Cluster) — 4 PROXY data ports +
3 non-PROXY healthcheck ports:
- Data (PROXY v2):
- `port 25 → targetPort 2525 → nodePort 30125`
- `port 465 → targetPort 4465 → nodePort 30126`
- `port 587 → targetPort 5587 → nodePort 30127`
- `port 993 → targetPort 10993 → nodePort 30128`
- Healthcheck (no PROXY, stock SMTP/SMTPS/Submission listeners):
- `port 2500 → targetPort 25 → nodePort 30145` (smtp-check)
- `port 4650 → targetPort 465 → nodePort 30146` (smtps-check)
- `port 5870 → targetPort 587 → nodePort 30147` (sub-check)
- Service `mailserver` (ClusterIP) — unchanged stock ports 25/465/587/993
for intra-cluster clients (Roundcube pod, `email-roundtrip-monitor`
CronJob, book-search). These listeners are PROXY-free.
bd: `code-yiu`.
## Steady-state architecture
```
External mail (WAN) path — PROXY v2
┌─────────────────────────────────────────────────────────────────────┐
│ Client (real IP) │
│ │ SMTP/SMTPS/Sub/IMAPS │
│ ▼ │
│ pfSense WAN:{25,465,587,993} │
│ │ NAT rdr → 10.0.20.1:{same} │
│ ▼ │
│ pfSense HAProxy (mode tcp, 4 frontends, 4 backend pools) │
│ │ data: send-proxy-v2 → :{30125..30128} (PROXY-aware pod) │
│ │ health: TCP-check → :{30145..30147} (no-PROXY pod) │
│ │ inter 5000 │
│ ▼ │
│ k8s-node<1-4>:{30125..30128} ← any node (ETP:Cluster) │
│ │ kube-proxy SNAT (source IP lost on the wire) │
│ ▼ │
│ mailserver pod :{2525,4465,5587,10993} │
│ │ postscreen / smtpd / Dovecot parse PROXY v2 header │
│ │ → real client IP recovered despite kube-proxy SNAT │
│ ▼ │
│ CrowdSec + Postfix / Dovecot see the true source IP ✓ │
└─────────────────────────────────────────────────────────────────────┘
Intra-cluster path — no PROXY
┌─────────────────────────────────────────────────────────────────────┐
│ Roundcube pod / email-roundtrip-monitor CronJob │
│ │ SMTP/IMAP │
│ ▼ │
│ mailserver.mailserver.svc.cluster.local:{25,465,587,993} │
│ │ ClusterIP — bypasses LoadBalancer/NodePort layer entirely │
│ ▼ │
│ mailserver pod stock :{25,465,587,993} (PROXY-free) │
└─────────────────────────────────────────────────────────────────────┘
```
## Validation
```sh
# All HAProxy frontends listening
ssh admin@10.0.20.1 'sockstat -l | grep haproxy'
# Expect: *:25, *:465, *:587, *:993, *:2525 (test port)
# All backend pools healthy
ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio" \
| awk 'NR>1 {print $3, $4, $6}'
# srv_op_state 2 = UP, 0 = DOWN
# Container listens on all 8 ports
kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
ss -ltn | grep -E ':(25|2525|465|4465|587|5587|993|10993)\b'
# pf rdr points at pfSense (10.0.20.1), not <mailserver> alias
ssh admin@10.0.20.1 'pfctl -sn' | grep -E 'port = (25|submission|imaps|smtps)'
# E2E probe — Brevo → external MX :25 → IMAP fetch
kubectl create job --from=cronjob/email-roundtrip-monitor probe-test -n mailserver
kubectl wait --for=condition=complete --timeout=90s job/probe-test -n mailserver
kubectl logs job/probe-test -n mailserver | grep SUCCESS
kubectl delete job probe-test -n mailserver
# Real client IP in maillog post-delivery
kubectl logs -c docker-mailserver deployment/mailserver -n mailserver \
| grep 'smtpd-proxy25.*CONNECT from' | tail -5
# Expect external source IPs (e.g., Brevo 77.32.148.x), NOT 10.0.20.x
```
## Bootstrap / restore from scratch
pfSense HAProxy config lives in `/cf/conf/config.xml` under
`<installedpackages><haproxy>`. That file is scp'd nightly to
`/mnt/backup/pfsense/config-YYYYMMDD.xml` by `scripts/daily-backup.sh`, then
synced to Synology. To rebuild from source of truth (git):
```sh
scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/
ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'
```
The script is idempotent — re-runs reset the mailserver frontends + backends
to the declared state.
Expected output:
```
haproxy_check_and_run rc=OK
```
## Operations
### Change backend k8s node IPs / NodePorts
Edit `infra/scripts/pfsense-haproxy-bootstrap.php``$NODES` array + the
`build_pool()` port arguments. Re-run the bootstrap command above. Don't
hand-edit `/var/etc/haproxy/haproxy.cfg` — it is regenerated from XML on
every apply.
### Check health of backends
```sh
ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio"
```
`srv_op_state=2` means UP, `0` means DOWN.
### View live HAProxy stats (WebUI)
`https://pfsense.viktorbarzin.me` → Services → HAProxy → Stats.
### Reload after config.xml edit
```sh
ssh admin@10.0.20.1 'pfSsh.php playback svc restart haproxy'
```
### Rollback (flip NAT back to MetalLB, post-Phase-6 only partial)
There is no Phase-6 rollback one-liner. Phase 6 removed the MetalLB
LoadBalancer 10.0.20.202 entirely, so un-flipping NAT now would send
traffic to a dead alias. To regress:
1. Re-add `metallb.io/loadBalancerIPs = "10.0.20.202"` + `type = "LoadBalancer"`
+ `external_traffic_policy = "Local"` to `kubernetes_service.mailserver`,
apply.
2. Re-add the `mailserver` host alias in pfSense pointing at 10.0.20.202
(Firewall → Aliases → Hosts).
3. Run `infra/scripts/pfsense-nat-mailserver-haproxy-unflip.php` on pfSense.
For rollback of just the NAT (Phase 4) without touching the Service, only
the third step is needed — but only meaningful BEFORE Phase 6.
### Restore from backup
pfSense config backup is a plain XML file:
```
/mnt/backup/pfsense/config-YYYYMMDD.xml # sda host copy (1.1TB RAID1)
/volume1/Backup/Viki/pve-backup/pfsense/... # Synology offsite
```
Full restore: pfSense WebUI → Diagnostics → Backup & Restore → Upload that
`config.xml`. The `<installedpackages><haproxy>` section is included.
## Phase history (bd code-yiu)
| Phase | Status | Description |
|---|---|---|
| 1a | ✅ commit `ef75c02f` | k8s alt :2525 listener + NodePort Service |
| 2 | ✅ 2026-04-19 | pfSense HAProxy pkg installed (`pfSense-pkg-haproxy-devel-0.63_2`, HAProxy 2.9-dev6) |
| 3 | ✅ commit `ba697b02` | HAProxy config persisted in pfSense XML (bootstrap script + this runbook) |
| 4+5| ✅ commit `9806d515` | 4-port alt listeners + HAProxy frontends for 25/465/587/993 + NAT flip |
| 6 | ✅ this commit | Mailserver Service downgraded LoadBalancer → ClusterIP; `10.0.20.202` released back to MetalLB pool; orphan `mailserver` pfSense alias removed; monitors retargeted |
## Known warts
- ~~HAProxy TCP health-check with `send-proxy-v2` generates `getpeername:
Transport endpoint not connected` warnings on postscreen every check cycle.~~
**Resolved 2026-05-05**: dedicated non-PROXY healthcheck NodePorts
(30145/30146/30147 → stock pod 25/465/587) added; HAProxy now checks
those, eliminating both the `getpeername` postscreen warnings and the
`smtpd_peer_hostaddr_to_sockaddr: ... Servname not supported` fatals
that were throttling smtpd respawns and causing ~50% client timeouts on
the public 587 path. `inter` dropped 120000 → 5000 (fast failover, no
log-spam concern). `option smtpchk` was tried but flapped against
postscreen (multi-line greet + DNSBL silence + anti-pre-greet detection
trip HAProxy's parser → L7RSP). Plain TCP check on the no-PROXY ports
is sufficient.
- Frontend binds on all pfSense interfaces (`bind :25` instead of
`10.0.20.1:25`). `<extaddr>` is set in XML but pfSense templates it
port-only. Low concern in practice because WAN firewall rules plus the
NAT rdr gate external access; internal VLAN clients SHOULD be able to
reach HAProxy on any pfSense-local IP.
- k8s-node5 doesn't exist — cluster has master + 4 workers. Backend pool
capped at 4 servers.
- Postscreen still logs `improper command pipelining` for legitimate
clients that send `EHLO\r\nQUIT\r\n` as a single TCP write. This is
unchanged pre/post-migration — postscreen's anti-bot heuristic.

View file

@ -1,181 +0,0 @@
# Mailserver PROXY protocol — research & decision
Last updated: 2026-04-18 (original research). **Outcome implemented 2026-04-19 — see [UPDATE](#update-2026-04-19) below.**
> ## UPDATE (2026-04-19)
>
> This doc describes the research that led to the Phase-6 rollout. **Option C
> (pfSense HAProxy + PROXY v2)** was chosen and is now live. Operational
> state, cutover history, bootstrap, and rollback procedures live in
> [`mailserver-pfsense-haproxy.md`](mailserver-pfsense-haproxy.md).
>
> This file is retained as a decision record — it explains *why* Option A
> (pod-pinning via nodeSelector) was rejected mid-session in favour of
> Option C, and documents the MetalLB upstream limitation (PROXY injection
> is explicitly won't-implement). Future debates of "why don't we just pin
> the pod?" should land here first.
## TL;DR
**MetalLB does not and will not inject PROXY protocol headers.** The original plan
(`/home/wizard/.claude/plans/let-s-work-on-linking-temporal-valiant.md`, task
`code-rtb`) assumed MetalLB could be configured to emit PROXY v1/v2 on behalf of
the `mailserver` LoadBalancer Service. That assumption is wrong at the product
level. MetalLB is a control-plane-only announcer (ARP/NDP for L2 mode, BGP for
L3 mode); it never touches the L4 payload.
As a result, there is no single Terraform change that can flip
`externalTrafficPolicy: Local``Cluster` on the `mailserver` Service while
preserving the real client IP for Postfix/postscreen and Dovecot. Three
alternative paths exist (see below); none is trivial.
## Environment (verified 2026-04-18)
- **MetalLB version**: `quay.io/metallb/controller:v0.15.3` /
`quay.io/metallb/speaker:v0.15.3` (5 speakers).
- **Advertisement type**: L2Advertisement `default` bound to IPAddressPool
`default` (10.0.20.20010.0.20.220). No BGPAdvertisements.
- **Service**: `mailserver/mailserver` — type `LoadBalancer`, `loadBalancerIPs:
10.0.20.202`, `externalTrafficPolicy: Local`,
`healthCheckNodePort: 30234`, 5 ports (25, 465, 587, 993, 9166/dovecot-metrics).
- **Pod**: single replica today, RWO PVCs prevent horizontal scale without
further work (`mailserver-data-encrypted`, `mailserver-letsencrypt-encrypted`).
## Why the original plan fails
### MetalLB never touches packets
> *"MetalLB is controlplane only, making it part of the dataplane means we
> would be responsible for the performance of the system, so more bugs to
> fight, I personally don't see that happening."*
> — MetalLB maintainer `champtar`, 2021-01-06
> (issue [#797 — Feature Request: Supporting Proxy Protocol v2](https://github.com/metallb/metallb/issues/797))
Issue #797 is closed as "won't implement". Repeat asks in 20222023 got the
same answer. The v0.15.3 API surface confirms this: no
`proxyProtocol`/`haproxy`/`protocol: proxy` field exists on `IPAddressPool`,
`L2Advertisement`, `BGPAdvertisement`, or as a Service annotation.
Only managed-cloud LBs (AWS NLB, Azure LB, OCI, DO, OVH, Scaleway, etc.) offer
PROXY protocol as a tick-box. MetalLB's equivalents are:
| MetalLB feature | Does it preserve client IP? | Comment |
|---|---|---|
| `externalTrafficPolicy: Local` (current) | Yes, via iptables DNAT on the speaker node | Forces pod↔speaker colocation on L2 mode. This is the pain we wanted to avoid. |
| `externalTrafficPolicy: Cluster` | No — kube-proxy SNATs to the node IP | The problem we would re-introduce if we flipped without PROXY injection. |
| PROXY protocol injection | N/A — not implemented | Dead end. |
### The `Local` trap is real, but narrower than it seems
Today's `Local` policy means the ARP announcer node must also host the mailserver
pod. MetalLB always picks a single speaker to advertise the VIP (leader
election per IP), so in practice exactly one node matters at any moment. A pod
rescheduled to a different node silently drops inbound SMTP/IMAP until a GARP
flip or node cordon.
The only pods on our cluster that see this same class of risk are Traefik
(3 replicas + PDB `minAvailable=2`, so 2 of 3 nodes always have a pod) and
mailserver (1 replica). Traefik survives because the pods outnumber the nodes
that could be the speaker at once; the mailserver cannot.
## Alternative paths (ranked by effort)
### Option A — Pin the mailserver pod to a specific node (SIMPLEST)
Add `nodeSelector` on the mailserver Deployment pointing at a label that's also
stamped on the MetalLB speaker we want to advertise the VIP from, and use
MetalLB's [node selector](https://metallb.io/configuration/_advanced_l2_configuration/#specify-network-interfaces-that-lb-ip-can-be-announced-from)
on `L2Advertisement.spec.nodeSelectors` to pin the announcer to the same node.
Trade-offs:
- Zero changes to Postfix/Dovecot configs.
- Keeps `externalTrafficPolicy: Local` — real client IP keeps arriving.
- Loses HA (the whole point of the MetalLB layer) but reflects reality — one
replica, one PVC, no HA today anyway.
- Drain of that node requires a planned cutover, but that's no worse than
today's silent failure mode.
Implementation (~10 lines of Terraform):
```hcl
# In stacks/mailserver/modules/mailserver/main.tf, on the Deployment:
node_selector = { "viktorbarzin.me/mailserver-anchor" = "true" }
# In stacks/platform (or wherever the MetalLB CRs live):
resource "kubernetes_manifest" "mailserver_l2ad" {
manifest = {
apiVersion = "metallb.io/v1beta1"
kind = "L2Advertisement"
metadata = { name = "mailserver", namespace = "metallb-system" }
spec = {
ipAddressPools = ["default"]
nodeSelectors = [{ matchLabels = { "viktorbarzin.me/mailserver-anchor" = "true" } }]
}
}
}
```
Plus a node label via `kubectl label node k8s-node3 viktorbarzin.me/mailserver-anchor=true`.
**Recommendation: this is the shortest path to eliminating the silent-drop
failure mode** without taking on a new proxy tier.
### Option B — Put a HAProxy sidecar in front of Postfix/Dovecot
Stand up an in-cluster HAProxy with PROXY v2 enabled on the frontend and
`send-proxy-v2` on the backend to `mailserver:25/465/587/993`. Expose HAProxy
via a new MetalLB Service with `externalTrafficPolicy: Cluster` + kube-proxy
DSR workaround (still loses client IP at that layer), or run HAProxy on the
host-network of the same node (back to Option A's colocation).
Trade-offs:
- Introduces one more network hop and TLS-termination decision for every
SMTP connect.
- HAProxy needs its own cert rotation (or `tls-passthrough`) — adds moving
parts to an already crowded mailserver module.
- Doesn't actually solve the colocation problem on its own — HAProxy itself
needs to receive the client IP, so we are back to externalTrafficPolicy
constraints for HAProxy.
**Recommendation: avoid unless we also get HA for mailserver itself, which
needs RWX storage + DB split-brain work — out of scope.**
### Option C — Replace MetalLB with a different LB for this Service
Candidates: [kube-vip](https://kube-vip.io/) (supports eBPF-based DSR but not
PROXY injection either), [Cilium LB](https://docs.cilium.io/en/stable/network/lb-ipam/)
(preserves client IP via DSR in hybrid mode), or a dedicated HAProxy running on
pfSense and NAT-forwarding 25/465/587/993 with PROXY headers to a
ClusterIP-exposed mailserver. Cilium requires a CNI migration (we run Calico
today); pfSense HAProxy is genuinely feasible but belongs in a different bd
task.
**Recommendation: track as P3 follow-up under a new bd task if Option A proves
insufficient.**
## Decision
Do nothing in this session beyond this runbook + the bd note. The `code-rtb`
task as written is not executable — MetalLB cannot inject PROXY headers, and
the Postfix/Dovecot config changes the plan proposed would not receive the
header they expect, they would hang waiting for it and then timeout (5s per
connection).
Follow-up work filed as bd child tasks (if user wants to pursue):
- **Option A — pin mailserver + L2Advertisement nodeSelectors** (new bd task)
- **Option C — HAProxy on pfSense with PROXY v2 to a ClusterIP** (new bd task)
## References
- [MetalLB issue #797 — Feature Request: Supporting Proxy Protocol v2](https://github.com/metallb/metallb/issues/797) (closed, won't implement)
- [MetalLB PR #796 — Source IP Preservation discussion](https://github.com/metallb/metallb/issues/796)
- Postfix [postscreen_upstream_proxy_protocol](https://www.postfix.org/postconf.5.html#postscreen_upstream_proxy_protocol) — expects the PROXY header *on every incoming connection*; if absent, postscreen drops after `postscreen_upstream_proxy_timeout`.
- Dovecot [haproxy_trusted_networks](https://doc.dovecot.org/settings/core/#core_setting-haproxy_trusted_networks) — treats the header as mandatory for listed source networks.
- Cluster state verified against: `kubectl -n metallb-system get pods`,
`kubectl get ipaddresspools.metallb.io -A`,
`kubectl get l2advertisements.metallb.io -A`,
`kubectl get bgpadvertisements.metallb.io -A`,
`kubectl -n mailserver get svc mailserver -o yaml`.

View file

@ -1,66 +0,0 @@
# NFS Prerequisites for `modules/kubernetes/nfs_volume`
The `nfs_volume` Terraform module creates a `PersistentVolume` pointing at a
path on the Proxmox NFS server (`192.168.1.127`). It does **not** create the
underlying directory on the server.
If the path does not exist, the first pod that tries to mount the resulting
PVC gets stuck in `ContainerCreating` with the kubelet event:
```
MountVolume.SetUp failed for volume "<name>" : mount failed: exit status 32
mount.nfs: mounting 192.168.1.127:/srv/nfs/<path> failed, reason given by
server: No such file or directory
```
## Bootstrap before first apply
Before adding a new `nfs_volume` consumer (backup CronJob, data PV, etc.),
create the export root on the PVE host:
```sh
# Replace <app> with the backup stack name, e.g. mailserver-backup,
# roundcube-backup, immich-backup, etc.
ssh root@192.168.1.127 'mkdir -p /srv/nfs/<app> && chmod 755 /srv/nfs/<app>'
# Confirm exports are live (no change to /etc/exports needed — `/srv/nfs`
# is already exported via the root entry in pve-nfs-exports).
ssh root@192.168.1.127 exportfs -v | grep '/srv/nfs\b'
```
`/srv/nfs` is exported with the root entry. Subdirectories inherit the
export automatically; they just have to exist on disk.
## Known consumers
| Consumer | NFS path | Owning stack |
|--------------------------------|---------------------------------|--------------------------|
| `mailserver-backup` | `/srv/nfs/mailserver-backup` | `stacks/mailserver/` |
| `roundcube-backup` | `/srv/nfs/roundcube-backup` | `stacks/mailserver/` |
| `mysql-backup` | `/srv/nfs/mysql-backup` | `stacks/dbaas/` |
| `postgresql-backup` | `/srv/nfs/postgresql-backup` | `stacks/dbaas/` |
| `vaultwarden-backup` | `/srv/nfs/vaultwarden-backup` | `stacks/vaultwarden/` |
Use `grep -rn 'nfs_volume' infra/stacks/` to find all active consumers.
## Why not auto-create?
Two options were considered for automating this:
1. `null_resource` + `local-exec` SSH `mkdir` in the `nfs_volume` module —
works but adds an SSH dependency to every Terraform run, makes the
module non-hermetic, and fails if the operator does not have SSH to
the PVE host.
2. `nfs-subdir-external-provisioner` — handles subdirs automatically but
changes the PV/PVC shape and would require migrating all existing
consumers.
Neither is worth the churn for a one-time operation per new backup stack.
Document + checklist is the current call; re-evaluate if we start adding
one NFS consumer per week.
## Related tasks
- `code-yo4` — this runbook
- `code-z26` — mailserver backup CronJob (first-time setup hit this)
- `code-1f6` — Roundcube backup CronJob (also hit this)

View file

@ -1,281 +0,0 @@
# pfSense Unbound DNS Resolver
Last updated: 2026-04-19
## Overview
pfSense runs **Unbound** (DNS Resolver) as its sole DNS service, replacing
dnsmasq (DNS Forwarder) as of 2026-04-19 (DNS hardening Workstream D,
bd `code-k0d`).
Unbound AXFR-slaves the `viktorbarzin.lan` zone from the Technitium primary
via the `10.0.20.201` LoadBalancer, so LAN-side `.lan` resolution survives
a full Kubernetes outage. Public queries go to Cloudflare via DNS-over-TLS
(`1.1.1.1` + `1.0.0.1` on port 853, SNI `cloudflare-dns.com`).
## Listeners
Unbound binds on:
| Interface | IP | Purpose |
|-----------|-----|---------|
| WAN | `192.168.1.2:53` | LAN (192.168.1.0/24) clients querying via pfSense WAN |
| LAN | `10.0.10.1:53` | Management VLAN clients |
| OPT1 | `10.0.20.1:53` | K8s VLAN clients (CoreDNS upstream) |
| lo0 | `127.0.0.1:53` | pfSense itself |
The prior WAN NAT `rdr` rule (`192.168.1.2:53 → 10.0.20.201`) was removed in
the same change — Unbound now answers directly on WAN.
## Config Summary
Relevant `<unbound>` keys in `/cf/conf/config.xml`:
| Key | Value | Meaning |
|-----|-------|---------|
| `enable` | flag | Enable Unbound |
| `dnssec` | flag | DNSSEC validation on |
| `forwarding` | flag | Forwarding mode (send recursive queries to upstream) |
| `forward_tls_upstream` | flag | Use DoT for upstream forwarders |
| `prefetch` | flag | Prefetch records near expiry |
| `prefetchkey` | flag | Prefetch DNSKEY records |
| `dnsrecordcache` | flag | `serve-expired: yes` |
| `active_interface` | `lan,opt1,wan,lo0` | Listen interfaces |
| `msgcachesize` | `256` (MB) | Message cache (rrset-cache auto-doubles to 512MB) |
| `cache_max_ttl` | `604800` | 7 days |
| `cache_min_ttl` | `60` | 60 seconds |
| `custom_options` | base64 | Contains `serve-expired-ttl: 259200` + `auth-zone:` block |
Upstream DoT forwarders live in `<system>`:
- `dnsserver[0] = 1.1.1.1`
- `dnsserver[1] = 1.0.0.1`
- `dns1host = cloudflare-dns.com`
- `dns2host = cloudflare-dns.com`
## Auth-Zone for viktorbarzin.lan
The custom_options block declares:
```
server:
serve-expired-ttl: 259200
auth-zone:
name: "viktorbarzin.lan"
master: 10.0.20.201
fallback-enabled: yes
for-downstream: yes
for-upstream: yes
zonefile: "viktorbarzin.lan.zone"
allow-notify: 10.0.20.201
```
- `master: 10.0.20.201` — AXFR source (Technitium LoadBalancer)
- `fallback-enabled: yes` — if the zone can't refresh from master, fall back to normal recursion for this name (prevents hard-fail if AXFR breaks)
- `for-downstream: yes` — answer queries for this zone with AA flag
- `for-upstream: yes` — Unbound's internal iterator also uses this zone
- `zonefile` is relative to the chroot (`/var/unbound/viktorbarzin.lan.zone`)
- `allow-notify: 10.0.20.201` — accept NOTIFY from Technitium
## Technitium-side ACL
Zone `viktorbarzin.lan` on Technitium has `zoneTransfer = UseSpecifiedNetworkACL`
with ACL entries:
- `10.0.20.1` (pfSense OPT1)
- `10.0.10.1` (pfSense LAN)
- `192.168.1.2` (pfSense WAN)
Verify via the Technitium API:
```
curl -sk "http://127.0.0.1:5380/api/zones/options/get?token=$TOK&zone=viktorbarzin.lan" | jq .response.zoneTransfer
```
## Operational Checks
```bash
# Is Unbound listening?
ssh admin@10.0.20.1 "sockstat -l -4 -p 53"
# Auth-zone loaded?
ssh admin@10.0.20.1 "unbound-control -c /var/unbound/unbound.conf list_auth_zones"
# Expected: viktorbarzin.lan. serial NNNNN
# LAN record via auth-zone? (aa flag = authoritative / from auth-zone)
dig @192.168.1.2 idrac.viktorbarzin.lan +norec
# Public record via DoT? (ad flag = DNSSEC validated, via 1.1.1.1/1.0.0.1)
dig @192.168.1.2 example.com +dnssec
# Zonefile has all records?
ssh admin@10.0.20.1 "wc -l /var/unbound/viktorbarzin.lan.zone"
```
## K8s Outage Drill
Tests that `.lan` resolution survives a full Technitium outage:
```bash
# Scale Technitium primary to 0
kubectl -n technitium scale deploy/technitium --replicas=0
# Wait ~5 seconds, then test from a LAN client
ssh devvm.viktorbarzin.lan "dig @192.168.1.2 idrac.viktorbarzin.lan +short"
# Expected: 192.168.1.4 (served from Unbound's cached auth-zone)
# Restore immediately
kubectl -n technitium scale deploy/technitium --replicas=1
```
Completed successfully on 2026-04-19 initial deployment.
Note: secondary/tertiary Technitium pods remain up and continue to serve
queries via the `10.0.20.201` LoadBalancer even when the primary is down —
so the strongest proof that Unbound's auth-zone serves locally is to also
scale those down (optional, not part of the routine drill).
## Backup & Rollback
### Backups
- **On-box**: `/cf/conf/config.xml.2026-04-19-pre-unbound` (created before this
workstream ran — keep for 30 days, then delete)
- **Daily**: PVE `daily-backup` script copies `/cf/conf/config.xml` and a full
pfSense config tar to `/mnt/backup/pfsense/` on the Proxmox host at 05:00
- **Offsite**: Synology `pve-backup/pfsense/` (synced daily by
`offsite-sync-backup`)
### Rollback to dnsmasq
If Unbound misbehaves, revert to dnsmasq + NAT rdr:
```bash
# On pfSense
cp /cf/conf/config.xml.2026-04-19-pre-unbound /cf/conf/config.xml
# Tell pfSense to re-read config and reload services
php -r 'require_once("config.inc"); require_once("config.lib.inc"); disable_path_cache();'
/etc/rc.restart_webgui # reloads PHP config caches
# Restart services
php -r 'require_once("config.inc"); require_once("services.inc"); services_dnsmasq_configure(); services_unbound_configure(); filter_configure();'
/etc/rc.filter_configure # re-applies NAT rules (brings back rdr)
```
Verify:
```bash
sockstat -l -4 -p 53 | grep dnsmasq # expect dnsmasq on 10.0.10.1 and 10.0.20.1
pfctl -sn | grep '53' # expect rdr on wan UDP 53 → 10.0.20.201
```
### Rollback without wiping new changes
If you only want to stop Unbound without restoring the whole config, edit
config.xml and remove `<enable/>` from `<unbound>` + add it back to `<dnsmasq>`,
then re-run `services_unbound_configure()` + `services_dnsmasq_configure()`.
You also need to re-add the WAN NAT rdr in `<nat><rule>` (see the backup XML
for the exact shape — tracker `1775670025`).
## Known Gotchas
1. **pfSense regenerates `/var/unbound/unbound.conf`** on every service reload
from `<unbound>` in `config.xml`. Edits to unbound.conf are NOT durable.
2. **`unbound-control` default config path is wrong**. Always use
`unbound-control -c /var/unbound/unbound.conf <cmd>`.
3. **`custom_options` is base64-encoded** in config.xml. Use `base64 -d` to
decode in a shell, or `base64_decode()` in PHP.
4. **`interface-automatic: yes` is NOT used** when `active_interface` is
explicitly set to a list — pfSense emits explicit `interface: <ip>` lines.
5. **`auth-zone`'s `zonefile` path is relative to the Unbound chroot**
(`/var/unbound`), NOT absolute. Using an absolute path silently fails.
6. **DoT forwarders need `forward_tls_upstream`** flag AND `dns1host` /
`dns2host` set in `<system>` for SNI — without the hostname, pfSense emits
`forward-addr: 1.1.1.1@853` (no `#`) which Cloudflare rejects with
certificate hostname mismatch.
## Kea DHCP-DDNS TSIG (WS E, 2026-04-19)
Kea DHCP-DDNS on pfSense signs its RFC 2136 dynamic updates with an
HMAC-SHA256 TSIG key (`kea-ddns`). Technitium's `viktorbarzin.lan` zone
and reverse zones (`10.0.10.in-addr.arpa`, `20.0.10.in-addr.arpa`,
`1.168.192.in-addr.arpa`) require both a pfSense-source IP (10.0.20.1 /
10.0.10.1 / 192.168.1.2) AND a valid TSIG signature.
### Config locations
| Side | File | Notes |
|------|------|-------|
| pfSense | `/usr/local/etc/kea/kea-dhcp-ddns.conf` | Hand-managed. Pre-WS-E backup: `.2026-04-19-pre-tsig`. Daemon: `kea-dhcp-ddns` (`pkill -x kea-dhcp-ddns && /usr/local/sbin/kea-dhcp-ddns -c /usr/local/etc/kea/kea-dhcp-ddns.conf -d &`) |
| Technitium | Zone options API: `POST /api/zones/options/set?zone=<z>&updateSecurityPolicies=kea-ddns\|*.<z>\|ANY&updateNetworkACL=10.0.20.1,10.0.10.1,192.168.1.2&update=UseSpecifiedNetworkACL` | Set on primary; replicates to secondary/tertiary via AXFR |
| Technitium settings | TSIG keys array: `POST /api/settings/set` with `tsigKeys: [{keyName: "kea-ddns", sharedSecret: <b64>, algorithmName: "hmac-sha256"}]` | Must be set on all 3 Technitium instances (primary, secondary, tertiary) |
| Vault | `secret/viktor/kea_ddns_tsig_secret` | Authoritative copy of the base64 secret |
### Rotating the TSIG key
1. Generate a new base64 32-byte secret: `openssl rand -base64 32` (any base64-encoded blob of reasonable length works; HMAC-SHA256 truncates/pads internally).
2. Write it to Vault: `vault kv patch secret/viktor kea_ddns_tsig_secret=<new-secret>`.
3. Add the new key under a **new name** (e.g., `kea-ddns-v2`) via the Technitium settings API on all 3 instances. Do NOT overwrite `kea-ddns` while Kea still uses it — you'd orphan in-flight updates.
4. Update `/usr/local/etc/kea/kea-dhcp-ddns.conf` on pfSense to reference both keys in `tsig-keys`, set `key-name: kea-ddns-v2` on each `forward-ddns` / `reverse-ddns` domain, restart `kea-dhcp-ddns`.
5. Update each affected zone's `updateSecurityPolicies` to use the new key name.
6. After a lease-renewal cycle (default Kea lease = 7200s / 2h), verify with `kubectl -n technitium exec <primary-pod> -- grep "TSIG KeyName: kea-ddns-v2" /etc/dns/logs/<today>.log`.
7. Remove the old `kea-ddns` key from Technitium settings + Kea config.
### Emergency TSIG bypass (if rotation breaks DDNS)
If DDNS updates are failing and you cannot quickly fix the key, temporarily
downgrade the zone policy to IP-ACL only (pfSense source IPs) without
TSIG:
```bash
kubectl -n technitium port-forward pod/<primary-pod> 5380:5380 &
TOKEN=$(curl -s -X POST http://127.0.0.1:5380/api/user/login \
-d "user=admin&pass=$(vault kv get -field=technitium_password secret/platform)&includeInfo=false" | jq -r .token)
for Z in viktorbarzin.lan 10.0.10.in-addr.arpa 20.0.10.in-addr.arpa 1.168.192.in-addr.arpa; do
curl -s -X POST "http://127.0.0.1:5380/api/zones/options/set?token=$TOKEN&zone=$Z&update=UseSpecifiedNetworkACL&updateNetworkACL=10.0.20.1,10.0.10.1,192.168.1.2&updateSecurityPolicies="
done
```
This clears `updateSecurityPolicies` while keeping the IP ACL. Updates
now flow unsigned from pfSense IPs — **weaker** than TSIG but restores
service. Re-enable TSIG as soon as the key issue is resolved.
### Verify TSIG is enforced
```bash
# Unsigned update should fail
nsupdate <<EOF
server 10.0.20.201 53
zone viktorbarzin.lan
update delete tsig-test.viktorbarzin.lan.
update add tsig-test.viktorbarzin.lan. 300 A 10.99.99.99
send
EOF
# Expected: "update failed: REFUSED"
# Signed update should succeed
cat > /tmp/kea-ddns.key <<EOF
key "kea-ddns" {
algorithm hmac-sha256;
secret "$(vault kv get -field=kea_ddns_tsig_secret secret/viktor)";
};
EOF
nsupdate -k /tmp/kea-ddns.key <<EOF
server 10.0.20.201 53
zone viktorbarzin.lan
update delete tsig-test.viktorbarzin.lan.
update add tsig-test.viktorbarzin.lan. 300 A 10.99.99.99
send
EOF
dig @10.0.20.201 +short tsig-test.viktorbarzin.lan
# Expected: 10.99.99.99
rm -f /tmp/kea-ddns.key
```
## Related Docs
- `docs/architecture/dns.md` — overall DNS architecture (K8s side, Technitium, CoreDNS)
- `docs/architecture/networking.md` — VLAN layout, pfSense interface mapping
- `.claude/skills/pfsense/skill.md` — SSH / CLI patterns for pfSense management

View file

@ -1,103 +0,0 @@
# Runbook: Proxmox host (pve, 192.168.1.127)
Last updated: 2026-04-19
The Proxmox host is a baremetal hypervisor on the storage LAN
(192.168.1.0/24) with a single IP `192.168.1.127`. It hosts every
Kubernetes node VM and the NFS exports that back PVCs. It does **not**
receive DHCP — its network config is static in
`/etc/network/interfaces` (ifupdown). Because of that, DNS must be
configured manually and stays out of the scope of Kea/DHCP-DDNS.
## DNS configuration
The host uses a plain `/etc/resolv.conf` with two nameservers. No
`systemd-resolved`, no `resolvconf`, no NetworkManager — nothing
manages `/etc/resolv.conf`; it is a regular file owned by root.
### Why plain `/etc/resolv.conf` and not systemd-resolved
1. Installing `systemd-resolved` on an active Proxmox node during
business hours is the kind of change that risks breaking the NFS
server or VM networking. PVE's Debian base does not ship
`systemd-resolved` by default.
2. The ifupdown `/etc/network/interfaces` file does not manage
`/etc/resolv.conf` here — ifupdown's resolvconf integration is
only active if the `resolvconf` package is installed, which it is
not (`dpkg -l resolvconf` returns `un`).
3. A plain file is the simplest mental model and avoids a second
layer of "which tool is running now" confusion during an incident.
If you ever want to migrate to `systemd-resolved`, install the
package, enable the service, symlink `/etc/resolv.conf` to
`/run/systemd/resolve/stub-resolv.conf`, and drop the config in
`/etc/systemd/resolved.conf.d/10-internal-dns.conf` — but do this
during a maintenance window, not reactively.
### Current state
```
# /etc/resolv.conf
search viktorbarzin.lan
nameserver 192.168.1.2
nameserver 94.140.14.14
options timeout:2 attempts:2
```
| Field | Value | Purpose |
|---|---|---|
| Primary | `192.168.1.2` | pfSense LAN interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` |
| Fallback | `94.140.14.14` | AdGuard public DNS — recursive only, used if pfSense LAN IP unreachable |
| `search` | `viktorbarzin.lan` | Unqualified names (`technitium`, `idrac`, etc.) resolve against the internal zone |
| `timeout:2 attempts:2` | — | Cap glibc resolver at 2s per server, 2 tries — reasonable fallback latency |
### Verification commands
```sh
ssh root@192.168.1.127 '
cat /etc/resolv.conf # should show the two nameservers
dig +short idrac.viktorbarzin.lan # expect an A record (192.168.1.4)
dig +short github.com # expect an A record
'
```
Simulated failover — force the primary unreachable and verify the
fallback answers:
```sh
ssh root@192.168.1.127 '
ip route add blackhole 192.168.1.2
dig +short +time=3 github.com # glibc times out on primary, tries 94.140.14.14 → A record returned
ip route del blackhole 192.168.1.2 # cleanup
'
```
Expected behaviour: the first `dig` prints a warning about the UDP
setup failing for 192.168.1.2 and then prints the GitHub A record
(answered by 94.140.14.14).
## Rollback
A pre-change backup of `/etc/resolv.conf`, `/etc/network/interfaces`,
and `/etc/network/interfaces.d/` lives at
`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
host. To roll back:
```sh
ssh root@192.168.1.127 '
# pick the backup you want (there may be multiple if this runbook has been applied more than once)
BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
tar -xzf "$BACKUP" -C /
cat /etc/resolv.conf
'
```
No service restart is needed — glibc re-reads `/etc/resolv.conf` per
lookup.
## Related docs
- `docs/architecture/dns.md` — where each resolver IP lives and which
subnet it serves.
- `docs/runbooks/nfs-prerequisites.md` — other operations on this
host; read before adding new NFS exports.

View file

@ -146,8 +146,8 @@ qm shutdown 220; sleep 10
for VMID in 102 300 103; do qm shutdown $VMID; done
sleep 20
# TrueNAS (decommissioned 2026-04-13 — VM 9000 should already be stopped; skip if absent)
qm shutdown 9000 2>/dev/null || true
# TrueNAS (wait for ZFS flush)
qm shutdown 9000; sleep 60
# pfSense (last — network gateway)
qm shutdown 101; sleep 15

View file

@ -1,170 +0,0 @@
# Runbook: Rebuild an Image After a Registry Orphan-Index Incident
Last updated: 2026-04-19
## When to use this
Pipelines that pull from `registry.viktorbarzin.me:5050` are failing with
messages like:
- `failed to resolve reference … : not found`
- `manifest unknown`
- `image can't be pulled` (Woodpecker exit 126)
- `error pulling image`: HEAD on a child manifest digest returns 404
…and `skopeo inspect --tls-verify --creds "$USER:$PASS" docker://registry.viktorbarzin.me:5050/<image>:<tag>`
returns an OCI image index whose `manifests[].digest` references are 404
on the registry.
This is the **orphan OCI-index** failure mode documented in
`docs/post-mortems/2026-04-19-registry-orphan-index.md`. The fix is to
rebuild the affected image from source so the registry receives a fresh,
complete push.
If the symptom is different (e.g., registry container down, TLS expiry,
auth failure), use `docs/runbooks/registry-vm.md` instead.
## Phase 1 — Confirm the diagnosis
From any host with `skopeo`:
```sh
REG=registry.viktorbarzin.me:5050
IMAGE=infra-ci
TAG=latest
# 1. Confirm the index exists.
skopeo inspect --tls-verify --creds "$USER:$PASS" \
--raw "docker://$REG/$IMAGE:$TAG" | jq '.mediaType, .manifests[].digest'
# 2. HEAD each child. Any non-200 = confirmed orphan.
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
"docker://$REG/$IMAGE:$TAG" | jq -r '.manifests[].digest'); do
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
-I "https://$REG/v2/$IMAGE/manifests/$d")
echo "$d → $code"
done
```
If every child is 200, the problem is elsewhere — stop here and check
the registry VM, TLS, or auth.
The `registry-integrity-probe` CronJob in the `monitoring` namespace
runs this same check every 15 minutes across every tag in the catalog;
its last run is also a fast way to see which image(s) are affected:
```sh
kubectl -n monitoring logs \
$(kubectl -n monitoring get pods -l job-name -o name \
| grep registry-integrity-probe | head -1)
```
## Phase 2 — Rebuild
### Option A (preferred): rebuild via CI
Find the `build-*.yml` pipeline that produces the image:
| Image | Pipeline | Repo ID |
|---|---|---|
| `infra-ci` | `.woodpecker/build-ci-image.yml` | 1 (infra) |
| `infra` (cli) | `.woodpecker/build-cli.yml` | 1 (infra) |
| `k8s-portal` | `.woodpecker/k8s-portal.yml` | 1 (infra) |
Trigger a manual build. The Woodpecker API expects a numeric repo ID
(paths with `owner/name` return HTML):
```sh
WOODPECKER_TOKEN=$(vault kv get -field=woodpecker_admin_token secret/viktor)
# Kick off a manual build against master.
curl -s -X POST \
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
-H "Content-Type: application/json" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines" \
-d '{"branch":"master"}' | jq .number
# Follow the pipeline at https://ci.viktorbarzin.me/repos/1/pipeline/<number>
```
The pipeline's `verify-integrity` step walks every blob the push
references. If it passes, the registry now has a clean index; pull
consumers will recover on next attempt.
### Option B (fallback): build on the registry VM
Only use this if Woodpecker itself is broken (its own pipeline runs
from the same `infra-ci` image, so a corrupted `infra-ci:latest` can
prevent Option A from recovering).
```sh
ssh root@10.0.20.10 '
cd /tmp
git clone --depth 1 https://github.com/ViktorBarzin/infra
cd infra/ci
docker build -t registry.viktorbarzin.me:5050/infra-ci:manual -t registry.viktorbarzin.me:5050/infra-ci:latest .
docker login -u "$USER" -p "$PASS" registry.viktorbarzin.me:5050
docker push registry.viktorbarzin.me:5050/infra-ci:manual
docker push registry.viktorbarzin.me:5050/infra-ci:latest
'
```
Then re-run any pipelines that failed — Woodpecker UI → Restart, or:
```sh
curl -s -X POST \
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines/<failed-pipeline-number>"
```
## Phase 3 — Verify
```sh
# 1. Pull the image fresh (bypassing containerd cache) and check its index.
REG=registry.viktorbarzin.me:5050
skopeo inspect --tls-verify --creds "$USER:$PASS" \
--raw "docker://$REG/infra-ci:latest" \
| jq '.manifests[] | {digest, platform}'
# 2. HEAD every child digest — all should be 200.
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
"docker://$REG/infra-ci:latest" | jq -r '.manifests[].digest'); do
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
-I "https://$REG/v2/infra-ci/manifests/$d")
[ "$code" = "200" ] || echo "STILL BROKEN: $d → $code"
done
echo "verified"
# 3. Kick off the next scheduled probe for good measure.
kubectl -n monitoring create job --from=cronjob/registry-integrity-probe \
registry-integrity-probe-verify-$(date +%s)
kubectl -n monitoring logs -f -l job-name=registry-integrity-probe-verify-$(date +%s)
```
The `RegistryManifestIntegrityFailure` alert clears automatically when
the probe's next run returns zero failures.
## Phase 4 — Investigate orphans
Once the immediate fix is in, check whether any OTHER images on the
registry have orphan children:
```sh
ssh root@10.0.20.10 'python3 /opt/registry/fix-broken-blobs.sh --dry-run 2>&1 | grep "ORPHAN INDEX"'
```
Each hit is a separate image that will eventually fail to pull. Rebuild
them in the same way (Option A preferred). If the list is long, open a
beads task — do NOT batch-delete the indexes; that's a destructive
registry operation outside this runbook's scope.
## Related
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — why this
happens.
- `docs/runbooks/registry-vm.md` — VM-level operations (DNS,
`docker compose` restarts).
- `modules/docker-registry/fix-broken-blobs.sh` — the scanner cron
itself, runs nightly and after each GC.
- `stacks/monitoring/modules/monitoring/main.tf`
`registry_integrity_probe` CronJob definition.

View file

@ -1,227 +0,0 @@
# Runbook: Registry VM (docker-registry, 10.0.20.10)
Last updated: 2026-05-07
The registry VM is an Ubuntu 24.04 VM on the cluster LAN subnet
`10.0.20.0/24`, with a static netplan config (no DHCP). Because it
sits on a subnet that only has pfSense as its gateway, its DNS must
be statically configured.
**As of Phase 4 of forgejo-registry-consolidation 2026-05-07** the VM
no longer hosts the private R/W registry. It hosts pull-through
caches only:
| Port | Upstream |
|---|---|
| 5000 | docker.io (Docker Hub) — auth via dockerhub_registry_password |
| 5010 | ghcr.io |
| 5020 | quay.io |
| 5030 | registry.k8s.io |
| 5040 | reg.kyverno.io |
The decommissioned private registry (port 5050) is now hosted on
Forgejo at `forgejo.viktorbarzin.me/viktor/<image>`. See
`docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md` for the
migration. Break-glass tarballs of `infra-ci` are still produced on
each build to `/opt/registry/data/private/_breakglass/` — see
`docs/runbooks/forgejo-registry-breakglass.md`.
## DNS configuration
Ubuntu ships `systemd-resolved` and uses netplan to declare per-link
`nameservers`. Netplan writes systemd-networkd or NetworkManager
configs that resolved reads at runtime. There is **no automatic
merging** of netplan DNS with the `[Resolve]` section of
`/etc/systemd/resolved.conf` — per-link settings override the global
ones. So both layers must be in sync:
| Layer | File | Role |
|---|---|---|
| Netplan | `/etc/netplan/50-cloud-init.yaml` | Per-link DNS servers that resolved reports on `Link 2 (eth0)` |
| Resolved global | `/etc/systemd/resolved.conf.d/10-internal-dns.conf` | `Global` scope `DNS=` / `FallbackDNS=` — also shown in `resolvectl status` |
### Current state
`/etc/systemd/resolved.conf.d/10-internal-dns.conf`:
```ini
[Resolve]
DNS=10.0.20.1
FallbackDNS=94.140.14.14
Domains=viktorbarzin.lan
```
`/etc/netplan/50-cloud-init.yaml` (eth0 block, simplified):
```yaml
nameservers:
addresses:
- 10.0.20.1
- 94.140.14.14
search:
- viktorbarzin.lan
```
`resolvectl status` output after the change:
```
Global
resolv.conf mode: stub
Current DNS Server: 10.0.20.1
DNS Servers: 10.0.20.1
Fallback DNS Servers: 94.140.14.14
DNS Domain: viktorbarzin.lan
Link 2 (eth0)
Current Scopes: DNS
Current DNS Server: 10.0.20.1
DNS Servers: 10.0.20.1 94.140.14.14
DNS Domain: viktorbarzin.lan
```
| Field | Value | Purpose |
|---|---|---|
| Primary | `10.0.20.1` | pfSense OPT1 interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` |
| Fallback | `94.140.14.14` | AdGuard public DNS — used if pfSense unreachable (e.g., OPT1 flap) |
| Search | `viktorbarzin.lan` | Unqualified names resolve against the internal zone |
### Why this matters for the registry
Container builds on this VM reference `.lan` hostnames (Technitium,
NFS, etc.) and external hostnames (Docker Hub, GHCR). Before the
hardening the netplan had `1.1.1.1` / `8.8.8.8` only, which meant:
1. Internal hostname lookups silently failed (slow timeout) — the
VM could not resolve `idrac.viktorbarzin.lan` or any internal
helper.
2. If Cloudflare's 1.1.1.1 had an outage, the VM would lose DNS
entirely.
With the new config the VM can resolve both zones and keeps working
if the primary DNS server is unreachable.
## Apply / re-apply
```sh
ssh root@10.0.20.10 '
netplan generate
netplan apply
systemctl restart systemd-resolved
resolvectl status | head -20
'
```
`netplan apply` is not disruptive when only `nameservers` change — it
does not bounce the link.
## Verification
```sh
ssh root@10.0.20.10 '
dig +short idrac.viktorbarzin.lan # 192.168.1.4
dig +short github.com # GitHub A record
dig +short registry.viktorbarzin.me # 10.0.20.10 + external A
'
```
Fallback test — blackhole the primary and confirm external lookups
still succeed through 94.140.14.14:
```sh
ssh root@10.0.20.10 '
ip route add blackhole 10.0.20.1
dig +short +time=5 +tries=2 github.com # should still answer
ip route del blackhole 10.0.20.1
'
```
Internal lookups do fail during the blackhole (the fallback is a
public resolver and does not know about the internal zone), which is
expected — the fallback buys availability for external pulls, not
internal hostnames.
## Rollback
A pre-change backup of `/etc/resolv.conf`, `/etc/systemd/resolved.conf`,
and `/etc/netplan/` lives at
`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
VM. To roll back:
```sh
ssh root@10.0.20.10 '
BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
tar -xzf "$BACKUP" -C /
rm -f /etc/systemd/resolved.conf.d/10-internal-dns.conf
netplan apply
systemctl restart systemd-resolved
resolvectl status | head -10
'
```
## Auto-sync pipeline
Changes to `modules/docker-registry/{docker-compose.yml, fix-broken-blobs.sh,
cleanup-tags.sh, nginx_registry.conf, config-private.yml}` deploy
automatically via `.woodpecker/registry-config-sync.yml`:
- Fires on `push` to master touching any of those paths, or via `manual`
event (Woodpecker UI / API).
- SCPs every managed file to `/opt/registry/` on `10.0.20.10`.
- Bounces containers + nginx when a compose-visible file changed; leaves
them alone when only scripts changed (cron picks up automatically).
- Runs a dry-run `fix-broken-blobs.sh` at the end to verify the registry
is still coherent.
SSH credentials: Woodpecker repo-secret `registry_ssh_key` (ed25519,
provisioned 2026-04-19). Public key at `/root/.ssh/authorized_keys` on
`10.0.20.10`. Private key mirrored at `secret/woodpecker/registry_ssh_key`
in Vault (subkeys `private_key` / `public_key` / `known_hosts_entry`).
Manual override if you need to sync right now:
```sh
curl -sf -X POST \
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines" \
-d '{"branch":"master"}' | jq .number
```
## Bouncing registry containers — the nginx DNS trap
`docker compose up -d` on `/opt/registry/docker-compose.yml` recreates
`registry-*` containers when their image tag changes, which assigns them
new IPs on the `registry` bridge network. **`registry-nginx` resolves its
upstream DNS names (`registry-private`, `registry-dockerhub`, …) ONCE at
startup and caches the results** — it does not re-resolve after a
recreate.
Symptom if you forget: `/v2/_catalog` on `:5050` returns
`{"repositories": []}`, `/v2/` returns 200 without auth, pulls return
the wrong image. nginx is forwarding to a stale IP that now belongs to a
different registry-* backend (commonly the pull-through ghcr or
dockerhub cache, which have empty catalogs from the htpasswd-auth user's
perspective).
**Always follow a registry-* bounce with `docker restart registry-nginx`.**
Or prevent the problem by setting a `resolver` directive in
`nginx_registry.conf` so upstream names are re-resolved per request.
```sh
ssh root@10.0.20.10 '
cd /opt/registry && docker compose up -d
docker restart registry-nginx
sleep 3
docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}" \
| grep -E "registry-"
'
```
## Related docs
- `docs/architecture/dns.md` — resolver IP assignments per subnet.
- `.claude/CLAUDE.md` (at repo root) — notes on the private registry
and `containerd` `hosts.toml` redirects.
- `docs/runbooks/registry-rebuild-image.md` — rebuild an image after an
orphan OCI-index incident (different class of problem than DNS).
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — root cause
+ detection gaps behind the recurring missing-blob incidents.

View file

@ -7,7 +7,7 @@
## Backup Location
- NFS: `/mnt/main/etcd-backup/etcd-snapshot-YYYYMMDD-HHMMSS.db`
- Replicated to Synology NAS (192.168.1.13) via Proxmox host offsite-sync-backup (inotify-driven rsync)
- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication
- Retention: 30 days
- Schedule: Daily at 00:00

View file

@ -8,8 +8,8 @@ Last updated: 2026-04-06
- Proxmox host failure requiring fresh VM provisioning
## Prerequisites
- Proxmox host (192.168.1.127) accessible, with NFS exports on `/srv/nfs` and `/srv/nfs-ssd`
- Synology NAS (192.168.1.13) accessible for offsite backup restore if the PVE host backup disk is also lost
- Proxmox host (192.168.1.127) accessible
- TrueNAS NFS server (10.0.10.15) accessible — or Synology NAS (192.168.1.13) for backups
- sda backup disk mounted at `/mnt/backup` on PVE host (or restore from Synology first)
- Git repo with infra code
- SOPS age keys for state decryption (`~/.config/sops/age/keys.txt`)

View file

@ -130,7 +130,7 @@ kubectl rollout restart deployment -n <namespace>
## Alternative: Restore from sda Backup
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
If TrueNAS NFS is unavailable but the PVE host is accessible:
```bash
# 1. SSH to PVE host
@ -148,17 +148,17 @@ kubectl run mysql-restore --rm -it --image=mysql \
## Alternative: Restore from Synology (if PVE host is down)
If the PVE host itself is unavailable:
If both TrueNAS and PVE host are unavailable:
```bash
# 1. SSH to Synology NAS
ssh Administrator@192.168.1.13
# 2. Navigate to backup directory
cd /volume1/Backup/Viki/nfs/mysql-backup/
cd /volume1/Backup/Viki/pve-backup/nfs-mirror/mysql-backup/
# 3. Copy dump to a temporary location accessible from cluster
# (e.g., via rsync to a surviving node, or restore PVE host first)
# (e.g., via rsync to a surviving node, or restore TrueNAS first)
```
## Estimated Time

View file

@ -123,7 +123,7 @@ kubectl rollout restart deployment -n <namespace>
## Alternative: Restore from sda Backup
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
If TrueNAS NFS is unavailable but the PVE host is accessible:
```bash
# 1. SSH to PVE host
@ -142,17 +142,17 @@ kubectl run pg-restore --rm -it --image=postgres:16.4-bullseye \
## Alternative: Restore from Synology (if PVE host is down)
If the PVE host itself is unavailable:
If both TrueNAS and PVE host are unavailable:
```bash
# 1. SSH to Synology NAS
ssh Administrator@192.168.1.13
# 2. Navigate to backup directory
cd /volume1/Backup/Viki/nfs/postgresql-backup/
cd /volume1/Backup/Viki/pve-backup/nfs-mirror/postgresql-backup/
# 3. Copy dump to a temporary location accessible from cluster
# (e.g., via rsync to a surviving node, or restore PVE host first)
# (e.g., via rsync to a surviving node, or restore TrueNAS first)
```
## Estimated Time

View file

@ -93,7 +93,7 @@ kubectl get externalsecrets -A | grep -v "SecretSynced"
## Alternative: Restore from sda Backup
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
If TrueNAS NFS is unavailable but the PVE host is accessible:
```bash
# 1. SSH to PVE host
@ -115,17 +115,17 @@ vault operator raft snapshot restore -force ./vault-raft-YYYYMMDD-HHMMSS.db
## Alternative: Restore from Synology (if PVE host is down)
If the PVE host itself is unavailable:
If both TrueNAS and PVE host are unavailable:
```bash
# 1. SSH to Synology NAS
ssh Administrator@192.168.1.13
# 2. Navigate to backup directory
cd /volume1/Backup/Viki/nfs/vault-backup/
cd /volume1/Backup/Viki/pve-backup/nfs-mirror/vault-backup/
# 3. Copy snapshot to local workstation
scp Administrator@192.168.1.13:/volume1/Backup/Viki/nfs/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./
scp Administrator@192.168.1.13:/volume1/Backup/Viki/pve-backup/nfs-mirror/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./
# 4. Restore via port-forward (same as above)
```

View file

@ -104,9 +104,9 @@ lvchange -an pve/$LV_NAME
kubectl scale deployment vaultwarden -n vaultwarden --replicas=1
```
## Alternative: Restore from sda Backup Mirror
## Alternative: Restore from sda NFS Mirror
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
If TrueNAS NFS is unavailable but PVE host is accessible:
```bash
# 1. SSH to PVE host

View file

@ -1,51 +0,0 @@
# Runbook: Applying the Technitium Terraform stack
Last updated: 2026-04-19
The `stacks/technitium/` apply has a **post-apply readiness gate** that asserts all three DNS instances are healthy before the apply is allowed to finish. This runbook explains what it checks, how to interpret failures, and how to override it for emergency maintenance.
## What the gate checks
`stacks/technitium/modules/technitium/readiness.tf` defines `null_resource.technitium_readiness_gate`. It runs after the three Technitium deployments, the DNS LoadBalancer service, and the PDB are applied, and performs:
1. **Rollout status**`kubectl rollout status deploy/<name> --timeout=180s` for `technitium`, `technitium-secondary`, `technitium-tertiary`. Fails if any deployment has not reached its desired pod count within 180s.
2. **Per-pod API health** — for every pod with label `dns-server=true`, executes `wget http://127.0.0.1:5380/api/stats/get` inside the pod and asserts the response contains `"status":"ok"`. Catches Technitium process hangs that TCP probes miss.
3. **Zone-count parity** — queries `technitium-web`, `technitium-secondary-web`, `technitium-tertiary-web` and counts the zones returned. Fails if the three counts differ, which would mean `technitium-zone-sync` has drifted or a replica has lost state.
The gate is re-run whenever any of the deployment container spec, the CoreDNS Corefile, or the apply timestamp changes (see `triggers` in `readiness.tf`).
## Emergency override
Set `skip_readiness=true` via terragrunt inputs or pass it directly to the Terraform apply:
```bash
cd infra/stacks/technitium
scripts/tg apply -var skip_readiness=true
```
Only use this when you need to land a Terraform change while one Technitium instance is intentionally offline (e.g., you are replacing its PVC, migrating storage, or recovering a corrupted config DB). Re-apply without the flag once the instance is back.
You can also target around the gate during emergency work:
```bash
scripts/tg apply -target=kubernetes_config_map.coredns
```
`-target` bypasses the `depends_on` chain feeding the gate, so a single-resource push does not need the gate to pass.
## Failure modes and responses
| Symptom | Likely cause | Fix |
|---------|--------------|-----|
| `rollout status` times out on one deployment | Pod stuck `Pending` (node pressure / anti-affinity with other dns-server pods) or `ImagePullBackOff` | `kubectl describe pod` for events. If anti-affinity is blocking, confirm 3 nodes are Ready. |
| API check fails on a pod but readiness probe passes | Technitium process hung but port 53 still accepting TCP (liveness probe is `tcp_socket` on :53) | `kubectl delete pod <name>` — deployment will recreate it. |
| Zone count differs between instances | `technitium-zone-sync` CronJob is failing or AXFR is blocked | `kubectl logs -n technitium -l job-name=<latest-zone-sync-job>`. Check `TechnitiumZoneSyncFailed` alert. |
| Gate passes but external clients still cannot resolve | Gate only checks in-pod API and intra-cluster zone parity — external path (LoadBalancer → Technitium pod) is not tested | Run the LAN-client drill in `docs/architecture/dns.md` troubleshooting section. |
## What the gate does NOT check
- External reachability through the LoadBalancer IP `10.0.20.201` (that would require a LAN-side probe).
- CoreDNS health (CoreDNS is patched by `coredns.tf`, not this module's deployments — alerts `CoreDNSErrors` / `CoreDNSForwardFailureRate` catch regressions post-apply).
- Upstream resolver health (covered by `CoreDNSForwardFailureRate`).
For broader end-to-end verification, see `docs/architecture/dns.md` → "Verification" section, or run the Uptime Kuma external DNS probe.

View file

@ -1,217 +0,0 @@
# Runbook: Vault Raft Leader Deadlock + Safe Pod Restart
Captures the 2026-04-22 incident pattern. When a Vault raft leader enters a
stuck goroutine state (port 8201 accepts TCP but RPCs never return), the
recovery is *not* `kubectl delete --force`. Force-deleting a Vault pod that
holds a stuck NFS mount leaves kernel NFS client state corrupted, which
blocks all subsequent NFS mounts from the node and usually requires a VM
hard-reset to clear.
**Related**: [post-mortems/2026-04-22-vault-raft-leader-deadlock.md](../post-mortems/2026-04-22-vault-raft-leader-deadlock.md).
## Symptoms
- `https://vault.viktorbarzin.me/v1/sys/health` returns HTTP 503.
- Standbys log `msgpack decode error [pos 0]: i/o timeout` every 2s.
- `kubectl exec` into a standby shows raft thinks the leader is alive
(peers list all `Voter`, leader address populated) but `vault operator
raft autopilot state` stalls or errors.
- The "leader" pod's logs go silent — no heartbeats, no audit writes,
nothing. TCP on 8201 still accepts connections.
- ESO-backed secrets stop refreshing (ExternalSecret `SecretSyncedError`).
- Woodpecker CI pipelines that read from Vault at plan time hang.
## 0. Confirm the diagnosis (before touching anything)
Don't jump to force-delete. Verify the leader is actually stuck, not just
slow:
```sh
# 1. Who does raft think the leader is?
kubectl exec -n vault vault-0 -c vault -- vault status 2>&1 | \
grep -E 'HA Mode|Active Node|Leader|Raft'
# 2. Is the leader's port open but unresponsive?
LEADER_POD=vault-2 # or whichever vault status reports
kubectl exec -n vault $LEADER_POD -c vault -- sh -c \
'timeout 3 nc -zv 127.0.0.1 8200 2>&1; echo; timeout 3 vault status'
# 3. Is the active vault service pointing at a real pod?
kubectl get endpoints -n vault vault-active -o yaml | \
grep -E 'addresses|notReadyAddresses' -A2
# 4. What do standby logs say?
kubectl logs -n vault vault-0 -c vault --tail=40 | grep -iE 'msgpack|decode|rpc'
```
If (2) hangs and (4) shows repeated msgpack errors → stuck leader.
## 1. Identify the stuck pod precisely
```sh
# Find the pod whose vault_core_active would be 1 if it were scraping
# (currently no telemetry — use logs as proxy until telemetry is enabled).
for p in vault-0 vault-1 vault-2; do
echo "=== $p ==="
kubectl logs -n vault $p -c vault --tail=5 2>&1 | head -5
done | grep -B1 'no recent output'
```
The pod whose logs have been silent for minutes while the others are
actively erroring is the stuck leader.
## 2. The safe restart sequence (avoids zombie containers)
**DO NOT** `kubectl delete pod --force --grace-period=0` as the first
step. On NFS-backed Vault that's the exact move that leaves the kernel
NFS client corrupted on the node where the stuck pod ran.
Instead:
### 2a. Graceful delete first (30s grace)
```sh
kubectl delete pod -n vault vault-2
```
Wait 30 seconds. Most of the time the TERM → SIGKILL path works and the
new pod schedules cleanly. The remaining leaders re-elect and the external
endpoint recovers.
### 2b. If the pod is Terminating after 60s, find the stuck process
```sh
NODE=$(kubectl get pod -n vault vault-2-<suffix> -o jsonpath='{.spec.nodeName}')
POD_UID=$(kubectl get pod -n vault vault-2-<suffix> -o jsonpath='{.metadata.uid}')
ssh $NODE "sudo ps auxf | grep -A2 $POD_UID | head -20"
# Look for: mount.nfs (D-state), vault (Z-state), or the sh wrapper in do_wait
```
### 2c. Unmount stale NFS before force-deleting
If the old pod's NFS mount is still present, lazy-unmount it FIRST so
the kernel can release NFS session state cleanly:
```sh
ssh $NODE "sudo mount | grep $POD_UID | awk '{print \$3}' | xargs -I{} sudo umount -l {}"
```
Verify no mount.nfs processes are in D-state on the node:
```sh
ssh $NODE "ps -eo state,pid,comm | grep '^D' | head -5"
```
### 2d. Only NOW force-delete if needed
```sh
kubectl delete pod -n vault vault-2-<suffix> --force --grace-period=0
```
## 3. Recovery when the node is already stuck
If you force-deleted before reading this runbook and NFS is now broken
on the node:
**Diagnostic — confirm NFS client state is corrupted:**
```sh
NODE=k8s-node2 # node where the force-delete happened
ssh $NODE "sudo mkdir -p /tmp/nfstest && sudo timeout 30 \
mount -t nfs 192.168.1.127:/srv/nfs /tmp/nfstest && echo MOUNT_OK"
```
If the mount times out at 30-110s, kernel NFS client state is stuck.
No userspace recovery exists — only a VM reboot clears it.
**Workaround before rebooting**: mounting with `nfsvers=4.1` succeeds
on broken nodes (the corruption is NFSv4.2 session-state specific).
This is useful for diagnostic mounts, but does NOT fix CSI pods —
their mount options come from the `nfs-proxmox` StorageClass and can't
be overridden per-pod.
**Reboot the affected node VM:**
```sh
# Find PVE VM ID — nodes numbered 201-204 for k8s-node1..4
ssh root@192.168.1.127 "qm reset 20<N>"
# If qm reset leaves the VM PID unchanged (it didn't actually reboot),
# use qm stop/start:
ssh root@192.168.1.127 "qm stop 20<N> && qm start 20<N>"
```
Wait for the node to become Ready (`kubectl get node k8s-node<N> -w`)
and CSI driver to register (`kubectl get pods -n nfs-csi -o wide`).
**Gotcha — `qm reset` can be a no-op.** On the 2026-04-22 incident,
`qm reset 201` returned exit 0 but did NOT restart the VM (same QEMU PID
before and after). `qm status` reported "running" throughout. Always
verify by checking the QEMU PID or VM uptime post-reset. If uptime is
unchanged, escalate to `qm stop && qm start`.
**Gotcha — check boot order before stop/start.** Long-running VMs
(630+ day uptime) may have stale `bootdisk:` config that's been hidden
by never rebooting. On 2026-04-22, k8s-node1's config had `bootdisk:
scsi0` but the actual OS disk was on `scsi1`, so the first boot after
stop attempted iPXE and failed. Before stopping, verify:
```sh
ssh root@192.168.1.127 "grep -E 'boot|scsi[0-9]+:' /etc/pve/qemu-server/20<N>.conf"
```
If `bootdisk` references a disk ID that doesn't exist, fix it first
with `qm set 20<N> --boot "order=scsi<ID>"` (use the ID of the main
OS disk).
## 4. Prevent re-infection — the chown loop
After the node comes back, the vault pod's PV chown walk can still
peg kubelet. The durable fix is in `stacks/vault/main.tf`:
```hcl
statefulSet = {
securityContext = {
pod = {
fsGroupChangePolicy = "OnRootMismatch"
}
}
}
```
This was applied in commit `2f1f9107` (2026-04-22). If you find
yourself editing this in a kubectl patch for live recovery, follow
up with a Terraform apply the same session — leaving the cluster
ahead of Terraform state is technical debt that re-triggers on the
next apply.
## 5. Verify end-to-end
```sh
# External endpoint — the user-facing health check
curl -sk -o /dev/null -w "%{http_code}\n" https://vault.viktorbarzin.me/v1/sys/health
# expect: 200
# Raft peers (needs VAULT_TOKEN with operator capability)
kubectl exec -n vault vault-0 -c vault -- vault operator raft list-peers
# All pods 2/2
kubectl get pods -n vault -l app.kubernetes.io/name=vault -o wide
# No alerts fired (once VaultRaftLeaderStuck + VaultHAStatusUnavailable are live)
curl -s https://alertmanager.viktorbarzin.me/api/v2/alerts | \
jq '.[] | select(.labels.alertname | test("Vault"))'
```
## Known limitations
- **No alert for stuck leaders yet.** `VaultRaftLeaderStuck` and
`VaultHAStatusUnavailable` require Vault telemetry enabled
(`telemetry { unauthenticated_metrics_access = true }`) and a
scrape job. Alerts are defined in `prometheus_chart_values.tpl`
but stay silent until telemetry lands — tracked as a beads task.
- **Vault on NFS violates the documented rule.** `infra/.claude/CLAUDE.md`
says critical services must use `proxmox-lvm-encrypted`. The
`dataStorage`/`auditStorage` still use `nfs-proxmox`. Migration
tracked as an epic-level beads task.

View file

@ -1,73 +0,0 @@
# Runbook: Onboarding a new Forgejo repo to Woodpecker
Last updated: 2026-05-07
When you create a new repo on `forgejo.viktorbarzin.me`, Woodpecker
does NOT auto-discover it via the cluster's existing OAuth session.
The `forgejo` user inside Woodpecker (Forgejo-OAuth'd) needs to:
1. Open `https://ci.viktorbarzin.me/` in a browser.
2. Log in via Forgejo OAuth (the "Sign in with Forgejo" button).
3. Click "Add Repository" — your new repo should appear.
4. Click the toggle to activate it. Woodpecker will:
- Add a webhook on the Forgejo repo (push, PR, release events).
- Register the repo's `forge_remote_id` in its DB so subsequent
hooks deserialize correctly.
5. Push a commit (or hit "Run pipeline" in Woodpecker UI) — first
build fires.
## Why API-only doesn't work
The webhook URL contains a JWT signed with a per-server key that's
stored in the DB and only accessible at OAuth-flow time. POST'ing
`/api/repos` as the admin (`ViktorBarzin` GitHub user) returns 500
because the lookup queries forge-side OAuth state for THAT user,
which doesn't exist for the Forgejo `viktor` user. We confirmed:
- Direct `POST /api/repos?forge_remote_id=N` → HTTP 500 server-side.
- Generating a JWT with the agent secret → "token is unverifiable"
on hook delivery (the signing key is repo-specific, not the
global agent secret).
There's no admin endpoint that side-steps the OAuth flow.
## Bootstrap when UI access isn't available
If you absolutely need to bootstrap a new image without UI access
(e.g., during an outage), the workaround is:
1. Build locally:
```bash
docker build -t forgejo.viktorbarzin.me/viktor/<name>:<tag> /path/to/source
docker push forgejo.viktorbarzin.me/viktor/<name>:<tag>
```
2. Or pull from another already-built source and retag:
```bash
docker pull viktorbarzin/<name>:<tag> # DockerHub
docker tag viktorbarzin/<name>:<tag> forgejo.viktorbarzin.me/viktor/<name>:<tag>
docker push forgejo.viktorbarzin.me/viktor/<name>:<tag>
```
3. Flip the cluster `image=` reference and restart deployments.
Document the bootstrap in the relevant stack so future maintainers
know the image was put there by hand. After Woodpecker UI onboarding,
the next pipeline run replaces the bootstrap image with a CI-built one.
## Repos onboarded in flight 2026-05-07
These were created during the forgejo-registry-consolidation but the
UI step above hasn't been done yet — their `.woodpecker.yml` /
`.woodpecker/build.yml` exists on Forgejo but no pipeline fires:
- `viktor/broker-sync` — image bootstrapped via DockerHub (see
`infra/stacks/wealthfolio/main.tf` comment).
- `viktor/fire-planner` — image bootstrapped via local docker build.
- `viktor/hmrc-sync`
- `viktor/freedify`
- `viktor/claude-agent-service`
- `viktor/beadboard` — image bootstrapped via local docker build.
- `viktor/claude-memory-mcp`
Walk through each in the Woodpecker UI to enable. Pipelines for
already-onboarded repos (payslip-ingest, job-hunter, infra) fired
correctly after the v3.13 → v3.14 upgrade.

View file

@ -3,12 +3,8 @@ networks:
driver: bridge
services:
# registry:2 is pinned after the 2026-04-13 + 2026-04-19 orphan-index incidents.
# Floating tags were swapping to regressed versions between GC runs. Upgrade
# path: bump all six registry-* services in lockstep and bounce via
# `systemctl restart docker-compose-registry.service`.
registry-dockerhub:
image: registry:2.8.3
image: registry:2
container_name: registry-dockerhub
restart: always
volumes:
@ -26,7 +22,7 @@ services:
start_period: 10s
registry-ghcr:
image: registry:2.8.3
image: registry:2
container_name: registry-ghcr
restart: always
volumes:
@ -42,7 +38,7 @@ services:
start_period: 10s
registry-quay:
image: registry:2.8.3
image: registry:2
container_name: registry-quay
restart: always
volumes:
@ -58,7 +54,7 @@ services:
start_period: 10s
registry-k8s:
image: registry:2.8.3
image: registry:2
container_name: registry-k8s
restart: always
volumes:
@ -74,7 +70,7 @@ services:
start_period: 10s
registry-kyverno:
image: registry:2.8.3
image: registry:2
container_name: registry-kyverno
restart: always
volumes:
@ -89,26 +85,35 @@ services:
retries: 3
start_period: 10s
# registry-private decommissioned in Phase 4 of
# forgejo-registry-consolidation 2026-05-07 — image migration completed,
# cluster flipped to forgejo.viktorbarzin.me/viktor/<image>. The remaining
# five services on this VM are pull-through caches for upstream registries.
# After 1 week of no incidents, `rm -rf /opt/registry/data/private/` on the
# VM frees ~2.6 GB. The tarball break-glass under
# /opt/registry/data/private/_breakglass/ stays — it's how we recover
# infra-ci if Forgejo ever goes fully down.
registry-private:
image: registry:2
container_name: registry-private
restart: always
volumes:
- /opt/registry/data/private:/var/lib/registry
- /opt/registry/config-private.yml:/etc/docker/registry/config.yml:ro
- /opt/registry/htpasswd:/auth/htpasswd:ro
networks:
- registry
healthcheck:
# 401 is expected (auth required) — any HTTP response means the registry is healthy
test: ["CMD", "sh", "-c", "wget -qS -O /dev/null http://127.0.0.1:5000/v2/ 2>&1 | grep -q 'HTTP/'"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
nginx:
image: nginx:alpine
container_name: registry-nginx
restart: always
# 5050 dropped Phase 4 of forgejo-registry-consolidation 2026-05-07.
ports:
- "5000:5000"
- "5010:5010"
- "5020:5020"
- "5030:5030"
- "5040:5040"
- "5050:5050"
volumes:
- /opt/registry/nginx.conf:/etc/nginx/nginx.conf:ro
- /opt/registry/tls:/etc/nginx/tls:ro
@ -126,6 +131,8 @@ services:
condition: service_healthy
registry-kyverno:
condition: service_healthy
registry-private:
condition: service_healthy
healthcheck:
test: ["CMD", "sh", "-c", "wget -qO- http://127.0.0.1:5000/v2/ >/dev/null 2>&1"]
interval: 30s

View file

@ -1,33 +1,25 @@
#!/usr/bin/env python3
"""Registry integrity scanner — two classes of brokenness.
"""Finds and removes layer links that point to non-existent blobs.
1. Orphaned layer links: the cleanup-tags.sh + garbage-collect cycle can delete
blob data while leaving _layers/ link files intact. The registry then returns
HTTP 200 with 0 bytes for those layers (it finds the link, trusts the blob
exists, but the data is gone). Containerd sees "unexpected EOF".
Action: delete the orphan link so the next pull re-fetches cleanly.
When the cleanup-tags.sh + garbage-collect cycle runs, it can delete blob data
while leaving _layers/ link files intact. The registry then returns HTTP 200
with 0 bytes for those layers (it finds the link, trusts the blob exists, but
the data is gone). This causes containerd to fail with "unexpected EOF".
2. Orphaned OCI-index children: an image index (multi-platform manifest list)
references child manifests by digest. If a child's blob has been deleted —
by a cleanup-tags.sh tag rmtree followed by garbage-collect walking the
children wrong (distribution/distribution#3324 class), or by an incomplete
`buildx --push` whose partial blob was later purged by `uploadpurging`
the index survives but pulls fail with `manifest unknown`.
Action: log loudly. Deleting an index is a conscious decision (the image
was published; removing it breaks downstream consumers), so we surface
the problem and leave repair to a human or to the rebuild runbook.
This script walks all repositories, checks each layer link against the actual
blobs directory, and removes any orphaned links. On next pull, the registry
will re-fetch the missing blobs from the upstream registry.
Run after garbage-collect (Sunday 03:30) and daily (Mon-Sat 02:30).
Run after garbage-collect (e.g., 3:15 AM Sunday) or daily.
"""
import argparse
import json
import os
import sys
sys.stdout.reconfigure(line_buffering=True)
parser = argparse.ArgumentParser(description="Scan registry for orphaned blobs and indexes")
parser = argparse.ArgumentParser(description="Remove orphaned registry layer links")
parser.add_argument("base", nargs="?", default="/opt/registry/data", help="Registry data directory")
parser.add_argument("--dry-run", action="store_true", help="Report but don't delete")
args = parser.parse_args()
@ -35,124 +27,39 @@ args = parser.parse_args()
BASE = args.base
DRY_RUN = args.dry_run
INDEX_MEDIA_TYPES = (
"application/vnd.oci.image.index.v1+json",
"application/vnd.docker.distribution.manifest.list.v2+json",
)
# Only the private R/W registry is authoritative for every child of every
# index it stores — we pushed those indexes ourselves, so a missing child is
# always a bug (the 2026-04-13 + 2026-04-19 failure mode).
#
# Pull-through caches (dockerhub, ghcr, quay, k8s, kyverno) are ALLOWED to
# have missing children: they only fetch what someone actually pulls.
# Uncached arm64 / arm / attestation variants of a multi-platform index are
# normal partial state, not orphans. Scanning them generates hundreds of
# false-positive warnings — noise that would mask the real signal from the
# private registry. Scan 2 is therefore private-only.
INDEX_SCAN_REGISTRIES = ("private",)
total_layer_removed = 0
total_layer_checked = 0
total_index_scanned = 0
total_index_orphans = 0
def load_manifest_blob(blobs_root, digest_hex):
blob_path = os.path.join(blobs_root, digest_hex[:2], digest_hex, "data")
if not os.path.isfile(blob_path):
return None
try:
with open(blob_path, "rb") as f:
raw = f.read(1024 * 1024)
except OSError:
return None
try:
return json.loads(raw)
except (json.JSONDecodeError, UnicodeDecodeError):
return None
total_removed = 0
total_checked = 0
for registry_name in sorted(os.listdir(BASE)):
repos_dir = os.path.join(BASE, registry_name, "docker/registry/v2/repositories")
blobs_root = os.path.join(BASE, registry_name, "docker/registry/v2/blobs/sha256")
blobs_dir = os.path.join(BASE, registry_name, "docker/registry/v2/blobs")
if not os.path.isdir(repos_dir):
continue
for root, _, _ in os.walk(repos_dir):
# --- Scan 1: orphan layer links ----------------------------------------
if root.endswith("/_layers/sha256"):
repo = root.replace(repos_dir + "/", "").replace("/_layers/sha256", "")
for root, dirs, files in os.walk(repos_dir):
if not root.endswith("/_layers/sha256"):
continue
for digest_dir in os.listdir(root):
link_file = os.path.join(root, digest_dir, "link")
if not os.path.isfile(link_file):
continue
repo = root.replace(repos_dir + "/", "").replace("/_layers/sha256", "")
total_layer_checked += 1
blob_data = os.path.join(blobs_root, digest_dir[:2], digest_dir, "data")
if os.path.isfile(blob_data):
continue
for digest_dir in os.listdir(root):
link_file = os.path.join(root, digest_dir, "link")
if not os.path.isfile(link_file):
continue
total_checked += 1
# Check if the actual blob data exists
blob_data = os.path.join(blobs_dir, "sha256", digest_dir[:2], digest_dir, "data")
if not os.path.isfile(blob_data):
prefix = "[DRY RUN] " if DRY_RUN else ""
print(f"{prefix}[{registry_name}/{repo}] removing orphaned layer link: {digest_dir[:12]}...")
if not DRY_RUN:
# Remove the entire digest directory (contains the link file)
import shutil
shutil.rmtree(os.path.join(root, digest_dir))
total_layer_removed += 1
# --- Scan 2: orphan OCI-index children (private registry only) --------
elif root.endswith("/_manifests/revisions/sha256") and registry_name in INDEX_SCAN_REGISTRIES:
repo = root.replace(repos_dir + "/", "").replace("/_manifests/revisions/sha256", "")
for digest_dir in os.listdir(root):
# Manifest revision entry. Load the blob it points to.
manifest = load_manifest_blob(blobs_root, digest_dir)
if manifest is None:
continue
media_type = manifest.get("mediaType", "")
if media_type not in INDEX_MEDIA_TYPES:
continue
total_index_scanned += 1
# Per-repo revision links — serving a child manifest via the API
# requires <repo>/_manifests/revisions/sha256/<child-digest>/link
# to exist. The blob data alone is not enough: cleanup-tags.sh
# rmtrees tag dirs (which on 2.8.x also orphans the per-repo
# revision links for index children), while the upstream blob
# data survives in /blobs/. That's exactly the 2026-04-19
# failure mode — the probe sees 404 even though the blob file
# is still on disk.
revisions_root = os.path.dirname(root) # …/_manifests/revisions
for child in manifest.get("manifests", []):
child_digest = child.get("digest", "")
if not child_digest.startswith("sha256:"):
continue
child_hex = child_digest[len("sha256:"):]
child_link = os.path.join(revisions_root, "sha256", child_hex, "link")
if os.path.isfile(child_link):
continue
platform = child.get("platform", {})
arch = platform.get("architecture", "?")
os_ = platform.get("os", "?")
child_blob = os.path.join(blobs_root, child_hex[:2], child_hex, "data")
blob_state = "blob-data-present" if os.path.isfile(child_blob) else "blob-data-gone"
print(
f"WARNING [{registry_name}/{repo}] ORPHAN INDEX: "
f"{digest_dir[:12]} references missing child {child_hex[:12]} "
f"({arch}/{os_}, {blob_state}) — registry returns 404, rebuild required"
)
total_index_orphans += 1
total_removed += 1
mode = "DRY RUN — " if DRY_RUN else ""
print(f"\n{mode}Layer scan: checked {total_layer_checked} links, removed {total_layer_removed} orphaned.")
print(f"{mode}Index scan: inspected {total_index_scanned} image indexes, found {total_index_orphans} orphaned children.")
if total_index_orphans > 0:
print(f"\nACTION REQUIRED: {total_index_orphans} orphan index child(ren) detected. "
"See docs/runbooks/registry-rebuild-image.md — the affected image must be rebuilt "
"(a registry DELETE on an index is a conscious decision, not an automated repair).")
print(f"\n{mode}Checked {total_checked} layer links, removed {total_removed} orphaned.")

View file

@ -33,9 +33,10 @@ http {
keepalive 32;
}
# `upstream private` removed in Phase 4 of forgejo-registry-consolidation
# 2026-05-07. The /v2/ private registry is now Forgejo at
# forgejo.viktorbarzin.me/viktor/.
upstream private {
server registry-private:5000;
keepalive 32;
}
# --- Docker Hub (port 5000) ---
@ -167,8 +168,37 @@ http {
}
}
# --- Private R/W Registry (port 5050) decommissioned Phase 4 2026-05-07 ---
# The TLS port 5050 server block previously fronted `registry-private`.
# Migrated to Forgejo at forgejo.viktorbarzin.me/viktor/. Both
# docker-compose.yml and this nginx config no longer reference port 5050.
# --- Private R/W Registry (port 5050, TLS) ---
server {
listen 5050 ssl;
server_name registry.viktorbarzin.me;
ssl_certificate /etc/nginx/tls/fullchain.pem;
ssl_certificate_key /etc/nginx/tls/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
client_max_body_size 0;
proxy_request_buffering off;
proxy_buffering off;
chunked_transfer_encoding on;
location /v2/ {
proxy_pass http://private;
proxy_http_version 1.1;
proxy_set_header Host $http_host;
proxy_set_header Connection "";
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 900;
proxy_send_timeout 900;
}
location / {
return 200 'ok';
add_header Content-Type text/plain;
}
}
}

View file

@ -40,9 +40,8 @@ variable "ingress_path" {
default = ["/"]
}
variable "max_body_size" {
type = string
default = null
description = "Maximum request body size, e.g. '5g'. null = no limit (Traefik default). When set, a per-ingress Buffering middleware is created and attached."
type = string
default = "50m"
}
variable "extra_annotations" {
default = {}
@ -149,19 +148,10 @@ locals {
# record (either CF-proxied or direct A/AAAA). Explicit bool overrides.
effective_external_monitor = var.external_monitor != null ? var.external_monitor : (var.dns_type != "none")
# Emit the annotation when effective is true (positive signal), or when the
# caller explicitly set external_monitor=false (opt-out). When the caller
# leaves it null AND dns_type="none", emit nothing the sync script's
# default opt-in (any *.viktorbarzin.me ingress) keeps monitoring services
# that are publicly reachable via routes we don't manage here (e.g.
# helm-provisioned ingresses, services behind cloudflared tunnel with DNS
# set elsewhere).
external_monitor_annotations = local.effective_external_monitor ? merge(
{ "uptime.viktorbarzin.me/external-monitor" = "true" },
var.external_monitor_name != null ? { "uptime.viktorbarzin.me/external-monitor-name" = var.external_monitor_name } : {},
) : (var.external_monitor == false ?
{ "uptime.viktorbarzin.me/external-monitor" = "false" } : {}
)
) : {}
ns_to_group = {
monitoring = "Infrastructure"
@ -204,17 +194,6 @@ locals {
"gethomepage.dev/href" = "https://${local.effective_host}"
"gethomepage.dev/icon" = "${replace(var.name, "-", "")}.png"
} : {}
# Parse "5g"/"50m"/"1024k"/"42" into bytes. Traefik's Buffering middleware
# takes maxRequestBodyBytes as an integer. Empty unit = bytes.
body_size_match = var.max_body_size == null ? null : regex("^([0-9]+)([kmgKMG]?)$", var.max_body_size)
body_size_unit_multiplier = var.max_body_size == null ? 0 : (
lower(local.body_size_match[1]) == "g" ? 1073741824 :
lower(local.body_size_match[1]) == "m" ? 1048576 :
lower(local.body_size_match[1]) == "k" ? 1024 :
1
)
max_body_size_bytes = var.max_body_size == null ? 0 : tonumber(local.body_size_match[0]) * local.body_size_unit_multiplier
}
@ -257,7 +236,6 @@ resource "kubernetes_ingress_v1" "proxied-ingress" {
var.protected ? "traefik-authentik-forward-auth@kubernetescrd" : null,
var.allow_local_access_only ? "traefik-local-only@kubernetescrd" : null,
var.custom_content_security_policy != null ? "${var.namespace}-custom-csp-${var.name}@kubernetescrd" : null,
var.max_body_size != null ? "${var.namespace}-buffering-${var.name}@kubernetescrd" : null,
], var.extra_middlewares)))
"traefik.ingress.kubernetes.io/router.entrypoints" = "websecure"
}, local.homepage_defaults, var.extra_annotations,
@ -315,27 +293,6 @@ resource "kubernetes_manifest" "custom_csp" {
}
}
# Buffering middleware - created per service when max_body_size is set.
# Traefik default is unlimited; setting maxRequestBodyBytes enforces a limit
# (e.g. Forgejo container pushes can ship multi-GB layer blobs).
resource "kubernetes_manifest" "buffering" {
count = var.max_body_size != null ? 1 : 0
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "buffering-${var.name}"
namespace = var.namespace
}
spec = {
buffering = {
maxRequestBodyBytes = local.max_body_size_bytes
}
}
}
}
# Cloudflare DNS records created automatically when dns_type is set.
# Proxied: CNAME to Cloudflare tunnel. Non-proxied: A + AAAA to public IP.
resource "cloudflare_record" "proxied" {

View file

@ -18,8 +18,4 @@ resource "kubernetes_secret" "tls_secret" {
"tls.key" = var.tls_key == "" ? file("${path.root}/secrets/privkey.pem") : var.tls_key
}
type = "kubernetes.io/tls"
lifecycle {
# KYVERNO_LIFECYCLE_V1: the sync-tls-secret policy stamps generate.kyverno.io/* + app.kubernetes.io/managed-by labels on this generated Secret
ignore_changes = [metadata[0].labels]
}
}

View file

@ -1,7 +1,7 @@
#!/usr/bin/env bash
# Cluster health check script.
# Runs 42 diagnostic checks against the Kubernetes cluster and prints
# Runs 24 diagnostic checks against the Kubernetes cluster and prints
# a colour-coded report with PASS / WARN / FAIL for each section.
#
# Usage: ./scripts/cluster_healthcheck.sh [--fix] [--quiet|-q] [--json] [--kubeconfig <path>]
@ -26,7 +26,7 @@ JSON=false
KUBECONFIG_PATH="$(pwd)/config"
KUBECTL=""
JSON_RESULTS=()
TOTAL_CHECKS=42
TOTAL_CHECKS=30
# --- Helpers ---
info() { [[ "$JSON" == true ]] && return 0; echo -e "${BLUE}[INFO]${NC} $*"; }
@ -71,16 +71,14 @@ parse_args() {
while [[ $# -gt 0 ]]; do
case "$1" in
--fix) FIX=true; shift ;;
--no-fix) FIX=false; shift ;;
--quiet|-q) QUIET=true; shift ;;
--json) JSON=true; shift ;;
--kubeconfig) KUBECONFIG_PATH="$2"; shift 2 ;;
-h|--help)
echo "Usage: $0 [--fix|--no-fix] [--quiet|-q] [--json] [--kubeconfig <path>]"
echo "Usage: $0 [--fix] [--quiet|-q] [--json] [--kubeconfig <path>]"
echo ""
echo "Flags:"
echo " --fix Auto-remediate safe issues (delete evicted pods)"
echo " --no-fix Disable auto-remediation (default)"
echo " --quiet, -q Only show WARN and FAIL sections"
echo " --json Machine-readable JSON output"
echo " --kubeconfig PATH Override kubeconfig (default: \$(pwd)/config)"
@ -1242,17 +1240,9 @@ check_overcommit() {
HA_CACHE_DIR=""
ha_sofia_available() {
if [[ -z "${HOME_ASSISTANT_SOFIA_URL:-}" ]]; then
export HOME_ASSISTANT_SOFIA_URL="https://ha-sofia.viktorbarzin.me"
if [[ -z "${HOME_ASSISTANT_SOFIA_URL:-}" ]] || [[ -z "${HOME_ASSISTANT_SOFIA_TOKEN:-}" ]]; then
return 1
fi
if [[ -z "${HOME_ASSISTANT_SOFIA_TOKEN:-}" ]]; then
if command -v vault >/dev/null 2>&1 && [[ -n "${VAULT_TOKEN:-}${HOME:-}" ]]; then
local t
t=$(vault kv get -field=haos_api_token secret/viktor 2>/dev/null || true)
[[ -n "$t" ]] && export HOME_ASSISTANT_SOFIA_TOKEN="$t"
fi
fi
[[ -n "${HOME_ASSISTANT_SOFIA_TOKEN:-}" ]] || return 1
return 0
}
@ -1760,616 +1750,6 @@ else:
json_add "hardware_exporters" "$status" "${detail:-All healthy}"
}
# Returns 0 if cert-manager CRDs are installed, 1 otherwise.
cert_manager_installed() {
$KUBECTL get crd certificates.cert-manager.io -o name >/dev/null 2>&1
}
# --- 31. cert-manager: Certificate Readiness ---
check_cert_manager_certificates() {
section 31 "cert-manager — Certificate Readiness"
local certs not_ready detail="" status="PASS"
if ! cert_manager_installed; then
pass "cert-manager not installed — N/A"
json_add "certmanager_certificates" "PASS" "N/A (cert-manager not installed)"
return 0
fi
certs=$($KUBECTL get certificates.cert-manager.io -A -o json 2>/dev/null) || {
warn "cert-manager CRDs installed but API query failed"
json_add "certmanager_certificates" "WARN" "API query failed"
return 0
}
not_ready=$(echo "$certs" | python3 -c '
import json, sys
data = json.load(sys.stdin)
for item in data.get("items", []):
ns = item["metadata"]["namespace"]
name = item["metadata"]["name"]
conds = item.get("status", {}).get("conditions", [])
ready = next((c for c in conds if c.get("type") == "Ready"), None)
if not ready or ready.get("status") != "True":
reason = ready.get("reason", "NoCondition") if ready else "NoCondition"
print(f"{ns}/{name}:{reason}")
' 2>/dev/null) || true
if [[ -z "$not_ready" ]]; then
pass "All Certificate CRs Ready"
json_add "certmanager_certificates" "PASS" "All Ready"
else
[[ "$QUIET" == true ]] && section_always 31 "cert-manager — Certificate Readiness"
local count
count=$(count_lines "$not_ready")
while IFS= read -r line; do
fail "Certificate not Ready: $line"
detail+="$line; "
done <<< "$not_ready"
status="FAIL"
json_add "certmanager_certificates" "$status" "$count not Ready: $detail"
fi
}
# --- 32. cert-manager: Certificate Expiry (<14d) ---
check_cert_manager_expiry() {
section 32 "cert-manager — Certificate Expiry (<14d)"
local certs expiring detail="" status="PASS"
if ! cert_manager_installed; then
pass "cert-manager not installed — N/A"
json_add "certmanager_expiry" "PASS" "N/A (cert-manager not installed)"
return 0
fi
certs=$($KUBECTL get certificates.cert-manager.io -A -o json 2>/dev/null) || {
warn "cert-manager CRDs installed but API query failed"
json_add "certmanager_expiry" "WARN" "API query failed"
return 0
}
expiring=$(echo "$certs" | python3 -c '
import json, sys
from datetime import datetime, timezone, timedelta
data = json.load(sys.stdin)
cutoff = datetime.now(timezone.utc) + timedelta(days=14)
for item in data.get("items", []):
ns = item["metadata"]["namespace"]
name = item["metadata"]["name"]
not_after = item.get("status", {}).get("notAfter")
if not not_after:
continue
try:
expiry = datetime.fromisoformat(not_after.replace("Z", "+00:00"))
if expiry < cutoff:
days = (expiry - datetime.now(timezone.utc)).days
level = "FAIL" if days <= 3 else "WARN"
print(f"{level}:{ns}/{name}:{days}")
except ValueError:
pass
' 2>/dev/null) || true
if [[ -z "$expiring" ]]; then
pass "No Certificate CRs expiring within 14 days"
json_add "certmanager_expiry" "PASS" "None expiring <14d"
else
[[ "$QUIET" == true ]] && section_always 32 "cert-manager — Certificate Expiry (<14d)"
while IFS= read -r line; do
local level cert_name days
level=$(echo "$line" | cut -d: -f1)
cert_name=$(echo "$line" | cut -d: -f2)
days=$(echo "$line" | cut -d: -f3)
if [[ "$level" == "FAIL" ]]; then
fail "Certificate $cert_name expires in ${days}d"
status="FAIL"
else
warn "Certificate $cert_name expires in ${days}d"
[[ "$status" != "FAIL" ]] && status="WARN"
fi
detail+="$cert_name=${days}d; "
done <<< "$expiring"
json_add "certmanager_expiry" "$status" "$detail"
fi
}
# --- 33. cert-manager: Failed CertificateRequests ---
check_cert_manager_requests() {
section 33 "cert-manager — Failed CertificateRequests"
local requests failed detail="" status="PASS"
if ! cert_manager_installed; then
pass "cert-manager not installed — N/A"
json_add "certmanager_requests" "PASS" "N/A (cert-manager not installed)"
return 0
fi
requests=$($KUBECTL get certificaterequests.cert-manager.io -A -o json 2>/dev/null) || {
warn "cert-manager CRDs installed but API query failed"
json_add "certmanager_requests" "WARN" "API query failed"
return 0
}
failed=$(echo "$requests" | python3 -c '
import json, sys
data = json.load(sys.stdin)
for item in data.get("items", []):
ns = item["metadata"]["namespace"]
name = item["metadata"]["name"]
conds = item.get("status", {}).get("conditions", [])
for c in conds:
if c.get("type") == "Ready" and c.get("status") == "False" and c.get("reason") == "Failed":
print(f"{ns}/{name}:{c.get(\"message\", \"\")[:80]}")
break
' 2>/dev/null) || true
if [[ -z "$failed" ]]; then
pass "No failed CertificateRequests"
json_add "certmanager_requests" "PASS" "None failed"
else
[[ "$QUIET" == true ]] && section_always 33 "cert-manager — Failed CertificateRequests"
local count
count=$(count_lines "$failed")
while IFS= read -r line; do
fail "CertificateRequest failed: $line"
detail+="$line; "
done <<< "$failed"
status="FAIL"
json_add "certmanager_requests" "$status" "$count failed: $detail"
fi
}
# --- 34. Backup Freshness: Per-DB Dumps ---
check_backup_per_db() {
section 34 "Backup Freshness — Per-DB Dumps"
local detail="" had_issue=false status="PASS"
# Freshness threshold: 25 hours
local now_epoch max_age_sec
now_epoch=$(date -u +%s)
max_age_sec=$((25 * 3600))
_check_cronjob_fresh() {
local ns="$1" cj="$2" label="$3"
local ts age_sec
ts=$($KUBECTL get cronjob -n "$ns" "$cj" -o jsonpath='{.status.lastSuccessfulTime}' 2>/dev/null || true)
if [[ -z "$ts" ]]; then
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 34 "Backup Freshness — Per-DB Dumps"
fail "$label: CronJob $ns/$cj has no lastSuccessfulTime"
detail+="${label}=no-success; "
had_issue=true
status="FAIL"
return 0
fi
local ts_epoch
ts_epoch=$(date -u -d "$ts" +%s 2>/dev/null || echo 0)
age_sec=$((now_epoch - ts_epoch))
if [[ "$age_sec" -gt "$max_age_sec" ]]; then
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 34 "Backup Freshness — Per-DB Dumps"
local age_h=$((age_sec / 3600))
fail "$label: last success ${age_h}h ago (>25h)"
detail+="${label}=${age_h}h; "
had_issue=true
status="FAIL"
else
local age_h=$((age_sec / 3600))
detail+="${label}=${age_h}h; "
fi
}
_check_cronjob_fresh dbaas mysql-backup-per-db mysql
_check_cronjob_fresh dbaas postgresql-backup-per-db pg
[[ "$had_issue" == false ]] && pass "Per-DB dumps fresh — $detail"
json_add "backup_per_db" "$status" "$detail"
}
# --- 35. Backup Freshness: Offsite Sync ---
check_backup_offsite_sync() {
section 35 "Backup Freshness — Offsite Sync"
local metrics detail="" status="PASS"
metrics=$($KUBECTL exec -n monitoring deploy/prometheus-server -- \
wget -qO- "http://prometheus-prometheus-pushgateway:9091/metrics" 2>/dev/null || true)
if [[ -z "$metrics" ]]; then
[[ "$QUIET" == true ]] && section_always 35 "Backup Freshness — Offsite Sync"
warn "Cannot query Pushgateway"
json_add "backup_offsite_sync" "WARN" "Pushgateway unreachable"
return 0
fi
local age_hours
age_hours=$(echo "$metrics" | python3 -c '
import sys, re, time
ts = None
for line in sys.stdin:
if line.startswith("#"):
continue
if "backup_last_success_timestamp" in line and "offsite-backup-sync" in line:
m = re.search(r"\s([0-9.eE+]+)\s*$", line.strip())
if m:
try:
ts = float(m.group(1))
break
except ValueError:
pass
if ts is None:
print("missing")
else:
age = (time.time() - ts) / 3600
print(f"{age:.1f}")
' 2>/dev/null) || age_hours="error"
if [[ "$age_hours" == "missing" ]]; then
[[ "$QUIET" == true ]] && section_always 35 "Backup Freshness — Offsite Sync"
fail "backup_last_success_timestamp metric missing for offsite-backup-sync"
json_add "backup_offsite_sync" "FAIL" "Metric missing"
elif [[ "$age_hours" == "error" ]]; then
[[ "$QUIET" == true ]] && section_always 35 "Backup Freshness — Offsite Sync"
warn "Failed to parse Pushgateway metric"
json_add "backup_offsite_sync" "WARN" "Parse error"
else
local age_int
age_int=$(printf '%.0f' "$age_hours")
if [[ "$age_int" -gt 27 ]]; then
[[ "$QUIET" == true ]] && section_always 35 "Backup Freshness — Offsite Sync"
fail "Offsite sync last success ${age_hours}h ago (>27h)"
status="FAIL"
else
pass "Offsite sync last success ${age_hours}h ago"
fi
detail="age=${age_hours}h"
json_add "backup_offsite_sync" "$status" "$detail"
fi
}
# --- 36. Backup Freshness: LVM PVC Snapshots ---
check_backup_lvm_snapshots() {
section 36 "Backup Freshness — LVM PVC Snapshots"
local snap_output detail="" status="PASS"
snap_output=$(ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no \
root@192.168.1.127 "lvs -o lv_name,lv_time --noheadings 2>/dev/null | grep _snap" 2>/dev/null || true)
if [[ -z "$snap_output" ]]; then
[[ "$QUIET" == true ]] && section_always 36 "Backup Freshness — LVM PVC Snapshots"
warn "No LVM PVC snapshots found or SSH to 192.168.1.127 failed (BatchMode)"
json_add "backup_lvm_snapshots" "WARN" "SSH failed or no snapshots"
return 0
fi
local newest_age_hours
newest_age_hours=$(echo "$snap_output" | python3 -c '
import sys, re, time
from datetime import datetime
newest = None
for line in sys.stdin:
line = line.strip()
if not line:
continue
parts = line.split(None, 1)
if len(parts) < 2:
continue
date_str = parts[1].strip()
# lv_time format: "2026-04-19 03:00:01 +0000" or similar
for fmt in ("%Y-%m-%d %H:%M:%S %z", "%Y-%m-%d %H:%M:%S"):
try:
dt = datetime.strptime(date_str, fmt)
ts = dt.timestamp()
if newest is None or ts > newest:
newest = ts
break
except ValueError:
continue
if newest is None:
print("parse_error")
else:
age = (time.time() - newest) / 3600
print(f"{age:.1f}")
' 2>/dev/null) || newest_age_hours="error"
if [[ "$newest_age_hours" == "parse_error" || "$newest_age_hours" == "error" ]]; then
[[ "$QUIET" == true ]] && section_always 36 "Backup Freshness — LVM PVC Snapshots"
warn "Could not parse LVM snapshot timestamps"
json_add "backup_lvm_snapshots" "WARN" "Parse error"
else
local count age_int
count=$(count_lines "$snap_output")
age_int=$(printf '%.0f' "$newest_age_hours")
if [[ "$age_int" -gt 25 ]]; then
[[ "$QUIET" == true ]] && section_always 36 "Backup Freshness — LVM PVC Snapshots"
fail "Newest LVM snapshot ${newest_age_hours}h old (>25h); $count total"
status="FAIL"
else
pass "LVM snapshots fresh — $count total, newest ${newest_age_hours}h old"
fi
detail="count=$count newest=${newest_age_hours}h"
json_add "backup_lvm_snapshots" "$status" "$detail"
fi
}
# --- 37. Monitoring: Prometheus + Alertmanager ---
check_monitoring_prom_am() {
section 37 "Monitoring — Prometheus + Alertmanager"
local detail="" had_issue=false status="PASS"
# Prometheus /-/ready
local prom_ready
prom_ready=$($KUBECTL exec -n monitoring deploy/prometheus-server -- \
wget -qO- "http://localhost:9090/-/ready" 2>/dev/null || true)
if echo "$prom_ready" | grep -qi "ready"; then
detail+="prometheus=ready; "
else
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 37 "Monitoring — Prometheus + Alertmanager"
fail "Prometheus /-/ready returned no Ready response"
detail+="prometheus=not-ready; "
had_issue=true
status="FAIL"
fi
# Alertmanager running pod count
local am_running
am_running=$($KUBECTL get pods -n monitoring --no-headers 2>/dev/null | \
grep alertmanager | awk '$3 == "Running"' | wc -l | tr -d ' ')
if [[ "$am_running" -gt 0 ]]; then
detail+="alertmanager=${am_running} running; "
else
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 37 "Monitoring — Prometheus + Alertmanager"
fail "Alertmanager: 0 Running pods"
detail+="alertmanager=none-running; "
had_issue=true
status="FAIL"
fi
[[ "$had_issue" == false ]] && pass "Prometheus Ready, $am_running Alertmanager pod(s) Running"
json_add "monitoring_prom_am" "$status" "$detail"
}
# --- 38. Monitoring: Vault Sealed Status ---
check_monitoring_vault() {
section 38 "Monitoring — Vault Sealed Status"
local output detail="" status="PASS"
output=$($KUBECTL exec -n vault vault-0 -- \
sh -c 'VAULT_ADDR=http://127.0.0.1:8200 vault status' 2>&1 || true)
if [[ -z "$output" ]]; then
[[ "$QUIET" == true ]] && section_always 38 "Monitoring — Vault Sealed Status"
fail "Cannot exec vault status on vault-0"
json_add "monitoring_vault" "FAIL" "Exec failed"
return 0
fi
if echo "$output" | grep -qi "^Sealed[[:space:]]*false"; then
pass "Vault unsealed"
detail="sealed=false"
json_add "monitoring_vault" "PASS" "$detail"
elif echo "$output" | grep -qi "^Sealed[[:space:]]*true"; then
[[ "$QUIET" == true ]] && section_always 38 "Monitoring — Vault Sealed Status"
fail "Vault is SEALED — secrets unavailable"
detail="sealed=true"
status="FAIL"
json_add "monitoring_vault" "$status" "$detail"
else
[[ "$QUIET" == true ]] && section_always 38 "Monitoring — Vault Sealed Status"
warn "Cannot parse vault status output"
json_add "monitoring_vault" "WARN" "Parse error"
fi
}
# --- 39. Monitoring: ClusterSecretStore Ready ---
check_monitoring_css() {
section 39 "Monitoring — ClusterSecretStore Ready"
local css not_ready detail="" status="PASS"
css=$($KUBECTL get clustersecretstore -o json 2>/dev/null) || {
[[ "$QUIET" == true ]] && section_always 39 "Monitoring — ClusterSecretStore Ready"
warn "ClusterSecretStore CRD not installed"
json_add "monitoring_css" "WARN" "CRD missing"
return 0
}
not_ready=$(echo "$css" | python3 -c '
import json, sys
data = json.load(sys.stdin)
for item in data.get("items", []):
name = item["metadata"]["name"]
conds = item.get("status", {}).get("conditions", [])
ready = next((c for c in conds if c.get("type") == "Ready"), None)
if not ready or ready.get("status") != "True":
print(f"{name}:{ready.get(\"reason\", \"NoCondition\") if ready else \"NoCondition\"}")
' 2>/dev/null) || true
if [[ -z "$not_ready" ]]; then
local total
total=$(echo "$css" | python3 -c 'import json,sys; print(len(json.load(sys.stdin).get("items",[])))' 2>/dev/null || echo "?")
pass "All $total ClusterSecretStores Ready"
json_add "monitoring_css" "PASS" "$total Ready"
else
[[ "$QUIET" == true ]] && section_always 39 "Monitoring — ClusterSecretStore Ready"
while IFS= read -r line; do
fail "ClusterSecretStore not Ready: $line"
detail+="$line; "
done <<< "$not_ready"
status="FAIL"
json_add "monitoring_css" "$status" "$detail"
fi
}
# --- 40. External Reachability: Cloudflared + Authentik Replicas ---
check_external_replicas() {
section 40 "External — Cloudflared + Authentik Replicas"
local detail="" had_issue=false status="PASS"
# Cloudflared
local cf_json cf_ready cf_desired
cf_json=$($KUBECTL get deployment cloudflared -n cloudflared -o json 2>/dev/null || true)
if [[ -z "$cf_json" ]]; then
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 40 "External — Cloudflared + Authentik Replicas"
fail "Cloudflared deployment not found"
detail+="cloudflared=missing; "
had_issue=true
status="FAIL"
else
cf_ready=$(echo "$cf_json" | python3 -c 'import json,sys; print(json.load(sys.stdin).get("status",{}).get("readyReplicas",0) or 0)' 2>/dev/null || echo "0")
cf_desired=$(echo "$cf_json" | python3 -c 'import json,sys; print(json.load(sys.stdin).get("spec",{}).get("replicas",0) or 0)' 2>/dev/null || echo "0")
if [[ "$cf_ready" != "$cf_desired" ]]; then
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 40 "External — Cloudflared + Authentik Replicas"
fail "Cloudflared: $cf_ready/$cf_desired ready (external access degraded)"
detail+="cloudflared=${cf_ready}/${cf_desired}; "
had_issue=true
status="FAIL"
else
detail+="cloudflared=${cf_ready}/${cf_desired}; "
fi
fi
# Authentik server (Helm chart names the deployment goauthentik-server)
local auth_json auth_ready auth_desired
auth_json=$($KUBECTL get deployment goauthentik-server -n authentik -o json 2>/dev/null || true)
if [[ -z "$auth_json" ]]; then
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 40 "External — Cloudflared + Authentik Replicas"
warn "goauthentik-server deployment not found in authentik namespace"
detail+="authentik=missing; "
had_issue=true
[[ "$status" != "FAIL" ]] && status="WARN"
else
auth_ready=$(echo "$auth_json" | python3 -c 'import json,sys; print(json.load(sys.stdin).get("status",{}).get("readyReplicas",0) or 0)' 2>/dev/null || echo "0")
auth_desired=$(echo "$auth_json" | python3 -c 'import json,sys; print(json.load(sys.stdin).get("spec",{}).get("replicas",0) or 0)' 2>/dev/null || echo "0")
if [[ "$auth_ready" != "$auth_desired" ]]; then
[[ "$had_issue" == false && "$QUIET" == true ]] && section_always 40 "External — Cloudflared + Authentik Replicas"
fail "goauthentik-server: $auth_ready/$auth_desired ready (auth degraded)"
detail+="authentik=${auth_ready}/${auth_desired}; "
had_issue=true
status="FAIL"
else
detail+="authentik=${auth_ready}/${auth_desired}; "
fi
fi
[[ "$had_issue" == false ]] && pass "Cloudflared + authentik-server at full replicas ($detail)"
json_add "external_replicas" "$status" "$detail"
}
# --- 41. External Reachability: ExternalAccessDivergence Alert ---
check_external_divergence() {
section 41 "External — ExternalAccessDivergence Alert"
local alerts result detail="" status="PASS"
alerts=$($KUBECTL exec -n monitoring deploy/prometheus-server -- \
wget -qO- "http://localhost:9090/api/v1/alerts" 2>/dev/null || true)
if [[ -z "$alerts" ]]; then
[[ "$QUIET" == true ]] && section_always 41 "External — ExternalAccessDivergence Alert"
warn "Cannot query Prometheus alerts"
json_add "external_divergence" "WARN" "Cannot query"
return 0
fi
result=$(echo "$alerts" | python3 -c '
import json, sys
try:
data = json.load(sys.stdin)
alerts = data.get("data", {}).get("alerts", []) if isinstance(data, dict) else data
firing = [a for a in alerts
if a.get("labels", {}).get("alertname") == "ExternalAccessDivergence"
and a.get("state") == "firing"]
if firing:
hosts = [a.get("labels", {}).get("host") or a.get("labels", {}).get("service") or "?" for a in firing]
print(f"{len(firing)}:" + ",".join(hosts))
else:
print("0:")
except Exception as e:
print(f"error:{e}")
' 2>/dev/null) || result="error:parse"
if [[ "$result" == error:* ]]; then
[[ "$QUIET" == true ]] && section_always 41 "External — ExternalAccessDivergence Alert"
warn "Failed to parse alerts JSON: ${result#error:}"
json_add "external_divergence" "WARN" "Parse error"
return 0
fi
local count names
count=$(echo "$result" | cut -d: -f1)
names=$(echo "$result" | cut -d: -f2-)
if [[ "$count" -eq 0 ]]; then
pass "ExternalAccessDivergence not firing"
json_add "external_divergence" "PASS" "Not firing"
else
[[ "$QUIET" == true ]] && section_always 41 "External — ExternalAccessDivergence Alert"
fail "ExternalAccessDivergence firing for $count target(s): $names"
status="FAIL"
detail="$count firing: $names"
json_add "external_divergence" "$status" "$detail"
fi
}
# --- 42. External Reachability: Traefik 5xx Rate ---
check_external_traefik_5xx() {
section 42 "External — Traefik 5xx Rate (15m)"
local query_result detail="" status="PASS"
query_result=$($KUBECTL exec -n monitoring deploy/prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=topk(10,rate(traefik_service_requests_total{code=~%225..%22}%5B15m%5D))' 2>/dev/null || true)
if [[ -z "$query_result" ]]; then
[[ "$QUIET" == true ]] && section_always 42 "External — Traefik 5xx Rate (15m)"
warn "Cannot query Prometheus for traefik 5xx rate"
json_add "external_traefik_5xx" "WARN" "Query failed"
return 0
fi
local parsed
parsed=$(echo "$query_result" | python3 -c '
import json, sys
try:
data = json.load(sys.stdin)
results = data.get("data", {}).get("result", [])
hot = [(r.get("metric", {}).get("service", "?"), float(r.get("value", [0, "0"])[1])) for r in results]
hot = [(s, v) for s, v in hot if v > 0.01] # 1% req/s threshold
hot.sort(key=lambda x: -x[1])
if not hot:
print("0:")
else:
top = [f"{s}={v:.2f}/s" for s, v in hot[:5]]
print(f"{len(hot)}:" + "; ".join(top))
except Exception as e:
print(f"error:{e}")
' 2>/dev/null) || parsed="error:parse"
if [[ "$parsed" == error:* ]]; then
[[ "$QUIET" == true ]] && section_always 42 "External — Traefik 5xx Rate (15m)"
warn "Parse failed: ${parsed#error:}"
json_add "external_traefik_5xx" "WARN" "Parse error"
return 0
fi
local count top
count=$(echo "$parsed" | cut -d: -f1)
top=$(echo "$parsed" | cut -d: -f2-)
if [[ "$count" -eq 0 ]]; then
pass "No Traefik services with 5xx rate >0.01 req/s (last 15m)"
json_add "external_traefik_5xx" "PASS" "None above threshold"
else
[[ "$QUIET" == true ]] && section_always 42 "External — Traefik 5xx Rate (15m)"
# WARN at any 5xx; FAIL if top service >1 req/s
local top_rate
top_rate=$(echo "$top" | grep -oE '[0-9.]+/s' | head -1 | tr -d '/s')
if awk "BEGIN{exit !($top_rate > 1.0)}" 2>/dev/null; then
fail "$count Traefik service(s) with elevated 5xx: $top"
status="FAIL"
else
warn "$count Traefik service(s) emitting 5xx: $top"
status="WARN"
fi
detail="$count services: $top"
json_add "external_traefik_5xx" "$status" "$detail"
fi
}
# --- Summary ---
print_summary() {
if [[ "$JSON" == true ]]; then
@ -2452,18 +1832,6 @@ main() {
check_ha_automations
check_ha_system
check_hardware_exporters
check_cert_manager_certificates
check_cert_manager_expiry
check_cert_manager_requests
check_backup_per_db
check_backup_offsite_sync
check_backup_lvm_snapshots
check_monitoring_prom_am
check_monitoring_vault
check_monitoring_css
check_external_replicas
check_external_divergence
check_external_traefik_5xx
print_summary
# Exit code: 2 for failures, 1 for warnings, 0 for clean

View file

@ -1,76 +0,0 @@
#!/usr/bin/env bash
# One-shot migration of every private image on registry.viktorbarzin.me to
# Forgejo. Used as a stop-gap when the dual-push CI pipelines aren't
# producing Forgejo images on their own (Forgejo-Woodpecker forge driver
# context-deadline-exceeded issue, see bd code-d3y / 2026-05-07).
#
# Pulls each image from registry.viktorbarzin.me, retags, pushes to
# forgejo.viktorbarzin.me/viktor/<name>:<tag> — preserving the blob bytes
# verbatim so the cluster can flip image= without a rebuild.
#
# Run from any host with docker + network reach to BOTH registries. Auth
# from `docker login` (~/.docker/config.json) — make sure both registries
# are logged in:
# docker login registry.viktorbarzin.me -u viktorbarzin
# docker login forgejo.viktorbarzin.me -u viktor # use viktor PAT, not ci-pusher
#
# (ci-pusher CANNOT push to viktor/<image> — Forgejo container packages
# are scoped to the pushing user. Only viktor's PAT can write to viktor/*.)
#
# After the script, the new image lives at
# forgejo.viktorbarzin.me/viktor/<name>:<tag>
# Phase 3 of the consolidation flips infra/stacks/<svc>/main.tf image=
# to that path.
set -euo pipefail
OLD_REG=registry.viktorbarzin.me
NEW_REG=forgejo.viktorbarzin.me/viktor
# Image list: <name>:<tag>. Generated 2026-05-07 from `grep -rEn 'image\s*=\s*
# "registry\.viktorbarzin\.me'` across infra/stacks/.
#
# Excluded:
# - wealthfolio-sync: registry repo exists but has 0 tags (CronJob has been
# broken for 36+ days, separate decision needed). User to triage before
# migration.
# - fire-planner: registry repo exists but has 0 tags. Dockerfile + CI added
# in this session (commit 8b53d99e); rebuild via Woodpecker before flipping.
IMAGES=(
"chrome-service-novnc:v4"
"chrome-service-novnc:latest"
"payslip-ingest:latest"
"job-hunter:latest"
"claude-agent-service:latest"
"freedify:latest"
"beadboard:latest"
"infra-ci:latest"
)
for img in "${IMAGES[@]}"; do
echo "=== $img ==="
src="$OLD_REG/$img"
dst="$NEW_REG/$img"
if ! docker pull "$src" 2>&1 | tee /tmp/pull-$$ | grep -q 'Status: '; then
if grep -q 'not found' /tmp/pull-$$; then
echo " SKIP — image not present in source registry"
rm -f /tmp/pull-$$
continue
fi
fi
rm -f /tmp/pull-$$
echo " tag → $dst"
docker tag "$src" "$dst"
echo " push $dst"
docker push "$dst" 2>&1 | tail -2
echo " cleanup local copy"
docker rmi "$src" "$dst" 2>&1 | tail -1 || true
done
echo ""
echo "Done. Verify in Forgejo Web UI: https://forgejo.viktorbarzin.me/viktor/-/packages?type=container"
echo "Phase 3 of the plan flips infra/stacks/{wealthfolio,fire-planner}/main.tf image= references."

View file

@ -1,469 +0,0 @@
#!/usr/bin/env bash
# lvm-pvc-snapshot — LVM thin snapshot management for Proxmox CSI PVCs
# Deploy to PVE host at /usr/local/bin/lvm-pvc-snapshot
set -euo pipefail
# --- Configuration ---
VG="pve"
THINPOOL="data"
SNAP_SUFFIX_FORMAT="%Y%m%d_%H%M"
RETENTION_DAYS=7
MIN_FREE_PCT=10
PUSHGATEWAY="${LVM_SNAP_PUSHGATEWAY:-http://10.0.20.100:30091}"
PUSHGATEWAY_JOB="lvm-pvc-snapshot"
LOCKFILE="/run/lvm-pvc-snapshot.lock"
KUBECONFIG="${KUBECONFIG:-/root/.kube/config}"
export KUBECONFIG
# Namespaces to exclude from snapshots (high-churn, have app-level dumps)
# These PVCs cause significant CoW write amplification (~36% overhead)
EXCLUDE_NAMESPACES="${LVM_SNAP_EXCLUDE_NS:-dbaas,monitoring}"
# --- Logging ---
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
warn() { log "WARN: $*" >&2; }
die() { log "FATAL: $*" >&2; exit 1; }
# --- Helpers ---
get_thinpool_free_pct() {
local data_pct
data_pct=$(lvs --noheadings --nosuffix -o data_percent "${VG}/${THINPOOL}" 2>/dev/null | tr -d ' ')
echo "scale=2; 100 - ${data_pct}" | bc
}
build_exclude_lv_list() {
# Query K8s for PVs in excluded namespaces, extract their LV names
if [[ -z "${EXCLUDE_NAMESPACES}" ]] || ! command -v kubectl &>/dev/null; then
return
fi
kubectl get pv -o json 2>/dev/null | jq -r --arg ns "${EXCLUDE_NAMESPACES}" '
($ns | split(",")) as $excl |
.items[] |
select(.spec.csi.driver == "csi.proxmox.sinextra.dev") |
select(.spec.claimRef.namespace as $n | $excl | index($n)) |
.spec.csi.volumeHandle | split("/") | last
' 2>/dev/null || true
}
discover_pvc_lvs() {
# List thin LVs matching PVC pattern, excluding snapshots, pre-restore backups,
# and LVs belonging to excluded namespaces (high-churn databases/metrics)
local all_lvs exclude_lvs
all_lvs=$(lvs --noheadings -o lv_name,pool_lv "${VG}" 2>/dev/null \
| awk -v pool="${THINPOOL}" '$2 == pool { print $1 }' \
| grep -E '^vm-[0-9]+-pvc-' \
| grep -v '_snap_' \
| grep -v '_pre_restore_')
exclude_lvs=$(build_exclude_lv_list)
if [[ -n "${exclude_lvs}" ]]; then
# Filter out excluded LVs
local exclude_pattern
exclude_pattern=$(echo "${exclude_lvs}" | paste -sd'|' -)
echo "${all_lvs}" | grep -vE "(${exclude_pattern})" || true
else
echo "${all_lvs}"
fi
}
list_snapshots() {
lvs --noheadings -o lv_name,pool_lv "${VG}" 2>/dev/null \
| awk -v pool="${THINPOOL}" '$2 == pool { print $1 }' \
| grep '_snap_' || true
}
parse_snap_timestamp() {
# Extract YYYYMMDD_HHMM from snapshot name, convert to epoch
local snap_name="$1"
local ts_str
ts_str=$(echo "${snap_name}" | grep -oE '[0-9]{8}_[0-9]{4}$')
if [[ -z "${ts_str}" ]]; then
echo "0"
return
fi
local ymd="${ts_str:0:8}"
local hm="${ts_str:9:4}"
date -d "${ymd:0:4}-${ymd:4:2}-${ymd:6:2} ${hm:0:2}:${hm:2:2}" +%s 2>/dev/null || echo "0"
}
get_original_lv_from_snap() {
# vm-200-pvc-abc_snap_20260403_1200 -> vm-200-pvc-abc
echo "$1" | sed 's/_snap_[0-9]\{8\}_[0-9]\{4\}$//'
}
push_metrics() {
local status="$1" created="$2" failed="$3" pruned="$4"
local free_pct
free_pct=$(get_thinpool_free_pct)
cat <<METRICS | curl -sf --connect-timeout 5 --max-time 10 --data-binary @- \
"${PUSHGATEWAY}/metrics/job/${PUSHGATEWAY_JOB}" 2>/dev/null || warn "Failed to push metrics to Pushgateway"
# HELP lvm_snapshot_last_run_timestamp Unix timestamp of last snapshot run
# TYPE lvm_snapshot_last_run_timestamp gauge
lvm_snapshot_last_run_timestamp $(date +%s)
# HELP lvm_snapshot_last_status Exit status (0=success, 1=partial failure, 2=aborted)
# TYPE lvm_snapshot_last_status gauge
lvm_snapshot_last_status ${status}
# HELP lvm_snapshot_created_total Number of snapshots created in last run
# TYPE lvm_snapshot_created_total gauge
lvm_snapshot_created_total ${created}
# HELP lvm_snapshot_failed_total Number of snapshot failures in last run
# TYPE lvm_snapshot_failed_total gauge
lvm_snapshot_failed_total ${failed}
# HELP lvm_snapshot_pruned_total Number of snapshots pruned in last run
# TYPE lvm_snapshot_pruned_total gauge
lvm_snapshot_pruned_total ${pruned}
# HELP lvm_snapshot_thinpool_free_pct Thin pool free percentage
# TYPE lvm_snapshot_thinpool_free_pct gauge
lvm_snapshot_thinpool_free_pct ${free_pct}
METRICS
}
# --- Subcommands ---
cmd_snapshot() {
log "Starting PVC LVM thin snapshot run"
# Check thin pool free space
local free_pct
free_pct=$(get_thinpool_free_pct)
log "Thin pool free space: ${free_pct}%"
if (( $(echo "${free_pct} < ${MIN_FREE_PCT}" | bc -l) )); then
warn "Thin pool has only ${free_pct}% free (minimum: ${MIN_FREE_PCT}%). Aborting."
push_metrics 2 0 0 0
exit 1
fi
# Discover PVC LVs
local lvs_list
lvs_list=$(discover_pvc_lvs)
if [[ -z "${lvs_list}" ]]; then
warn "No PVC LVs found matching pattern"
push_metrics 2 0 0 0
exit 1
fi
local count=0 failed=0 total
total=$(echo "${lvs_list}" | wc -l | tr -d ' ')
local snap_ts
snap_ts=$(date +"${SNAP_SUFFIX_FORMAT}")
log "Found ${total} PVC LVs to snapshot"
while IFS= read -r lv; do
local snap_name="${lv}_snap_${snap_ts}"
if lvcreate -s -kn -n "${snap_name}" "${VG}/${lv}" >/dev/null 2>&1; then
log " Created: ${snap_name}"
count=$((count + 1))
else
warn " Failed to create snapshot for ${lv}"
failed=$((failed + 1))
fi
done <<< "${lvs_list}"
log "Snapshot run complete: ${count} created, ${failed} failed out of ${total}"
# Auto-prune
log "Running auto-prune..."
local pruned
pruned=$(cmd_prune_count)
# Determine status
local status=0
if (( failed > 0 && count > 0 )); then
status=1 # partial
elif (( failed > 0 && count == 0 )); then
status=2 # all failed
fi
push_metrics "${status}" "${count}" "${failed}" "${pruned}"
log "Done"
}
cmd_list() {
printf "%-45s %-50s %8s %8s\n" "ORIGINAL LV" "SNAPSHOT" "AGE" "DATA%"
printf "%-45s %-50s %8s %8s\n" "-----------" "--------" "---" "-----"
local now
now=$(date +%s)
local snap_lines
snap_lines=$(lvs --noheadings --nosuffix -o lv_name,lv_size,data_percent "${VG}" 2>/dev/null \
| grep -E '_snap_|_pre_restore_' || true)
if [[ -z "${snap_lines}" ]]; then
echo "(no snapshots found)"
return
fi
echo "${snap_lines}" | while read -r name size data_pct; do
local original age_str ts epoch
if [[ "${name}" == *"_pre_restore_"* ]]; then
original=$(echo "${name}" | sed 's/_pre_restore_[0-9]\{8\}_[0-9]\{4\}$//')
ts=$(echo "${name}" | grep -oE '[0-9]{8}_[0-9]{4}$')
else
original=$(get_original_lv_from_snap "${name}")
ts=$(echo "${name}" | grep -oE '[0-9]{8}_[0-9]{4}$')
fi
epoch=$(parse_snap_timestamp "${name}")
if (( epoch > 0 )); then
local age_s=$(( now - epoch ))
local days=$(( age_s / 86400 ))
local hours=$(( (age_s % 86400) / 3600 ))
age_str="${days}d${hours}h"
else
age_str="unknown"
fi
printf "%-45s %-50s %8s %7s%%\n" "${original}" "${name}" "${age_str}" "${data_pct}"
done
}
cmd_prune() {
local pruned
pruned=$(cmd_prune_count)
log "Pruned ${pruned} expired snapshots"
}
cmd_prune_count() {
# NOTE: stdout of this function is captured by callers (`pruned=$(cmd_prune_count)`),
# so all log/warn output must go to stderr — the only thing on stdout is the count.
local now cutoff pruned=0
now=$(date +%s)
cutoff=$(( now - RETENTION_DAYS * 86400 ))
local snaps
snaps=$(lvs --noheadings -o lv_name,pool_lv "${VG}" 2>/dev/null \
| awk -v pool="${THINPOOL}" '$2 == pool { print $1 }' \
| grep -E '_snap_|_pre_restore_' || true)
if [[ -z "${snaps}" ]]; then
echo "0"
return
fi
while IFS= read -r snap; do
local epoch
epoch=$(parse_snap_timestamp "${snap}")
if (( epoch > 0 && epoch < cutoff )); then
if lvremove -f "${VG}/${snap}" >/dev/null 2>&1; then
log " Pruned: ${snap}" >&2
pruned=$((pruned + 1))
else
warn " Failed to prune: ${snap}"
fi
fi
done <<< "${snaps}"
echo "${pruned}"
}
cmd_restore() {
local pvc_lv="${1:-}" snapshot_lv="${2:-}"
if [[ -z "${pvc_lv}" || -z "${snapshot_lv}" ]]; then
die "Usage: $0 restore <pvc-lv-name> <snapshot-lv-name>"
fi
# Validate LVs exist
if ! lvs "${VG}/${pvc_lv}" >/dev/null 2>&1; then
die "PVC LV '${pvc_lv}' not found in VG '${VG}'"
fi
if ! lvs "${VG}/${snapshot_lv}" >/dev/null 2>&1; then
die "Snapshot LV '${snapshot_lv}' not found in VG '${VG}'"
fi
# Discover K8s context
log "Discovering Kubernetes context for LV '${pvc_lv}'..."
local volume_handle="local-lvm:${pvc_lv}"
local pv_info
pv_info=$(kubectl get pv -o json 2>/dev/null | jq -r \
--arg vh "${volume_handle}" \
'.items[] | select(.spec.csi.volumeHandle == $vh) | "\(.metadata.name) \(.spec.claimRef.namespace) \(.spec.claimRef.name)"' \
) || die "Failed to query PVs (is kubectl configured?)"
if [[ -z "${pv_info}" ]]; then
die "No PV found with volumeHandle '${volume_handle}'"
fi
local pv_name pvc_ns pvc_name
read -r pv_name pvc_ns pvc_name <<< "${pv_info}"
log "Found: PV=${pv_name}, PVC=${pvc_ns}/${pvc_name}"
# Find the workload (Deployment or StatefulSet) that uses this PVC
local workload_type="" workload_name="" original_replicas=""
# Check StatefulSets first (databases use these)
local sts_info
sts_info=$(kubectl get statefulset -n "${pvc_ns}" -o json 2>/dev/null | jq -r \
--arg pvc "${pvc_name}" \
'.items[] | select(
(.spec.template.spec.volumes // [] | .[].persistentVolumeClaim.claimName == $pvc) or
(.spec.volumeClaimTemplates // [] | .[].metadata.name as $vct |
.spec.replicas as $r | range($r) | "\($vct)-\(.metadata.name)-\(.)" ) == $pvc
) | "\(.metadata.name) \(.spec.replicas)"' 2>/dev/null \
) || true
# If not found via simple volume check, try matching VCT naming pattern
if [[ -z "${sts_info}" ]]; then
sts_info=$(kubectl get statefulset -n "${pvc_ns}" -o json 2>/dev/null | jq -r \
--arg pvc "${pvc_name}" \
'.items[] | .metadata.name as $sts | .spec.replicas as $r |
select(.spec.volumeClaimTemplates != null) |
.spec.volumeClaimTemplates[].metadata.name as $vct |
[range($r)] | map("\($vct)-\($sts)-\(.)") |
if any(. == $pvc) then "\($sts) \($r)" else empty end' 2>/dev/null \
) || true
fi
if [[ -n "${sts_info}" ]]; then
read -r workload_name original_replicas <<< "${sts_info}"
workload_type="statefulset"
else
# Check Deployments
local deploy_info
deploy_info=$(kubectl get deployment -n "${pvc_ns}" -o json 2>/dev/null | jq -r \
--arg pvc "${pvc_name}" \
'.items[] | select(
.spec.template.spec.volumes // [] | .[].persistentVolumeClaim.claimName == $pvc
) | "\(.metadata.name) \(.spec.replicas)"' 2>/dev/null \
) || true
if [[ -n "${deploy_info}" ]]; then
read -r workload_name original_replicas <<< "${deploy_info}"
workload_type="deployment"
fi
fi
if [[ -z "${workload_type}" ]]; then
warn "Could not auto-discover workload for PVC '${pvc_name}' in namespace '${pvc_ns}'."
warn "You may need to scale down the pod manually."
echo ""
read -rp "Continue with LV swap anyway? (yes/no): " confirm
[[ "${confirm}" == "yes" ]] || die "Aborted by user"
workload_type="manual"
fi
# Dry-run output
local backup_name="${pvc_lv}_pre_restore_$(date +"${SNAP_SUFFIX_FORMAT}")"
echo ""
echo "╔══════════════════════════════════════════════════════════════╗"
echo "║ RESTORE DRY-RUN ║"
echo "╠══════════════════════════════════════════════════════════════╣"
echo "║ PVC: ${pvc_ns}/${pvc_name}"
echo "║ PV: ${pv_name}"
if [[ "${workload_type}" != "manual" ]]; then
echo "║ Workload: ${workload_type}/${workload_name} (replicas: ${original_replicas}→0→${original_replicas})"
fi
echo "║"
echo "║ Actions:"
if [[ "${workload_type}" != "manual" ]]; then
echo "║ 1. Scale ${workload_type}/${workload_name} to 0 replicas"
echo "║ 2. Wait for pod termination"
fi
echo "║ 3. Rename ${pvc_lv}${backup_name}"
echo "║ 4. Rename ${snapshot_lv}${pvc_lv}"
if [[ "${workload_type}" != "manual" ]]; then
echo "║ 5. Scale ${workload_type}/${workload_name} back to ${original_replicas} replicas"
fi
echo "╚══════════════════════════════════════════════════════════════╝"
echo ""
# Interactive confirmation
read -rp "Type 'yes' to proceed with restore: " confirm
if [[ "${confirm}" != "yes" ]]; then
die "Aborted by user"
fi
# Scale down
if [[ "${workload_type}" != "manual" ]]; then
log "Scaling ${workload_type}/${workload_name} to 0 replicas..."
kubectl scale "${workload_type}/${workload_name}" -n "${pvc_ns}" --replicas=0
log "Waiting for pod termination (timeout: 120s)..."
kubectl wait --for=delete pod -l "app.kubernetes.io/name=${workload_name}" -n "${pvc_ns}" --timeout=120s 2>/dev/null || \
kubectl wait --for=delete pod -l "app=${workload_name}" -n "${pvc_ns}" --timeout=120s 2>/dev/null || \
warn "Timeout waiting for pods — continuing anyway (LV may still be in use)"
sleep 5 # extra grace period for device detach
fi
# Verify LV is not active
local lv_active
lv_active=$(lvs --noheadings -o lv_active "${VG}/${pvc_lv}" 2>/dev/null | tr -d ' ')
if [[ "${lv_active}" == "active" ]]; then
warn "LV ${pvc_lv} is still active. Attempting to deactivate..."
# Close any LUKS mapper on the LV before deactivation
if dmsetup ls 2>/dev/null | grep -q "${pvc_lv}"; then
log "Closing LUKS mapper for ${pvc_lv}..."
cryptsetup luksClose "${pvc_lv}" 2>/dev/null || true
fi
lvchange -an "${VG}/${pvc_lv}" 2>/dev/null || warn "Could not deactivate — proceeding with caution"
fi
# LV swap
log "Renaming ${pvc_lv}${backup_name}"
lvrename "${VG}" "${pvc_lv}" "${backup_name}" || die "Failed to rename original LV"
log "Renaming ${snapshot_lv}${pvc_lv}"
lvrename "${VG}" "${snapshot_lv}" "${pvc_lv}" || die "Failed to rename snapshot LV"
# Scale back up
if [[ "${workload_type}" != "manual" ]]; then
log "Scaling ${workload_type}/${workload_name} back to ${original_replicas} replicas..."
kubectl scale "${workload_type}/${workload_name}" -n "${pvc_ns}" --replicas="${original_replicas}"
log "Waiting for pod to become Ready (timeout: 300s)..."
kubectl wait --for=condition=Ready pod -l "app.kubernetes.io/name=${workload_name}" -n "${pvc_ns}" --timeout=300s 2>/dev/null || \
kubectl wait --for=condition=Ready pod -l "app=${workload_name}" -n "${pvc_ns}" --timeout=300s 2>/dev/null || \
warn "Timeout waiting for pod Ready — check manually"
fi
echo ""
log "Restore complete!"
log "Old data preserved as: ${backup_name}"
log "To delete old data after verification: lvremove -f ${VG}/${backup_name}"
}
# --- Main ---
usage() {
cat <<EOF
Usage: $(basename "$0") <command> [args]
Commands:
snapshot Create thin snapshots of all PVC LVs
list List existing snapshots with age and data%
prune Remove snapshots older than ${RETENTION_DAYS} days
restore <lv> <snap> Restore a PVC from a snapshot (interactive)
Environment:
LVM_SNAP_PUSHGATEWAY Pushgateway URL (default: ${PUSHGATEWAY})
KUBECONFIG Kubeconfig path (default: /root/.kube/config)
EOF
}
main() {
local cmd="${1:-}"
shift || true
# Acquire lock (except for list which is read-only)
if [[ "${cmd}" != "list" && "${cmd}" != "" && "${cmd}" != "help" && "${cmd}" != "--help" && "${cmd}" != "-h" ]]; then
exec 200>"${LOCKFILE}"
if ! flock -n 200; then
die "Another instance is already running (lockfile: ${LOCKFILE})"
fi
fi
case "${cmd}" in
snapshot) cmd_snapshot ;;
list) cmd_list ;;
prune) cmd_prune ;;
restore) cmd_restore "$@" ;;
help|--help|-h|"") usage ;;
*) die "Unknown command: ${cmd}. Run '$0 help' for usage." ;;
esac
}
main "$@"

View file

@ -1,236 +0,0 @@
<?php
// pfSense HAProxy bootstrap — configures the mailserver PROXY-v2 path
// (bd code-yiu, Phases 2/3 + 5).
//
// WHY THIS EXISTS
// pfSense HAProxy config is stored XML-in-`/cf/conf/config.xml` under
// `<installedpackages><haproxy>`. That file IS picked up by the nightly
// `daily-backup` on the PVE host (see `scripts/daily-backup.sh` → `scp
// root@10.0.20.1:/cf/conf/config.xml`) and synced to Synology. This script
// is the canonical reproducer: run it to rebuild the pfSense HAProxy config
// from scratch (DR restore, fresh pfSense install, etc.).
//
// WHAT IT BUILDS
// 4 backend pools — one per mail port:
// mailserver_nodes_smtp → k8s-node1..4:30125 (container :2525 postscreen)
// mailserver_nodes_smtps → k8s-node1..4:30126 (container :4465 smtps)
// mailserver_nodes_sub → k8s-node1..4:30127 (container :5587 submission)
// mailserver_nodes_imaps → k8s-node1..4:30128 (container :10993 IMAPS)
// Each server uses `send-proxy-v2` and TCP health-check every 120s.
// 4 frontends on pfSense 10.0.20.1:{25,465,587,993} TCP mode.
// + 1 legacy test frontend on :2525 (kept for validation; safe to remove later).
//
// USAGE (on pfSense host, via SSH as admin)
// scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/
// ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'
//
// IDEMPOTENCY
// Removes any existing entries named mailserver_* before re-adding, so
// repeat runs are safe and behave as reset-to-declared.
require_once('/etc/inc/config.inc');
require_once('/usr/local/pkg/haproxy/haproxy.inc');
require_once('/usr/local/pkg/haproxy/haproxy_utils.inc');
global $config;
parse_config(true);
if (!is_array($config['installedpackages']['haproxy'])) {
$config['installedpackages']['haproxy'] = [];
}
$h = &$config['installedpackages']['haproxy'];
$h['enable'] = 'yes';
$h['maxconn'] = '1000';
// Our declared object names (anything starting with mailserver_ is ours)
$POOL_NAMES = [
'mailserver_nodes', // legacy (Phase 2/3 test)
'mailserver_nodes_smtp',
'mailserver_nodes_smtps',
'mailserver_nodes_sub',
'mailserver_nodes_imaps',
];
$FRONTEND_NAMES = [
'mailserver_proxy_test', // legacy (Phase 2/3 test, :2525)
'mailserver_proxy_25',
'mailserver_proxy_465',
'mailserver_proxy_587',
'mailserver_proxy_993',
];
// k8s workers. Not in the cluster: master (control-plane) and node5
// (doesn't exist in this topology).
$NODES = [
['k8s-node1', '10.0.20.101'],
['k8s-node2', '10.0.20.102'],
['k8s-node3', '10.0.20.103'],
['k8s-node4', '10.0.20.104'],
];
// Build a pool with optional split healthcheck path.
//
// $check_port: if non-null, HAProxy sends health probes to that NodePort
// (which Service `mailserver-proxy` maps to the pod's stock no-PROXY
// listener — see infra/stacks/mailserver/.../mailserver_proxy ports
// 30145/30146/30147). Real client traffic still goes to $nodeport with
// PROXY v2 framing.
// $check_type: 'TCP' for plain accept-on-port checks, 'ESMTP' for
// `option smtpchk EHLO <monitor_domain>` (real SMTP banner+EHLO+250).
//
// Why split: smtpd-proxy587/4465 fatal on every PROXY-v2-aware health
// probe with `smtpd_peer_hostaddr_to_sockaddr: ... Servname not supported`
// — the daemon respawns get throttled by Postfix master and real clients
// land mid-respawn → 6s TCP timeout. Routing health probes to the stock
// no-PROXY port sidesteps the bug entirely while data path still gets
// PROXY v2 for CrowdSec/Postfix client-IP visibility. The HAProxy package
// has no `checkport` field, so `port N` is appended via the server's
// `advanced` string (HAProxy parses server keywords in any order).
function build_pool(
string $name,
string $nodeport,
array $nodes,
string $check_type = 'TCP',
?string $check_port = null,
string $monitor_domain = ''
): array {
$advanced_check = $check_port !== null
? "send-proxy-v2 port {$check_port}"
: 'send-proxy-v2';
$servers = [];
foreach ($nodes as $n) {
$servers[] = [
'name' => $n[0],
'address' => $n[1],
'port' => $nodeport,
'weight' => '10',
'ssl' => '',
// 5s = sub-block-window failover when a NodePort goes sour.
// Safe to be aggressive once health probes don't fatal smtpd.
'checkinter' => '5000',
'advanced' => $advanced_check,
'status' => 'active',
];
}
return [
'name' => $name,
'balance' => 'roundrobin',
'check_type' => $check_type,
'monitor_domain' => $monitor_domain,
'checkinter' => '5000',
'retries' => '3',
'ha_servers' => ['item' => $servers],
'advanced_bind' => '',
'persist_cookie_enabled' => '',
'transparent_clientip' => '',
'advanced' => '',
];
}
function build_frontend(string $name, string $descr, string $extaddr, string $port, string $pool): array {
return [
'name' => $name,
'descr' => $descr,
'status' => 'active',
'secondary' => '',
'type' => 'tcp',
'a_extaddr' => ['item' => [[
'extaddr' => $extaddr,
'extaddr_port' => $port,
'extaddr_ssl' => '',
'extaddr_advanced' => '',
]]],
'backend_serverpool' => $pool,
'ha_acls' => '',
'dontlognull'=> '',
'httpclose' => '',
'forwardfor' => '',
'advanced' => '',
];
}
// ── Backend pools ───────────────────────────────────────────────────────
if (!is_array($h['ha_pools'])) $h['ha_pools'] = ['item' => []];
if (!is_array($h['ha_pools']['item'])) $h['ha_pools']['item'] = [];
$h['ha_pools']['item'] = array_values(array_filter(
$h['ha_pools']['item'],
fn($p) => !in_array($p['name'] ?? '', $POOL_NAMES, true)
));
// Legacy test pool (still used by the :2525 test frontend for manual SMTP roundtrip).
$h['ha_pools']['item'][] = build_pool('mailserver_nodes', '30125', $NODES);
// Production pools — one per mail port.
//
// All SMTP/SMTPS/Submission backends use plain TCP checks against
// dedicated non-PROXY healthcheck NodePorts (30145/30146/30147 → pod
// stock 25/465/587) so probes hit the no-PROXY listeners and avoid
// the smtpd_peer_hostaddr_to_sockaddr fatal that fires on PROXY-v2
// LOCAL frames. Real client traffic still goes to 30125-30128 with
// PROXY v2 for client-IP visibility.
//
// We tried `option smtpchk EHLO` initially — it works on the plain
// `submission` daemon (587) but flaps the `postscreen` listener on
// port 25 (multi-line greet + DNSBL silence + anti-pre-greet
// detection makes HAProxy's simple smtpchk parser hit L7RSP). A
// plain TCP accept-on-port check is enough for both: HAProxy still
// gets fast failover when the listener actually goes away, and we
// stop triggering the Postfix fatal entirely.
//
// IMAPS stays on its existing TCP-check-with-PROXY-frame for now —
// Dovecot's PROXY parser doesn't show the same fatal pattern; adding
// a separate IMAP healthcheck path would require another svc port.
$h['ha_pools']['item'][] = build_pool('mailserver_nodes_smtp', '30125', $NODES, 'TCP', '30145');
$h['ha_pools']['item'][] = build_pool('mailserver_nodes_smtps', '30126', $NODES, 'TCP', '30146');
$h['ha_pools']['item'][] = build_pool('mailserver_nodes_sub', '30127', $NODES, 'TCP', '30147');
$h['ha_pools']['item'][] = build_pool('mailserver_nodes_imaps', '30128', $NODES);
// ── Frontends ───────────────────────────────────────────────────────────
if (!is_array($h['ha_backends'])) $h['ha_backends'] = ['item' => []];
if (!is_array($h['ha_backends']['item'])) $h['ha_backends']['item'] = [];
$h['ha_backends']['item'] = array_values(array_filter(
$h['ha_backends']['item'],
fn($f) => !in_array($f['name'] ?? '', $FRONTEND_NAMES, true)
));
// Legacy test frontend — :2525 — retained so SMTP roundtrip tests keep working
// without touching the real :25. Safe to remove once fully validated.
$h['ha_backends']['item'][] = build_frontend(
'mailserver_proxy_test',
'code-yiu Phase 2/3 test — PROXY v2 to k8s mailserver NodePort 30125 (alt port :2525)',
'10.0.20.1', '2525',
'mailserver_nodes'
);
// Production frontends — 4 ports listening on pfSense VLAN20 IP 10.0.20.1.
$h['ha_backends']['item'][] = build_frontend(
'mailserver_proxy_25',
'code-yiu Phase 4/5 — external SMTP (:25) via PROXY v2 → pod :2525 postscreen',
'10.0.20.1', '25',
'mailserver_nodes_smtp'
);
$h['ha_backends']['item'][] = build_frontend(
'mailserver_proxy_465',
'code-yiu Phase 4/5 — external SMTPS (:465) via PROXY v2 → pod :4465 smtpd',
'10.0.20.1', '465',
'mailserver_nodes_smtps'
);
$h['ha_backends']['item'][] = build_frontend(
'mailserver_proxy_587',
'code-yiu Phase 4/5 — external submission (:587) via PROXY v2 → pod :5587 smtpd',
'10.0.20.1', '587',
'mailserver_nodes_sub'
);
$h['ha_backends']['item'][] = build_frontend(
'mailserver_proxy_993',
'code-yiu Phase 4/5 — external IMAPS (:993) via PROXY v2 → pod :10993 Dovecot',
'10.0.20.1', '993',
'mailserver_nodes_imaps'
);
write_config('code-yiu: mailserver HAProxy — 4 production frontends + legacy :2525 test');
$messages = '';
$rc = haproxy_check_and_run($messages, true);
echo 'haproxy_check_and_run rc=' . ($rc ? 'OK' : 'FAIL') . "\n";
echo "messages: $messages\n";

View file

@ -1,68 +0,0 @@
<?php
// pfSense NAT redirect flip — mail ports 25/465/587/993 from
// <mailserver> alias (10.0.20.202 MetalLB LB) to pfSense's own HAProxy
// listener (10.0.20.1). bd code-yiu.
//
// THIS IS THE CUTOVER. After this script:
// Internet → pfSense WAN:{25,465,587,993} → rdr → 10.0.20.1:{...}
// (pfSense HAProxy) → send-proxy-v2 → k8s-node:{30125..30128} NodePort
// → kube-proxy → mailserver pod alt listeners (2525/4465/5587/10993)
// → Postfix/Dovecot parse PROXY v2 → real client IP recovered.
//
// Internal clients (Roundcube, email-roundtrip-monitor CronJob) continue
// using the existing mailserver ClusterIP Service on the stock ports
// (25/465/587/993) which hit container stock listeners WITHOUT PROXY.
// No change to internal traffic paths.
//
// USAGE
// scp infra/scripts/pfsense-nat-mailserver-haproxy-flip.php admin@10.0.20.1:/tmp/
// ssh admin@10.0.20.1 'php /tmp/pfsense-nat-mailserver-haproxy-flip.php'
//
// REVERT — run pfsense-nat-mailserver-haproxy-unflip.php (companion script).
//
// IDEMPOTENT — re-runs converge. Flips nothing if already pointed at 10.0.20.1.
require_once('/etc/inc/config.inc');
require_once('/etc/inc/filter.inc');
global $config;
parse_config(true);
$PORTS_TO_FLIP = ['25', '465', '587', '993'];
$OLD_TARGET = 'mailserver';
$NEW_TARGET = '10.0.20.1';
$changed = 0;
foreach ($config['nat']['rule'] as $i => &$r) {
$iface = $r['interface'] ?? '';
$lport = $r['local-port'] ?? '';
$tgt = $r['target'] ?? '';
if ($iface !== 'wan') continue;
if (!in_array($lport, $PORTS_TO_FLIP, true)) continue;
if ($tgt !== $OLD_TARGET) {
printf("rule %d (dport=%s) target=%s — not flipping (already %s or unexpected)\n",
$i, $lport, $tgt, $NEW_TARGET);
continue;
}
$r['target'] = $NEW_TARGET;
// Also unset the 'associated-rule-id' linked filter rule target if any —
// actually pfSense regenerates the associated rule from NAT rule on apply,
// so leaving associated-rule-id intact is fine.
$changed++;
printf("rule %d (dport=%s): target %s → %s\n", $i, $lport, $OLD_TARGET, $NEW_TARGET);
}
unset($r);
if ($changed === 0) {
echo "No changes. (Already flipped? Run unflip script to revert.)\n";
exit(0);
}
write_config("code-yiu: NAT rdr — mail ports {$changed} flipped to HAProxy (10.0.20.1)");
// Rebuild pf rules & reload.
$rc = filter_configure();
printf("filter_configure rc=%s\n", var_export($rc, true));
echo "done.\n";

View file

@ -1,48 +0,0 @@
<?php
// REVERT of pfsense-nat-mailserver-haproxy-flip.php.
// Moves mail-port NAT rdr target from 10.0.20.1 (pfSense HAProxy) back to
// <mailserver> alias (10.0.20.202 MetalLB LB IP). bd code-yiu rollback.
//
// USE THIS IF: external mail breaks after the flip, any postscreen
// PROXY timeouts show up in logs, or you need to back out before Phase 6.
require_once('/etc/inc/config.inc');
require_once('/etc/inc/filter.inc');
global $config;
parse_config(true);
$PORTS_TO_REVERT = ['25', '465', '587', '993'];
$OLD_TARGET = '10.0.20.1';
$NEW_TARGET = 'mailserver';
$changed = 0;
foreach ($config['nat']['rule'] as $i => &$r) {
$iface = $r['interface'] ?? '';
$lport = $r['local-port'] ?? '';
$tgt = $r['target'] ?? '';
if ($iface !== 'wan') continue;
if (!in_array($lport, $PORTS_TO_REVERT, true)) continue;
if ($tgt !== $OLD_TARGET) {
printf("rule %d (dport=%s) target=%s — not reverting (already %s or unexpected)\n",
$i, $lport, $tgt, $NEW_TARGET);
continue;
}
$r['target'] = $NEW_TARGET;
$changed++;
printf("rule %d (dport=%s): target %s → %s\n", $i, $lport, $OLD_TARGET, $NEW_TARGET);
}
unset($r);
if ($changed === 0) {
echo "No changes. (Already reverted.)\n";
exit(0);
}
write_config("code-yiu: NAT rdr — mail ports {$changed} reverted to <mailserver> alias");
$rc = filter_configure();
printf("filter_configure rc=%s\n", var_export($rc, true));
echo "done.\n";

View file

@ -39,43 +39,26 @@ if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
fi
echo "Vault authenticated"
# 5. Fetch API token for claude-agent-service
AGENT_TOKEN=$(curl -sf -H "X-Vault-Token: $VAULT_TOKEN" \
http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/claude-agent-service | \
jq -r '.data.data.api_bearer_token')
if [ -z "$AGENT_TOKEN" ] || [ "$AGENT_TOKEN" = "null" ]; then
echo "ERROR: Failed to fetch agent API token"
# 5. Fetch DevVM SSH key from Vault
curl -sf -H "X-Vault-Token: $VAULT_TOKEN" \
http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/ci/infra | \
jq -r '.data.data.devvm_ssh_key' > /tmp/devvm-key
chmod 600 /tmp/devvm-key
if [ ! -s /tmp/devvm-key ]; then
echo "ERROR: Failed to fetch DevVM SSH key"
exit 1
fi
echo "Agent token fetched"
echo "SSH key fetched"
# 6. Submit to claude-agent-service
# 6. SSH to DevVM and run Claude Code headless
TODOS=$(cat /tmp/todos.json)
PAYLOAD=$(jq -n \
--arg prompt "Implement the auto-implementable TODOs from $PM_FILE. Parsed TODO list: $TODOS" \
--arg agent ".claude/agents/postmortem-todo-resolver" \
'{prompt: $prompt, agent: $agent, max_budget_usd: 5, timeout_seconds: 900}')
ssh -i /tmp/devvm-key -o StrictHostKeyChecking=no wizard@10.0.10.10 \
"cd ~/code && git -C infra stash && git -C infra pull && git -C infra stash pop 2>/dev/null; ~/.local/bin/claude -p \
--agent infra/.claude/agents/postmortem-todo-resolver \
--dangerously-skip-permissions \
--max-budget-usd 5 \
'Implement the auto-implementable TODOs from $PM_FILE. Parsed TODO list: $TODOS'"
RESP=$(curl -sf -X POST \
-H "Authorization: Bearer $AGENT_TOKEN" \
-H "Content-Type: application/json" \
-d "$PAYLOAD" \
http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute)
JOB_ID=$(echo "$RESP" | jq -r '.job_id')
echo "Job submitted: $JOB_ID"
# 7. Poll for completion (15min max)
for i in $(seq 1 60); do
sleep 15
RESULT=$(curl -sf \
-H "Authorization: Bearer $AGENT_TOKEN" \
http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs/$JOB_ID)
STATUS=$(echo "$RESULT" | jq -r '.status')
echo "[$i/60] Status: $STATUS"
if [ "$STATUS" != "running" ]; then
echo "$RESULT" | jq .
if [ "$STATUS" = "completed" ]; then exit 0; else exit 1; fi
fi
done
echo "ERROR: Job timed out after 15 minutes"
exit 1
# 7. Cleanup
rm -f /tmp/devvm-key
echo "Pipeline complete"

View file

@ -1,59 +0,0 @@
#!/usr/bin/env bash
# One-shot deployment of the forgejo.viktorbarzin.me containerd hosts.toml
# entry across every k8s node. Cloud-init only fires on VM provision, so
# existing nodes need this manual rollout.
#
# What it does, per node:
# 1. drain (ignore-daemonsets, delete-emptydir-data)
# 2. ssh in: mkdir + write /etc/containerd/certs.d/forgejo.viktorbarzin.me/hosts.toml
# 3. systemctl restart containerd
# 4. uncordon
#
# hosts.toml is documented as hot-reloaded but the post-2026-04-19
# containerd corruption playbook calls for an explicit restart so the
# config is unambiguously in effect. Running drain/uncordon around it
# avoids pulling against an in-flight containerd restart.
#
# Re-run is safe: writes are idempotent.
set -euo pipefail
CERTS_DIR=/etc/containerd/certs.d/forgejo.viktorbarzin.me
HOSTS_TOML='server = "https://forgejo.viktorbarzin.me"
[host."https://10.0.20.200"]
capabilities = ["pull", "resolve"]
'
NODES=$(kubectl get nodes -o name | sed 's|^node/||')
if [[ -z "$NODES" ]]; then
echo "ERROR: no nodes returned from kubectl get nodes" >&2
exit 1
fi
for n in $NODES; do
echo "=== $n ==="
kubectl drain "$n" --ignore-daemonsets --delete-emptydir-data --force --grace-period=60
ssh -o StrictHostKeyChecking=accept-new "wizard@$n" sudo bash <<EOF
set -euo pipefail
mkdir -p "$CERTS_DIR"
cat > "$CERTS_DIR/hosts.toml" <<'TOML'
$HOSTS_TOML
TOML
systemctl restart containerd
EOF
kubectl uncordon "$n"
# Wait for the node to report Ready before moving to the next one.
for i in {1..30}; do
if kubectl get node "$n" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' | grep -q True; then
echo " node Ready"
break
fi
sleep 2
done
done
echo "All nodes updated."

View file

@ -72,23 +72,12 @@ if [ -n "$STACK_NAME" ]; then
else
# Tier 1: PG backend — fetch credentials from Vault
if [ -z "${PG_CONN_STR:-}" ]; then
# Pre-flight: vault CLI must be available. Previously CI failed with a
# misleading "Cannot read PG credentials" message because the Alpine CI
# image lacked the vault binary — the 2>/dev/null below swallowed the
# real "vault: not found" error. Fail fast with a clear message instead.
if ! command -v vault >/dev/null 2>&1; then
echo "ERROR: vault CLI not found on PATH. Install it or use an image that includes it (ci/Dockerfile)." >&2
exit 1
fi
VAULT_OUT=$(vault read -format=json database/static-creds/pg-terraform-state 2>&1) || {
echo "ERROR: Cannot read PG credentials from Vault. Vault output follows:" >&2
echo "$VAULT_OUT" >&2
echo "" >&2
echo "Hint: humans run 'vault login -method=oidc'; CI auths via K8s SA (role=ci)." >&2
PG_CREDS=$(vault read -format=json database/static-creds/pg-terraform-state 2>/dev/null) || {
echo "ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc" >&2
exit 1
}
PG_USER=$(echo "$VAULT_OUT" | jq -r .data.username)
PG_PASS=$(echo "$VAULT_OUT" | jq -r .data.password)
PG_USER=$(echo "$PG_CREDS" | jq -r .data.username)
PG_PASS=$(echo "$PG_CREDS" | jq -r .data.password)
export PG_CONN_STR="postgres://${PG_USER}:${PG_PASS}@10.0.20.200:5432/terraform_state?sslmode=disable"
fi
fi

Binary file not shown.

Binary file not shown.

Binary file not shown.

29
setup-monitoring.sh Executable file
View file

@ -0,0 +1,29 @@
#!/bin/bash
# Setup script for automated monitoring environment
# Ensures health check scripts have access to kubeconfig
echo "=== Setting up automated monitoring environment ==="
# Copy kubeconfig to location expected by health check scripts
if [ -f /home/node/.openclaw/kubeconfig ]; then
cp /home/node/.openclaw/kubeconfig /workspace/infra/config
echo "✅ Kubeconfig copied to /workspace/infra/config"
else
echo "❌ Source kubeconfig not found at /home/node/.openclaw/kubeconfig"
exit 1
fi
# Test health check access
echo ""
echo "Testing health check script access..."
cd /workspace/infra
if KUBECONFIG="" timeout 30 bash .claude/cluster-health.sh --quiet > /dev/null 2>&1; then
echo "✅ Health check script can access cluster"
else
echo "❌ Health check script cannot access cluster"
exit 1
fi
echo ""
echo "✅ Automated monitoring environment setup complete"
echo "📊 Cron health checks will now work properly"

View file

@ -63,7 +63,7 @@ resource "kubernetes_deployment" "app" {
}
}
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -116,10 +116,6 @@ resource "kubernetes_deployment" "actualbudget" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "actualbudget" {
@ -218,10 +214,6 @@ resource "kubernetes_deployment" "actualbudget-http-api" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "actualbudget-http-api" {
@ -312,8 +304,4 @@ resource "kubernetes_cron_job_v1" "bank-sync" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -46,7 +46,7 @@ locals {
# To create a new deployment:
/**
1. Create a subdirectory for {name} under /srv/nfs on the Proxmox host (192.168.1.127)
1. Export a new nfs share with {name} in truenas
2. Add {name} as proxied cloudflare route (tfvars)
3. Add module here
*/
@ -59,10 +59,6 @@ resource "kubernetes_namespace" "actualbudget" {
tier = local.tiers.edge
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -76,14 +72,13 @@ module "tls_secret" {
module "viktor" {
source = "./factory"
name = "viktor"
tag = "26.4.0"
tag = "26.3.0"
tls_secret_name = var.tls_secret_name
nfs_server = var.nfs_server
depends_on = [kubernetes_namespace.actualbudget]
tier = local.tiers.edge
enable_http_api = true
enable_bank_sync = true
storage_size = "4Gi"
budget_encryption_password = lookup(local.credentials["viktor"], "password", null)
sync_id = lookup(local.credentials["viktor"], "sync_id", null)
homepage_annotations = {
@ -100,7 +95,7 @@ module "viktor" {
module "anca" {
source = "./factory"
name = "anca"
tag = "26.4.0"
tag = "26.3.0"
tls_secret_name = var.tls_secret_name
nfs_server = var.nfs_server
depends_on = [kubernetes_namespace.actualbudget]
@ -123,7 +118,7 @@ module "anca" {
module "emo" {
source = "./factory"
name = "emo"
tag = "26.4.0"
tag = "26.3.0"
tls_secret_name = var.tls_secret_name
nfs_server = var.nfs_server
depends_on = [kubernetes_namespace.actualbudget]

View file

@ -90,10 +90,6 @@ resource "kubernetes_namespace" "affine" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -201,7 +197,7 @@ resource "kubernetes_deployment" "affine" {
annotations = {
"diun.enable" = "true"
"diun.include_tags" = "^\\d+\\.\\d+\\.\\d+$"
"dependency.kyverno.io/wait-for" = "postgresql.dbaas:5432,redis-master.redis:6379"
"dependency.kyverno.io/wait-for" = "postgresql.dbaas:5432,redis.redis:6379"
}
}
spec {
@ -323,10 +319,6 @@ resource "kubernetes_deployment" "affine" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "affine" {

View file

@ -1,81 +0,0 @@
# goauthentik/authentik Terraform provider.
#
# Adopted 2026-04-18 (Wave 6a of the state-drift consolidation plan) to bring
# the catch-all Proxy Provider previously managed only via the Authentik UI
# under Terraform management. API token lives in Vault
# `secret/authentik/tf_api_token` (token identifier `terraform-infra-stack`,
# intent API, user akadmin, no expiry). Required-providers declaration sits
# in the central terragrunt.hcl so every stack has it available; only this
# stack configures a provider block.
data "vault_kv_secret_v2" "authentik_tf" {
mount = "secret"
name = "authentik"
}
provider "authentik" {
url = "https://authentik.viktorbarzin.me"
token = data.vault_kv_secret_v2.authentik_tf.data["tf_api_token"]
}
data "authentik_flow" "default_authorization_implicit_consent" {
slug = "default-provider-authorization-implicit-consent"
}
data "authentik_flow" "default_provider_invalidation" {
slug = "default-provider-invalidation-flow"
}
# -----------------------------------------------------------------------------
# Catch-all Proxy Provider + Application.
#
# Created via the Authentik UI ~a year ago; adopted into Terraform 2026-04-18
# (Wave 6a). The proxy provider is consumed by the embedded outpost
# (uuid 0eecac07-97c7-443c-8925-05f2f4fe3e47) via an outpost-level binding
# that stays in the UI it's a single toggle with no drift risk.
# -----------------------------------------------------------------------------
resource "authentik_application" "catchall" {
name = "Domain wide catch all"
slug = "domain-wide-catch-all"
protocol_provider = authentik_provider_proxy.catchall.id
lifecycle {
ignore_changes = [meta_description, meta_launch_url, meta_icon, group, backchannel_providers, policy_engine_mode, open_in_new_tab]
}
}
resource "authentik_provider_proxy" "catchall" {
name = "Provider for Domain wide catch all"
mode = "forward_domain"
external_host = "https://authentik.viktorbarzin.me"
cookie_domain = "viktorbarzin.me"
# Flow UUIDs resolved dynamically so a flow re-creation (keeping the slug)
# doesn't require an HCL edit.
authorization_flow = data.authentik_flow.default_authorization_implicit_consent.id
invalidation_flow = data.authentik_flow.default_provider_invalidation.id
lifecycle {
ignore_changes = [property_mappings, jwt_federation_sources, skip_path_regex, internal_host, basic_auth_enabled, basic_auth_password_attribute, basic_auth_username_attribute, intercept_header_auth, access_token_validity]
}
}
# -----------------------------------------------------------------------------
# Default User Login stage bound to default-authentication-flow.
# Adopted into Terraform 2026-05-01 to set session_duration=weeks=4 so users
# stay logged in across browser restarts. There is no Brand.session_duration
# in authentik 2026.2.x UserLoginStage is the correct knob.
# -----------------------------------------------------------------------------
resource "authentik_stage_user_login" "default_login" {
name = "default-authentication-login"
session_duration = "weeks=4"
lifecycle {
# Pin only session_duration; everything else stays UI-managed so the
# plan doesn't churn unrelated knobs (e.g. remember_me_offset toggles).
ignore_changes = [
remember_me_offset,
terminate_other_sessions,
geoip_binding,
network_binding,
]
}
}

View file

@ -31,10 +31,6 @@ resource "kubernetes_namespace" "authentik" {
"resource-governance/custom-quota" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_resource_quota" "authentik" {

View file

@ -74,36 +74,6 @@ resource "kubernetes_deployment" "pgbouncer" {
container_port = 6432
}
resources {
requests = {
cpu = "50m"
memory = "128Mi"
}
limits = {
memory = "512Mi"
}
}
readiness_probe {
tcp_socket {
port = 6432
}
initial_delay_seconds = 5
period_seconds = 10
timeout_seconds = 3
failure_threshold = 3
}
liveness_probe {
tcp_socket {
port = 6432
}
initial_delay_seconds = 30
period_seconds = 30
timeout_seconds = 5
failure_threshold = 3
}
volume_mount {
name = "config"
mount_path = "/etc/pgbouncer/pgbouncer.ini"
@ -145,29 +115,6 @@ resource "kubernetes_deployment" "pgbouncer" {
}
}
depends_on = [kubernetes_secret.pgbouncer_auth]
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
# --- 3b PodDisruptionBudget ---
# Protects auth against simultaneous node drains. With 3 replicas and
# minAvailable=2, a single drain rolls cleanly; a simultaneous two-node
# outage is correctly blocked.
resource "kubernetes_pod_disruption_budget_v1" "pgbouncer" {
metadata {
name = "pgbouncer"
namespace = "authentik"
}
spec {
min_available = 2
selector {
match_labels = {
app = "pgbouncer"
}
}
}
}
# --- 4 Service ---

View file

@ -14,38 +14,9 @@ authentik:
port: 6432
user: authentik
password: ""
# Persistent client-side connections (safe with PgBouncer session mode;
# must be < pgbouncer server_idle_timeout=600s). Cuts Django connection
# setup overhead off the ~70 sequential ORM ops per flow stage.
conn_max_age: 60
conn_health_checks: true
cache:
# Cache flow plans for 30m and policy evaluations for 15m. Authentik 2026.2
# moved cache storage from Redis to Postgres, so a TTL hit is still a
# SELECT — but a single indexed lookup beats re-evaluating PolicyBindings.
timeout_flows: 1800
timeout_policies: 900
web:
# Gunicorn: 3 workers × 4 threads per server pod (default 2×4).
# Pairs with the server memory bump to 2Gi (each worker preloads Django ~500Mi).
workers: 3
threads: 4
worker:
# Celery-equivalent worker threads per pod (default 2, renamed from
# AUTHENTIK_WORKER__CONCURRENCY in 2025.8).
threads: 4
server:
replicas: 3
# Anonymous Django sessions (no completed login: bots, healthcheckers,
# partial flows) expire in 2h. Default is days=1. Once login completes,
# UserLoginStage.session_duration takes over via request.session.set_expiry.
# Injected via server.env (not authentik.sessions.*) because we use
# authentik.existingSecret.secretName, which makes the chart skip
# rendering the AUTHENTIK_* secret — so the values block doesn't reach env.
env:
- name: AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE
value: "hours=2"
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
@ -56,7 +27,7 @@ server:
cpu: 100m
memory: 1.5Gi
limits:
memory: 2Gi
memory: 1.5Gi
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
@ -73,17 +44,12 @@ server:
diun.include_tags: "^202[0-9].[0-9]+.*$" # no need to annotate the worker as it uses the same image
pdb:
enabled: true
minAvailable: 2
minAvailable: 1
global:
addPrometheusAnnotations: true
worker:
replicas: 3
# Same unauthenticated_age cap as server — both the server (Django session
# middleware) and worker (cleanup tasks) need to see the value.
env:
- name: AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE
value: "hours=2"
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
@ -94,7 +60,7 @@ worker:
cpu: 100m
memory: 1.5Gi
limits:
memory: 2Gi
memory: 1.5Gi
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname

View file

@ -3,27 +3,6 @@ variable "tls_secret_name" {
sensitive = true
}
variable "beadboard_image_tag" {
type = string
default = "17a38e43"
}
# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf keep in
# sync when the claude-agent-service image is rebuilt. Reused here because the
# dispatcher + reaper CronJobs only need bd, curl, and jq, which that image
# already ships.
variable "claude_agent_service_image_tag" {
type = string
default = "2fd7670d"
}
# Kill switch for auto-dispatch. When false, both CronJobs are suspended. The
# manual BeadBoard Dispatch button keeps working either way.
variable "beads_dispatcher_enabled" {
type = bool
default = true
}
resource "kubernetes_namespace" "beads" {
metadata {
name = "beads-server"
@ -31,10 +10,6 @@ resource "kubernetes_namespace" "beads" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_persistent_volume_claim" "dolt_data" {
@ -170,7 +145,7 @@ resource "kubernetes_deployment" "dolt" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].dns_config
]
}
}
@ -374,7 +349,7 @@ resource "kubernetes_deployment" "workbench" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].dns_config
]
}
}
@ -411,13 +386,13 @@ module "tls_secret" {
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = kubernetes_namespace.beads.metadata[0].name
name = "dolt-workbench"
tls_secret_name = var.tls_secret_name
protected = false
exclude_crowdsec = true
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = kubernetes_namespace.beads.metadata[0].name
name = "dolt-workbench"
tls_secret_name = var.tls_secret_name
protected = false
exclude_crowdsec = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Dolt Workbench"
@ -488,38 +463,6 @@ resource "kubernetes_config_map" "beadboard_config" {
}
}
# Pulls the claude-agent-service bearer token from Vault so BeadBoard can
# dispatch agent jobs via the in-cluster HTTP API.
resource "kubernetes_manifest" "beadboard_agent_service_secret" {
manifest = {
apiVersion = "external-secrets.io/v1beta1"
kind = "ExternalSecret"
metadata = {
name = "beadboard-agent-service"
namespace = kubernetes_namespace.beads.metadata[0].name
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = {
name = "beadboard-agent-service"
}
data = [
{
secretKey = "api_bearer_token"
remoteRef = {
key = "claude-agent-service"
property = "api_bearer_token"
}
},
]
}
}
}
resource "kubernetes_deployment" "beadboard" {
metadata {
name = "beadboard"
@ -528,9 +471,6 @@ resource "kubernetes_deployment" "beadboard" {
app = "beadboard"
tier = local.tiers.aux
}
annotations = {
"reloader.stakater.com/auto" = "true"
}
}
spec {
replicas = 1
@ -567,29 +507,13 @@ resource "kubernetes_deployment" "beadboard" {
container {
name = "beadboard"
# Phase 3 cutover 2026-05-07 Forgejo registry consolidation.
image = "forgejo.viktorbarzin.me/viktor/beadboard:${var.beadboard_image_tag}"
image = "registry.viktorbarzin.me:5050/beadboard:latest"
port {
name = "http"
container_port = 3000
}
env {
name = "CLAUDE_AGENT_SERVICE_URL"
value = "http://claude-agent-service.claude-agent.svc.cluster.local:8080"
}
env {
name = "CLAUDE_AGENT_BEARER_TOKEN"
value_from {
secret_key_ref {
name = "beadboard-agent-service"
key = "api_bearer_token"
}
}
}
volume_mount {
name = "beads-writable"
mount_path = "/app/.beads"
@ -646,7 +570,7 @@ resource "kubernetes_deployment" "beadboard" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].dns_config
]
}
}
@ -672,13 +596,13 @@ resource "kubernetes_service" "beadboard" {
}
module "beadboard_ingress" {
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = kubernetes_namespace.beads.metadata[0].name
name = "beadboard"
tls_secret_name = var.tls_secret_name
protected = true
exclude_crowdsec = true
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = kubernetes_namespace.beads.metadata[0].name
name = "beadboard"
tls_secret_name = var.tls_secret_name
protected = true
exclude_crowdsec = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "BeadBoard"
@ -688,275 +612,3 @@ module "beadboard_ingress" {
"gethomepage.dev/pod-selector" = ""
}
}
# Beads auto-dispatch (dispatcher + reaper CronJobs)
#
# Flow:
# user: bd assign <id> agent
# > CronJob: beads-dispatcher (every 2 min)
# 1. GET BeadBoard /api/agent-status skip if claude-agent-service busy
# 2. bd query 'assignee=agent AND status=open' pick highest priority
# 3. bd update -s in_progress (claim; next tick won't re-pick)
# 4. POST BeadBoard /api/agent-dispatch reuses prompt-build + bearer flow
# 5. bd note "dispatched: job=<id>" (or rollback + note on failure)
#
# CronJob: beads-reaper (every 10 min)
# for bead (assignee=agent, status=in_progress, updated_at > 30m):
# bd update -s blocked + bd note (recover from pod crashes mid-run)
#
# The claude-agent-service image ships bd + jq + curl no separate image built.
resource "kubernetes_config_map" "beads_metadata" {
metadata {
name = "beads-metadata"
namespace = kubernetes_namespace.beads.metadata[0].name
}
data = {
"metadata.json" = jsonencode({
database = "dolt"
backend = "dolt"
dolt_mode = "server"
dolt_server_host = "${kubernetes_service.dolt.metadata[0].name}.${kubernetes_namespace.beads.metadata[0].name}.svc.cluster.local"
dolt_server_port = 3306
dolt_server_user = "beads"
dolt_database = "code"
project_id = "a8f8bae7-ce65-4145-a5db-a13d11d297da"
})
}
}
locals {
# Phase 3 cutover 2026-05-07 Forgejo registry consolidation.
claude_agent_service_image = "forgejo.viktorbarzin.me/viktor/claude-agent-service:${var.claude_agent_service_image_tag}"
beadboard_internal_url = "http://${kubernetes_service.beadboard.metadata[0].name}.${kubernetes_namespace.beads.metadata[0].name}.svc.cluster.local"
beads_script_prelude = <<-EOT
set -euo pipefail
# bd with Dolt server mode needs metadata.json in a directory it can walk.
# ConfigMap mounts are read-only copy to a writable location before use.
mkdir -p /tmp/.beads
cp /etc/beads-metadata/metadata.json /tmp/.beads/metadata.json
EOT
}
resource "kubernetes_cron_job_v1" "beads_dispatcher" {
metadata {
name = "beads-dispatcher"
namespace = kubernetes_namespace.beads.metadata[0].name
}
spec {
schedule = "*/2 * * * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 3
starting_deadline_seconds = 60
suspend = !var.beads_dispatcher_enabled
job_template {
metadata {}
spec {
backoff_limit = 0
ttl_seconds_after_finished = 600
template {
metadata {
labels = {
app = "beads-dispatcher"
}
}
spec {
restart_policy = "Never"
image_pull_secrets {
name = "registry-credentials"
}
container {
name = "dispatcher"
image = local.claude_agent_service_image
command = ["/bin/sh", "-c", <<-EOT
${local.beads_script_prelude}
BUSY=$(curl -sf "$${BEADBOARD_URL}/api/agent-status" | jq -r '.busy // false')
if [ "$BUSY" != "false" ]; then
echo "claude-agent-service is busy — skipping tick"
exit 0
fi
BEAD=$(bd --db /tmp/.beads query 'assignee=agent AND status=open' --json \
| jq -r '[.[] | select(.acceptance_criteria and (.acceptance_criteria | length) > 0)]
| sort_by(.priority, .updated_at)[0].id // empty')
if [ -z "$BEAD" ]; then
echo "no eligible beads (assignee=agent, status=open, has acceptance_criteria)"
exit 0
fi
echo "picked bead: $BEAD"
bd --db /tmp/.beads update "$BEAD" -s in_progress
bd --db /tmp/.beads note "$BEAD" "auto-dispatcher claimed at $(date -u +%Y-%m-%dT%H:%M:%SZ)"
RESP=$(curl -sS -w '\n%%{http_code}' -X POST \
-H 'Content-Type: application/json' \
-d "{\"taskId\":\"$BEAD\"}" \
"$${BEADBOARD_URL}/api/agent-dispatch")
CODE=$(printf '%s' "$RESP" | tail -n1)
BODY=$(printf '%s' "$RESP" | sed '$d')
if [ "$CODE" = "200" ]; then
JOB_ID=$(printf '%s' "$BODY" | jq -r '.job_id // "unknown"')
bd --db /tmp/.beads note "$BEAD" "dispatched: job=$JOB_ID"
echo "dispatched $BEAD as job $JOB_ID"
else
# Roll the claim back so the next tick can retry.
bd --db /tmp/.beads update "$BEAD" -s open
bd --db /tmp/.beads note "$BEAD" "dispatch failed HTTP $CODE: $BODY"
echo "dispatch FAILED for $BEAD: HTTP $CODE — $BODY" >&2
exit 1
fi
EOT
]
env {
name = "BEADBOARD_URL"
value = local.beadboard_internal_url
}
env {
name = "API_BEARER_TOKEN"
value_from {
secret_key_ref {
name = "beadboard-agent-service"
key = "api_bearer_token"
}
}
}
env {
name = "BEADS_ACTOR"
value = "beads-dispatcher"
}
env {
name = "HOME"
value = "/tmp"
}
volume_mount {
name = "beads-metadata"
mount_path = "/etc/beads-metadata"
read_only = true
}
resources {
requests = {
cpu = "50m"
memory = "128Mi"
}
limits = {
memory = "256Mi"
}
}
}
volume {
name = "beads-metadata"
config_map {
name = kubernetes_config_map.beads_metadata.metadata[0].name
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_cron_job_v1" "beads_reaper" {
metadata {
name = "beads-reaper"
namespace = kubernetes_namespace.beads.metadata[0].name
}
spec {
schedule = "*/10 * * * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 3
starting_deadline_seconds = 60
suspend = !var.beads_dispatcher_enabled
job_template {
metadata {}
spec {
backoff_limit = 0
ttl_seconds_after_finished = 600
template {
metadata {
labels = {
app = "beads-reaper"
}
}
spec {
restart_policy = "Never"
image_pull_secrets {
name = "registry-credentials"
}
container {
name = "reaper"
image = local.claude_agent_service_image
command = ["/bin/sh", "-c", <<-EOT
${local.beads_script_prelude}
THRESHOLD_MIN=30
NOW=$(date -u +%s)
bd --db /tmp/.beads query 'assignee=agent AND status=in_progress' --json \
| jq -c '.[]' \
| while read -r BEAD_JSON; do
ID=$(printf '%s' "$BEAD_JSON" | jq -r '.id')
LAST_UPDATE=$(printf '%s' "$BEAD_JSON" | jq -r '.updated_at')
# Alpine's busybox date lacks GNU -d; parse ISO-8601 with python3.
LAST_TS=$(python3 -c "from datetime import datetime; print(int(datetime.fromisoformat('$LAST_UPDATE'.replace('Z','+00:00')).timestamp()))")
AGE_MIN=$(( (NOW - LAST_TS) / 60 ))
if [ "$AGE_MIN" -gt "$THRESHOLD_MIN" ]; then
bd --db /tmp/.beads note "$ID" "reaper: no progress for $${AGE_MIN}m (threshold $${THRESHOLD_MIN}m) — blocking"
bd --db /tmp/.beads update "$ID" -s blocked
echo "REAPED $ID (stale $${AGE_MIN}m)"
else
echo "keeping $ID (age $${AGE_MIN}m < $${THRESHOLD_MIN}m)"
fi
done
EOT
]
env {
name = "BEADS_ACTOR"
value = "beads-reaper"
}
env {
name = "HOME"
value = "/tmp"
}
volume_mount {
name = "beads-metadata"
mount_path = "/etc/beads-metadata"
read_only = true
}
resources {
requests = {
cpu = "50m"
memory = "128Mi"
}
limits = {
memory = "256Mi"
}
}
}
volume {
name = "beads-metadata"
config_map {
name = kubernetes_config_map.beads_metadata.metadata[0].name
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -12,10 +12,6 @@ resource "kubernetes_namespace" "website" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
@ -75,10 +71,6 @@ resource "kubernetes_deployment" "blog" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_service" "blog" {

View file

@ -14,10 +14,6 @@ resource "kubernetes_namespace" "broker_sync" {
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# Secrets for all providers. Seeded in Vault at `secret/broker-sync`:
@ -105,7 +101,7 @@ resource "kubernetes_cron_job_v1" "version_probe" {
metadata {}
spec {
backoff_limit = 1
ttl_seconds_after_finished = 86400
ttl_seconds_after_finished = 300
template {
metadata {
labels = { app = "broker-sync", component = "version-probe" }
@ -126,10 +122,6 @@ resource "kubernetes_cron_job_v1" "version_probe" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Trading212 steady-state daily sync. Phase 1 deliverable.
@ -226,10 +218,6 @@ resource "kubernetes_cron_job_v1" "trading212" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# IMAP ingest InvestEngine + Schwab email parsers, one combined pod.
@ -246,12 +234,7 @@ resource "kubernetes_cron_job_v1" "imap" {
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 5
# Unsuspended 2026-04-19 for RSU vest ground-truth ingestion the parser
# now detects Schwab Release Confirmations and scaffolds VestEvents; the
# postgres sink that persists them into payslip_ingest.rsu_vest_events is
# pending a real-email fixture and cross-service DB grant (see
# follow-up beads task filed under the RSU tax spike fix epic).
suspend = false
suspend = true # enable in Phase 2
job_template {
metadata {}
spec {
@ -360,10 +343,6 @@ resource "kubernetes_cron_job_v1" "imap" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# CSV drop-folder processor Scottish Widows, Fidelity quarterly, Freetrade, etc.
@ -452,10 +431,6 @@ resource "kubernetes_cron_job_v1" "csv_drop" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Monthly HMRC FX reconciliation rewrites last-month activities with official
@ -544,10 +519,6 @@ resource "kubernetes_cron_job_v1" "fx_reconcile" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Backup: snapshot sync.db / fx.db / csv-archive into NFS daily, keep 30 days.
@ -625,170 +596,4 @@ resource "kubernetes_cron_job_v1" "backup" {
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# -----------------------------------------------------------------------------
# Fidelity UK PlanViewer monthly pension contribution sync
#
# Architecture notes:
# - The CLI (`broker-sync fidelity-ingest`) loads storage_state.json, boots
# headless Chromium, scrapes the transaction history + valuation JSON, and
# posts DEPOSIT activities to Wealthfolio. See
# broker-sync/docs/providers/fidelity-planviewer.md for the seed workflow.
# - Storage_state is staged to Vault (`secret/broker-sync`
# `fidelity_storage_state`). ESO projects all broker-sync keys into the
# shared `broker-sync-secrets` K8s Secret; an init container writes the
# JSON blob to the PVC so the main container can load it.
# - Image needs Chromium baked in add the `fidelity-capable: "true"` label
# so the Dockerfile/CI treats this CronJob's pod spec as the Playwright
# variant. Until the Playwright image ships, keep `suspend = true`.
# - Schedule: 05:00 UK on the 20th of each month well after Viktor's mid-
# month payroll contribution has settled (finance history shows credits
# landing 13th-18th).
resource "kubernetes_cron_job_v1" "fidelity" {
metadata {
name = "broker-sync-fidelity"
namespace = kubernetes_namespace.broker_sync.metadata[0].name
labels = { app = "broker-sync", component = "fidelity" }
}
spec {
schedule = "0 5 20 * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 5
# Suspended until the broker-sync image ships with Playwright + Chromium.
suspend = true
job_template {
metadata {}
spec {
backoff_limit = 1
ttl_seconds_after_finished = 86400
template {
metadata {
labels = { app = "broker-sync", component = "fidelity" }
}
spec {
restart_policy = "OnFailure"
# Materialise the JSON storage_state from the projected Secret
# onto the PVC where Playwright expects to read it. Init container
# runs as root; the main broker-sync container runs as uid 10001,
# so we chown+chmod 600 to grant read access to the broker user.
init_container {
name = "stage-storage-state"
image = "busybox:1.36"
command = ["/bin/sh", "-c", <<-EOT
set -eu
mkdir -p /data
cp /secrets/fidelity_storage_state /data/fidelity_storage_state.json
chown 10001:10001 /data/fidelity_storage_state.json
chmod 600 /data/fidelity_storage_state.json
EOT
]
volume_mount {
name = "secrets"
mount_path = "/secrets"
read_only = true
}
volume_mount {
name = "data"
mount_path = "/data"
}
resources {
requests = { cpu = "5m", memory = "8Mi" }
limits = { memory = "32Mi" }
}
}
container {
name = "broker-sync"
image = local.broker_sync_image
command = ["broker-sync", "fidelity-ingest"]
env {
name = "BROKER_SYNC_DATA_DIR"
value = "/data"
}
env {
name = "WF_SESSION_PATH"
value = "/data/wealthfolio_session.json"
}
env {
name = "FIDELITY_STORAGE_STATE_PATH"
value = "/data/fidelity_storage_state.json"
}
env {
name = "FIDELITY_PLAN_ID"
value_from {
secret_key_ref {
name = "broker-sync-secrets"
key = "fidelity_plan_id"
}
}
}
env {
name = "WF_BASE_URL"
value_from {
secret_key_ref {
name = "broker-sync-secrets"
key = "wf_base_url"
}
}
}
env {
name = "WF_USERNAME"
value_from {
secret_key_ref {
name = "broker-sync-secrets"
key = "wf_username"
}
}
}
env {
name = "WF_PASSWORD"
value_from {
secret_key_ref {
name = "broker-sync-secrets"
key = "wf_password"
}
}
}
volume_mount {
name = "data"
mount_path = "/data"
}
resources {
# Chromium is hungry headless shell + page rendering
# comfortably under 1Gi, spike up to 1.2Gi during full-page
# screenshots.
requests = { cpu = "50m", memory = "512Mi" }
limits = { memory = "1280Mi" }
}
}
volume {
name = "secrets"
secret {
secret_name = "broker-sync-secrets"
items {
key = "fidelity_storage_state"
path = "fidelity_storage_state"
}
}
}
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data_encrypted.metadata[0].name
}
}
}
}
}
}
}
lifecycle {
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
}

View file

@ -1,67 +0,0 @@
# Calico CNI
#
# Calico has underpinned this cluster's pod networking since 2024-07-30, installed
# as raw kubectl manifests (tigera-operator Deployment + CRDs + Installation CR).
# Bringing the full stack under Terraform is high-blast the operator and its
# Deployment must never flap during node pressure or during any apply, because
# new pod scheduling breaks within ~seconds of a CNI outage.
#
# This stack (created 2026-04-18 Wave 5b) adopts the three namespaces only:
# calico-system, calico-apiserver, tigera-operator. The `tigera-operator`
# Deployment, the 20+ CRDs it manages, and the `Installation` CR itself are
# intentionally *not* adopted yet they require a low-traffic window and a
# careful ignore_changes set to cover operator-generated defaults on the
# Installation CR. Follow-up tracked in beads code-3ad.
#
# The namespaces are safe to adopt (no networking impact they're just label
# containers) and give TF an audit trail entry for the labels/tier Kyverno
# cares about.
resource "kubernetes_namespace" "calico_system" {
metadata {
name = "calico-system"
labels = {
name = "calico-system"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode label on every namespace.
# pod-security.kubernetes.io/* labels are applied by the tigera-operator
# reconciler on calico-system + calico-apiserver for PSA 'privileged'.
ignore_changes = [
metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"],
metadata[0].labels["pod-security.kubernetes.io/enforce"],
metadata[0].labels["pod-security.kubernetes.io/enforce-version"],
]
}
}
resource "kubernetes_namespace" "calico_apiserver" {
metadata {
name = "calico-apiserver"
labels = {
name = "calico-apiserver"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1 + PSA labels applied by tigera-operator (see calico_system).
ignore_changes = [
metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"],
metadata[0].labels["pod-security.kubernetes.io/enforce"],
metadata[0].labels["pod-security.kubernetes.io/enforce-version"],
]
}
}
resource "kubernetes_namespace" "tigera_operator" {
metadata {
name = "tigera-operator"
labels = {
name = "tigera-operator"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}

View file

@ -1,6 +0,0 @@
include "root" {
path = find_in_parent_folders()
}
# No platform dependency Calico provides the cluster network the rest
# of the platform runs on. This stack must not introduce a dep cycle.

Some files were not shown because too many files have changed in this diff Show more