Compare commits

...

224 commits

Author SHA1 Message Date
a3bcb5e12f fire-planner: COL refresh CronJob + Grafana Cost-of-Living dashboard
Operational layer for the new col_snapshot cache shipped in
fire-planner@e72fd22:

stacks/fire-planner:
- fire-planner-col-refresh CronJob — Sun 04:00 UTC, no-op until rows
  age toward the 1-year TTL boundary (within 7 days). Calls
  python -m fire_planner col-refresh-stale, upserts via cache.upsert.

monitoring/dashboards/cost-of-living.json (Finance folder):
- Two template variables: $city (single-select from col_snapshot),
  $baseline_city (for COL ratio computation, defaults London).
- Stat row: total w/rent, w/o rent, 1-bed rent, ratio (color-coded).
- All-cities ranked table with gradient-gauged total + colored ratio.
- Cache-freshness table flags rows approaching TTL expiry.

Initial population needs a one-shot: post-Keel-rollout,
  kubectl -n fire-planner exec deploy/fire-planner -- \\
    python -m fire_planner col-seed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:01 +00:00
Viktor Barzin
d4c76a07a2 openclaw: revert model swap + document codex re-auth path
The previous commit promoted modelrelay/auto-fastest to primary as a
workaround for the expired openai-codex OAuth token. But modelrelay
routes to small tool-call-shy models (nvidia/stepfun-ai/step-3.5-flash)
that hallucinate answers instead of using ssh / curl / etc. — exactly
what the v4 learning loop is supposed to leverage.

Revert primary back to openai-codex/gpt-5.4-mini (gpt-5.4-mini is the
only mini variant the Codex backend accepts for ChatGPT Plus tier),
and inline the re-auth command in the model-block comment so future
sessions know exactly what to do when the OAuth token expires:

  kubectl -n openclaw exec -it $(kubectl -n openclaw get pods \
    -l app=openclaw -o jsonpath='{.items[0].metadata.name}') \
    -c openclaw -- node /app/openclaw.mjs models auth login \
    --provider openai-codex

modelrelay/auto-fastest stays in the fallback chain so the agent
remains partially usable while the token is expired.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:01 +00:00
Viktor Barzin
6457aa6d8f cluster-health skill: document tightened #43 thermal threshold (65 C) 2026-05-22 14:17:01 +00:00
Viktor Barzin
6950b8f197 cluster-health #43: tighten PVE thermal threshold to 65 C
Per Viktor: healthy baseline range is 55-65 C; anything above 65 C is a
signal a VM/workload is using too much CPU and warrants investigation.

Previous thresholds were calibrated to the hardware's TjMax (75/83 C) —
that was too lax, since cluster-load-driven elevation arrives a long
time before throttling. The 65 C cutoff matches the live Prometheus
baseline (Apr 20-May 8 2026: peak 61-69 C, avg 51-55 C) and the
session-observed correlation: above 65 C means the cluster is doing
sustained work that should be looked at, even if hardware is still
nowhere near its limit.

Updated:
  PASS  < 65 C   (within 55-65 baseline)
  WARN  65-82 C  (elevated; check top kvm processes for the culprit)
  FAIL  >= 83 C  (at/above TjMax — throttling imminent)

Verified live: 67 C now WARN (was PASS under the 75 C threshold).
2026-05-22 14:17:01 +00:00
Viktor Barzin
dbb3dc04d3 openclaw: engrain the learning loop at the identity level
User feedback: "this should work for any task, not just calendar.
this learning flow must be strongly engrained to ensure openclaw
gets better over time."

The v3 rules were buried at the bottom of TOOLS.md and only stated
in workflow language. Three changes to make the rule unavoidable:

1. **SOUL.md** — new marker-delimited section "Learning is your
   identity" inserted before ## Boundaries. AGENTS.md tells the
   agent to read SOUL.md first every session, so this is now the
   FIRST thing the agent loads about itself. Frames learning as
   character, not procedure.

2. **TOOLS.md v4** — section moved from the END of the file to
   right after the `# TOOLS.md` title (first substantive content
   on file load). Title strengthened: "THE FLOW — run this on
   EVERY task. Not just hard ones." Concrete examples explicitly
   call out diverse domains (calendar, frigate restart, disk
   usage, inbox summary, deploys) so the universality is
   unmistakable.

3. **learn-from-tasks skill** — opens with "This is universal.
   EVERY task runs through this flow — not just hard ones, not
   just unfamiliar ones. The save at the end is mandatory."

The actual flow (know → ask devvm → save) is unchanged. What
changed is salience: the rule is now the first thing the agent
encounters in three independent surfaces, with stronger framing
that makes "skipping the save" feel like a violation of identity
rather than a missed optimisation.

Marker bumped v3 → v4. Stripper handles v1-v9 idempotently.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:01 +00:00
854817e2e3 trading-bot: revive K8s stack + add meet-kevin-watcher
Uncomment the trading-bot stack (disabled 2026-04-06 due to resource
consumption) and add the new meet_kevin_watcher service container.

Changes:
- Uncomment the /* ... */ block enclosing the entire stack
- Fix db_init job: add -d postgres to psql commands (root user has no
  root-named database — matches pattern used in claude-memory + others)
- Remove 3 disabled containers from trading-bot-workers Pod spec:
  news-fetcher, sentiment-analyzer, trade-executor
- Add new meet-kevin-watcher container (image
  viktorbarzin/trading-bot-service:latest, command
  python -m services.meet_kevin_watcher.main, mem 128Mi/256Mi)
- Extend ExternalSecret with TRADING_OPENROUTER_API_KEY and
  TRADING_MEET_KEVIN_CHANNEL_ID keys (sourced from Vault
  secret/trading-bot)
- Add 4 common_env entries for the Meet Kevin pipeline
  (poll interval, daily cost cap, model slug, prompt version)
- Update lifecycle.ignore_changes to 4 image indices

vault: re-enable pg-trading static role

- Add pg-trading to vault_database_secret_backend_connection allowed_roles
- Uncomment vault_database_secret_backend_static_role.pg_trading
  (was disabled 2026-04-06 with the rest of trading-bot stack)

kyverno: add postgres* to trusted-registries allowlist

- trading-bot db_init uses postgres:16-alpine (Docker Hub library image)
- postgres* was not in the DockerHub bare-name allowlist (unlike mysql*,
  alpine*, nginx*, python* which were already there)

Final workers Pod containers (in order):
  [0] signal-generator
  [1] learning-engine
  [2] market-data
  [3] meet-kevin-watcher (NEW)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:01 +00:00
Viktor Barzin
d0a4876825 openclaw: v3 flow — know → ask devvm → (rarely) try yourself
Refines the devvm-fallback into an explicit triage flow that the
agent runs on every task. The default path is to ASK devvm-claude
when uncertain — don't brute-force. Most tasks are solvable there.

## The flow

1. Do I KNOW how? Check `memory_recall` and INDEX.md.
2. If not, SSH devvm and ask claude — and crucially, ask it to
   share the steps + credentials needed so I can do it on my own
   next time. Save the answer in openclaw memory.
3. (RARE) If devvm-claude says no, try in-pod. Most likely fail —
   that's OK.

## Storage moved to memory-indexed location

Learnings now live under
`/workspace/memory/projects/openclaw-learned/` (was
`/workspace/learned/`) so memory-core indexes them and
`memory_recall` surfaces them. Layout:

- `scripts/<task>.md`       runnable recipes
- `knowledge/<topic>.md`    decisions, paths, gotchas
- `credentials/<name>.md`   **POINTERS to Vault, never values**

## Credentials = Vault pointers only

Previous v2 design saved cred values to plaintext NFS files. v3
flips to pointer-only: cred file documents the Vault path + fetch
command (`ssh devvm 'vault kv get -field=foo secret/bar'`), the
consumer, and rotation expectations. The secret stays in Vault.

## Init container also migrates

Strips v1/v2/v3 markers from TOOLS.md before re-inserting v3,
moves any files from the legacy `/workspace/learned/` tree into
the new location, removes the empty legacy dir. User edits
outside the markers always survive.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:01 +00:00
Viktor Barzin
ef67a53676 openclaw: explicit "use devvm + learn" default behaviour
Refine the init container's devvm-fallback seeding so the OpenClaw
agent treats devvm as its DEFAULT teacher and saves recipes locally
to become independent over time:

1. TOOLS.md v2 section now has two emphatic CRITICAL rules:
   - "TRY DEVVM before giving up" — when stuck, ssh devvm before
     telling the user "I can't do that".
   - "After every task, introspect → save a faster way" — for any
     non-trivial task (especially recurring ones), save the recipe
     to /workspace/learned/ and update INDEX.md.

2. New cc-skill `learn-from-tasks` at
   /home/node/.openclaw/cc-skills/learn-from-tasks/SKILL.md formalises
   both triggers: (A) you're stuck → check INDEX → ask devvm → save;
   (B) you just finished → introspect → save if recurring.

3. /workspace/learned/ scaffold: INDEX.md table-of-contents +
   scripts/, knowledge/, credentials/ (0700) subdirs. Agent checks
   INDEX.md BEFORE reaching for devvm, so saved recipes are
   findable on the next run.

4. Marker migration: strips both v1 and v2 markers before re-inserting
   so user edits outside the markers always survive future restarts.

Security caveat documented inline: credentials in
/workspace/learned/credentials/ are NFS plaintext — acceptable for
home-lab personal scope, NOT for anything more sensitive than what
`ssh devvm` already gives the pod (wizard's access).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:01 +00:00
Viktor Barzin
43802d2452 openclaw: also write devvm section to /workspace/TOOLS.md
The OpenClaw agent reads TOOLS.md on every session per AGENTS.md
("environment-specific notes"), but it does NOT auto-search the
memory-core index for "devvm" before answering. Result: the agent
said "I don't have access to the devvm" even though ssh + the
openclaw-task wrapper were fully wired up (verified e2e in
9ad52dfd).

Updated init 6 (seed-devvm-memory-note) to ALSO append a
marker-delimited section to /workspace/TOOLS.md describing the
devvm SSH capability + openclaw-task usage. Idempotent: strips
any prior v1 section before re-inserting, so user edits outside
the markers survive future pod restarts.

The /workspace/memory/projects/openclaw-runtime/devvm-fallback.md
memory note stays in place — it's still indexed by memory-core
and surfaces for memory_recall queries.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:01 +00:00
Viktor Barzin
7e558de8f0 openclaw: SSH + tmux task fallback to devvm
Give the OpenClaw pod two new capabilities:

1. Host-tools bundle. New init container `install-host-tools` extracts
   openssh-client + dnsutils + tmux + jq + ripgrep + fd + vault + yq +
   friends into /tools/host-tools/, with the bookworm-slim libs the
   binaries need. PATH + LD_LIBRARY_PATH on the main container point
   ld.so at the bundle. Idempotent via /tools/host-tools/.installed-v1
   marker; smoke test (ldd-based) fails the init at deploy time if any
   binary has unresolved deps. Bundle is ~558 MB on the existing
   /srv/nfs/openclaw/tools NFS.

2. devvm SSH + async task pattern. New init `setup-ssh-config` writes
   id_rsa/config/known_hosts under /home/node/.openclaw/.ssh; main
   container startup symlinks /home/node/.ssh → there. New
   /usr/local/bin/openclaw-task wrapper on devvm manages long-running
   work as tmux sessions on devvm (sessions and logs survive pod
   restarts — they live on devvm, not in the pod). New init container
   `seed-devvm-memory-note` drops a markdown note teaching the pattern;
   main container startup now runs `openclaw memory index --force` so
   the note is searchable on first boot.

Design + verified E2E flow in
docs/plans/2026-05-22-openclaw-devvm-access-design.md. Persistence test
green: spawned a 50s task from pod A, deleted pod A, new pod B saw the
task finish and read its full log.

Pre-existing keel.sh annotation drift on openclaw/{openlobster,
task_webhook} cleaned up in the same apply.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:01 +00:00
Viktor Barzin
d9ad973621 state(vault): update encrypted state 2026-05-22 14:17:01 +00:00
Viktor Barzin
1979c2b213 cluster-health: add checks 43 + 44 (PVE host thermals + load)
Both new checks SSH read-only to the PVE host and emit PASS/WARN/FAIL
via the standard healthcheck output + JSON. They run alongside the
existing 42 checks and surface the same alerts the 2026-05-20/21
optimization session had to gather by hand.

#43 PVE Host Thermals — Xeon E5-2699v4 package + per-core temps
  Reads every /sys/class/hwmon/hwmon0/temp*_input in one SSH round-trip.
  Thresholds tuned to the live TjMax=83 / Tcrit=93:
    PASS  < 75 °C package
    WARN  75-82 °C  (approaching max, action time)
    FAIL  >= 83 °C  (at/above TjMax, throttling imminent)
  Reports hottest core label too so a single hot core doesn't hide in
  the package average.

#44 PVE Host Load — load avg vs 44-thread capacity
  Reads /proc/loadavg, compares 5-min to thread count (44):
    PASS  load_5 < 30   (< 70% threads busy)
    WARN  30-37         (oversubscribed but not saturating)
    FAIL  >= 38         (~85%+ threads busy — scheduler saturation)
  Uses 5-min so brief work spikes don't false-fail.

Both gracefully WARN-degrade if SSH BatchMode fails, matching the
existing check 36 (LVM PVC snapshots) pattern. TOTAL_CHECKS bumped
42 -> 44 and the dispatcher updated.
2026-05-22 14:17:01 +00:00
Viktor Barzin
61f7539de2 postiz: disable unused providers + pin temporal vs Keel force-policy
Two changes in one commit because they are coupled — the DISABLED_PROVIDERS
addition cannot land safely without the Keel exclusion on temporal:

1. Add DISABLED_PROVIDERS env on postiz Helm chart. Live DB audit showed
   only 'instagram-standalone' connected; all other Postiz providers
   were idle-polling Temporal task queues. List excludes x, linkedin,
   reddit, threads, youtube, tiktok, pinterest, dribbble, slack,
   discord, mastodon, bluesky, lemmy, warpcast, vk, beehiiv, telegram,
   wordpress, nostr, farcaster. Keeps facebook + instagram + the
   standalone variant active.

2. temporal deployment needs keel.sh/policy=never (set live via kubectl
   annotate). Keel was rolling temporalio/auto-setup 1.28.1 -> 0.20.0
   on every helm reconcile because :0.20.0 is published in the same
   registry path but is a DIFFERENT (legacy Cassandra-based) image
   stream. Memory id 1933 trap; new variant captured in id 2315-2319.

   The annotation is set live (not in TF) because the existing TF block
   has lifecycle.ignore_changes = [keel.sh/policy] so the chart
   reconcile won't reset it. Long-term fix: add temporal to the
   Kyverno keel-mutate-existing exclude list so it survives a
   namespace re-label.
2026-05-22 14:17:00 +00:00
Viktor Barzin
052404301b docs: HA control plane design (3 masters)
Captures today's k8s-upgrade-pipeline session findings — root cause
of repeated upgrade failures is the single-master apiserver outage
window cascading into operator crashloops + storm I/O. HA control
plane with 3 masters + apiserver LB removes the cascade entirely.

Tracked in beads code-n0ow. Plan doc to follow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00
Viktor Barzin
eca1cc7e2e k8s-version-upgrade: retry kubeadm apply on static-pod-hash timeout
kubeadm's `upgrade apply` waits 5min for each static-pod manifest swap
to be picked up by the kubelet (it polls the pod's
`kubernetes.io/config.hash` annotation via apiserver). On a freshly-rebooted
master with apiserver-to-kubelet status sync lagging, that 5min isn't
enough — kubeadm declares the upgrade failed and rolls back.

The thing is: the etcd container HAS already been swapped to the new
image by then (verified live — pod is on registry.k8s.io/etcd:3.6.5-0
when this fires). kubeadm's check is just slow to notice. The 2nd
attempt sees etcd already on target, skips it, and proceeds cleanly.

Wrap `kubeadm upgrade apply` in a 3-attempt loop with 30s between.
Worker phase doesn't need this — `kubeadm upgrade node` has no
static-pod-hash waits.

Today's autonomous-pipeline session: master phase Failed at 5m on
attempt #1 with this exact error, retried, hit same timeout, gave up
(backoffLimit=1). The wrapper turns this from a fatal pipeline halt
into a "wait a bit, try again" that usually completes on attempt #2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00
Viktor Barzin
6dd1f15881 k8s-version-upgrade: kill-switch + ignore RecentNodeReboot + shorter quiet window
Three changes from today's autonomous-pipeline validation session:

1. **Kill-switch ConfigMap** — chain checks for `k8s-upgrade-killswitch`
   ConfigMap in `k8s-upgrade` namespace at the top of every phase + at the
   start of version-check. Existence halts the chain (exit 0) with a Slack
   message. Single-command emergency stop:
       kubectl -n k8s-upgrade create configmap k8s-upgrade-killswitch \
           --from-literal=reason="storm response"
   Resume:  kubectl -n k8s-upgrade delete cm k8s-upgrade-killswitch
   Role rule for `configmaps` get/list/watch added (resourceName-scoped).

2. **Ignore RecentNodeReboot in halt_on_alert_query everywhere** — the
   chain itself causes reboots. The pre-drain master check, post-upgrade
   worker check, postflight check, and preflight halt-on-alert all now
   pass `RecentNodeReboot` as the extra-ignore. Previously only worker
   phase's post-upgrade gate did this. Master Failed silently this morning
   on the pre-drain check after my own master reboot.

3. **Preflight quiet-baseline 3600s → 600s** — the 1h cooldown after any
   Ready transition meant the chain refused to run for an hour after
   every kured reboot. 10 min is enough for kubelet/control-plane to
   settle; the 24h-between-cluster-reboots invariant lives in
   kured-sentinel-gate, not here.

Validated by running the chain end-to-end: preflight passed in 5s,
master phase now in drain. Today's storm post-mortem (snapshot CoW
amplification + tigera-operator crashloop feedback loop) drove the
kill-switch design.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00
Viktor Barzin
899c7adaa0 authentik: worker replicas 3 -> 2
Workers handle background tasks only (LDAP sync, email, certificate
renewal) — no user-facing traffic, so 2-of-3 redundancy isn't load-
bearing. Reduces sustained CPU by ~100m.

Server replicas unchanged at 3 (PDB minAvailable=2 — user-facing).
PgBouncer pool unchanged at 3 (DB connection pooling).
2026-05-22 14:17:00 +00:00
Viktor Barzin
701b73bf53 forgejo: disable source archive ZIP/TAR downloads
Bot crawlers were hitting /<owner>/<repo>/archive/<sha>.zip on the
dot_files repo (vim-plugin source trees) — each request synthesised a
fresh ZIP from git history, taking 9.9s and returning 500 under
sustained load. Cost: ~440m sustained forgejo CPU.

Toggle: FORGEJO__repository__DISABLE_DOWNLOAD_SOURCE_ARCHIVES=true.
/archive/* URLs now 404; git clone / OCI registry / API unaffected.

Measured: forgejo pod 440-573m -> 60m steady-state (~85% drop).

(Pod rollout took ~7min on the new RS due to kubelet's recursive
chown of the 2700+ files in the data PVC — fsGroupChangePolicy is
unset and defaults to Always; could be set to OnRootMismatch later.)
2026-05-22 14:17:00 +00:00
Viktor Barzin
b92e1166a8 monitoring: prometheus global scrape 1m -> 2m + UPS pinned 30s
Halves sample volume on all default-scrape jobs (cAdvisor, node-exporter,
service-endpoints, etc.). Memory id 559's earlier scrape-2m tuning was
applied live but not codified — this restores the Helm template.

Companion changes to keep alerting fidelity:
- evaluation_interval kept at 1m (alerts evaluate every minute)
- snmp-ups job pinned to scrape_interval=30s so PowerOutage /
  LowUPSBattery detect within ~30s instead of 2m
- 3 alerts bumped from for:1m to for:3m (HighGPUTemp, LowUPSBattery,
  PowerOutage) for stability above the new 2m global cadence

Other jobs that already had per-job overrides (snmp-idrac 1m,
redfish-idrac 3m, kubernetes-pods 5m, kubernetes-services 5m) unaffected.

Expected: 50-150m sustained CPU saving on Prometheus + apiserver.
Verification ongoing — apiserver settles ~minutes after Prometheus
config reload due to initial-target-scrape burst.
2026-05-22 14:17:00 +00:00
Viktor Barzin
5bc98851b9 alloy: switch pod log shipping from apiserver to file-tail
Replaced 'loki.source.kubernetes' with 'loki.source.file' in alloy DS
config. discovery.relabel.pod_logs already sets __path__ to the kubelet
log path (/var/log/pods/*<uid>/<container>/*.log) and varlog host-mount
was already present, so this is a one-line swap.

Why: apiserver was burning ~700m sustained on 'CONNECT pods/log' streams
(13 req/s, ~2200 sec/s of long-lived TCP connections). Streaming pod
logs through the apiserver instead of tailing kubelet's log files was
the dominant residual cost after the recent Loki/Alloy onboarding.

Measured before/after:
- Alloy DS: ~620m total (5 x ~125m) -> ~92m total (5 x ~18m)
- kube-apiserver: peak 1959m midnight burst, settled 632m

(Stuck-pod recovery: alloy-7zg7t on k8s-master needed --force delete
during rollout — FailedKillPod 'unable to signal init: permission denied'
on runc, transient runtime issue, unrelated to this change.)
2026-05-22 14:17:00 +00:00
Viktor Barzin
48e7c309fc vault: add pg-matrix + pg-technitium static roles to allowed_roles
Both static-roles existed in Vault state (created out-of-band) but
were missing from the postgresql connection's allowed_roles list. Vault
was logging 'is not an allowed role' rotation errors every 10s for both,
sustained CPU waste ~40-70m.

Adopted both via 'import {}' (import blocks removed after first apply
per the canonical adoption pattern).

- pg-matrix: username=matrix, rotation_period=86400 (1d)
- pg-technitium: username=technitium, rotation_period=604800 (7d)

Verified: 'is not an allowed role' errors stopped in vault-0 logs
immediately after apply.
2026-05-22 14:17:00 +00:00
Viktor Barzin
00736a9f85 state(vault): update encrypted state 2026-05-22 14:17:00 +00:00
Viktor Barzin
94ca849379 k8s-version-upgrade: grant get/list on apps resources for drain
kubectl drain --ignore-daemonsets needs to GET each pod's owner
reference (DaemonSet/StatefulSet/ReplicaSet/Deployment) to classify
which pods can be drained vs ignored. Without these RBAC verbs, drain
bails with 'cannot delete daemonsets ... is forbidden' for every
daemonset-managed pod on the node.
2026-05-22 14:17:00 +00:00
Viktor Barzin
a90ce27923 infra: add kubectl + authentik providers across 6 stacks
Provider declarations were applied across freshrss, linkwarden,
navidrome, openclaw, tandoor, vault in prior sessions; lock files
regenerated for the 4 stacks where init had run. Commits the WIP so
downstream Terraform plans can proceed.

- kubectl (gavinbunney/kubectl ~> 1.14): kubernetes_manifest panic
  workaround for Kyverno CRDs (beads code-e2dp)
- authentik (goauthentik/authentik ~> 2024.10): used where stacks
  manage their own Authentik objects
2026-05-22 14:17:00 +00:00
fa2b57f177 openclaw: enable recruiter-api plugin (allowlist + manifest contracts)
Plugin needs three things to load under OpenClaw 2026.5.x:
1. plugins.allow includes 'recruiter-api' (doctor --fix overwrites the
   ConfigMap-baked value, so re-patch via 'openclaw config patch --stdin'
   in the startup command after doctor runs).
2. 'openclaw plugins enable recruiter-api' to flip its registry entry.
3. manifest declares contracts.tools (added in recruiter-responder commit
   83ffd9fa).

Plus: VIKTOR_CHAT_ID env wired from secret/openclaw.viktor_chat_id so the
plugin's polling loop knows which Telegram chat to deliver into.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00
4bc0c5f27e recruiter-responder: deploy d7892396 — OpenClaw-driven flow
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00
6417c770c1 recruiter-responder + openclaw: wire gpt-mini secret keys + VIKTOR_CHAT_ID
recruiter-responder ExternalSecret gains GPT_MINI_ENDPOINT/_API_KEY/_MODEL
(NIM-served qwen3-coder-480b — gpt-5.4-mini in OpenClaw is OAuth-only and
not HTTP-accessible to external services). OpenClaw gains VIKTOR_CHAT_ID
env consumed by the recruiter-api plugin's announcement loop.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00
Viktor Barzin
8aff0ba1a2 k8s-version-upgrade: fix two more grep-pipefail bugs
Same `grep -v` / `set -o pipefail` interaction as commit 10b261d2,
in two more callsites the previous fix didn't cover:

  Line 354 (phase_master): control-plane Running check —
    `grep -v Running | wc -l` returns 1 when all pods are Running
    (the happy path), aborting the chain right after master upgrades.

  Line 419 (phase_postflight): on-target node check —
    `grep -v ":v$TARGET_VERSION$" | wc -l` returns 1 when all nodes
    are on the target version (the happy path, exactly when postflight
    should succeed). Aborts at the moment of victory.

Forensics on yesterday's master Job failure (see commit message of
10b261d2 for context): the master Job spawned 16s after the previous
fix's TF apply, before configmap propagation completed on the kubelet.
With those two latent bugs also looming, the chain would have died
post-master-upgrade and again at postflight even if propagation had
been timely.

Wrapping each grep in `{ ... || true; }` so a no-matches result
returns success.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00
Viktor Barzin
83fc15c22b k8s-version-upgrade: fix pipefail abort when no alerts are firing
halt_on_alert_query() ends with `grep -vE "$regex" | sort -u`. When
zero alerts are firing (the desired healthy state), grep matches
nothing and exits 1. Under `set -o pipefail`, the whole pipeline
returns 1; under `set -e`, the caller's `alerts=$(...)` assignment
fails and aborts the script in ~1s with no diagnostic output.

The chain effectively required at least one non-meta alert to be
firing to make any forward progress. Today (2026-05-19) the cluster
is fully clean post-MySQL recovery, the daily 12:00 UTC detection
spawned the preflight Job, and it died instantly — blocking the
1.34.7 → 1.34.8 patch chain.

Fix: wrap the grep in `{ ... || true; }` so a no-matches result
returns success. Preflight verified end-to-end after the fix — the
chain is now in flight (preflight ✓, master phase running).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00
612a83f8ce security(wave1): W1.6 expand observation from recruiter-responder pilot → tier 3+4 (82 namespaces)
## Change
- Replaced kubectl_manifest.wave1_egress_observe_recruiter_responder with
  kubectl_manifest.wave1_egress_observe_tier34
- namespaceSelector changed from `kubernetes.io/metadata.name == 'recruiter-responder'`
  to `tier in {"3-edge", "4-aux"}` — covers 82 namespaces (17 tier-3-edge + 65 tier-4-aux)
- Legacy pilot GNP wave1-egress-observe-recruiter-responder kubectl-deleted
  (apply_only=true means TF rename does NOT destroy the live old resource;
  cleanup done manually)
- Tier 0/1/2 namespaces explicitly out of wave 1 observation per locked plan
  (cluster infra + GPU workloads, deferred)

## Verification (live cluster, 2026-05-19)
- 82 namespaces match `tier in (3-edge,4-aux)`
- Felix translated the new policy into iptables LOG rule in cali-po-* chain
- LogQL `{job="node-journal"} |~ "calico-packet"` returns real packet metadata
  from multiple namespaces with distinct destinations:
  - east-west pod-to-pod (10.10.108.48, 10.10.122.131)
  - in-cluster service VIP (10.96.0.10 — kube-dns)
  - external (149.154.166.110 — Telegram API from recruiter-responder)

## W1.7 next step (calendar-bound, ~1 week)
- Let observation run for ~1 week
- Aggregate distinct destinations per namespace via LogQL
- Build per-namespace egress allowlist module `tier3_egress_baseline`
- Flip GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]`
- Phased per-namespace as originally planned

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00
2f9ac0110a security(wave1): W1.6 observe phase LIVE — Calico GNP action:Log pilot on recruiter-responder
Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico
Enterprise-only field, rejected by OSS v3.26) with the supported primitive:
Calico GlobalNetworkPolicy with `action: Log`.

## Mechanics (verified end-to-end on 2026-05-19)
1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder`
   with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`,
   `types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`.
2. Felix translates to iptables LOG rule in
   `cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5.
3. Linux kernel emits LOG entries to ring buffer with transport=kernel.
4. systemd-journald captures kernel transport entries.
5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`.
6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing
   SRC/DST/PROTO/PORT for every NEW egress connection.

## Verified output sample
`calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132
DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...`

The Allow rule in the GNP keeps egress functional (recruiter-responder
remained 1/1 Running through the apply — verified Python TCP connections to
1.1.1.1, 8.8.8.8, 9.9.9.9 succeed).

## Wave 1 status
W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7
remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"`
samples, build empirical egress allowlist, flip the GNP rules from
`[Log, Allow]` to `[Allow <specific dests>, Deny]`.

Expand observation to additional namespaces by adding entries to
`spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00
Viktor Barzin
aa05942fa5 upgrade-state: filter transient registry digest-check errors
Keel polls ~175 image manifests hourly against public registries.
Transient i/o timeouts and registry 5xx responses are inherent at
that scale and auto-recover on the next poll, but they were tripping
the Apps row into ⚠ attn — pure noise.

Extend benign_re to cover:
  - failed to check digest + (i/o timeout | connection refused
    | connection reset | context deadline exceeded | TLS handshake
    timeout | no such host | EOF)
  - failed to check digest + non-successful response (status=5xx)

Real actionable digest-check failures (HTTP 401 auth, 404 removed
tag) still surface. Persistent registry-side 5xx is owned by the
registry's own monitoring (forgejo-integrity-probe +
RegistryCatalogInaccessible), not by Keel logs.

Tested locally: Apps row flips from ⚠ attn → ✓ healthy after the
filter is in place; remaining errors-line drops to "(none in last
24h)".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00
Viktor Barzin
a5772060f8 dbaas: opt MySQL out of Keel + add do-not-bump warning
Two changes to make the 8.4.8 pin durable:

1. Add `keel.sh/policy: never` annotation on the mysql-standalone
   StatefulSet. The dbaas namespace was already excluded from the
   Kyverno mutate, but the StatefulSet carried orphan Keel annotations
   (force/poll/match-tag) from an earlier policy version that lacked
   the exclusion list. Keel kept watching :8.4.8 for digest changes.
   Now explicitly opted out; Keel logged "image no longer tracked".

2. Expand the inline comment to a banner pointing at the upgrade plan
   docs and the gating beads task. Anyone touching this line sees the
   warning + the path to do it right.

Closes the loop on the 2026-05-18 outage. Real upgrade tracked in
code-963q + docs/plans/2026-05-19-mysql-8.4.9-upgrade-{design,plan}.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:59 +00:00
Viktor Barzin
866cf8331c state(dbaas): update encrypted state 2026-05-22 14:16:59 +00:00
Viktor Barzin
e4b9e97ac9 docs: design + plan for MySQL 8.4.8 → 8.4.9 upgrade
Captures the wipe+reinit strategy (sidestep the broken DD upgrade
path), the IO config bump (innodb_io_capacity 100→2000), root-cause
analysis with explicit uncertainty, verification gates, and rollback.

Not scheduled yet. Tracked in beads code-963q.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:59 +00:00
a048b37f60 security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE
## W1.1 — K8s API audit log shipping (LIVE)
- alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on
  k8s-master node. Verified alloy-7zg7t scheduled on master, tailing
  /var/log/kubernetes/audit.log
- loki.tf "Security Wave 1" rule group: added K2-K9 alert rules
  (skipped K1 per Q7 decision):
  - K2 K8sSATokenFromUnexpectedIP
  - K3 K8sSensitiveSecretReadByUnexpectedActor
  - K4 K8sExecIntoSensitiveNamespace
  - K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user)
  - K6 K8sAuditPolicyModified (kubeadm-config CM change)
  - K7 K8sClusterRoleWildcardCreated (verbs=* + resources=*)
  - K8 K8sAnonymousBindingGranted
  - K9 K8sViktorFromUnexpectedIP
- All rules use source-IP regex matching the wave-1 allowlist
  (10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc,
  100.64-127 tailnet) and `lane = "security"` → #security Slack route.
- Verified: kubectl-audit logs flowing in Loki query
  {job="kubernetes-audit"} returns events with node=k8s-master.
- Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1.

## W1.5 — require-trusted-registries Enforce (LIVE)
- security-policies.tf: flipped Audit→Enforce with explicit allowlist
  built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration.
- Removed `*/*` catch-all (which made Audit→Enforce a no-op).
- Pattern includes 15 explicit registries, 6 DockerHub library bare
  names, 56 DockerHub user repos.
- Verified by admission dry-run:
  - evilcorp.example/malware:v1 → BLOCKED with custom message
  - alpine:3.20 → ALLOWED (matches `alpine*`)
  - docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`)

## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation)
- Tried adding FelixConfiguration with flowLogsFileEnabled=true via
  kubectl_manifest in stacks/calico/main.tf
- Calico OSS rejected with "strict decoding error: unknown field
  spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only
- Removed the failed resource. Documented alternative paths in main.tf
  comment block: GNP with action=Log (iptables NFLOG → journal), Cilium
  migration, eBPF tooling, or Tigera Operator adoption.

## Docs updates
- security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE,
  W1.6/W1.7 blocked
- monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in
  prior session before today's apply)

## Cleanup
- Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed
  their job in the 2026-05-18 apply; should not stay in tree per TF docs)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:59 +00:00
51365937b1 recruiter-responder: bump image to 444fa58c (header CRLF fix)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:59 +00:00
Viktor Barzin
fd1490ae15 docs: update MySQL restore runbook + CLAUDE.md after 8.4.9 recovery
Runbook rewritten for the standalone setup (InnoDB Cluster gone since
2026-04-16) and now covers the full disaster-recovery flow we just
executed: stop pod, wipe PVC (incl. PV reclaim-policy flip from Retain
→ Delete), re-apply TF, restore via in-namespace Job, drop+create
static users with fresh Vault passwords, restart dependents.

CLAUDE.md MySQL row notes the 8.4.8 pin + links the runbook.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:59 +00:00
Viktor Barzin
efe8c9625b dbaas: pin MySQL to 8.4.8, recover from broken 8.4.9 DD upgrade
The mysql:8.4 floating tag let Keel auto-bump to 8.4.9, whose
data-dictionary upgrade got stuck mid-flight on every attempt
(no progress, no CPU, never completing). Pinning to 8.4.8 +
restoring from the 2026-05-18 00:30 UTC mysqldump puts us back
on a known-good binary.

Closes: code-eme8
Closes: code-k40p

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:59 +00:00
Viktor Barzin
8ee0ea55cf state(dbaas): update encrypted state 2026-05-22 14:16:58 +00:00
1082cba0fb kyverno(wave1): swap kubernetes_manifest → kubectl_manifest + flip 3 security policies to Enforce
## Resolves code-e2dp (Kyverno TF apply blocked)
Root cause: terraform-provider-kubernetes v3.1.0 panics on plan/refresh of
kubernetes_manifest resources holding Kyverno ClusterPolicy CRDs (large
CEL/foreach schemas). Workaround: swap to gavinbunney/kubectl_manifest which
treats manifests as opaque YAML strings.

## Migration mechanics
- Root terragrunt.hcl: added gavinbunney/kubectl provider declaration so all
  stacks get it generated in providers.tf.
- stacks/kyverno/modules/kyverno/versions.tf (new): module-level provider source
  declaration (required for kubectl_manifest in a child module).
- Converted 17 kubernetes_manifest resources across 7 files to kubectl_manifest
  with yaml_body = yamlencode({...}). depends_on chains preserved.
- terraform state rm for all 17 old kubernetes_manifest entries.
- stacks/kyverno/imports.tf (new): TF 1.5+ import blocks mapping each
  kubectl_manifest to its live cluster resource by apiVersion//Kind//name ID.
- One resource (policy_inject_keel_annotations) needed kubectl delete + recreate
  because the kubectl provider couldn't patch it cleanly (resourceVersion=0
  invalid for update — gotcha when adopting a resource previously
  kubernetes_manifest-owned).

## W1.4 — security policies Audit → Enforce (LIVE)
Three policies flipped: deny-privileged-containers, deny-host-namespaces,
restrict-sys-admin. Verified live via kubectl. failurePolicy=Ignore preserved.

## Shared exclude list (35 namespaces)
local.security_policy_exclude_namespaces in security-policies.tf.
- 31 critical from memory id=1970 (Keel rollout list)
- + frigate (camera HW transcoding needs host access)
- + kured (privileged DaemonSet for node reboots)
- + default (etcd backup/defrag CronJobs use hostNetwork)
- + changedetection (uses SYS_ADMIN for chromium sandbox)

## W1.5 — require-trusted-registries stays Audit
Pattern */* allows anything-with-a-slash; Enforce would be a no-op for supply
chain. Tracked under beads code-8ywc as follow-up.

## TF import-blocks
The imports.tf file should be removed in a follow-up cleanup commit once
verified — TF doesn't auto-clean these.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Closes: code-e2dp
2026-05-22 14:16:58 +00:00
83079758bb monitoring(wave1): re-enable Loki+Alloy, deploy wave1 alert rules, add #security Slack lane
## Loki + Alloy re-enabled (code-146x)
- Uncommented helm_release.loki, helm_release.alloy, kubernetes_daemon_set_v1.sysctl-inotify,
  kubernetes_config_map.loki_alert_rules, kubernetes_config_map.grafana_loki_datasource
- Reverses the documented "operational overhead vs benefit after node2 incident"
  decision. Re-evaluated because wave 1 security detection layer (beads code-8ywc)
  needs Loki + ruler + alert routing.
- SingleBinary mode, 2-4Gi mem, 50Gi proxmox-lvm PVC, 30-day retention, ruler enabled
  pointed at prometheus-alertmanager.monitoring.svc:9093
- Alloy DaemonSet (4 pods on worker nodes) discovers pod logs via K8s API + pushes
  to Loki
- Loki canaries running (4)
- Vault audit-tail sidecar logs now flowing to Loki: queried
  {namespace="vault",container="audit-tail"} returns live audit JSON

## Wave 1 alert rules deployed (W1.3 partial)
Added "Security Wave 1" rule group to loki_alert_rules configmap:
- V1: VaultRootTokenCreated — auth/token/create with policies=[root]
- V2: VaultAuditDeviceModified — sys/audit/* create/delete/update
- V3: VaultSealChanged — sys/seal update
- V4: VaultPolicyModified — sys/policies/acl/* create/update/delete
- V5: VaultAuthFailureSpike — >10 permission denied/min
- V7: VaultViktorFromUnexpectedIP — auth as me@viktorbarzin.me from non-allowlist source IP
  (allowlist: 10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet)
- S1: PVEsshLoginFromUnexpectedIP — sshd "Accepted" from non-allowlist IP (rule defined,
  fires once promtail/Alloy ships sshd journal with job=sshd-pve)

Verified rules visible via /loki/api/v1/rules. K2-K9 (K8s API audit) deferred to W1.1
which needs the audit policy + apiserver log shipping codified.

## #security Slack lane (Alertmanager)
- New `slack-security` receiver in prometheus_chart_values.tpl, channel #security
- Higher-priority route at top of routes list: matchers `lane = security` →
  slack-security, continue: false (so wave 1 alerts never fall through to #alerts)
- Slack message format includes summary + description + runbook link annotation
- All wave 1 rules set `lane = "security"` label

## Resource summary
- 6 added: helm_release.loki, helm_release.alloy, kubernetes_config_map.grafana_loki_datasource,
  kubernetes_config_map.loki_alert_rules, kubernetes_daemon_set_v1.sysctl-inotify,
  + 1 other
- 5 changed: helm_release.prometheus (alertmanager config — new receiver + route),
  4 deployments (image tag drift from Keel-managed images, unrelated)
- 1 destroyed: null_resource grafana_admin_only_folder_acl["Finance (Personal)"]
  (timestamp-triggered always recreates — not destructive)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Closes: code-146x
2026-05-22 14:16:58 +00:00
Viktor Barzin
1cdccc1ad6 upgrade-state: suppress known-benign Keel slack-bot-not-configured noise
Keel 1.2.0 registers a Slack socket-mode bot whenever SLACK_BOT_TOKEN is
set, then fails because we don't supply an `xapp-` app-level token:

    bot.slack.Configure(): SLACK_APP_TOKEN must have the prefix "xapp-".
    bot.Run(): can not get configuration for bot [slack]

We don't want the interactive bot — opt-out auto-update + no approval flow
(see stacks/keel/main.tf comment). The Slack NOTIFICATION sender works
independently and continues posting rollout messages to #general fine.

But /upgrade-state's broad `grep level=error` was counting these as real
errors → ⚠ on the Apps row every run. Add a small skip-pattern list so the
two recurring benign lines drop out; any new genuine Keel error still
shows. Reuses `bot.Run()` + `SLACK_APP_TOKEN must have the prev?if|prefix`
(typo in Keel's actual log message preserved as alternation).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:57 +00:00
c9289192c7 security(wave1): Vault audit-tail sidecar (live) + doc reality-check
## Vault audit-tail sidecar (APPLIED + VERIFIED)
- Added `audit-tail` extraContainer to vault helm chart values: busybox:1.37 with
  `tail -F /vault/audit/vault-audit.log`. Reads the audit PVC (`audit` volume
  from the chart's auditStorage), emits JSON audit events to stdout. kubelet
  captures the stdout; once Loki+Alloy are deployed (blocked on code-146x),
  these logs flow automatically to Loki with `container="audit-tail"`.
- Resources: 5m CPU / 16Mi mem request, 32Mi limit. PVC mount is readOnly.
- Applied via `tg apply -target=helm_release.vault`. All 3 vault pods rolled
  cleanly (OnDelete strategy, manual one-at-a-time, auto-unseal each ~10s).
- Verified: `kubectl logs -n vault vault-2 -c audit-tail` shows live JSON
  audit lines from ESO token issuance, KV reads, etc.

## Doc reality-check
While verifying logs reached Loki, discovered Loki is NOT actually deployed.
`stacks/monitoring/modules/monitoring/loki.tf` defines `helm_release.loki` but
has a self-referencing `depends_on = [helm_release.loki]` that prevented apply.
No `loki` Helm release in the cluster, no Loki pods, no Loki Service. The
monitoring.md "Loki: deployed" claim was aspirational.

- security.md W1.2 row: PENDING → PARTIAL (sidecar live, shipping blocked on
  code-146x)
- security.md W1.3 row: gated on code-146x added
- monitoring.md Loki row: marked NOT DEPLOYED with cross-ref to code-146x

## New beads task
- code-146x P1 — Loki + log shipper missing. Lists the helm_release self-depends_on bug,
  investigation paths, and revised wave 1 sequencing (Loki/Alloy is prereq 0).

## Wave 1 status update
- W1.2: Vault audit device + XFF + audit-tail sidecar all LIVE; Loki shipping blocked on code-146x
- W1.1, W1.3, W1.6, W1.7: still not started (W1.6 also blocked on code-3ad Calico Installation CR)
- W1.4, W1.5: code committed, blocked on code-e2dp (Kyverno provider crash)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:57 +00:00
Viktor Barzin
0a26364e4f state(vault): update encrypted state 2026-05-22 14:16:57 +00:00
ae0c1701ec security(wave1): W1.2 Vault XFF (applied) + W1.4/W1.5 Kyverno code prep (apply blocked on provider crash)
## W1.2 — Vault audit device + X-Forwarded-For (APPLIED + VERIFIED)
- Added `x_forwarded_for_authorized_addrs = "10.10.0.0/16"` to vault listener config.
  Trust X-Forwarded-For from in-cluster sources (pod CIDR). Without this, every
  vault audit log entry shows Traefik's pod IP instead of the real client IP —
  the V7 alert rule (Viktor identity from non-allowlist source IP) needs the
  real client IP to be meaningful.
- Applied via `tg apply -target=helm_release.vault` (vault stack has pre-existing
  for_each unknown issues unrelated to this change; -target documented in error
  message itself as the workaround).
- Rolling restart of vault-{0,1,2} performed manually (StatefulSet uses OnDelete
  update strategy, not RollingUpdate). All 3 pods rejoined Raft + auto-unsealed
  within ~10s each. Verified XFF config visible in pod's
  /vault/config/extraconfig-from-values.hcl.
- The `vault_audit "file"` resource was already in TF at line 287 (writing to
  /vault/audit/vault-audit.log) — no change needed.

## W1.4 + W1.5 — Kyverno enforce flip (CODE ONLY, apply BLOCKED)
- Added shared `local.security_policy_exclude_namespaces` (31 critical namespaces
  from memory id=1970 + `frigate, kured, default, changedetection` discovered
  during the live-cluster pre-flight check for privileged/hostNetwork/SYS_ADMIN
  pods that would be blocked by Enforce).
- Flipped 3 security policies Audit → Enforce: deny-privileged-containers,
  deny-host-namespaces, restrict-sys-admin. failurePolicy=Ignore preserved at
  chart level.
- `require-trusted-registries` STAYS in Audit mode pending allowlist tightening
  (current pattern includes `*/*` which matches anything-with-a-slash, so Enforce
  would be a no-op for supply chain). Tracked under beads `code-8ywc` W1.5.

**Apply blocker**: `tg plan` panics with `terraform-provider-kubernetes_v3.1.0`
crash on the kubernetes_manifest resources (`ElementKeyInt(0): can't use
tftypes.Object...` — provider schema mismatch on Kyverno CRDs). The crash
reproduces on the UNMODIFIED file, so it's a pre-existing provider issue, not
caused by these changes. Resolving it requires either upgrading the provider or
finding a kubernetes_manifest-compatible workaround. Tracked under `code-8ywc`.

## Wave 1 status after this commit
- W1.2: APPLIED + VERIFIED (vault XFF + audit device already in place)
- W1.4 + W1.5: code ready, apply blocked on provider crash
- W1.1, W1.3, W1.6, W1.7: not started in this session

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:57 +00:00
87961e9ef8 monitoring(wealth): drop 6y timeFrom override on META vest cadence 2026-05-22 14:16:57 +00:00
Viktor Barzin
7e4d4ac4a6 state(vault): update encrypted state 2026-05-22 14:16:57 +00:00
Viktor Barzin
dd24ace480 realestate-crawler: dockerhub pull-secret + lift image-pin on ui/api
Companion to the GHA migration in immovika/realestate-crawler@c2acbf5.

Apps row of /upgrade-state was flagging ⚠ because Keel poll on the four
Deployments returned 401 — DockerHub repo viktorbarzin/realestatecrawler
is private, the Deployments had no imagePullSecrets, and Keel's poll-secret
discovery list came up empty. Pods kept running only because the image
landed in containerd cache months ago.

Adds:
- ExternalSecret `dockerhub-pull-secret` synced from Vault
  secret/viktor.dockerhub_registry_password. ESO template renders the
  dockerconfigjson server-side (Sprig b64enc) so the PAT never sits in
  cleartext in any K8s manifest.
- image_pull_secrets { name = "dockerhub-pull-secret" } on all 4
  Deployments (ui, api, celery, celery-beat).
- Lifts `ignore_changes=[container[0].image]` on ui+api so TF re-asserts
  :latest. CI no longer patches the image to a numeric tag — Keel now
  drives rollouts from digest changes on :latest.

Live state after apply: all 4 Deployments on :latest with
imagePullSecrets=dockerhub-pull-secret; ExternalSecret SecretSynced=True.
Once a GHA build pushes a new digest, Keel will roll all four within ~1h.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:57 +00:00
99127939a8 monitoring(wealth): keep only FIFO-realized PNL table; pair Positions + vest-cadence side-by-side
- Removed panel 27 (META RSU vest value over time) — superseded by
  vest-cadence chart which carries the same value signal plus the
  share-count overlay.
- Removed panel 28 (per-vest value at vest vs today) — duplicative with
  panel 31's FIFO realized PNL.
- Removed panel 29 (per-sell realized PNL) — same data as panel 31,
  just rolled up by sell date instead of vest date.
- Resized panel 26 (Positions) to w=12 and moved panel 30
  (META vest cadence) to (y=32, x=12, w=12) so they sit side-by-side
  next to the Positions table.
- Moved panel 31 (FIFO realized PNL) to y=118, where the deleted RSU
  chart used to live.
2026-05-22 14:16:57 +00:00
b3cf75dc61 docs(security): wave 1 plan — Kyverno enforce, NetworkPolicy egress, audit logging, source-IP anomaly
Locked design for wave 1 of cluster security hardening. Plan only — implementation lives in beads
code-8ywc and follow-up commits. Captures:

- security.md: Kyverno policy table updated (Audit → Enforce planned for the four security policies
  with the 31-namespace exclude list). New section "Audit Logging & Anomaly Detection" detailing the
  K8s API audit policy, Vault audit device + X-Forwarded-For trust, source-IP anomaly rules (K9, V7,
  S1), and the rejected-canary-tokens / rejected-K1 rationales. New section "NetworkPolicy
  Default-Deny Egress" describing the observe-then-enforce (γ) approach for tier 3+4.
- monitoring.md: new "Security Alerts (Wave 1)" section listing the 16 rules (K2-K9, V1-V7, S1)
  and the Loki ruler → Alertmanager → #security routing path.
- runbooks/security-incident.md (new): per-alert response playbook with LogQL queries, action
  steps, false-positive triage, and SEV1 escalation.
- .claude/CLAUDE.md: new "Security Posture" section summarising the locked decisions: identity
  allowlist is me@viktorbarzin.me ONLY, source-IP allowlist CIDRs, no public-IP access policy,
  rationale for not adopting canary tokens.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:57 +00:00
b879481d71 monitoring(wealth): per-vest realized PNL via FIFO sell-match
New table panel below the per-sell breakdown. For each vest, FIFO-match
its shares against the subsequent sells (shares from earlier vests get
sold first), and aggregate the matched portions:

  realized_pnl = SUM(matched_qty * (sell_price - vest_price))
  pnl_pct      = realized_pnl / SUM(matched_qty * vest_price) * 100
  days_held    = AVG(sell_date - vest_date) per matched portion

Footer reducer sums shares, vest value, sell value, and realized PNL
so the bottom row is the full-portfolio realized take.
2026-05-22 14:16:57 +00:00
8b60e6bb6d monitoring(wealth): META vest cadence chart — value vs shares (dual axis)
Per-vest event line chart. Left Y axis (blue): vest value at the
time = SUM(quantity * unit_price), in USD. Right Y axis (orange):
number of shares vested. One point per vest date (aggregated when
multiple BUY rows share a date, e.g. 2021-05-18 was 18 + 2 shares).

Lets Viktor see how vest sizes ramped (initial 18 shares -> 38 ->
60s) and how the per-vest USD value tracked META's price ride
across 2020-2026. timeFrom='6y' override pins the panel to the full
vesting window.
2026-05-22 14:16:57 +00:00
af077112cb monitoring(wealth): META vest + sell PNL tables with FIFO cost basis
Two new bottom-of-dashboard tables:

Panel 28 'META vests — value at vest vs today': one row per BUY
activity. Shows vest-day price * shares + what those same shares
would be worth at today's META quote, plus the hypo P&L if Viktor
had held everything (color-text on the gain columns).

Panel 29 'META sells — realized PNL vs if held until today':
one row per SELL with FIFO-matched cost basis (LEAST/GREATEST
overlap in cumulative-share space). Shows realized P&L, the
counterfactual P&L had he held until today, and the
'missed by' delta = (today_price - sell_price) * shares.

Both pull today_price dynamically from quote_latest via a CTE so
they self-update as Yahoo updates the META quote. Schwab account
is empty so no live activity is expected.
2026-05-22 14:16:57 +00:00
20c5965f95 monitoring(wealth): pin META RSU panel to 6y window
Dashboard default time range is now-180d, but the META vesting + sell
arc spans 2020-11 → 2026-02. With the default window the panel just
showed a flat line at $64 (the empty post-sell residual). timeFrom='6y'
override makes panel 27 always render the full vesting curve regardless
of the dashboard-level time selector.
2026-05-22 14:16:57 +00:00
Viktor Barzin
3d43d96a5e k8s-version-upgrade: switch detection cron from weekly to daily
Was `0 12 * * 0` (Sun 12:00 UTC) — patch releases waited up to 6 days
before the chain picked them up. Now `0 12 * * *` (daily 12:00 UTC,
still outside kured's 02:00-06:00 London window). Concurrency is
bounded by Forbid + deterministic job-name idempotency (the detection
job exits early if a preflight Job for the same target already exists),
so back-to-back days can't pile up parallel runs.

- stacks/k8s-version-upgrade/main.tf: var.schedule default + rationale comment
- scripts/upgrade_state.sh: rename next_sunday_noon_utc -> next_daily_noon_utc
  (now returns "Tue 2026-05-19 12:00 UTC" form); change "(Sun cron)" label
  to "(daily cron)"
- .claude/skills/upgrade-state/SKILL.md: cadence column + frontmatter

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:57 +00:00
018ef3790f monitoring(wealth): META RSU vest value panel (Schwab account)
Daily total_value timeseries for the Schwab workplace account
(account_id 72d34e09-...). Single-asset account holding META RSUs
that vested 2020-11 → 2026-02 and were sold opportunistically over
the same window. Currency USD (account_currency). Yahoo quote on
META powers WF's daily mark; the historical DAV mirrored into
wealthfolio_sync via pg-sync gives us ~2k days of vesting curve.
2026-05-22 14:16:57 +00:00
Viktor Barzin
b107c0be8c upgrade-state: skill + script + Keel scrape for periodic three-pipeline audit
Three autonomous-upgrade pipelines run independently — Keel for apps
(hourly registry polling), unattended-upgrades+kured for OS, and the
k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there
was no single place to see whether each was healthy, what's pending,
or whether anything's stuck. The /upgrade-state skill collapses the
state of all three into one table you can run before each Sunday's
k8s-version-check fires.

- stacks/keel/main.tf: add Prometheus pod-annotation scrape on
  container port 9300. Surfaces pending_approvals,
  poll_trigger_tracked_images, and registries_scanned_total{image}
  so the skill has a real timeseries (also opens the door to a
  future "pending_approvals > 0 for 24h" alert).
- scripts/upgrade_state.sh: collector + renderer. Three-row table
  (Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2.
  SSH fan-out (parallel subshells) to all five nodes for apt
  state + reboot-required + uu log; Prometheus query for Keel;
  Pushgateway parse for k8s_upgrade_* gauges. Read-only.
- .claude/skills/upgrade-state/SKILL.md: hardlinked to
  ~/.claude/skills/upgrade-state/SKILL.md so the skill is
  discoverable from both monorepo-rooted and global sessions.

Verification: ran the script, stress-tested the ✗ stalled path by
pushing in_flight=1 + started_timestamp=-100min to Pushgateway and
resetting after — script correctly raised ✗ and exit 2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:57 +00:00
Viktor Barzin
309f83ec8c beads-server: codify Keel annotations on Dolt deployment (drift cleanup)
Task 1's recovery from the broken `:latest` image rollout left
keel.sh/policy=never set imperatively via `kubectl annotate` — out of
TF, which violates the "all infra via TF" rule. Now codified alongside
match-tag, trigger, pollSchedule. Removed those three keys from
ignore_changes (was the original "Keel manages these" pattern, no
longer correct for this deployment).

Also added KYVERNO_LIFECYCLE_V1 ignore_changes on the presence_schema
migration Job so future applies don't try to replace it over the
Kyverno-injected ndots dns_config.

Verified: 0 added, 3 changed (unrelated pre-existing drift on
beadboard/workbench/service), 0 destroyed. Dolt pod uninterrupted
(revision 13 preserved).
2026-05-22 14:16:57 +00:00
Viktor Barzin
5482f46125 RecentNodeReboot: 24h → 1h threshold, matching upgrade-chain preflight
The 24h kubelet-uptime threshold (process_start_time_seconds < 86400)
was a defense-in-depth duplicate of the 24h-since-Ready-transition
check in kured-sentinel-gate Check 4 — but they used different
signals (kubelet process start vs node Ready transition). Whenever
the cluster cycled through reboots, the alert kept firing for a full
day even after sentinel-gate's check passed, and blocked anything
querying halt-on-alert (kured, K8s version-upgrade preflight).

Tightened to 1h (3600s) for "node just rebooted, give it a settle
window". The cluster-wide 24h-between-reboots invariant lives
exclusively in kured-sentinel-gate Check 4 from now on (independent,
uses lastTransitionTime).

Matched the preflight's own 24h-quiet check in upgrade-step.sh
(86400 → 3600) so it doesn't act as a second blocker.

Empirically verified: all 5 kubelets are >10h up, alert cleared on
next eval after the rule reload.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:57 +00:00
Viktor Barzin
d1dcc5d12d beads-server: add presence_claims table for agent coordination
Adds the schema for the new agent presence board. Live Dolt is updated
via a hashed-named one-shot Job; the ConfigMap entry preserves fresh-PVC
init.

Also pins the Dolt image to 2.0.3 — :latest on dolthub/dolt-sql-server
currently resolves to 0.50.10, whose docker-entrypoint.sh references an
undefined docker_process_sql function and crash-loops on every init
script in /docker-entrypoint-initdb.d. Keel can still bump this tag
in-cluster (image is in lifecycle.ignore_changes).
2026-05-22 14:16:57 +00:00
Viktor Barzin
e4e2babd6a k8s-version-upgrade: FQDN SSH targets + python3 in place of envsubst
Two latent bugs in the K8s-version-upgrade pipeline surfaced when a
real detection run ran post-26.04 upgrade today:

1. **DNS**: pod's CoreDNS search path is `<ns>.svc.cluster.local
   svc.cluster.local cluster.local` (+ ndots=2 via Kyverno mutation).
   Unqualified `k8s-master` falls through all of those and then queries
   upstream Technitium for the bare name → NXDOMAIN. The FQDN
   `k8s-master.viktorbarzin.lan` is what Technitium actually serves.
   Suffix every node SSH target with `$NODE_DOMAIN`.

2. **envsubst missing**: claude-agent-service image doesn't ship
   `gettext-base`. Replace `envsubst <template | apply` with
   `python3 -c 'import os,sys; sys.stdout.write(os.path.expandvars(
   sys.stdin.read()))' <template | apply`. Same semantics, image
   already has python3. Multi-line $SCHEDULING_BLOCK is preserved
   correctly through expandvars.

Verified by manually triggering `k8s-version-check` post-fix:
detection now reads `Latest patch: v1.34.8` (currently running 1.34.7)
and spawns `k8s-upgrade-preflight-1-34-8`. The Job pod scheduled and
started; killed before it touched the cluster (will land on Sunday
2026-05-24 12:00 UTC like the schedule says).

Root cause of why these bugs lay dormant: yesterday's first
manual-test detection found "no upgrade needed" so neither code path
exercised SSH or envsubst. Today's apt-source restore (do-release-
upgrade had mangled them) unmasked the v1.34.8 candidate, which made
detection finally proceed past the SSH step.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
Viktor Barzin
6de4549a96 docs/plans: add agent presence implementation plan (2026-05-17)
15-task plan for a shared presence board so Claude Code sessions can
see which shared infra resources are being actively mutated by other
sessions. Resource-scoped claims on the existing Dolt server,
heartbeat-driven TTL, agent-driven via CLAUDE.md rule + Python CLI.
2026-05-22 14:16:56 +00:00
Viktor Barzin
23d8aa89c4 keel: enroll 11 more namespaces (operators + critical infra)
Per user decision, removed authentik, kyverno, metallb-system,
external-secrets, proxmox-csi, nfs-csi, vpa, sealed-secrets,
infra-maintenance from the policy-level exclude list, and added
keel.sh/enrolled=true to aiostreams (alive — 1/1 Running, despite
being earlier flagged as scaled-to-0) and woodpecker.

Net cluster coverage: 197/227 workloads on safe-force (86%), up from
170/227 (74%). All 197 are paired with match-tag=true (digest-only).

Remaining 7 namespaces in Kyverno exclude list (irreducible):
- keel (self-update)
- calico-system + tigera-operator (operator-managed Installation CR)
- cnpg-system + dbaas (state-coupled)
- nvidia (chart-pinned at 570.195.03 per code-8vr0 until NVIDIA ships
  ubuntu26.04 driver images)
- kube-system (k8s built-ins)

Files:
- stacks/kyverno/modules/kyverno/keel-annotations.tf — exclude list
  trimmed from 16 → 7
- stacks/authentik, kyverno, proxmox-csi, nfs-csi, vpa, sealed-secrets,
  servarr/aiostreams, metallb (creates ns "metallb-system"), woodpecker —
  added keel.sh/enrolled=true label on kubernetes_namespace resource
- infra-maintenance was in the policy exclude but the namespace doesn't
  actually exist in the cluster; the removal is a no-op there

Applied via kubectl patch on the live ClusterPolicy + kubectl label on
namespaces because the kubernetes provider v3.1.0 panics on Kyverno
ClusterPolicy refresh — TF source has the desired state for next clean
apply on a fixed provider.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
Viktor Barzin
3bdba9f388 keel: enroll 15 critical-path namespaces for digest-only auto-update
Per user decision today: monitoring, mailserver, vault, descheduler,
metrics-server, traefik, technitium, crowdsec, redis, reverse-proxy,
reloader, headscale, wireguard, xray, cloudflared now participate in
the same `force + match-tag` regime as the rest of the cluster — Keel
watches the deployment's CURRENT tag for digest changes only and rolls
on push, never rewriting tag strings.

Two-part change:

stacks/kyverno/modules/kyverno/keel-annotations.tf
  Trim the policy-level namespace exclude list from 31 → 16. The 16
  remaining exclusions are the irreducible cluster-operator + state-
  coupled set: keel itself, calico-system + tigera-operator (operator
  loop), authentik (2026-05-17 pgbouncer incident bite), cnpg-system +
  dbaas (state-coupled), kyverno, metallb-system, external-secrets,
  proxmox-csi + nfs-csi + nvidia (just stabilized today, chart-pinned),
  kube-system, vpa, sealed-secrets, infra-maintenance.

stacks/<each-of-15>/.../main.tf
  Add `"keel.sh/enrolled" = "true"` label to the `kubernetes_namespace`
  resource so the Kyverno mutate policy can target the workloads via
  its namespaceSelector matchLabels.

Note on the apply path: the live ClusterPolicy was patched via
`kubectl patch` because the hashicorp/kubernetes provider v3.1.0 panics
during state refresh on Kyverno ClusterPolicy schemas with deeply
nested optional `context.celPreconditions` / `imageRegistry` fields
(see crash dump). The TF source above has the desired state, so any
clean future apply on a fixed provider version will be a no-op against
the live cluster.

Floating-tag workloads in the newly-enrolled set (will roll on every
upstream digest update — acceptable risk per user):
  - wireguard: sclevine/wg:latest (image fixed today via iptables-nft
    postStart shim)
  - xray: teddysun/xray
  - crowdsec-web: viktorbarzin/crowdsec_web
  - monitoring: prompve/prometheus-pve-exporter:latest, prom/snmp-exporter
  - traefik: nginx:1-alpine, openresty/openresty:alpine,
    ghcr.io/tarampampam/error-pages:3
  - redis: haproxy:3.1-alpine, redis:8-alpine

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
Viktor Barzin
f5cf6ec051 nvidia: bump driver container memory limit 128Mi → 2Gi
After rolling back k8s-node1's kernel to 6.8.0-117 + spoofing
/etc/os-release to 24.04 so the operator picked the matching
ubuntu24.04 driver image (everything per the workaround documented in
docs/known-issues.md), the driver container still went into a restart
loop. Container status:

    lastState.terminated: { reason: "OOMKilled", exitCode: 137 }

The driver-installer was hitting the namespace LimitRange default of
128Mi during `apt-get install linux-headers-6.8.0-117-generic` — the
last log line on every restart was "Installing Linux kernel
headers..." before SIGKILL. 2Gi gives apt + the DKMS compile step
enough headroom; peak observed during a successful compile in a test
container was ~1.4Gi.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
Viktor Barzin
63cbd0aba5 docs: known-issues entry for the Ubuntu 26.04 / NVIDIA driver gap
Captures the workaround applied on k8s-node1 today (kernel rolled back
to 6.8.0-117-generic, apt-mark hold on kernel meta-packages,
/etc/os-release spoofed to 24.04 so NFD reports VERSION_ID=24.04 and
the gpu-operator picks an existing ubuntu24.04 driver image), plus the
trigger that lets us un-mitigate: any ubuntu26.04 tag appearing on
nvcr.io/nvidia/driver.

Linked from the post-mortem and from beads code-8vr0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
22c18dc061 paperless-mcp: deploy MCP for AI document search
- New stack `paperless-mcp` running barryw/PaperlessMCP v0.1.19 (.NET,
  HTTP+SSE on :5000) wraps paperless-ngx's built-in FTS. 43 tools exposed.
- In-cluster only egress to paperless-ngx svc; no Cloudflare hop on
  MCP-internal traffic.
- Read-only at paperless layer: dedicated `claude-mcp` user (non-superuser)
  in new `claude-mcp-readers` group with view-only Django perms; existing
  279 docs bulk-granted view perm via /api/documents/bulk_edit/;
  workflow #2 auto-grants the group on new docs (Consumption Added).
- Gateway-level bearer auth via new Traefik plugin
  Aetherinox/traefik-api-token-middleware@v0.1.4 (loaded in traefik stack
  alongside crowdsec-bouncer); per-stack Middleware CRD `bearer-auth`
  pulls token list from Vault `secret/paperless-mcp/bearer_tokens`.
- Vault `secret/paperless-mcp` holds: paperless_api_token (synced to
  K8s Secret via ESO; pod env via secret_key_ref), bearer_tokens
  (JSON array, read at plan time), bearer_token_viktor_laptop (mirror
  for laptop wiring), paperless_user_password (paperless UI fallback).
- Image auto-update via Keel (semver minor policy, hourly poll).
- Ingress dns_type=proxied → Uptime Kuma external monitor auto-created
  by external-monitor-sync CronJob.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
d1e7121115 recruiter-responder: bump image to 05b95943 (split callback routes)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
Viktor Barzin
c72b839a2f nvidia: pin chart to v25.10.1 after v26.3.1 upgrade revealed missing ubuntu26.04 driver images
k8s-node1 was upgraded to Ubuntu 26.04 (kernel 7.0.0-15-generic) at some
point. NVIDIA has NOT published ubuntu26.04 driver images yet
(skopeo list-tags docker://nvcr.io/nvidia/driver returned 0 ubuntu26.04
tags vs 779 for ubuntu22.04 and 206 for ubuntu24.04).

Attempted fix today: bump gpu-operator chart v25.10.1 → v26.3.1 +
driver 570.195.03 → 580.105.08 + kernelModuleType=open. The chart
applied cleanly but the v26.3.1 operator auto-detects host OS via NFD
labels and constructs `<version>-ubuntu26.04` image tags, which 404 on
pull. Rolled back to chart v25.10.1 and pinned it explicitly here so
future `terraform apply` doesn't surface the same trap again.

Note: chart rollback alone does NOT restore GPU functionality on
k8s-node1. Both v25.10.1 and v26.3.1's operators now pick the
ubuntu26.04 suffix (the NFD label is sticky once detected). The actual
recovery path requires either (a) NVIDIA shipping ubuntu26.04 driver
images, or (b) rolling the host kernel back to 6.8.0-117-generic
(still installed in /boot, headers in /usr/src) + `apt-mark hold` to
prevent re-upgrade. That step needs explicit user authorization for a
node reboot — left as the next action item on code-8vr0.

Files:
  - stacks/nvidia/modules/nvidia/main.tf — explicit version pin,
    explanatory comment
  - stacks/nvidia/modules/nvidia/values.yaml — comment block
    documenting the situation; driver pinned at 570.195.03
  - docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md —
    full timeline, root causes, recovery procedure

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
Viktor Barzin
62efded1b6 wireguard: switch to iptables-nft so PostUp MASQUERADE works
Wireguard pod CrashLoopBackOff'd for hours with wg-quick's PostUp failing:

    iptables v1.8.4 (legacy): can't initialize iptables table `nat':
    Table does not exist (do you need to insmod?)

sclevine/wg's default `iptables` symlink points to iptables-legacy, which
talks to the kernel's xt-tables. K8s nodes nowadays initialize their
nat table via nftables (calico-node sets it up), so iptables-legacy in
the container sees "no nat table" and bails. Reproduced by ephemerally
debugging the live pod's namespaces (kubectl debug --copy-to + same
mounts as the real pod) — wg-quick output matched verbatim.

Fix: postStart now calls update-alternatives to point iptables and
ip6tables at iptables-nft/ip6tables-nft (already present in the image)
before exec'ing wg-quick. The wg0.conf PostUp MASQUERADE then writes
to the nftables-backed nat table calico already populated. Verified:
new pod went 2/2 Running with 0 restarts after apply.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
Viktor Barzin
45c8e88e89 terminal: probe + alerts after Traefik replica routing-table skew
User reported "site loads but failed to connect on the tmux session". Root
cause was a Traefik replica (traefik-db7696fbf-ktjjz) that came up missing
the kubernetes_ingress-derived router for terminal.viktorbarzin.me — only
the IngressRoute CRDs registered. About 1/3 of /token preflight requests
landed on that replica and got 404 with router="-", and WS upgrades
intermittently failed the same way, so the lobby iframe stayed stuck on
"Failed to connect. Retrying...". `kubectl delete pod` on the bad replica
restored the missing router and unblocked the user.

This commit adds the long-term mitigation:

stacks/terminal/main.tf
  - kubernetes_cron_job_v1.webterminal_probe runs every 5min, hits
    /token + /ws via Cloudflare and the in-cluster ttyd Service, pushes
    4 gauges to Pushgateway (token_status, ws_status, ttyd_status,
    last_success_timestamp). Verified the probe end-to-end:
      token=302 ws=302 ttyd=200 ok=1

stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
  - Webterminal group: WebterminalTokenDegraded (warning, 10m),
    WebterminalWebsocketDegraded (critical, 10m),
    WebterminalTtydUnreachable (critical, 10m),
    WebterminalProbeStale (warning, 15m).
  - Traefik Router Parity group: TraefikRouterCountSkew fires when any
    Traefik replica's router count diverges from siblings for >10m —
    catches the same class of issue cluster-wide, not just for terminal.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
d828b51670 recruiter-responder: bump image_tag to 50f43004 (backtest --persist)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
Viktor Barzin
0480477f44 nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem)
Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped
control-plane exclusion from the controller Deployment, so both replicas
landed on k8s-master, fought for hostNetwork ports 19809/29653, and one
went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes
holding the ports — only a kubelet restart on master cleared them.

- Pin helm_release.version = "4.13.1" so terraform apply can't drift to
  the broken chart (defense in depth; nfs-csi namespace is already in the
  Kyverno-Keel exclude list)
- Add controller.affinity: podAntiAffinity between replicas +
  nodeAffinity excluding node-role.kubernetes.io/control-plane
- docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md
  captures the root cause + recovery procedure (kubelet restart via
  nsenter is the escalation path when crictl rmp -f fails)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
e398e717f1 broker-sync(fidelity): un-suspend monthly CronJob
The provider now emits delta gains-offset DEPOSITs (broker-sync@98c4729)
which is the simple accumulate-gains approach Viktor signed off on:
each monthly scrape captures (current_pot, real_contribs), and we emit
a single DEPOSIT/WITHDRAWAL sized to growth-since-last-scrape.
dav_corrected handles the dashboard math.

Next scheduled run: 2026-05-20 05:00 UK. Manual trigger via
'kubectl -n broker-sync create job fid-now --from=cronjob/broker-sync-fidelity'.
2026-05-22 14:16:56 +00:00
Viktor Barzin
195b5e4061 keel: use +() anchors on policy/match-tag so per-workload overrides stick
Without the anchor, each policy update fires mutateExistingOnPolicyUpdate,
which OVERWRITES existing keel.sh/policy annotations back to 'force'. That
broke the phased rollout — bulk-setting workloads to 'never' didn't stick
because the next policy update reset them.

With +() anchors, the mutate only adds the annotation if missing. New
workloads (in enrolled namespaces) get force+match-tag; existing workloads
with explicit policy=never (out-of-band, for phased rollout) stay never.

Phase 1 rollout state (2026-05-17):
  - 10 workloads on force+match-tag in 10 namespaces (Phase 1)
    enrolled via keel.sh/enrolled=true namespace label:
      linkwarden, excalidraw, diun, echo, foolery, city-guesser,
      jsoncrack, privatebin, ntfy, speedtest
  - 216 workloads on policy=never (out-of-band kubectl annotate)
  - 31 critical namespaces excluded at policy level

Expand to Phase 2 by labeling more namespaces `keel.sh/enrolled=true`
and clearing the `never` annotation off their workloads.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
root
b8ab4613e4 Woodpecker CI Update TLS Certificates Commit 2026-05-22 14:16:55 +00:00
Viktor Barzin
25fcf80651 keel: expand critical-namespace exclude list — protects vault/cnpg/authentik/etc.
2026-05-17 incident: Keel rolled authentik 2026.2.2 → 2026.2.3 around 23:36.
The force+match-tag pairing should have constrained Keel to digest-only on
the current tag (not switch to a new tag), but a race between Kyverno's
mutate (injecting match-tag) and Keel's hourly poll caused the workload to
still have the old `force`-only annotation when Keel acted. Result: tag
rewrite, pods cycled, pgbouncer connection failures, login broken.

Manual rollback: `kubectl rollout undo` on all 5 authentik deployments back
to 2026.2.2. Auth restored within ~5 min.

Going forward, critical-namespace workloads are excluded at the policy level
so this race can't recur. They get upgraded via TF (Helm chart version bumps)
on a deliberate cadence, never by Keel.

Live state: 36 workloads on policy=never (35 critical + chrome-service pin
+ 7 CI-driven self-hosted from earlier), 190 on policy=force+match-tag for
opt-out-pure auto-update on the remaining stateless apps.

This matches user direction (2026-05-17): "upgrading is fine as long as we
upgrade correctly and the latest version is healthy" + "keel responsible
for the latest version, phased rollout, graceful".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
74ecda6cc3 keel: bump default policy patch → major (user wants latest version)
User: 'i'm happy with occasional breakages. we have alerts.'

Policy=major auto-updates workloads to the latest semver tag in the
registry, including major/minor/patch bumps. Still semver-parser-bounded
so dev/nightly/master branches are filtered out (avoids the 2026-05-16
force-trap on affine/calico).

Live: 217 patch-annotated workloads re-annotated to major. Next Keel
poll (~1h) will pick up any pending major/minor releases.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
6c8546bb84 recruiter-responder: bump image_tag to 94b37a9c (follow-up detection)
Replies from recruiters to our sent decline / engage / ignored threads
are now attached to the existing thread, surface with a 🔁 follow-up
marker in Telegram ("you previously sent"), and re-open thread status
to pending so they show up in recruiter_list status=pending.

Smoke-tested live: Rachel-style follow-up referencing our outbound
msgid + the original recruiter msgid in References → correctly
attached to thread #87, status flipped sent→pending, 3 messages
persisted (in/out/in).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
7e540292ad kyverno: bump background-controller memory 384Mi → 2Gi (OOMKilled processing keel URs)
The mutateExistingOnPolicyUpdate=true on inject-keel-annotations produced
176 UpdateRequests for the initial bulk scan across enrolled namespaces.
At the existing 384Mi limit, kyverno-background-controller OOMKilled while
processing them — no annotations got injected on existing workloads (count
stuck at 30).

Live state already bumped via kubectl set resources; this commit makes it
durable through Terraform. Also lowered the request to 256Mi (the 384Mi
floor was tight against limit; 2Gi headroom for bulk scans, 256Mi steady
state).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
257679166b recruiter-responder: bump image_tag to 02a01c9a (Reply-To + quoted body in replies)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
7e1ecaf74c kyverno: codify aggregated ClusterRole for keel mutate-existing
The previous commit (bc714755) added mutateExistingOnPolicyUpdate=true
to the inject-keel-annotations ClusterPolicy but Kyverno's validate
webhook rejected it: the background-controller SA needs update/patch
on apps/v1 Deployment/StatefulSet/DaemonSet.

Created live via kubectl + now in TF so the next apply is idempotent.
The ClusterRole aggregates into kyverno:background-controller via the
rbac.kyverno.io/aggregate-to-background-controller label.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
bede247e98 recruiter-responder: bump image_tag to 59df5f8a (Reply-To honoured)
Reply-To header now extracted on inbound and used for outbound replies.
Verified with a synthetic email From: noreply-careers@megacorp.example
Reply-To: spam@viktorbarzin.me — reply correctly went to spam@ and
threaded under the original (Re: subject + In-Reply-To + References).

Alembic 0003 added messages.reply_to_addr column.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
a2e8afc3ed kyverno: add mutateExistingOnPolicyUpdate=true so existing workloads get annotated
Before this, the inject-keel-annotations policy only fired on admission
events. Workloads that existed BEFORE their namespace got labeled
keel.sh/enrolled=true never received the annotation, so Keel didn't
watch them. Live state was 30 of 226 workloads auto-updating.

With mutateExistingOnPolicyUpdate=true and the required mutate.targets
block, Kyverno's BackgroundScan controller applies the mutate to
existing matching Deployments/StatefulSets/DaemonSets on policy update.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
cdeb89d5f1 final wave: enroll immich + status-page, retrigger 17 pending Bucket A
* immich: extended 3 V1 lifecycles to V2 (1 Deployment without V1
    skipped — has non-standard lifecycle from earlier work).
  * status-page: enrolled (was missing from original sweep).
  * v6 retrigger marker on 17 stacks that never reached terragrunt
    apply (#704 exit-1 halted mid-loop).

After this lands, expected live enrollment: ~96 / 118 Tier 1 stacks.
The remaining ~22 are operator/Helm-managed and intentionally excluded
(same fight-loop risk as Calico — bump via Helm chart version, not
Keel).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
root
caba0e811f Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:55 +00:00
Viktor Barzin
4944e508aa Bucket C: enroll 5 raw-deploy stacks in Keel auto-update
* beads-server: 3 Deployments — extended V1 lifecycle blocks to V2
    + KEEL_IGNORE_IMAGE; namespace label.
  * llama-cpp: 1 Deployment — extended V1→V2; namespace label.
  * novelapp: namespace label only (Deployment has non-standard
    lifecycle without V1 dns_config — drift expected, accept for now).
  * plotting-book: namespace label only (same as novelapp).
  * trading-bot: namespace label only (same as novelapp).

immich deferred — the bulk-add script's brace-counter got confused by
a HEREDOC in the file, inserting a lifecycle block in the wrong
position. Needs manual per-Deployment editing.

The 3 ns-only stacks (novelapp, plotting-book, trading-bot) will see
their Deployments mutated by Kyverno but their TF lifecycle doesn't
yet ignore the keel annotations. Expected behavior: drift visible in
terragrunt plan, applied-state oscillates with Kyverno re-injecting.
Acceptable starting point; per-Deployment lifecycle work to fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
b57596d930 Bucket A retrigger + Bucket D enrollment (5 module-nested stacks)
After fixing the postgresql-lb MetalLB flap (deleted stuck
ServiceL2Status CR l2-rgt9d), Tier 1 CI can apply again. Combined
commit:

  * Bucket A (16 stacks): re-append CI retrigger marker so the
    previously-pending applies pick up:
      blog calico cyberchef descheduler f1-stream homepage jsoncrack
      k8s-dashboard k8s-version-upgrade kms local-path osm_routing
      real-estate-crawler travel_blog vault webhook_handler

  * Bucket D (5 module-nested stacks): keel.sh/enrolled label on
    namespace + KYVERNO_LIFECYCLE_V2 on Deployments inside the module:
      postiz instagram-poster k8s-portal uptime-kuma vaultwarden

Bucket C (raw-deploy apps without V1 marker on their Deployment
lifecycles) deferred — needs per-Deployment lifecycle block additions
that the bulk script can't safely automate:
  beads-server immich llama-cpp novelapp plotting-book trading-bot

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
ec60af5fd4 kyverno: exclude calico-system from inject-keel-annotations
Stop the hourly Keel-vs-tigera-operator fight loop on calico-node
DaemonSet (v3.26.5 ↔ v3.26.1). Live: re-annotated 4 calico-system
workloads with keel.sh/policy=never; TF: added calico-system to the
namespaces exclude list so any future mutate run won't re-inject.

The previous calico unenrollment (label removal from namespace)
wasn't enough — once Kyverno had stamped the policy=patch annotation
on the Deployments/DaemonSets, removing the namespace label didn't
strip the annotation, so Keel kept watching them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:54 +00:00
Viktor Barzin
b48ddc09d6 ci: retrigger v4 — remaining 16 Keel stacks (#701 failed one of them) 2026-05-22 14:16:54 +00:00
Viktor Barzin
978237441e ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-22 14:16:54 +00:00
Viktor Barzin
1dd8f4e2bf openclaw: native MCP servers + daily claude-memory sync
Wire ha-mcp, context7, and the in-pod playwright sidecar as native
MCP servers on OpenClaw via `mcp set` in the container startup
(ConfigMap-baked mcp.servers gets stripped by `doctor --fix`; CLI-set
entries persist). HA URL pulled from new Vault key
secret/openclaw.ha_sofia_mcp_url and passed via the
HA_SOFIA_MCP_URL env var.

Add a daily 03:00 UTC `memory-sync` CronJob in the openclaw
namespace: pulls all non-sensitive memories from
claude-memory.claude-memory.svc:80/api/memories, groups by category,
writes 18 Markdown files into /workspace/memory/projects/claude-
memory-sync/ (the path memory-core indexes), then triggers
`openclaw memory index --force` via kubectl exec. Reuses the
existing cluster-healthcheck SA (pods+pods/exec). Smoke test: 1488
memories synced, 25/25 files indexed, search returns hits.

Also drops the legacy /app/extensions entry from
plugins.load.paths (doctor warning), wires HA_SOFIA_MCP_URL env,
and one-shot deletes the stale 2026-02-28 metaclaw-export.json from
the openclaw home volume.

claude_memory MCP intentionally NOT wired — its /mcp/mcp transport
404s on the deployed claude-memory-mcp:17 image (tracked as
code-z1so). Shared knowledge is delivered via the CronJob's REST
sync instead. Adding claude_memory to mcp.servers is a one-line
follow-up once that's fixed.
2026-05-22 14:16:53 +00:00
Viktor Barzin
0c73974362 ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-22 14:16:53 +00:00
root
87f7b25a13 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:53 +00:00
126cfb7022 wealth: dav_corrected view fixes pension gains-offset miscategorisation
The broker-sync Fidelity provider emits 'unrealised-gains-offset'
DEPOSIT activities to reconcile Wealthfolio's total with the
PlanViewer reported pot, because Wealthfolio doesn't track pension
fund units directly. Wealthfolio's data model treats that DEPOSIT as
a cash contribution, which double-inflates net_contribution and
zeroes out the implied growth.

Add a Postgres view 'dav_corrected' in wealthfolio_sync that
subtracts the cumulative gains-offset from net_contribution per
account per date (re-exporting as 'net_contribution' so it's a
drop-in replacement). All 17 wealth dashboard panels that compute
contribution/growth/ROI now read from the view. Total impact:
portfolio Growth jumps from £301,753.19 to £337,474.39 (exactly
the £35,721.20 Fidelity offset that was previously miscategorised).
2026-05-22 14:16:52 +00:00
Viktor Barzin
6769526e1e ci: retrigger apply for pending Keel enrollment (~58 stacks)
Bulk enrollment commit 8f4b1956 had its CI pipeline #689 killed before
terragrunt apply ran. The enrollment label + V2 lifecycle changes are
in master but never reached the cluster. Appending a one-line marker
to each pending stack's main.tf so Woodpecker's diff-detection picks
them up and applies them serially.

Idempotent — re-applying a stack whose state already matches is a no-op.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:52 +00:00
Viktor Barzin
137b4cbcf7 ci: retry after Keel rollout cascade settled 2026-05-22 14:16:51 +00:00
Viktor Barzin
e5f6d16b2e enrolled-patch stacks: ignore image drift from Keel auto-update
For Deployments enrolled in Keel with policy=patch, the image tag is
updated by Keel as new patches release upstream. Without
ignore_changes on the image field, terragrunt apply would fight Keel
in an endless loop (TF reverts → Keel re-rolls → repeat — same shape
as the calico/tigera-operator fight from earlier).

Adding KEEL_IGNORE_IMAGE marker to the lifecycle of these stacks.
Image string in TF becomes the initial seed; Keel rolls it forward.

Stacks: actualbudget, broker-sync, changedetection, city-guesser,
coturn, dashy, dawarich, diun, ebook2audiobook, ebooks, echo,
excalidraw, foolery, forgejo, freedify.

CI-driven self-hosted stacks (fire-planner, job-hunter, payslip-ingest,
recruiter-responder, claude-agent-service, claude-memory) keep TF
ownership of image and policy=never — their image_tag is set by CI
via terragrunt.hcl inputs, not by Keel. Adding image to ignore_changes
on those would break the CI deploy flow.

Caveat: only container[0].image is added. Multi-container Deployments
(immich, beads, etc.) will need additional container[N].image lines
for any container Keel rolls. Those stacks are not currently enrolled.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:51 +00:00
Viktor Barzin
8519e37975 calico: unenroll from Keel — tigera-operator owns DaemonSet spec
Keel kept rewriting calico-node + calico-kube-controllers images to
v3.26.5 (proper patch update); tigera-operator immediately reverted
to v3.26.1 because the Installation CR is the source of truth.
Endless churn but no data loss — Calico stayed healthy throughout.

Removing keel.sh/enrolled label and live label from calico-system ns.
Calico upgrades go through the tigera-operator's Installation CR
manually, not Keel.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:50 +00:00
Viktor Barzin
8f18621dd5 keel: default policy → patch (semver-bounded opt-out auto-update)
Move from `never` (no auto-update) to `patch` for the cluster-wide
default. Keel only auto-updates PATCH versions within the current
major.minor: 0.26.6 → 0.26.7 OK; 0.26.6 → :nightly-latest blocked.
Tag-rewrites that broke calico (v3.26.1 → :master) and affine
(0.26.6 → :nightly-latest) on 2026-05-16 cannot recur with patch.

Caveats:
  * Patch causes Terraform image drift for semver-pinned services —
    drift-detection pipeline will surface it; lifecycle ignore_changes
    on container[].image can be added per stack later if drift is
    noisy.
  * Tags that aren't parseable as semver (:latest, :11, :nightly,
    SHA tags) are ignored by patch — those workloads stay on their
    current image until promoted to `force` policy individually.

Self-hosted CI-driven services + chrome-service kept on `never`
(deliberate pins / CI controls the tag):
  recruiter-responder, claude-agent-service, claude-memory,
  chrome-service, fire-planner, job-hunter, payslip-ingest

Live state already updated via kubectl apply + per-workload patches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:50 +00:00
662695908a recruiter-triage: AI culture & tooling section + warm-engage AI ask
- claude-agent-service bumped to 191ed5dd (new AI section in agent
  template — leadership stance, approved tools, usage limits / quotas,
  code-gen safety, product-side AI depth, follow-up questions for the
  recruiter when the web is sparse).
- recruiter-responder bumped to ab59eeab (deep_research prompt asks
  for AI culture; warm_engage template adds a written-only ask for
  IDE assistants, chat tools, per-seat limits, source-to-external
  model policy).

Smoke-tested 2026-05-16: forced fresh research on Datadog, agent
returned full structured AI section with 7 explicit recruiter
questions covering DLP/IDE/limits/code-gen-policy. $0.80 / 192s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:50 +00:00
Viktor Barzin
0a6b2489f7 keel: default policy → never (post-incident safe default)
2026-05-16 incident: Keel's `force` policy switched semver-pinned
images (affine 0.26.6 → :nightly-latest, calico v3.26.1 → :master)
instead of digest-tracking. Force is documented as "always update
to the newest tag in the registry" — only safe on already-mutable
tags like :latest.

Changing the cluster-wide default in inject-keel-annotations to
`never`. The namespace enrollment label + V2 lifecycle suppression
stay in place so opt-in is one annotation per Deployment, but no
service auto-updates until explicitly approved.

To opt in a workload now:
  1. Verify the Deployment image is on a mutable tag (:latest,
     :<major>, or a vendor "stable" tag) — change in Terraform first
     if needed.
  2. Add to the Deployment's metadata.annotations:
       "keel.sh/policy" = "force"   (digest tracking)
       OR
       "keel.sh/policy" = "patch"   (semver patch bumps — also
       requires ignore_changes on the image)

Live policy already updated via kubectl apply + per-workload
override (force → never).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:50 +00:00
Viktor Barzin
9765f6b9a4 keel: enable Slack notifications on every upgrade
Wire Keel's Slack notifier to the existing bot token in Vault
(secret/viktor -> slack_bot_token). Posts to #general by default;
override via slack.channel in the Helm values if you want a dedicated
channel like #keel-notifications.

Notification level is "info" so we get every rollout event, not just
errors. Approval flow is OFF — opt-out-pure means all updates apply
unattended. If we later introduce approvals, add slack.approvalsChannel.

Resolves user request: 'keel should send notifications to slack everytime
it upgrades an app'.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:50 +00:00
3027ab85a8 recruiter-responder: bump image_tag to 189ef901
OpenClaw can now answer 'what do we know about <company>?' from cache
via the new recruiter_company_research tool, and recruiter_get embeds
the cached research payload inline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:49 +00:00
Viktor Barzin
be3b94da85 keel: pin chart 1.0.6 → 1.2.0 (1.0.6 doesn't exist)
The Helm repo at https://charts.keel.sh has versions 1.0.0–1.0.5,
1.1.0, 1.2.0. 1.0.6 is not published, so the Phase 0 apply failed
silently. Bump to 1.2.0 (app version 0.21.1, latest stable).
2026-05-22 14:16:48 +00:00
Viktor Barzin
411524a10d kured: drop Mon-Fri restriction, reboot any day
The weekday-only schedule was a 2026-03-16-incident-era guardrail when
the rest of the safety net was thin. Today's gates — halt-on-alert,
sentinel-gate Check 4 (24h soak via node Ready transitions), the
K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the
sentinel-path fix from earlier today — make weekend reboots safe and
just clear the backlog faster.

Effect: 5 pending node reboots clear in 5 calendar days instead of
queueing up over weekends. The K8s version-upgrade detection at Sun
12:00 UTC self-defers if a Sunday-morning kured reboot fires (the
RecentNodeReboot alert is in the Upgrade Gates ignore-less list for
the version-upgrade preflight — same mechanism kured uses).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
Viktor Barzin
2e52583abd Phase 1a: enroll 4 self-hosted services in Keel auto-update
Enrolls the cleanest Woodpecker-build-only self-hosted services into
the inject-keel-annotations ClusterPolicy by labeling their namespaces
keel.sh/enrolled=true. CI already pushes :latest (auto_tag: true) on
each, so Keel will detect the current upstream digest and trigger a
rolling restart when polling starts (1h cadence).

Per-Deployment lifecycle extended with KYVERNO_LIFECYCLE_V2 to suppress
the annotation drift Kyverno will inject (keel.sh/policy, /trigger,
/pollSchedule).

Services included:
  - fire-planner
  - job-hunter
  - payslip-ingest
  - recruiter-responder

Skipped from Phase 1 for follow-up:
  - claude-agent-service (user has WIP on main.tf)
  - claude-memory (Postgres co-deployed; treat in Phase 9 with other DBs)
  - kms (two Deployments; needs per-resource review)
  - wealthfolio (sync sidecar pattern; needs review)
  - chrome-service (deliberate :v4 pin; needs keel.sh/policy: never label)
  - GHA-migrated repos (10) (need per-repo CI cleanup)
  - beadboard, freedify (no CI)

See docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
5acfab5bb9 recruiter-responder: bump image_tag to f3cb91ff (180d research_cache TTL)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
e5a65c11a9 recruiter-triage v3: Perks & Office Life section + cache-first deep_research
- claude-agent-service bumped to f764fef6 (agent system prompt adds
  the Perks block: food/health/pension/equity/PTO/parental/equipment/
  learning/wellness/amenities/commuter). 1200-word cap.
- recruiter-responder bumped to 38a2cdaa (cache-first deep_research:
  serves cached payload if fetched_at + ttl_seconds > now; cache
  writes upsert; new force flag bypasses).

Verified end-to-end: deep_research on Datadog now returns full Perks
section (~220s, $0.60, 23 turns). Earlier 500 fixed (was
uq_research_company_tier dup-key on re-run).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
Viktor Barzin
020f62555b Phase 0: install Keel + Kyverno auto-update annotation injector
Foundation for opt-out-pure auto-update model per
docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.

- New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6).
  Polls registries hourly per design decision #8. Default schedule
  overridable per-workload via keel.sh/pollSchedule annotation.
- New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments,
  StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true`
  with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h.
- Phase 0 enrolls no namespaces. Phase 1 (next session) labels the
  self-hosted set.
- Per-workload opt-out: label `keel.sh/policy: never` (used by rollback
  runbook and chrome-service-style deliberate pins).
- Keel namespace excluded from the mutate — supervisor self-update has
  too-bad a failure mode (decision #11).
- AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the
  ignore_changes block enrolled workloads need.
- .claude/CLAUDE.md: docker-images rule flagged as transitional.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
Viktor Barzin
9476649539 docs/pm: kured silently stalled 6 days + Anubis HA lift (2026-05-16)
Captures the May 10–16 kured-vs-sentinel-gate hostPath mismatch (chart
derived hostPath from configuration.rebootSentinel) and the companion
work to harden the rolling-reboot pipeline against single-replica
PDB deadlocks: Anubis 1→2 replicas with shared Valkey store, kured
drainTimeout=30m, CNPG pg-cluster 2→3 instances. Includes the
mysql-standalone-PDB orphan cleanup and the k8s-node1 containerd-source
drift audit (benign).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
Viktor Barzin
3ef860b2be kured + cnpg: drain-safe defaults ahead of Monday reboot wave
Three defensive moves to make the kured rolling-reboot cycle survive
edge cases without operator intervention:

kured (stacks/kured/main.tf):
  - Set `configuration.drainTimeout = "30m"`. Default is unlimited; if
    a future PDB or finalizer stalls drain, kured retries forever and
    the node stays cordoned silently. 30m caps the silent-failure
    window — after timeout kured logs the abort and waits for the
    next period; the node stays Schedulable so cluster capacity isn't
    lost. Lets us fail closed instead of fail-silent.

CNPG pg-cluster (stacks/dbaas/modules/dbaas/main.tf):
  - Bump instances 2 → 3 (1 primary + 2 replicas). With 2 instances the
    failover during a primary-node drain depended on the lone replica
    being caught up; a WAL backlog would stall the drain until the
    replica was current. With 3 instances CNPG always has at least one
    fully-current replica to promote, and the PDB's
    `minAvailable=1` on the primary selector is satisfied throughout
    the switchover. Storage: +20Gi PVC on proxmox-lvm-encrypted (about
    35Gi after autoresize). Memory: +3Gi pod limit.
  - Updated the `triggers.instances` so the null_resource's local-exec
    actually re-applies the YAML (kubectl apply with the new spec). The
    YAML is the source-of-truth but the trigger is what tells terraform
    to re-run the provisioner.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
Viktor Barzin
4ff3638065 state(dbaas): update encrypted state 2026-05-22 14:16:48 +00:00
Viktor Barzin
08bf5e47b7 state(dbaas): update encrypted state 2026-05-22 14:16:48 +00:00
Viktor Barzin
5768216d0e anubis: HA with shared valkey/redis store + replicas=2
Anubis pre-2026-05-16 ran at replicas=1 because in-flight PoW challenge
state lived in process memory — a challenge issued by pod A wouldn't be
verifiable by pod B (HTTP 500 "store: key not found"). The PDB at
`minAvailable=1` made this worse: with replicas=1 the eviction API can
NEVER satisfy the constraint, so every drain on a node hosting an Anubis
pod looped forever. This is what stalled the manual K8s upgrade on
2026-05-11 (had to delete pods directly to bypass eviction) and was
about to block kured on Monday 2026-05-18 once the kured sentinel fix
landed.

Anubis upstream has first-class support for a Valkey/Redis-protocol
shared store (documented as the "Kubernetes worker pool" pattern).
Wire it up:

- modules/kubernetes/anubis_instance: add `shared_store_url` variable.
  When set, appends a `store: { backend: valkey, parameters: { url } }`
  block to the rendered policy YAML and defaults replicas to 2 (capped
  at 2). PDB switched from `minAvailable=1` to `maxUnavailable=1` so
  drains can take down one pod at a time. topologySpreadConstraint
  tightened to `DoNotSchedule` so the two replicas land on different
  nodes — a single node loss never takes a whole Anubis instance down.
- All 8 call sites (cyberchef, jsoncrack, kms, homepage, blog,
  travel_blog, real-estate-crawler, f1-stream) opted in. Each picks a
  unique Redis DB index (5–12) on `redis-master.redis:6379`. Cluster
  Redis already runs HA via Sentinel + haproxy, no new infra needed.

Verified: every Anubis Deployment now 2/2 Ready with pods on different
nodes; PDBs allow 1 disruption; Redis DBs 5,7,8,10 already populated
by live traffic post-apply; Palo Alto Networks scanner hit blog right
after apply and the challenge log shows the new state path.

Drain on any worker now succeeds without a `predrain_unstick` workaround
— eviction API is satisfied because at most one pod is unavailable at a
time, and the other replica keeps serving. Monday's kured reboot wave
should roll through cleanly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
3025879478 claude-agent-service: ship recruiter-triage agent + restore missing terragrunt.hcl
- main.tf: bump image_tag to 1b3350c0 (carries the new agent),
  init container  also copies recruiter-triage.md
  into /home/agent/.claude/agents/.
- terragrunt.hcl: restored (file was missing — apply was blocked).
  Standard root include + platform/vault/external-secrets dependencies.

Smoke-tested 2026-05-16: deep_research call on Datadog (thread 42)
via recruiter-responder REST API → 102.5s, $0.43, structured
markdown report with comp bands vs £600k floor, culture signals,
remote policy, recent news, sources cited. End-to-end Tier-2 is live.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
ea2342b8e2 docs: add CONTEXT.md domain glossary [ci skip]
Adds the per-repo domain glossary that engineering skills
(diagnose, tdd, improve-codebase-architecture, grill-with-docs)
read before working in this repo. Terms only — no implementation
detail. Six clusters (code organization, cluster, networking,
storage, secrets, CI/CD), 22 terms, plus relationships, an example
dialogue, and five flagged ambiguities.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
ce5f3ec209 recruiter-responder: expose Gmail IMAP creds for backtest CLI
Pulls vbarzin@gmail.com app password from secret/recruiter-responder
(seeded from secret/wealthfolio.imap_password — same Gmail credential
that wealthfolio uses for broker-statement ingestion). Env vars
GMAIL_IMAP_USER + GMAIL_IMAP_PASS, consumed by 'backtest gmail'.

Backtest verified 2026-05-16 against folder
'companies-I-dont-take-seriously': 20/20 recruiter, 100% company
extraction (9 stated, 6 subject, 4 sender_domain, 1 body), 30% comp,
avg 12s latency.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
Viktor Barzin
065982d978 kured: fix sentinel path mismatch that stalled rolling reboots
The kured Helm chart derives the sentinel hostPath from
`dirname(configuration.rebootSentinel)`. Previously
rebootSentinel=/sentinel/gated-reboot-required pointed hostPath at
`/sentinel/` (an empty auto-created directory on every host) while the
kured-sentinel-gate DaemonSet writes to /var/run/gated-reboot-required.

Two different host directories → kured never saw the open gate, even
though the gate's checks were all green every 5 min on every node.
Result: unattended-upgrades has packages waiting on every node since
2026-05-10 (when uu was re-enabled) and kured's hourly log says
"Reboot not required" for the entire period.

Set rebootSentinel=/var/run/gated-reboot-required so the chart mounts
hostPath /var/run — same directory the gate writes to. The in-pod
mountPath (/sentinel) is hardcoded by the chart and doesn't matter,
the symlink chain works out: /sentinel/<file> inside the pod resolves
to /var/run/<file> on the host.

Verified: kured pod can now list /sentinel/gated-reboot-required
(0 B) AND /sentinel/reboot-required (32 B, set by uu on 2026-05-15).
First gated reboot will land Mon 2026-05-18 02:00 London.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
80e6314bf0 recruiter-responder: bump image_tag to 559e5c57
PDF extraction, tech_stack list, aggressive company/comp inference,
no-phone-call drafts, backtest CLI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
8e11caff8d recruiter-responder: bump image_tag to bbd178da (structured Telegram + comp floor)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
Viktor Barzin
391c002f9a service-catalog: add aiostreams entry
Stremio stream aggregator now has its own row in the Active Use tier.
Captures the auth model (own UUID+password, not Authentik), monitoring
posture (canary probe + 3 alerts), and backup pipeline (weekly NFS
dumps of both decrypted config and the Stremio account addon
collection).

Follow-up from the 2026-05-15/16 hardening session: 5 commits on
servarr/aiostreams, none previously catalogued.
2026-05-22 14:16:47 +00:00
Viktor Barzin
24ce3e267d aiostreams: weekly backup of Stremio account addon collection
Adds stremio-account-backup CronJob (Sun 04:00 weekly, offset 1h from
the AIOStreams config-backup at 03:00):

- Logs into api.strem.io with credentials from Vault
  (secret/viktor.stremio_email + stremio_password, now also synced
  into the aiostreams-probe-secrets ExternalSecret)
- Fetches the full addonCollection via addonCollectionGet
- Writes timestamped JSON to the existing aiostreams-backup PVC
  (NFS /srv/nfs/aiostreams-backup/stremio-collection-*.json, mode 0600)
- 90-day retention, logs out to invalidate the auth key
- Pushgateway metrics: stremio_account_backup_{success,bytes,
  addon_count,duration_seconds,last_run_timestamp}

Protects against: accidental "uninstall all" / API regression / wrong
account login wiping the curated set of 22 addons (Cinemeta + 16
MDBList + AIOStreams + More Like This + Formulio + Zamunda + Local).

Verified: manual run wrote 93480 bytes, 22 addons, file present on NFS.
2026-05-22 14:16:47 +00:00
aa6e9b0242 recruiter-responder: public /cb ingress for Telegram URL-button callbacks
- Add ingress_factory module (auth=none, HMAC + expiry are the gate);
  ingress_path=["/cb"] only — /api stays internal, /healthz cluster.
  dns_type=proxied. anti_ai_scraping=false.
- Drop setup_tls_secret module — Kyverno ClusterPolicy `sync-tls-secret`
  auto-clones the wildcard cert into every namespace.
- Bump image_tag to 7383b426 (callback endpoints + SMTP STARTTLS
  hostname relax).
- Wire CALLBACK_BASE_URL=https://recruiter-responder.viktorbarzin.me.
- Drop git-crypt-encrypted wildcard cert files into
  stacks/recruiter-responder/secrets/. Allowlist privkey.pem in a new
  .gitleaksignore — git-crypt encrypts at rest but the working-tree
  copy is plaintext, so gitleaks can't tell.

Smoke-tested end-to-end 2026-05-15 23:45:
  synthetic email -> Telegram with / buttons ->  tapped via curl
  -> 'Sent' HTML page -> thread.status=sent, decision row recorded
  with decided_via=telegram_button, outbound message threaded correctly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
Viktor Barzin
77010b769a aiostreams: whitelist Vidhin + Tamtaro sync URLs
Adds two env vars on the AIOStreams deployment:
- WHITELISTED_REGEX_PATTERNS_URLS: Vidhin's release-group regex
  (TRaSH-aligned) so syncedRankedRegexUrls works for the user
- WHITELISTED_SEL_URLS: Vidhin's ranked stream expressions +
  Tamtaro's ISE/PSE/ESE-standard

Gotcha: AIOStreams validates each synced* field against the matching
whitelist — stream-expression files (incl. Vidhin's expressions.json)
go in WHITELISTED_SEL_URLS, not the regex one, even though they live
in Vidhin's regex repo. Mixing them up returns USER_INVALID_CONFIG.

User config: enabled Vidhin's regex + ranked expressions + Tamtaro's
ISEs. Skipped Tamtaro PSE/ESE for now to avoid surprise over-filtering;
can be added later from the same whitelist.
2026-05-22 14:16:47 +00:00
Viktor Barzin
c396092c86 aiostreams: weekly NFS backup of decrypted user config
Adds aiostreams-config-backup CronJob (Sun 03:00 weekly):
- Pulls /api/v1/user via internal ClusterIP with UUID + password from
  the existing aiostreams-probe-secrets ExternalSecret
- Writes timestamped JSON to nfs-backup PVC mounted at /backup
- 90-day retention, prunes older files
- Pushgateway metrics: aiostreams_config_backup_{success,bytes,duration,last_run_timestamp}

NFS path: 192.168.1.127:/srv/nfs/aiostreams-backup (auto-synced offsite
to Synology via the existing offsite-sync-backup CronJob).

Complements the daily postgresql-backup-per-db pipeline (which dumps
the encrypted blob) by storing the decrypted JSON — usable for human
inspection / disaster recovery even without the AIOStreams password.

Verified: manual job wrote 12931 bytes, file present on NFS.
2026-05-22 14:16:47 +00:00
root
1177a82452 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:47 +00:00
a98b00324d recruiter-responder: pin image tag + run plugin installer init as root
- stacks/recruiter-responder/terragrunt.hcl: bump image_tag to 0500c3d3
  (300s LLM timeouts + IMAP BODY.PEEK[] fix).
- stacks/openclaw/main.tf: install-recruiter-plugin init container now
  runs as uid 0 — the openclaw NFS volume is owned by uid 1000 and the
  recruiter-responder image otherwise drops to uid 10001 which can't
  write or chown.

Smoke-tested end-to-end 2026-05-15 ~23:15:
  Synthetic recruiter email -> IMAP IDLE EXISTS push -> qwen3-8b triage
  (12.1s, JSON output complete with company/role/salary/location/tech)
  -> 2 drafts persisted in Postgres -> Telegram sendMessage 200 OK.
Then deleted 3 stale n8n workflows W992Nr7..., 1AU4k7..., IisDNx... from
the n8n Postgres workflow_entity table.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
a72590db7d recruiter-responder: vault DB role + switch proactive push to Telegram
- stacks/vault/main.tf: register pg-recruiter-responder static role on
  the postgresql connection (7d password rotation). Adds the role to
  allowed_roles and creates vault_database_secret_backend_static_role
  for `recruiter_responder` user.
- stacks/recruiter-responder/main.tf: drop TASK_WEBHOOK_URL env, swap
  TASK_WEBHOOK_TOKEN secret for TELEGRAM_BOT_TOKEN + TELEGRAM_CHAT_ID.
  Updated header doc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:46 +00:00
Viktor Barzin
89e9471e87 state(vault): update encrypted state 2026-05-22 14:16:46 +00:00
7e1580ba8c recruiter-responder: deploy stack + llama-cpp qwen3-8b + openclaw plugin mount
Three coupled changes for the new recruiter-responder pipeline:

1. stacks/llama-cpp/: add qwen3-8b text-only model to llama-swap. Uses
   unsloth/Qwen3-8B-GGUF Q4_K_M, 16k context, no mmproj. Refactored the
   download Job script + cmd renderer to handle text_only=true (skip
   mmproj download + --mmproj flag). The 3 existing vision models stay
   on text_only=false; no behaviour change for them.

2. stacks/recruiter-responder/: new stack. Namespace, 2 ExternalSecrets
   (app secrets from secret/recruiter-responder, DB creds from Vault DB
   engine static-creds/pg-recruiter-responder), Deployment (replicas=1,
   Recreate -- IMAP IDLE + APScheduler want single leader), Service
   ClusterIP. Image: forgejo.viktorbarzin.me/viktor/recruiter-responder.

3. stacks/openclaw/: add init container `install-recruiter-plugin` that
   uses the recruiter-responder image to copy the .mjs plugin into
   /home/node/.openclaw/extensions/recruiter-api/ on NFS. Couples plugin
   version to the recruiter-responder image tag. Also injects
   RECRUITER_RESPONDER_URL + RECRUITER_RESPONDER_TOKEN env vars (token
   from openclaw-secrets.recruiter_responder_bearer_token, optional).

Pre-apply checklist for recruiter-responder stack:
  - Vault: seed secret/recruiter-responder with webhook_bearer_token,
    imap_{me,spam}_{user,pass}, smtp_password, claude_agent_token,
    task_webhook_token.
  - Vault: add secret/openclaw.recruiter_responder_bearer_token (same as
    above webhook_bearer_token).
  - dbaas: create DB recruiter_responder + role recruiter_responder,
    and Vault DB-engine role static-creds/pg-recruiter-responder.
  - Build + push image via Woodpecker (recruiter-responder repo CI).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:46 +00:00
Viktor Barzin
95b9f7bc89 aiostreams: 1h stream cache + canary stream-count probe + 3 alerts
Hardening pass following the empty-stream-list incident:

1. STREAM_CACHE_TTL=3600 — re-enables stream payload cache (was -1 /
   disabled). Default behaviour hit all 5 upstream addons on every
   Stremio request; with a 1h TTL repeat requests for the same title
   are instant, while RD cache invalidations still propagate quickly.

2. aiostreams-stream-probe CronJob (every 5 min): fetches the user's
   encryptedPassword via the internal ClusterIP, runs a canary stream
   search for Breaking Bad S01E01, pushes streams_count + probe_success
   to Pushgateway. Uses an ExternalSecret pulling UUID + password from
   Vault secret/viktor. Same pattern as email-roundtrip-monitor.

3. Three alerts in monitoring's prometheus_chart_values.tpl:
   - AIOStreamsStreamCountLow  (< 50 streams for 30m)
   - AIOStreamsProbeFailing    (probe_success == 0 for 30m)
   - AIOStreamsProbeStale      (last_run_timestamp > 30min for 10m)

Verified: probe returned streams=411 success=1 on first run; all 3
alerts loaded into Prometheus with state=inactive health=ok.
2026-05-22 14:16:46 +00:00
root
fba5ee2df4 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:46 +00:00
Viktor Barzin
c73234982f aiostreams: pin nightly + switch to auth=app
- Pin viren070/aiostreams:nightly → :2026.05.14.1326-nightly (avoid
  stale-pull cache, matches 8-char SHA convention for rolling tags)
- Switch ingress auth tier required → app: Authentik forward-auth
  blocks Stremio clients (cannot follow OAuth 302), and AIOStreams
  already enforces UUID + password on /configure and /api/*, with
  Stremio addon URLs using encryptedPassword as a bearer token.
  Result: empty-stream-list issue fixed for public Stremio clients.

Verified: 410 streams returned via public URL for Breaking Bad S01E01
with no cookies, vs 0 before (502→Authentik OIDC redirect).
2026-05-22 14:16:46 +00:00
2903ab9778 monitoring(wealth): move Positions table under contrib/growth row
Positions panel now sits at y=32 (immediately below the
contrib-vs-market + growth row at y=22..32), and everything from
the per-account stack down shifts 8 rows lower.
2026-05-22 14:16:46 +00:00
8461275308 wealth: positions table panel (shares + cost basis + unrealised return)
pg-sync sidecar now mirrors three extra views from the wealthfolio
SQLite: assets (id/symbol/name/currency), quote_latest (one row per
asset, preferring YAHOO over MANUAL on same-day collisions), and
positions_latest (currently-held positions extracted from the TOTAL
aggregate row of holdings_snapshots — quantity, average cost,
total cost basis).

Wealth dashboard gets a new bottom Positions table joining the three:
symbol, name, shares, avg cost, last price, market value, cost,
gain, return %. Gain and return % are color-text with red<0, green>=0
thresholds.
2026-05-22 14:16:46 +00:00
d6049ff7a0 terminal: extract app code to viktor/terminal-lobby on Forgejo
The lobby has grown enough (frontend, two Go services, devvm units +
scripts + config) that it earns its own repo. Code now lives at
https://forgejo.viktorbarzin.me/viktor/terminal-lobby with
scripts/deploy.sh covering the manual deploy until CI activation
lands (Woodpecker forge_id=2 activation still 500s; Forgejo Actions
not yet enabled).

This stack now owns only the K8s side — Services, Endpoints,
IngressRoutes, middlewares. main.tf comment block updated to point
at the new repo and the full DevVM port map.

Removed:
- stacks/terminal/files/        (index.html + DevVM artefacts)
- stacks/terminal/tmux-api/     (Go service)
- stacks/terminal/clipboard-upload/ (Go service)
2026-05-22 14:16:46 +00:00
c135c04c79 terminal: make slate the default theme 2026-05-22 14:16:46 +00:00
a44aa52e1a terminal: theme picker (carbon/slate/mono/ink) replacing violet
Drops the hardcoded violet/indigo palette. Four themes are defined as
CSS variables on body.theme-{carbon,slate,mono,ink}:

- Carbon (default): warm dark, ivory text, restrained amber accent.
- Slate: cool dark, GitHub/Linear-ish charcoal with electric blue.
- Mono: strict greyscale, off-white accent.
- Ink: warm paper light, deep ink, terracotta accent.

The lobby reads the choice from localStorage and applies the class
before render. The picker lives at the bottom of the sidebar
(margin-top: auto pins it). On change, the iframe is bounced through
about:blank so the inner xterm picks up the new computed CSS vars
(--terminal-bg/fg/cursor/selection) on the next mount.

Picker UI uses native buttons, current theme highlighted with the
accent border + color. No gradients, hairline borders only.
2026-05-22 14:16:45 +00:00
cbe83597c0 terminal: rename sessions + drag-and-drop reorder
Backend: POST /sessions/<name>/rename in tmux-api runs tmux
rename-session as the mapped OS user. 400 on bad name, 404 on missing
source, 409 on duplicate target, 401 on missing auth header.

Frontend:
- Rename button per card → prompt() dialog, validates against the
  shared regex. Updates currentActive + hash + iframe.src if the
  renamed session was active.
- Session order is now user-driven, persisted in localStorage
  keyed per osUser. New sessions append at the bottom. The previous
  sort-by-lastActivity is gone.
- HTML5 drag-and-drop reorders cards live during dragover; dragend
  captures the DOM order into localStorage.
- Polling renderLobby is suppressed while a drag is in flight so the
  5s tick doesn't yank the list out from under the user.
2026-05-22 14:16:45 +00:00
Viktor Barzin
04fd241679 terminal: inline session switching via sidebar + iframe
Replace full-page navigation with a two-pane lobby. Sidebar holds the
session list as clickable cards; an iframe in the content pane swaps
its src on click so switching sessions takes one click instead of two
navigations.

- #lobby-shell grid (260px sidebar + iframe pane)
- Cards become role=button, kill button stops propagation
- activateSession/deactivateSession with hash routing
  (location.hash <-> active session, replaceState so back stack stays
  clean)
- Killed active session deactivates the iframe before re-render
- 5s session poll preserves currentActive; deactivates if gone
- Mobile media query collapses to one column

CSP frame-ancestors already permits same-origin embedding
(*.viktorbarzin.me), no infra changes needed. Direct-link
?arg=<name> path is unchanged.
2026-05-22 14:16:45 +00:00
root
7663b5c36e Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:45 +00:00
Viktor Barzin
43affc3cdc actualbudget: add enabled flag to factory, disable emo
Emo isn't using the instance and the daily bank-sync CronJob has been
failing because the budget has zero accounts (deleted from the UI),
triggering BankSyncStale. Adds an `enabled` toggle that gates the core
Deployment + Service + Ingress + http-api + CronJob behind a single
plan-time bool while preserving the PVC, so we can flip back to true
later to restore the instance as-was.

Also fixes a latent bug where the http-api Service was always created
even when `enable_http_api=false`.

Apply: 7 resources destroyed (emo deploy/svc/ingress/cf dns/http-api
deploy+svc/cronjob), 0 changes for viktor/anca (moved blocks
migrated their state cleanly to the new [0] addresses). Pushgateway
job bank-sync-emo cleared manually; orphaned external-monitor
synced out by external-monitor-sync.
2026-05-22 14:16:45 +00:00
9fce3c7b09 terminal: per-Authentik-user OS-user isolation; deny unmapped users
Restores the kernel-level isolation the pre-cutover ttyd-session.sh had,
but keeps the multi-session lobby UX:

- ttyd.service gets `-H X-authentik-username` back. `tmux-attach.sh` reads
  $TTYD_USER, looks up the local part in /etc/ttyd-user-map, denies the
  connection (no fallback to wizard) if there's no mapping, otherwise
  `sudo -n -H -u <os_user> tmux …`. Each Authentik identity → its own
  Unix user → its own `/tmp/tmux-<uid>/default` socket.
- tmux-api scopes every request to the same OS user via the same header.
  Adds /whoami so the lobby HTML can preflight access and render
  "logged in as <os_user> (<authentik>)" instead of leaving the user to
  discover the deny via a reconnect loop.
- Commits /etc/ttyd-user-map and the matching /etc/sudoers.d/ttyd-users
  fragment under files/devvm/ so future operators see one canonical
  source of truth. Current mappings: vbarzin → wizard, emil.barzin → emo.

Adding a user is now: append a line to ttyd-user-map + a NOPASSWD
sudoers line + `useradd -m`. README walks through it.

No Terraform changes — this is all DevVM-side + lobby JS.
2026-05-22 14:16:45 +00:00
aff4f67671 terminal: cut over to multi-session lobby on terminal.viktorbarzin.me
Promotes the staged multi-session UX from term.viktorbarzin.me to the
primary terminal.viktorbarzin.me hostname. `ttyd.service` on the DevVM
moves to the same ExecStart that `ttyd-multi.service` was running:
`/usr/local/bin/ttyd -W -a -t enableClipboard=true -I
/usr/local/share/ttyd/index.html -p 7681 /usr/local/bin/tmux-attach.sh`.
The lobby HTML supersedes the old per-user-attach index.html
(ttyd-session.sh wrapper retired alongside).

Terraform: retires the `terminal-multi` Service+Endpoints and the
term.viktorbarzin.me ingress (Cloudflare DNS record for `term` is
released by module deletion). The tmux-api Service+Endpoints stay, but
its IngressRoute now matches terminal.viktorbarzin.me — same path-prefix
specificity wins against the catch-all ingress.

DevVM follow-up (applied manually as before — see files/devvm/README.md):
restart ttyd to pick up the new unit, stop+disable ttyd-multi.service.
2026-05-22 14:16:45 +00:00
root
86a2c66c8e Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:45 +00:00
Viktor Barzin
b1b2cb1974 terminal: add multi-tmux-session lobby on term.viktorbarzin.me (additive)
New hostname term.viktorbarzin.me serves a session-picker UI that lists,
creates, and kills tmux sessions. Visiting ?arg=<name> attaches to that
session (auto-creates via tmux -A). Builds on a fresh ttyd instance
(7685) plus a tmux-api Go binary (7684) on the DevVM, both running as
User=wizard alongside (not replacing) the existing ttyd.service (7681),
ttyd-ro.service (7682), and clipboard-upload (7683). Cutover of
terminal.viktorbarzin.me to the multi-session setup is deferred.

Terraform diff is purely additive — terminal-multi/tmux-api Service +
Endpoints + ingress_multi (term.viktorbarzin.me, Authentik-gated) + an
IngressRoute that path-prefixes /api/sessions/* to tmux-api with the
matching strip-prefix Middleware.

DevVM-side units ship under files/devvm/ with a README — manual scp +
systemctl install (see files/devvm/README.md). ttyd 1.7.7 already
deployed there (≥1.7 needed for -a).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:45 +00:00
726fb25182 monitoring(wealth): paint declining segments red on growth chart
Mirror the panel 5 treatment on panel 7 (Growth = market value −
contribution). Second SQL column emits the growth value only when
the point is part of a declining segment; field override paints it
red with no fill, spanNulls=false.
2026-05-22 14:16:45 +00:00
Viktor Barzin
cc47da87b0 payslip-ingest, instagram-poster: suspend two chronic-failure cronjobs
Identified during alert-noise review as steady sources of JobFailed.
Suspending them stops the noise; unsuspend after the per-job blocker is
cleared.

* payslip-ingest/actualbudget-payroll-sync — blocked on Vault
  `secret/payslip-ingest` missing `actualbudget_encryption_password`.
  `actualbudget_api_key` and `actualbudget_budget_sync_id` were added
  (copied from `secret/fire-planner`) in the same session; the
  encryption password is not stored anywhere in Vault and needs to be
  populated separately. ExternalSecret sync has been failing since
  2026-04-25.

* instagram-poster/ig-refresh-token — the deployed image (:da5b4191)
  does not contain the `POST /ig-refresh-token` route; the route is
  defined in uncommitted working-copy changes at
  `instagram-poster/instagram_poster/app.py:695`. Unsuspend after the
  new image rolls.

Each `suspend = true` line carries an inline comment with the unsuspend
trigger.
2026-05-22 14:16:45 +00:00
Viktor Barzin
cbd0f71a3b monitoring: PodImagePullBackOff alert + 2 inhibitors + JobFailed for:2h
Three improvements identified in the 7d alert-noise review:

A. New PodImagePullBackOff alert. `KubeletImagePullErrors` measures
   node-level pull error rate, which doesn't catch a single pod stuck
   in ImagePullBackOff — council-complaints sat broken for ~10h on
   2026-05-12 without paging. The new rule fires per-pod after 30m.

B. Two new inhibit_rules:
   - PVFillingUp (95% used, critical) suppresses PVPredictedFull
     (linear projection, warning) on the same PVC. Pair was producing
     ~24h of redundant firing per 7d.
   - EmailRoundtripFailing (active probe failure) suppresses
     EmailRoundtripStale (derivative >60min no-success). Same outage
     windows, ~14.5h of duplicate firing per 7d.

C. JobFailed for: 30m → 2h. Most cronjobs run every 5–15min; the old
   30-minute window paged on the first failed iteration before the
   next run could recover. 2h means "still failing across at least
   two cron iterations" — much more actionable.

Verified live: rules loaded, inhibitors in alertmanager config,
PodImagePullBackOff is currently inactive (council-complaints
ImagePullBackOff actively detected — see separate fix).
2026-05-22 14:16:45 +00:00
Viktor Barzin
70292b9e23 monitoring: TraefikReplicaConfigStale — drop false-positive on stale series
The initial formulation used clamp_min(min(rate[2h]), 0.0001), which
made a recently-deleted pod's lingering rate=0 drive the ratio toward
infinity for up to 2h until the stale series aged out of the rate
window. With for: 2h, this was a near-miss for spurious firing in the
immediate aftermath of restarting the bad replica (our remediation
path).

Tighter formulation:
* 30m rate window — stale series ages out within minutes, not hours
* `min(rate) > 0.0005` floor — filters both stale-zero and fresh-pod
  ramp-up series; the bug's actual rate (~0.00076 in the 2026-05-12
  incident) sits well above it, so true positives still trip
* for: 1h — fast enough to catch the next incident, long enough that
  short rate dips don't flap

Verified: post-deploy `(max/min) > 5 AND min > 0.0005` evaluates to 0
results with the live cluster's tight rate spread (~0.00065–0.0007/s
across all three Traefik replicas).
2026-05-22 14:16:45 +00:00
Viktor Barzin
165bb7258e monitoring: detect stale Traefik replicas + reduce alert-storm cascading
Two new alertmanager inhibit rules and one new Prometheus alert,
informed by the 2026-05-12 incident where Traefik pod
traefik-db7696fbf-k42wp came back after a SIGTERM with only 6 routers
vs 119 on healthy peers (stale K8s informer cache) and served 404 for
~1/3 of viktorbarzin.me traffic.

* New alert TraefikReplicaConfigStale: fires when max/min reload-rate
  ratio across Traefik pods exceeds 5x for 2h. The 2h window + 2h
  for-clause tolerates legitimate post-restart ramp-up; the bug
  pattern persists indefinitely.
* New inhibit: TraefikReplicaConfigStale suppresses the symptom
  alerts (HighService{Error,4xx,Latency}, IngressTTFB{High,Critical},
  IngressErrorRate5xxHigh, TraefikHighOpenConnections,
  ForwardAuthFallbackActive, AnubisChallengeStoreErrors,
  ExternalAccessDivergence) so only the actionable root cause pages.
* New inhibit: HomeAssistantDown suppresses
  HomeAssistantCriticalSensorUnavailable and
  HomeAssistantMetricsMissing — when HA itself is down, every sensor
  going unavailable is noise (10x firings observed in the last 12h).
* Extend NodeDown and NFSServerUnresponsive target lists to also
  suppress HomeAssistantCriticalSensorUnavailable.
2026-05-22 14:16:45 +00:00
Viktor Barzin
448bc0c0f6 k8s-version-upgrade: decompose into Job chain to fix self-preemption
The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:45 +00:00
root
8e13f1528e Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:45 +00:00
Viktor Barzin
e8854f9230 wealthfolio, paperless-ngx: drop migration-leftover -proxmox PVCs
The 2026-04-13 encrypted-PVC migration replaced the wealthfolio and
paperless-ngx data volumes with -encrypted variants but never removed
the original -proxmox PVC blocks from TF — both were sitting orphaned
with no pod mounting them, occupying 1Gi each of LVM thin pool. The
autoresizer also logged repeated "failed to get volume stats" for them
(no kubelet stats without a mounted pod), masking real signal.

  * wealthfolio: removed kubernetes_persistent_volume_claim.data_proxmox
  * paperless-ngx: removed kubernetes_persistent_volume_claim.data_proxmox
  (the paperless PVC turned out to be out-of-TF-state, so deleted via
   kubectl after the TF block removal.)
2026-05-22 14:16:45 +00:00
Viktor Barzin
701b0e3c57 claude-agent: replace unused 10Gi PVC with 5Gi NFS-backed /persistent
The 10Gi proxmox-lvm-encrypted PVC `claude-agent-workspace-encrypted` was
declared in TF but never wired into the deployment — the `workspace`
volume_mount pointed at an emptyDir, so the PVC sat allocated and idle
from 2026-04-15 to 2026-05-11.

Restructured per the design intent:
  * `workspace` (emptyDir) — fast per-pod ephemeral scratch for git clones.
    Each agent job clones the infra repo fresh, so persistence doesn't
    buy anything and emptyDir avoids RWO contention if the deployment
    is ever scaled past 1 replica.
  * `persistent` (5Gi NFS-backed RWX) — mounted at /persistent for cases
    where the agent needs to write state that should survive pod
    restarts (caches, ad-hoc outputs). RWX so all replicas share it;
    the service's sequential-mutex lock prevents concurrent writes.

Also fixed `fix-perms` init container: the Dockerfile's `WORKDIR
/workspace/infra` causes kubelet to create that path inside the
emptyDir as root:fsGroup with the setgid bit, which uid 1000 can't
write to. Pre-create the path + chmod 0775 to make it writable.

NFS export already exists on the PVE host
(/srv/nfs/claude-agent-persistent, owned 1000:1000).

Verified: pod runs 1/1; `/persistent` writable as agent uid 1000;
git-init successfully clones infra into /workspace/infra.
2026-05-22 14:16:45 +00:00
Viktor Barzin
cd13b9d062 monitoring: drop PVAutoExpanding alert — info-only noise, not actionable
PVAutoExpanding fired at >80% used (info severity), but pvc-autoresizer's
threshold is 10% free (= 90% used) — the alert always fired ~10 points
before any action would have been taken, and there was nothing for an
operator to do during that window either. It was a "heads up" that
didn't surface a problem.

Real failure modes are already covered:
  * PVFillingUp (critical, >95% for 10m) — autoresizer didn't keep up
  * PVPredictedFull (warning, predict_linear 24h) — trend toward exhaustion

Sharpened PVFillingUp's annotation to spell out the likely causes
(storage_limit reached, expansion failing, or missing autoresizer
annotations) so the responder doesn't have to recall the runbook.
2026-05-22 14:16:44 +00:00
396cce82cf monitoring(wealth): paint declining segments red on portfolio chart
Add a second SQL column on panel 5 that returns net_worth only when the
current point's previous or next neighbor is lower — i.e. the point is
part of a declining segment (including the peak and trough endpoints).
A field override draws this 'decline' series in red with no fill and
spanNulls=false, overlaying the green base line so down periods show
up as red on top of the climb.
2026-05-22 14:16:44 +00:00
Viktor Barzin
30eff178e9 healthcheck: probe uptime-kuma via internal Service (port-forward), not public URL
The Uptime Kuma check was hitting https://uptime.viktorbarzin.me, which
sits behind Authentik forward-auth. Authentik 302-redirects the Socket.IO
handshake the uptime-kuma-api library uses, and the library can't
complete the OAuth flow, so every healthcheck reported "Connection
failed" even though the pod was healthy and serving 225 monitors.

Fix: open a transient `kubectl port-forward` to svc/uptime-kuma in the
uptime-kuma namespace for the duration of the check, connect the
library to http://127.0.0.1:<port> (no auth gate), then SIGKILL the
port-forward on the way out. The disown is to suppress bash's "Killed"
job notification on stderr, which corrupted stdout when stderr was
merged for JSON parsing.

Verified end-to-end: healthcheck now reports the real signal —
"external down(3): www, xray-vless, hermes-agent" — the same 3
Cloudflare-facing endpoints flagging in the uptime-kuma logs.
2026-05-22 14:16:44 +00:00
Viktor Barzin
a699d5bedf vault: move audit-PVC autoresizer annotations to kubernetes_annotations
Background: 2026-05-10 someone added `server.auditStorage.annotations`
to vault/main.tf attempting to enable pvc-autoresizer on audit-vault-N
PVCs. The vault helm chart maps that block into the StatefulSet's
volumeClaimTemplates, which is immutable post-creation on existing
StatefulSets. Result: 4 consecutive helm upgrade attempts (rev 16-19)
all rejected with "StatefulSet spec: Forbidden", leaving the release
stuck in failed state since 22:47 UTC that day. Live PVCs were
hand-annotated via `kubectl annotate` as a workaround, but the IaC
declared a path that couldn't be applied — every subsequent tg apply
on the vault stack would re-fail.

Fix:
  * Remove `annotations` block from `server.auditStorage` values
    (with a comment recording why it can't live there).
  * Add `kubernetes_annotations` resources for audit-vault-{0,1,2}
    with `force = true`, so Terraform adopts the existing annotations
    and tracks the desired-state in IaC going forward. The autoresizer
    cares about PVC annotations, not StatefulSet template annotations,
    so this is functionally equivalent.

Done out-of-band before commit (helm state was already corrupted):
  `helm rollback vault 15 -n vault` → revision 20 deployed (clean).

Verified: helm status vault = deployed; audit-vault-0 still has
threshold=10% storage_limit=10Gi annotations; cluster healthcheck
no longer reports vault/vault=failed.
2026-05-22 14:16:44 +00:00
Viktor Barzin
18a17891c4 state(vault): update encrypted state 2026-05-22 14:16:44 +00:00
Viktor Barzin
bc5c10b38d ci: retrigger image rebuild — prior pipeline aborted during PG outage 2026-05-22 14:16:44 +00:00
Viktor Barzin
b278a8f158 docs/auth: sync to current auth enum (required/app/public/none)
Replace the legacy `protected = true` reference with the four-tier
`auth` enum that's been live for weeks. Document the anti-exposure
guard (`scripts/check-ingress-auth-comments.py` + `scripts/tg`)
that enforces the inline-comment convention. Fix two stale paths:

  - `stacks/platform/modules/ingress_factory/` → `modules/kubernetes/ingress_factory/`
  - `stacks/platform/modules/traefik/middleware.tf` → `stacks/traefik/modules/traefik/middleware.tf`

Replace the single `protected = true` example with three: a
default Authentik-gated admin UI, an app-managed backend, and an
intentionally-public webhook receiver. Each example shows the
required comment line above the auth assignment.

[ci skip]
2026-05-22 14:16:44 +00:00
Viktor Barzin
2ba36436c8 real-estate-crawler: populate SCRAPE_SCHEDULES (daily RENT + weekly BUY, London 1-2 bed)
Wires celery-beat to fire two periodic scrapes via the existing in-app
SchedulesConfig mechanism. Replaces the empty-string fallback with two
inline schedules expressed as Terraform-managed JSON:

- london-rent-daily: every day at 03:00 UTC, RENT, London, 1-2 bed,
  £1900-4000
- london-buy-weekly: every Sunday at 04:00 UTC, BUY, London, 1-2 bed,
  £400k-1.2M

Schedules live in `local.scrape_schedules` (jsonencode'd) rather than
Vault — they're configuration, not secrets, and benefit from being
version-controlled. The previous Vault-backed lookup
(`local.notification_settings["scrape_schedules"]`) was unused.

Verified live: new celery-beat pod logs
`Registering periodic task: london-rent-daily at 3:0` and
`london-buy-weekly at 4:0` immediately after roll-out.

Also tightens the comment above the wrongmove-api `auth = "none"` line
so it passes the new `scripts/check-ingress-auth-comments.py` guard
(pre-existing tech debt that blocked the apply).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:44 +00:00
Viktor Barzin
b3c1631597 ci: add python3 to infra-ci image — unblocks scripts/tg auth-comment check
Commit 0712a1b6 added a Python-based ingress_factory auth-comment check
that runs from scripts/tg on every plan/apply. The CI image
(forgejo.viktorbarzin.me/viktor/infra-ci) doesn't ship python3, so every
CI apply has been failing since with:

  env: can't execute 'python3': No such file or directory

Adding python3 to the apk install line restores CI applies for all stacks.
The build-ci-image.yml pipeline auto-fires on this commit (path filter
on ci/Dockerfile), so the rebuild + retag happens without manual action.
2026-05-22 14:16:44 +00:00
Viktor Barzin
f10784ddb6 infra: document auth = "app|none" tier on every legacy ingress
Sweep through the 30+ stacks that predated the auth = "app" tier
and were tagged auth = "none" without a comment explaining why
they weren't behind Authentik. Each is now self-documenting at the
call site, so the tg-level anti-exposure guard passes and future
readers don't have to reverse-engineer the intent.

Flipped 6 stacks from "none" to "app" — their backends have their
own user auth and the new tier records that more accurately:
  - navidrome   (Subsonic user/password)
  - ntfy        (deny-all default + user.db tokens)
  - nextcloud   (WebDAV/CalDAV/CardDAV app passwords)
  - vaultwarden (Bitwarden-compatible token auth)
  - headscale   (OIDC + preauth keys for Tailscale nodes)
  - paperless-ngx (app-layer login + API tokens)

Kept "none" with a comment on the rest — they're genuinely public,
webhook receivers, native-protocol endpoints, OAuth callbacks, or
Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt),
claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api,
fire-planner /api, forgejo (git/OCI native clients), frigate (HA
integration), immich/frame, insta2spotify /api, instagram-poster
(meta fetcher), k8s-portal, matrix (native bearer), monitoring×2
(HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT),
owntracks (HTTP Basic), postiz, privatebin (client-side enc),
rybbit (analytics tracker), send (E2E file drop), tuya-bridge
(API key), vault (own auth + CLI), webhook_handler, woodpecker
(forgejo webhooks + OAuth), xray (×3 VPN transports).

real-estate-crawler/main.tf:400 already had its comment from a
prior edit — not touched here.

No live state changes — auth = "app" produces the same middleware
chain as auth = "none" (verified earlier this session). This commit
is purely documentation + intent-tagging.
2026-05-22 14:16:44 +00:00
Viktor Barzin
20774f794d dbaas+monitoring: bump PG max_connections to 200, add scrape + alerts
Cluster grew past the 100-conn default — steady-state idle was 90/100,
leaving zero headroom for terragrunt applies or transient surges. The
ceiling was being discovered by Terraform crashing (pq: "remaining
connection slots are reserved for roles with the SUPERUSER attribute"),
not by alerting, because we had no PG scrape config at all.

dbaas (Tier 0):
  * max_connections: 100 → 200
  * shared_buffers: 512MB → 1GB (Postgres recommends ~25% of pod memory)
  * effective_cache_size: 1536MB → 2560MB (scaled with pod memory)
  * pod memory: 2Gi → 3Gi (rough rule of thumb: enough for shared_buffers
    + ~16MB work_mem * concurrent sorts + OS cache + overhead)
  * Triggers bump on null_resource.pg_cluster forces CNPG to re-apply,
    which rolls the cluster (standby first, then primary failover).

monitoring:
  * New scrape job 'cnpg' on dbaas namespace pods labeled
    cnpg.io/podRole=instance, port name=metrics (9187). Relabels add
    cnpg_cluster + cnpg_role labels for alert grouping.
  * PGConnectionsHigh (warning, >85% for 10m) — heads-up before exhaustion.
  * PGConnectionsCritical (critical, >95% for 3m) — last call before
    refusing connections.

Verified: cnpg targets up, sum(cnpg_backends_total)=84, max_connections
metric=200, alert ratio 0.42 → both alerts inactive.
2026-05-22 14:16:44 +00:00
Viktor Barzin
9e5a5fb0c7 infra/scripts/tg: enforce ingress_factory auth-comment convention
Every `tg plan/apply/destroy/refresh` now runs
`scripts/check-ingress-auth-comments.py` against the current stack
before invoking terragrunt. The check fails closed if any
`auth = "app"` or `auth = "none"` line in the stack's .tf files lacks
an immediately-preceding `# auth = "<tier>": ...` comment documenting
what gates the app (for "app") or why the endpoint is intentionally
public (for "none").

Why tg-level (not git pre-commit): tg is the universal entry point
for all infra changes. CI runs it, headless agents run it, humans
run it. A pre-commit hook only catches the human path. Wiring the
check into tg means the anti-exposure guard fires regardless of who
or what is invoking terragrunt.

Stack-scoped: each stack documents itself the next time it's edited.
The 30+ existing `auth = "none"` stacks that predate this guard are
not blocked from operating today; they'll need the comment added the
next time someone runs `tg plan` on them — at which point the gate
forces a conscious "yes, this is intentional" moment before any
state change can land.

Skipped on: init, fmt, validate, output, etc. — anything that doesn't
read or write infra state.
2026-05-22 14:16:44 +00:00
Viktor Barzin
e826e36658 state(dbaas): update encrypted state 2026-05-22 14:16:44 +00:00
Viktor Barzin
eb529d60e4 infra/ingress_factory: add auth = "app" mode for self-authed backends
Adds a fourth auth tier alongside required/public/none. "app" is
functionally identical to "none" — no Authentik middleware attached —
but the distinct name records intent at the call site: this backend
has its own user login (NextAuth, Django, OAuth, bearer-token API,
etc.) and Authentik would only break it.

Why the new tier: with only required/none, every "the app has its
own auth so drop Authentik" decision looked identical at the call
site to "this is an OAuth callback / webhook receiver / native-client
API". Future readers couldn't tell whether a stack was intentionally
unauthenticated or relying on backend auth. Now they can.

Migrates the 8 stacks flipped earlier this session (novelapp, immich,
linkwarden, tandoor, freshrss, affine, actualbudget, ebooks/audiobookshelf)
from "none" to "app". Confirmed no-op: `tg plan` on novelapp showed
"No changes" — same middleware chain, same live state.

The variable description and the .claude/CLAUDE.md Auth section now
spell out the anti-exposure rule: only pick "app" or "none" AFTER
verifying the app has its own user auth ("app") or the endpoint is
intentionally public ("none"). Default stays "required" so accidental
omission fails closed.

[ci skip]
2026-05-22 14:16:44 +00:00
root
6b9f5e8027 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:44 +00:00
Viktor Barzin
665b6b2934 actualbudget+monitoring: per-account bank-sync metrics, drop noisy alert
The bank-sync CronJob was posting to /accounts/banksync which fans out to
ALL accounts in a single call. With PSD2/GoCardless's 4-successful-pulls
per-account per-24h quota, a single rate-limited account would 500 the
whole call, and `bank_sync_success` would flip to 0 even though the data
itself was still flowing through manual UI syncs. Result: BankSyncFailing
fired routinely whenever the user had been active in the UI that day —
a structural false positive.

Fix:
  * CronJob: enumerate accounts via GET /accounts, POST per-account
    /accounts/{id}/banksync, emit bank_sync_account_success and
    bank_sync_account_last_success_timestamp labelled by account name.
    Roll up bank_sync_success = 1 iff any account succeeded.
  * Alerts: drop BankSyncFailing (noise generator). Keep BankSyncStale
    at 48h (global drought). Add BankSyncAccountStale at 72h (catches
    single-account auth expiry — the real signal we wanted).

Verified: manual run on bank-sync-viktor pushes 6 per-account success +
timestamp series; roll-up bank_sync_success=1; no firing alerts.
2026-05-22 14:16:44 +00:00
Viktor Barzin
7b6eee49c4 infra: drop Authentik forward-auth from 7 self-authed apps (auth = "none")
Apps with their own user auth + bearer-token APIs were being broken by
Traefik → Authentik forward-auth: every iOS/Android/native client got a
302 to authentik.viktorbarzin.me instead of the JSON they expected.
Authentik's 302+cookie dance can only be followed by a real browser.

Changed:
  - immich         (Immich mobile app + bearer-token /api)
  - linkwarden     (NextAuth + Linkwarden mobile clients)
  - tandoor        (Django auth + Tandoor mobile clients)
  - freshrss       (Fever/GReader API used by Reeder/FeedMe/etc.)
  - affine         (workspace auth + AFFiNE desktop/mobile sync)
  - actualbudget   (server password + Actual mobile/sync clients)
  - ebooks/abs     (Audiobookshelf iOS/Android app)

Each app's own auth is the gate now. CrowdSec + rate-limit + anti-AI
UA filter still front the ingresses. Same pattern as the novelapp
change earlier this session.

[ci skip]
2026-05-22 14:16:44 +00:00
Viktor Barzin
f98c3f2049 infra/novelapp: drop Authentik forward-auth (auth = "none")
novelapp handles its own user auth via NextAuth + Google OAuth, so the
ingress-level Authentik forward-auth was double-gating. Mobile webviews
(iOS/Android) can't follow the Authentik 302/cookie dance — they saw
HTML challenges where they expected JSON. CrowdSec + rate-limit +
anti-AI UA filter remain in front; novelapp's own login handles users.

[ci skip]
2026-05-22 14:16:44 +00:00
root
77492b3131 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:44 +00:00
Viktor Barzin
9be0672aa3 claude-memory / resume: unblock terragrunt apply (var defaults + psql -d postgres)
Two pre-existing apply failures uncovered during the Phase 4 mass apply,
unrelated to the auth refactor but blocking 100% rollout.

claude-memory:
- `var.claude_memory_db_password` had no default and wasn't passed by
  terragrunt → fall back to Vault `secret/claude-memory.db_password` via
  `coalesce(var.x, data.vault.data["db_password"])`.
- db-init Job was failing with `database "root" does not exist` because
  psql defaults the database name to the user when -d is omitted. Added
  `-d postgres` to all five psql invocations.

resume:
- `var.resume_database_url` had no default and wasn't passed → default to
  empty string. Vault carries the real value at `secret/resume.database_url`
  consumed at the deployment env-var level; the variable here just needs
  a value to satisfy the apply.

Also: priority-pass had lost most of its TF state (only 3 of 8 resources
tracked); imported namespace/service/pvc/deployment/ingress/tls-secret to
re-bind state with live K8s resources. No code change needed there.

Verified after re-apply:
- claude-memory.viktorbarzin.me → 200 (auth=none, native MCP responses)
- priority-pass.viktorbarzin.me → 302 → authentik (auth=required)
- resume.viktorbarzin.me → 302 → authentik public outpost (auth=public)
- 6 of 7 previously-failing applies now green; only vault remains, blocked
  by an unrelated helm chart immutable-StatefulSet-field issue.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:44 +00:00
Viktor Barzin
a168277213 healthcheck: tune noise filters + nvidia-exporter auth=none
Six tuning changes to cluster_healthcheck.sh so PASS sections actually
reflect "nothing to act on":

1. prometheus_alerts: only count severity=warning|critical. Info-level
   alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the
   alert rule itself sets severity; the script should respect it.

2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert
   auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the
   Lets Encrypt wildcard renews weekly; <14d is the only window where
   human attention is genuinely useful.

3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/
   event/image/update domains (transient by design), skip friendly
   names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and
   only count entities whose last_changed > 24h. Was 431/1470,
   most of which were "phone in standby" noise.

4. ha_automations: only flag DISABLED automations as abandoned if
   they've also been untouched (last_changed) for >180 days; raise
   stale threshold 30d → 180d. Was flagging seasonal/holiday-only
   automations as broken.

5. problematic_pods + evicted_pods: exclude pods owned by Jobs.
   CronJob retry leftovers (Error/Failed phase pods that K8s keeps
   around for log inspection) aren't problematic at the cluster level.

6. uptime_kuma: retry the WebSocket login 3x with backoff. Single-
   shot failures were a recurring false-positive even though the
   service was healthy.

Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's
nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll
/metrics and got 302'd to Authentik like the idrac/snmp ones did.
Same fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
root
8483ca59ba Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:43 +00:00
Viktor Barzin
dc7c19d88e frigate: lan ingress auth=none for HA Sofia integration
The frigate-lan.viktorbarzin.lan ingress had Authentik forward-auth in
front. HA Sofia's frigate integration polls /api/config and only knows
how to use Frigate's own API key (not browser SSO), so every poll got
a 302 to authentik.viktorbarzin.me and the integration entered the
errors-state. Same pattern as idrac-redfish-exporter (5c594291).

allow_local_access_only IP allowlist + Frigate's API key are enough.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
dc134011eb fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes
After fixing the threshold=80% misconfig and seeing two PVCs
(prometheus + technitium primary) get stuck Terminating, a 3rd round
showed four more PVCs (frigate, hackmd, immich-postgresql,
paperless-ngx) in the same state. Same root cause: TF spec'd a
smaller storage size than the autoresizer-grown live value, K8s
rejected the shrink, TF force-replaced the PVC, and the
pvc-protection finalizer held it in Terminating while the pod kept
using the underlying volume.

Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests]
on every kubernetes_persistent_volume_claim block that has
resize.topolvm.io/threshold annotations. The pattern was already
documented in .claude/CLAUDE.md but ~63 stacks were missing it.

Live PVCs are unaffected; this only prevents future TF applies from
attempting the destroy+recreate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
dd2b7de291 fix: HA Sofia REST sensors + PVC drift safety
Two real issues found while triaging HomeAssistantCriticalSensorUnavailable
alerts and the prometheus + technitium PVC Terminating-but-in-use
state from the earlier session.

1. idrac-redfish-exporter + snmp-exporter ingresses: auth=required →
   auth=none. HA Sofia REST sensors scrape these endpoints
   programmatically; with Authentik forward-auth in front, every
   request got a 302 to authentik.viktorbarzin.me and the REST
   sensors parsed the HTML login page instead of metrics — leaving
   the R730, UPS, and ~20 other sensors permanently unavailable.
   The allow_local_access_only IP allowlist (192.168.0.0/16 +
   10.0.0.0/8) already gates external access, so authentik on top
   was breaking machine-to-machine traffic for no security gain.

2. prometheus_server_pvc + technitium primary_config_encrypted:
   add lifecycle.ignore_changes = [spec[0].resources[0].requests].
   The autoresizer expands these PVCs; PVCs can't shrink. Without
   the ignore, every TF apply tried to revert the live size back
   to the TF spec value, hit K8s's shrink-forbidden rule, and
   force-replaced the PVC. Because the pod still mounted it, the
   PVC went into Terminating-but-protected limbo — fine until a
   pod restart would have orphaned the volume. Root cause of the
   2026-05-10 PVC Terminating incident.

Bonus: prometheus_server_pvc threshold was the inverted "90%" (the
same bug the bulk fecfa211 sweep fixed elsewhere; my regex only
matched "80%" so this one slipped through). Now "10%".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
7e69951cb9 state(dbaas): update encrypted state 2026-05-22 14:16:43 +00:00
Viktor Barzin
ee47197f3b vault: enroll audit-vault-0 in pvc-autoresizer (10Gi limit)
audit-vault-0 fills steadily with raft audit logs; without autoresizer
annotations it hits the 2Gi ceiling and Vault stalls on writes
(PVAutoExpanding alert was firing at 81% used). The Vault Helm chart
copies server.auditStorage.annotations onto the PVC at create time.

Live PVC already has the annotations applied via kubectl annotate;
this just keeps TF in sync.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
0fdadcc3dd dbaas: pg-cluster threshold 80%→10% in CNPG inheritedMetadata
Same misconfig as the bulk fecfa211 sweep, but the pg-cluster YAML
is buried inside a null_resource local-exec heredoc so the regex
didn't catch it. CNPG operator inherits these annotations onto each
member PVC (pg-cluster-1, pg-cluster-2), and reapplies them on every
reconcile — patching the live PVCs alone bounces back within seconds.

Live state already patched via kubectl patch cluster, this just keeps
TF in sync.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
3f2b2f9d32 fix: pvc-autoresizer threshold should be 10%, not 80%
topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE
percentage below which expansion fires (per upstream README). Setting
it to "80%" means "expand when free-space drops below 80%", i.e. as
soon as the PVC crosses 20% utilization — which caused
prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi
in 70 minutes (six 10% bumps, all when the volume was only ~14% used).
Once the SC opt-in fix landed (1e4eac53) and the inode metrics fix
landed (02a12f1a), the autoresizer started actively misfiring across
75+ PVCs cluster-wide.

Flip the value to "10%" everywhere — that's "expand when free-space
drops below 10%", i.e. at 90% utilization, which is the conventional
semantic and matches the alert thresholds in
prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp
at 95%).

The CLAUDE.md PVC template was the source of the misconfig, so update
it too. Live PVC annotations were patched in parallel via kubectl
annotate; TF apply on each affected stack will be a no-op against
those live values.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
dc4ce46411 k8s-version-upgrade: detection script refresh apt before madison + DRY_RUN_OVERRIDE
Test 2 dry-run revealed kubeadm plan reports v1.34.7 as latest while
apt-cache madison (without prior apt-get update) was reporting v1.34.5
— so the CronJob would have dispatched the agent against a stale
target. Now do `sudo apt-get update -qq` for just the kubernetes repo
before querying madison.

Also add a DRY_RUN_OVERRIDE env precedence so future test invocations
can override DRY_RUN without an apply cycle — but Job spec env is
immutable post-create, so this is only useful for CronJob spec edits
(suspend, then add env, then resume). Documented in the runbook.
2026-05-22 14:16:43 +00:00
Viktor Barzin
ae6dde45c2 k8s-version-upgrade: trigger etcd snapshot via existing backup-etcd Job; broaden agent RBAC
Stage 2 now reuses the existing default/backup-etcd CronJob (NFS-backed
PV pointing at 192.168.1.127:/srv/nfs/etcd-backup) instead of trying to
ssh into master and run etcdctl against a non-existent /mnt/main mount.
The agent triggers a one-shot Job from cronjob/backup-etcd, waits up to
10 min, then parses the backup-manage container log for "Backup done"
line + byte count.

Test 2 (dry-run) surfaced 5 real cluster blockers — agent loop works
end-to-end at the planning level.

Expanded the claude-agent ServiceAccount's privileges via a sibling
ClusterRole (claude-agent-upgrade-ops):
  - patch namespaces/k8s-upgrade (in-flight annotation)
  - create batch/jobs (trigger etcd snapshot Job)
  - patch nodes (cordon/uncordon)
  - create pods/eviction (drain)
  - delete pods (drain fallback)
2026-05-22 14:16:43 +00:00
Viktor Barzin
e75bcaf394 k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
2026-05-22 14:16:42 +00:00
Viktor Barzin
09f83b4e83 fire-planner / k8s-portal / insta2spotify: revert auth=public to auth=none
The Phase 4 audit promoted three "smoke-test candidates" from `protected = false`
to `auth = "public"`, but all three are XHR / curl-driven endpoints (fetch()
calls, automation scripts) that don't survive the 302+cookie redirect dance
that the public-auto-login flow requires on first visit. fire-planner's SPA
broke immediately — every fetch() to /api/* hit a cross-origin redirect and
CORS preflight rejected it.

Important learning for the `auth = "public"` design:

  `auth = "public"` is functionally equivalent to a normal Authentik forward-auth
  for the FIRST request — it issues a 302 to authentik to set a guest session
  cookie, then 302s back. This is invisible for top-level browser navigation
  but BREAKS:
    - XHR/fetch() under CORS preflight (preflight rejects redirects)
    - curl/automation scripts that don't preserve cookies across requests
    - Mobile / native clients that can't follow OAuth-style redirects

  Use `auth = "public"` only for top-level HTML pages where the user navigates
  via the browser address bar (or links). For XHR APIs, native-client surfaces,
  webhooks, OAuth callbacks — use `auth = "none"`.

The plan's "smoke test 3 candidates" were misjudged on this front. Reverting
all three to `auth = "none"` (their previous behaviour). The end-to-end public
flow IS verified working via curl + flow API — the design is sound, just the
test targets were wrong.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
root
faad99cff3 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:42 +00:00
Viktor Barzin
143413dc0b owntracks: explicit auth = "none" — Phase 5 audit completion
The Phase 4 audit pass missed this site because the previous agent scoped
out owntracks (it overrides the factory's middleware list via
extra_annotations to use its own basic-auth middleware). Adding the explicit
auth = "none" satisfies Phase 5's "every ingress has an explicit decision"
goal and makes the intent visible — mobile OwnTracks clients post location
data via HTTP basic-auth and can't follow Authentik forward-auth 302s.

Closes the loop on Phase 5: 122/122 active ingress_factory call sites now
carry an explicit auth = "..." decision (zero callers rely on the default).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
Viktor Barzin
ff5538a667 ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.

ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
  `protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
  middleware → dedicated public outpost → guest auto-bind. Logged-in users
  keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
  client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
  itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
  ingresses don't need anti-AI noise; the auth flow already discourages bots).

Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true`     → `auth = "required"`
- 8 explicit `protected = false`     → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
  reviewed individually:
  * 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
    homepage, wrongmove UI, privatebin) → `auth = "none"`
  * 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
    handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
    xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
    location ingestion, immich frame kiosk, headscale CP, send anonymous
    drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
    `auth = "none"`
  * Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
    UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
  k8s-portal API, insta2spotify callback.

Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.

Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.

Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
Viktor Barzin
88e57fdddb instagram-poster: disable ig-ingest-stories CronJob until /ig-ingest ships
The endpoint exists in the working copy of instagram_poster/app.py
but isn't committed/built/deployed, so every cron fire returned 404
and triggered JobFailed alerts every 30 min.

Set count = 0 to leave the resource declaration in place — re-enable
by removing that line once the endpoint is in a built image.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
Viktor Barzin
d2be0921e8 scripts: timeout rsync + sqlite calls in daily-backup
Per-PVC rsync had no timeout, so any single hung PVC (e.g. on a
corrupted snapshot or a sqlite held open by a writer) blocked the
whole script until systemd's 4h TimeoutStartSec kicked in, leaving
every later PVC silently unbacked. Today's run hung on
mailserver/roundcubemail-enigma-encrypted at 05:09 and didn't recover
— hence WeeklyBackupFailing alert.

Now:
- rsync per PVC: timeout 30 min, exit 124 logged separately
- sqlite3 per database: timeout 5 min
- /etc/pve rsync: timeout 5 min

Each timed-out PVC bumps PVC_FAIL but the loop keeps moving.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
fddf168ecb cloudflare: disable AI bot edge-block so x402 can issue payment offers
CF zone was returning 403 to declared AI-bot UAs at the edge
(`ai_bots_protection: "block"`). That meant the in-cluster x402
gateway never saw the request and could never issue an HTTP 402 with
the wallet payment requirements — the bot just bounced.

Adopt `cloudflare_bot_management.zone` via root-module import block,
flip ai_bots_protection to "disabled". Bot Fight Mode (`fight_mode`),
crawler challenge (`crawler_protection`), and managed robots.txt are
unaffected — generic automated traffic still gets the bot fight gate.

End-to-end verified: `User-Agent: Mozilla/5.0 (compatible; ClaudeBot/
1.0;...)` on viktorbarzin.me now returns HTTP 402 (was 403 CF block)
with `payTo=0xCc33...659f`, `amount=10000` micro-USDC, `network=base`.

Trade-off: bots that don't pay still hit origin (instead of CF
blackholing them), so a small bandwidth uptick. Negligible at our
traffic level.
2026-05-22 14:16:42 +00:00
Viktor Barzin
4103ea2ba0 monitoring(prometheus): keep all 4 kubelet_volume_stats_inodes metrics
pvc-autoresizer's GetMetrics() returns volume stats for a PVC only if
all four kubelet_volume_stats metrics (available_bytes, capacity_bytes,
inodes_free, inodes) are retrieved. The keep-list in the
kubernetes-nodes scrape job had available_bytes and capacity_bytes
(post 9d5da4d8) but was missing the two inode metrics, so the
autoresizer's reconcile logged "failed to get volume stats" for every
PVC and never resized anything.

Add kubelet_volume_stats_inodes and kubelet_volume_stats_inodes_free
to the regex.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
Viktor Barzin
6d3308c848 authentik: add public guest auto-login flow + dedicated outpost + traefik public middleware
Phase 1+2 of default-deny ingress plan. Adds the infrastructure for an `auth = "public"`
ingress tier that auto-binds anonymous requests to a `guest` Authentik user (no UI
prompt), so public sites are still recorded as authenticated by Authentik for audit
purposes — but as `guest`, not by leaking the standard catchall flow.

- guest user in `Public Guests` group (NOT `Allow Login Users`).
- `public-auto-login` flow: stage_binding policy sets `pending_user = guest`,
  `evaluate_on_plan = false` + `re_evaluate_policies = true` so flow_plan is
  populated when the policy mutates it; `authentication = none` lets anonymous
  requests enter.
- `Provider for Public` proxy provider (forward_domain, cookie_domain
  viktorbarzin.me) with `authentication_flow = public-auto-login`.
- Dedicated `public` outpost: only the public provider bound, deployed as
  `ak-outpost-public` Deployment+Service in the `authentik` namespace by
  Authentik's K8s controller.
- `public-auth.viktorbarzin.me` ingress exposes the public outpost's
  `/outpost.goauthentik.io/*` so OAuth callbacks land on it (the embedded
  outpost doesn't know about the public provider, so `authentik.viktorbarzin.me`
  callbacks would fail).
- `authentik-forward-auth-public` traefik middleware points at the public
  outpost service (not via the auth-proxy nginx fallback). The plan's
  `?app=public` dispatch idea was tested and rejected — the embedded outpost
  dispatches purely by Host header, so a dedicated outpost was the only way
  to isolate the public flow without conflicts.

No ingresses use the new middleware yet — Phase 3+4 (the ingress_factory
`auth` variable refactor + audit pass) wires it up. This commit is additive
and behaviour-neutral.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
Viktor Barzin
ff5416ff40 proxmox-csi: opt SCs into pvc-autoresizer (resize.topolvm.io/enabled=true)
Without this annotation on the StorageClass, pvc-autoresizer's controller
filters the SC out at the index lookup stage and never patches any of its
PVCs, regardless of utilization or per-PVC threshold/increase/storage_limit
annotations. Internal metric pvcautoresizer_loop_seconds_total ticked but
no PVCs were ever evaluated — visible cluster-wide as PVAutoExpanding alerts
firing for forgejo-data-encrypted (82%) and audit-vault-0 (81%) without any
ResizeStarted events ever following.

The Prometheus scrape-config fix in 9d5da4d8 was a prerequisite (autoresizer
reads kubelet_volume_stats_available_bytes) but not sufficient on its own.

Also pinning chart version to 0.5.6 so the next apply doesn't incidentally
bump to 0.5.7.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:41 +00:00
ea9b5542d1 x402: flip gateway live with Viktor's wallet + Slack payment notifications
Wires the traefik stack to read two new fields from secret/viktor:
  * x402_wallet_address     -> 0xCc33BD250d39752e0ceaB616f8a05F72274a659f
  * alertmanager_slack_api_url (existing) -> reused as the per-payment
    notification webhook so payment events arrive in the same Slack
    channel as other infra alerts.

Gateway now runs `wallet_set:true, dry_run:false`. Verified end-to-end:
  - Browser UA on all 9 sites -> 200 (passes through to Anubis)
  - python-requests/2.31 + scrapy + ClaudeBot UA -> 402 with
    PaymentRequiredResponse, payTo == Viktor's wallet, amount=10000
    micro-USDC, network=base, asset=Base USDC contract
  - Direct Slack-webhook test from inside cluster -> HTTP 200

Image bumped to forgejo.../x402-gateway:d9b83125 with Slack-format
notification payload (text=..., username=x402-gateway,
icon_emoji=💰; auxiliary fields preserved for richer receivers).

Notifications fire on every successful X-PAYMENT validation; failures
on Slack webhook are logged at WARN, never block the request, never
double-charge the bot.
2026-05-22 14:16:41 +00:00
Viktor Barzin
58789cde8b kured(sentinel-gate): fix auth + write-perm so safety checks actually run
Test 3 validation surfaced two latent bugs in the sentinel-gate
DaemonSet that have been masked since 2026-04-18 (when uu was off,
nothing wrote /var/run/reboot-required, so the gate never had to
fire):

1. automount_service_account_token=false on both the SA and the
   pod spec → kubectl in the script falls back to localhost:8080
   on every call. Each check (`kubectl get nodes`, `kubectl get
   pods -n calico-system`, transition-time read) errors to stderr
   and emits empty stdout. `wc -l` reports 0 → checks "pass" with
   no real data.

2. bitnami/kubectl:latest runs as uid=1001 by default. The hostPath
   /var/run is root:root 0755 → final
   `touch /host/var-run/gated-reboot-required` failed with EACCES.
   Fail-safe by accident — but if anything had ever loosened those
   perms, the broken checks above would have green-lit the gate
   with no real validation.

Fix: enable token mount on the SA + pod, set
securityContext.run_as_user=0 on the container.

Verified post-fix: kubectl returns all 5 nodes, touch succeeds,
sentinel-gate now reports the correct
`BLOCKED: A node transitioned Ready within the last 24 hours
(soak window)` when triggered with k8s-node1's recent reboot
within the cool-down period.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:41 +00:00
Viktor Barzin
a2377a38df scripts: cluster_healthcheck defaults to ~/.kube/config
The previous default of $(pwd)/config required running the script from
the infra/ directory or always passing --kubeconfig. From a parent
shell or any other working directory, the lookup hit a non-existent
file and kubectl returned a stale-token error, masking real check
results.

Now: use $KUBECONFIG if set, then ~/.kube/config, then fall back to
$(pwd)/config for backwards compatibility.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:41 +00:00
Viktor Barzin
f5b1fb179a docs: add k8s node auto-upgrade runbook + architecture section
The OS-side counterpart to the service-upgrade pipeline. Covers
the unattended-upgrades + kured + sentinel-gate + Prometheus
halt-on-alert design landed in c0991f7f8.

Runbook: ops procedures (verify health, halt rollout, restore
config to a re-imaged node, roll back a bad upgrade, investigate
which alert is blocking).

Architecture doc: extends the existing service-upgrade flow with
a "K8s Node OS Upgrades" section (stack, sources of truth, day-2
mechanism, why-this-design rationale tied to the March 2026
post-mortem).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:41 +00:00
Viktor Barzin
278ef5f19b monitoring(grafana): swap python3 for jq in folder-ACL local-exec
CI image (ci/Dockerfile) is alpine + jq, no python3. The
grafana_admin_only_folder_acl null_resource was parsing /api/folders
with a python3 oneliner, which crashed every CI apply with
"python3: command not found" and made every monitoring stack apply
fail in CI (worked locally because the dev VM has python3).

jq is already in the CI image and produces the same output.
2026-05-22 14:16:41 +00:00
Viktor Barzin
b99e30e798 docs/plans: 2026-04-20 infra audit design (post-research, post-challenge)
Adds the infra audit plan: 5 parallel research agents (Reliability,
Declarative, Maintenance, Scalability, Security) → 91 raw findings →
2 independent challengers → filtered/corrected/ranked backlog.

Already incorporates the challenger corrections (drops bad metric
pulls, reframes intentional-by-design items). Source for several
follow-ups already shipped this week (kured-prometheus gating, NFS
fsid post-mortem fixes, Authentik outpost postgres-backend).
2026-05-22 14:16:41 +00:00
Viktor Barzin
5c0ea96a91 infra: re-enable unattended-upgrades with kured prometheus-gating
Reverses the March 2026 outage mitigation that disabled unattended-
upgrades cluster-wide. Now re-enables it on the k8s template VM with:

  - Allowed-Origins limited to security/updates pockets
  - Package-Blacklist for k8s/containerd/runc/calico-node (apt-mark
    hold on the cluster-critical components)
  - Automatic-Reboot disabled — kured drives the actual reboots
  - Compatible with the existing kured + sentinel-gate flow

kured side:
  - rebootDelay 30s, concurrency 1
  - Sentinel cool-down stretched 30m → 24h (aligns with the 24h soak
    window from the post-mortem)
  - prometheusUrl + alertFilterRegexp wired so any firing non-ignored
    alert halts the rollout. Ignore-list excludes self-referential
    alerts (Watchdog/RebootRequired/KuredNodeWasNotDrained/
    InfoInhibitor) that would otherwise deadlock kured.

Prometheus side (already partly landed in 6c4e0966 — the "Upgrade
Gates" rule group):
  - Refine `KubeQuotaAlmostFull` to include the resourcequota label in
    both the on-clause and the summary, so multi-quota namespaces
    (authentik, beads-server, frigate) report the quota name correctly.

grafana.tf: terraform fmt whitespace only.

Together with the post-mortem 2026-03-22 (memory id=390) the loop is
closed: unattended-upgrades runs again, kernel-class updates can land,
but only when cluster health is green and the reboot window is open.
2026-05-22 14:16:41 +00:00
Viktor Barzin
fe75fad467 monitoring: protect grafana ingress with authentik + disable anonymous
- add traefik-authentik-forward-auth to grafana ingress middleware list
- disable auth.anonymous (was Viewer-by-default for the public)
- enable auth.proxy with X-authentik-username so Authentik users get
  signed in seamlessly (no double-login UX)

Prometheus and Alertmanager already had forward-auth — no change.
2026-05-22 14:16:41 +00:00
Viktor Barzin
6c294d4bb0 authentik: zero-endpoints alert + upgrade-validation checklist
Add `AuthentikForwardAuthFallbackActive` Prometheus alert: fires on
sustained 401/s spike on the websecure entrypoint (>5/s for 5m), which
is the symptom of the auth-proxy Emergency-Access fallback firing —
in turn caused by zero ready endpoints on the outpost service.

Why this rule and not `kube_endpoint_address_available == 0`:
kube-state-metrics endpoint metrics exist as series names but never
have current values in this Prometheus pipeline (something is dropping
them silently). Detecting the failure at the edge via Traefik is more
reliable than instrumenting the broken middle.

Also fix the pre-existing `AuthentikOutpostForwardAuth400Spike` regex
— the service label is `authentik-ak-outpost-...`, not
`authentik-authentik-outpost-...`, so the alert never matched any
series and never could have fired. Verified in Prometheus before/after
the fix.

Add an "Upgrade Validation Checklist" section to
`.claude/reference/authentik-state.md` with the seven-step smoke test
to run after Authentik chart bumps, provider bumps, or outpost pod
recreation. Covers the brittle surfaces (Service selector, JSON
patches, postgres backend wiring, access_token_validity TTL, edge
auth flow, plan-to-zero).
2026-05-22 14:16:41 +00:00
Viktor Barzin
dc87a9bffe infra/instagram-poster: shared CNPG-backed benchmark DB, no PVC for scores
The instagram_poster.benchmark CLI was writing scores to a sqlite file
on the pod's data PVC. Moving it to the shared CNPG cluster so the
benchmark scoring path is stateless on the pod, scores survive pod
recreation, and the rotation/backup pipeline applies automatically.

- dbaas: null_resource.pg_instagram_poster_db creates role + DB
  (idempotent CREATE IF NOT EXISTS, password placeholder) — same
  shape as pg_postiz_dbs / pg_wealthfolio_sync_db.
- vault: vault_database_secret_backend_static_role.pg_instagram_poster
  + add to allowed_roles. 7d rotation_period.
- instagram-poster: second ExternalSecret (vault-database store) →
  K8s Secret instagram-poster-benchmark-db with BENCHMARK_PG_HOST/
  PORT/USER/PASSWORD/DATABASE. env_from on the deployment.
  reloader.stakater.com/match=true bounces the pod on rotation.

Code-side: instagram_poster/benchmark.py now resolves the DB URL from
BENCHMARK_DB_URL or BENCHMARK_PG_* env vars; falls back to sqlite for
local DevVM scratch runs. Schema bootstraps via Base.metadata.create_all,
no alembic step needed for the benchmark-only side-DB.

Verified end-to-end via DevVM port-forward: ESO synced, K8s Secret has
all 5 fields, pod env shows BENCHMARK_PG_*, smoke-test scoring 3 photos
landed in the new PG table with subject_category populated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:41 +00:00
Viktor Barzin
93ee45bd25 docs/authentik: document postgres session backend + close out 2026-04-18 post-mortem items
Update `.claude/reference/authentik-state.md`:
  - Add `ProxyProvider.access_token_validity = "weeks=4"` to the Session
    Duration table with the gotcha that the gorilla session store binds
    the value once at outpost startup (rollout restart needed).
  - Replace the "session storage moved to Postgres in 2025.10" note that
    falsely implied the migration was automatic — explain that the
    `Outpost.managed` field gates the postgres path and our outpost
    silently stayed on `FilesystemStore` until 2026-05-10.
  - Document the goauthentik 2026.2.2 service-selector bug
    (service.py:52) and the JSON-patch workaround.
  - Document that the standalone embedded-outpost deployment needs
    `AUTHENTIK_POSTGRESQL__*` env vars injected via JSON patch, plus the
    `app.kubernetes.io/component=server` pod label.
  - Note the "Terraform doesn't expose `Outpost.managed`" assumption
    that holds the `managed=embedded` value in place across applies.

Close out post-mortem `2026-04-18-authentik-outpost-shm-full.md`:
  - P2 codify-in-Terraform: DONE.
  - P3 access_token_validity reduce: DONE-alt (we did the opposite —
    bumped to 4 weeks — because postgres backend mooted the storage
    concern).
  - P3 move-off-embedded-outpost: DONE-alt (postgres backend addresses
    the loss-of-state class on the embedded outpost itself).
2026-05-22 14:16:41 +00:00
Viktor Barzin
94dfbb9a9c state(vault): update encrypted state 2026-05-22 14:16:41 +00:00
Viktor Barzin
fbf97dfc5c state(dbaas): update encrypted state 2026-05-22 14:16:41 +00:00
Viktor Barzin
1fcf911269 authentik/pgbouncer: image_pull_policy IfNotPresent -> Always (match live)
The HCL declared `IfNotPresent` since module creation but the live
deployment reconciled to `Always` somewhere along the way (likely a
Helm/operator default). Since the image is `:latest`, `Always` is the
correct value — `IfNotPresent` would skip pulling updated images on
pod restart, defeating the point of the floating tag.

Drops the lone remaining drift in the authentik stack so plan-to-zero
holds across the whole stack, not just the resources I just adopted.
2026-05-22 14:16:41 +00:00
Viktor Barzin
24795ec203 authentik: codify proxy provider TTL + adopt embedded outpost
Bump access_token_validity to weeks=4 (was hours=168, UI-managed in
ignore_changes). Drives the cookie Max-Age and the proxysession.expires
TTL — keeps users logged in for 28d instead of 7d.

Adopt the embedded outpost into Terraform so the postgres-session-backend
fix from earlier today (2026-05-10) is described as code:
  - kubernetes_json_patches.deployment carries dshm 2Gi tmpfs, resource
    requests/limits, the app.kubernetes.io/component=server pod label
    (workaround for goauthentik 2026.2.2 service.py:52 selector mismatch
    on standalone embedded outposts), and AUTHENTIK_POSTGRESQL__* envFrom
    the shared `goauthentik` Secret so the postgres session backend can
    connect to the dbaas cluster.
  - kubernetes_json_patches.service replaces the controller-set selector
    (which targets app.kubernetes.io/name=authentik / the goauthentik-server
    pods) with the outpost's own labels — without this, endpoints are
    empty and auth-proxy falls back to Basic-Auth realm "Emergency Access".

The `managed` field ("goauthentik.io/outposts/embedded") is server-set
and not in the Terraform provider's schema, so TF preserves it across
applies (writes only fields it knows about). Plan-to-zero verified.
2026-05-22 14:16:41 +00:00
Viktor Barzin
63fc1e00de infra/compute: bump k8s-node1 RAM 32 -> 48 GiB
Reason: GPU multi-tenancy (frigate + ytdlp-highlights + llama-swap +
immich-ml) was hitting 94% memory-request saturation on the old size.
The benchmark on 2026-05-10 surfaced this when llama-swap stayed
Pending despite GPU time-slicing being on (nvidia.com/gpu replicas=100)
- the actual constraint was node1 RAM, not GPU.

Procedure: drained node1, qm shutdown 201, qm set 201 --memory 49152,
qm start 201, kubelet picked up new capacity (47 GiB / 45.5 GiB
allocatable), uncordon, restored llama-swap + immich-ml.

Out-of-band qm set is the path here (not Terraform) because VMID 201
is intentionally not managed by TF yet - the telmate/proxmox provider
trips on iSCSI-disked VMs (see infra/stacks/infra/main.tf line 442).
Adopt this VM into TF once we migrate to bpg/proxmox.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:41 +00:00
Viktor Barzin
6e7fe96a40 infra/llama-cpp: benchmark report + -fa flag fix
Phase 7 of the vision-LLM benchmark plan. Adds:

- docs/benchmarks/2026-05-10-vision-llm.md — curated report (TL;DR,
  per-model analysis, top-N agreement, cost vs cloud APIs, sample
  captions). Verdict: qwen3vl-4b for the request path (3.55 s p50,
  100% parse, decisive top-N distro); qwen3vl-8b for caption polish.
- docs/benchmarks/benchmark-2026-05-10-1424.json — raw 300-row dump
  for diff-checking against future runs.
- main.tf: -fa -> -fa on (b9085 llama.cpp removed the no-value form
  of the flash-attention flag; without the value llama-server exits
  before serving any request).
- llama-cpp.md architecture doc links the report so future operators
  land on the deployed-and-evaluated model from one entry point.

300/300 calls, 0 parse errors, 33m32s wall on a single T4 with the
GPU exclusively allocated. immich-ml was scaled to 0 for the run
(node1 RAM constraint, not GPU - bumping node1 RAM is tracked as a
follow-up).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:41 +00:00
Viktor Barzin
3da01e6e1e anubis: only challenge GET requests; allow everything else
PrivateBin's XHR `POST /` (paste creation) was the trigger — Anubis's
catch-all CHALLENGE rule served an HTML challenge page where the JS
expected JSON, breaking paste creation entirely. Same shape will hit
any SPA XHR or CORS preflight on the other 8 Anubis-fronted sites
(homepage actions, kms upload-then-poll, wrongmove search refresh,
jsoncrack share, etc.) the moment it gets exercised.

Add an `ALLOW` rule keyed on `method != "GET"` between the AI/UA-block
imports and the catch-all CHALLENGE. Rationale:

  * AI scrapers consume GET response bodies — they don't POST.
  * State-mutating XHRs and OPTIONS preflight need to bypass the
    challenge or the app breaks.
  * CrowdSec + per-route rate-limit + app-level auth already cover
    abuse on mutating methods, so this gives up nothing.
  * Hard-deny rules for known-bad bots run first, so a declared bad
    bot can't sneak through by sending a POST.

Also added a `checksum/policy` annotation on the Anubis pod template
sourced from `sha256(coalesce(var.policy_yaml, default_policy_yaml))`
so future policy changes auto-roll the deployment instead of needing
a manual `kubectl rollout restart`.

f1-stream had its own policy override (path carve-outs for SvelteKit
asset hashes and JSON data routes); mirrored the new rule there too.

Applied to all 8 Anubis-fronted stacks: blog, kms, f1-stream,
travel_blog, real-estate-crawler, homepage, cyberchef, jsoncrack.
Verified per stack: GET / returns the Anubis challenge page; POST,
PUT, DELETE, OPTIONS pass through to the backend (HTTP 301/405/502
from the upstream app, never the Anubis "not a bot" HTML).
2026-05-22 14:16:40 +00:00
root
ff3d64159a Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:40 +00:00
Viktor Barzin
1f0bd11d3f privatebin: drop Anubis — broke XHR paste creation
PrivateBin's UI POSTs the encrypted blob to `/` via XHR. With Anubis in
front, the catch-all CHALLENGE rule returned an HTML challenge page
where the JS expected JSON, so paste creation failed silently for every
user. The challenge cookie didn't bypass it — Anubis appears to issue a
fresh challenge on POST regardless of cookie state.

Pastes are client-side encrypted; AI scrapers gain nothing from
indexing them, so the default `anti_ai_scraping` middleware is enough
protection. Restoring the ingress to point straight at the privatebin
service. CSP `wasm-unsafe-eval` retained — PrivateBin's zlib.wasm
needs it independent of Anubis.

This matches the rule already documented in infra/.claude/CLAUDE.md:
"DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints — clients
without JS can't solve PoW." A SPA's XHR is the same shape.

Verified: GET / returns PrivateBin HTML (not the Anubis challenge),
POST / returns PrivateBin's own JSON error envelope.
2026-05-22 14:16:40 +00:00
Viktor Barzin
9c617e6d38 infra/llama-cpp: add stack — llama-swap fronting Qwen3-VL + MiniCPM-V
Single Deployment of mostlygeek/llama-swap:cuda hot-swaps three
GGUF vision models (qwen3vl-8b, minicpm-v-4-5, qwen3vl-4b) at one
OpenAI-compat /v1 endpoint on Service llama-swap.llama-cpp.svc.
Idle TTL 10min so models unload between benchmark batches.

Storage: NFS-RWX from /srv/nfs-ssd/llamacpp (30Gi). One-shot
download Job pulls Q4_K_M GGUF + mmproj per model, creates stable
model.gguf / mmproj.gguf symlinks so the llama-swap config is
filename-agnostic, then warms the kernel page cache.

GPU: nvidia.com/gpu=1 = whole T4 — operator must scale immich-ml
to 0 during benchmark windows. wait_for_rollout=false so apply
doesn't block on GPU availability.

Initial use case: vision-LLM benchmark for instagram-poster
candidate scoring; future consumers (HA, agentic tooling) hit
the same endpoint via LiteLLM at the gateway.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:40 +00:00
Viktor Barzin
0752bd49c8 kms: document native DNS auto-discovery (no client config needed)
LAN clients with DNS suffix viktorbarzin.lan now activate with zero
configuration — Windows queries _vlmcs._tcp.viktorbarzin.lan SRV by
default and the chain resolves through vlmcs.viktorbarzin.lan to the
new 10.0.20.202 KMS IP.

DNS state (Technitium primary, replicated to secondary+tertiary by the
existing technitium-zone-sync CronJob every 30 min):
- _vlmcs._tcp.viktorbarzin.lan SRV 0 0 1688 vlmcs.viktorbarzin.lan
  (was: target=kms.viktorbarzin.lan)
- vlmcs.viktorbarzin.lan A 10.0.20.202   (added)
- kms.viktorbarzin.lan A 10.0.20.200      (unchanged — still the
  Traefik LB for the user-facing website at kms.viktorbarzin.lan/)

vlmcs.viktorbarzin.lan was added as a dedicated KMS-server hostname
rather than retargeting kms.viktorbarzin.lan so the LAN-direct website
keeps working without depending on hairpin NAT through pfSense.

Verified end-to-end on WIN10Pro-DS32 (192.168.1.230):
slmgr /ckms → slmgr /ato → "Product activated successfully" with
"KMS machine name from DNS: vlmcs.viktorbarzin.lan:1688" and
"KMS machine IP address: 10.0.20.202". Real client IP 192.168.1.230
appears in vlmcsd log and in the slack-notifier sent line; second
activation within the dedup window correctly increments
kms_activations_dedup_skipped_total.
2026-05-22 14:16:40 +00:00
Viktor Barzin
d85b54d89d kms: per-connection state in notifier (vlmcsd is multi-threaded)
Bug found via E2E test against the Windows VM (VMID 300). The single
shared `state` dict in slack-notifier.py worked when vlmcsd processed
one connection at a time, but real Windows KMS activations hold the
connection open ~30 seconds (handshake + keep-alive). During that
window vlmcsd accepts other concurrent connections — most relevantly
the new kubelet TCP readiness probe every 5s — and each new OPEN line
reset the shared state, wiping the in-flight activation's
app/product/host before its CLOSE arrived. Result: real activations
were misclassified as probes (no Slack post, no metric increment).

Fix: state is now a dict keyed by `ip:port` with one sub-dict per
in-flight connection. A `__current` pointer tracks the most recent
OPEN so unkeyed detail lines (Application ID, Workstation name, etc.)
can be attributed correctly — vlmcsd writes detail lines immediately
after the OPEN and before any subsequent OPEN, so the heuristic holds.
Orphan CLOSEs (notifier started mid-conn) are now silently dropped
instead of emitting an empty probe event.

Two new regression tests:
- test_kubelet_probe_during_long_activation: 5s probe interleaved into
  a 31s activation block — exact production failure mode.
- test_orphan_close_no_event: bare CLOSE without prior OPEN.

Verified live: triggered slmgr /upk + /ipk + /skms 10.0.20.202 + /ato
on WIN10Pro-DS32. vlmcsd logged the full activation block, notifier
posted to Slack with ip=192.168.1.230 source=external
product='Windows 10 Professional' host='WIN10Pro-DS32.viktorbarzin.lan'
and kms_activations_total{product=Windows 10 Professional,
status=Licensed} 1 — real WAN client IP preserved through the
ETP=Local + dedicated MetalLB IP chain end to end.
2026-05-22 14:16:40 +00:00
Viktor Barzin
4a3ca572e8 fire-planner: imagePullPolicy=Always on alembic-migrate init container
After a rollout-restart, the main container (default Always for :latest)
pulled the new image with alembic 0003, but the init container
defaulted to IfNotPresent and reused a cached old image lacking 0003 →
"Can't locate revision identified by '0003'" → CrashLoopBackOff.

Setting Always on the init container so both containers stay in lockstep
across rollouts. Longer term we should switch the deployment to 8-char
git-SHA tags per the cluster policy in .claude/CLAUDE.md, but this
unblocks the Wave 1 deploy in the meantime.
2026-05-22 14:16:40 +00:00
Viktor Barzin
67b11a964a kms: dedicate MetalLB IP 10.0.20.202 + filter probe noise
Two coupled fixes for the hourly Slack noise + missing client IPs:

1. Move windows-kms off shared 10.0.20.200 to a dedicated MetalLB IP
   10.0.20.202 with externalTrafficPolicy=Local, so vlmcsd sees real
   WAN client IPs (pfSense WAN forwards do DNAT-only; ETP=Local skips
   kube-proxy SNAT). Same pattern mailserver used pre-2026-04-19.
   Sharing 10.0.20.200 is blocked because all 10 services there are
   ETP=Cluster and MetalLB requires consistent ETP per shared IP.

2. Slack notifier now suppresses Slack posts for bare TCP open/close
   pairs (no Application/Activation block) — these are Uptime Kuma's
   port monitor and the new kubelet readiness/liveness probes. Probe
   counts go to a new metric kms_connection_probes_total{source} where
   source classifies the IP as internal_pod / cluster_node / external.
   Real activations are unaffected.

Pod fluidity: added TCP readiness/liveness probes on 1688 to gate Pod
Ready on the listener actually being up — required for ETP=Local so
MetalLB only advertises 10.0.20.202 from a node where vlmcsd is serving.

pfSense side (applied separately, not codified):
- New alias k8s_kms_lb = 10.0.20.202 (KMS-only)
- WAN:1688 NAT + filter rule retargeted from k8s_shared_lb to k8s_kms_lb
- All other forwards on k8s_shared_lb (WireGuard, HTTPS, shadowsocks,
  smtps, etc.) untouched

Runbook updated. Tests added for classify_source / is_probe / process_line.
2026-05-22 14:16:40 +00:00
283 changed files with 30513 additions and 5339 deletions

View file

@ -28,9 +28,16 @@ Violations cause state drift, which causes future applies to break or silently r
- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies.
- **New services need CI/CD** and **monitoring** (Prometheus/Uptime Kuma)
- **New service**: Use `setup-project` skill for full workflow
- **Ingress**: `ingress_factory` module. Auth: `protected = true`. Anti-AI: on by default. **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`.
- **Ingress**: `ingress_factory` module. **Auth** (`auth` string enum, default `"required"` — fail-closed). Pick by asking "what gates the app?":
- `auth = "required"` — Authentik forward-auth gates every request. Use when the backend has **no built-in user auth** and Authentik is the only thing standing between strangers and the app (prowlarr, qbittorrent, netbox, phpipam, k8s-dashboard, foolery, any admin UI shipped without its own login).
- `auth = "app"` — the backend handles its own user authentication (NextAuth, Django, OAuth, bearer-token API, etc.); Authentik would only break it. No middleware attached; the app's own login is the gate. Examples: immich, linkwarden, tandoor, freshrss, affine, actualbudget, audiobookshelf, novelapp. **Functionally identical to `"none"`** — the distinct name exists to record intent at the call site.
- `auth = "public"` — Authentik anonymous binding via the dedicated `public` outpost (routes via `traefik-authentik-forward-auth-public``ak-outpost-public.authentik.svc:9000`). Strangers auto-bound to `guest`; logged-in users keep their identity in `X-authentik-username`. **Only works for top-level browser navigation** — CORS preflight rejects XHR/fetch and automation can't replay the cookie dance. Audit trail, not a gate.
- `auth = "none"` — no Authentik, no own-auth claim. Use for Anubis-fronted content (Anubis is the gate), native-client APIs (Git, `/v2/`, WebDAV/CalDAV, CardDAV), webhook receivers, OAuth callbacks, and Authentik outposts themselves.
- **Anti-exposure rule** (the reason `"app"` exists): only pick `"app"` or `"none"` AFTER you've verified the app has its own user auth (`"app"`) OR the endpoint is intentionally public (`"none"`). Default is `"required"` so accidental omission fails closed. **Convention**: when using `"app"` or `"none"`, add a comment line above the `auth = "..."` line stating what gates the app or why it's public. **Enforced by `scripts/tg`**: every `tg plan/apply/destroy/refresh` runs `scripts/check-ingress-auth-comments.py` against the current stack and aborts if any `auth = "app|none"` line lacks the preceding `# auth = "<tier>": ...` comment. Stack-scoped — untouched stacks aren't blocked until they're next edited.
- **Anti-AI**: on by default when `auth = "none"` or `auth = "app"` (no Authentik to discourage bots); redundant on `"required"` and `"public"`.
- **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend).
- **Anubis PoW challenge** (`modules/kubernetes/anubis_instance/`): per-site reverse proxy that issues a 30-day JWT cookie after a tiny PoW solve. Use for **public, content-bearing sites without app-level auth** (blog, docs, wikis, static landing pages). Pattern: declare `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://<backend>.<ns>.svc.cluster.local" }`, then in `ingress_factory` set `service_name = module.anubis.service_name`, `port = module.anubis.service_port`, `anti_ai_scraping = false`. Shared ed25519 key in Vault `secret/viktor` -> `anubis_ed25519_key`; cookie scoped to `viktorbarzin.me` so one solve covers all Anubis-fronted subdomains. **DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints** — clients without JS can't solve PoW. **Replicas default to 1** because Anubis stores in-flight challenges in process memory; a challenge issued by pod A and solved against pod B errors with `store: key not found` (HTTP 500). Bumping replicas requires wiring a shared Redis store (TODO). For path-level carve-outs (e.g. wrongmove has `/` behind Anubis but `/api` direct), declare a second `ingress_factory` with `ingress_path = ["/api"]` pointing at the bare backend service. Active on: blog, www, kms, travel, f1, cc, json, pb (privatebin), home (homepage), wrongmove (UI only). See `.claude/reference/patterns.md` "Anti-AI Scraping" for full layering.
- **Docker images**: Always build for `linux/amd64`. Use 8-char git SHA tags — `:latest` causes stale pull-through cache.
- **Docker images**: Always build for `linux/amd64`. SHA-tag rule is being phased out — see `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`. New model: CI pushes `:latest` (optionally also `:<8-char-sha>` for traceability), Keel polls and triggers rollouts. Cache-staleness concern from the old rule is resolved at the nginx layer (URL-split — manifests pass through, blobs cached). Until Phase 1 of the migration completes (per the plan), follow the SHA-tag rule for new services to match existing pattern.
- **Private registry**: `forgejo.viktorbarzin.me/viktor/<name>` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. Containerd `hosts.toml` on every node redirects to in-cluster Traefik LB `10.0.20.200` to avoid hairpin NAT. Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest`; integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07.
- **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts.
- **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi.
@ -129,7 +136,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` with `mysql:8.4` (migrated from InnoDB Cluster 2026-04-16). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (15Gi, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Old InnoDB Cluster + operator still in TF (Phase 4 cleanup pending). Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
## Monitoring & Alerting
@ -140,6 +147,17 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence.
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
## Security Posture (Wave 1 — locked 2026-05-18)
Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/security-incident.md`. Beads epic: `code-8ywc`.
- **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale.
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.
- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
## Storage & Backup Architecture
### Storage Class Decision Rule (for new services)
@ -177,7 +195,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
name = "<service>-data-proxmox"
namespace = kubernetes_namespace.<ns>.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -213,7 +231,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
name = "<service>-data-encrypted"
namespace = kubernetes_namespace.<ns>.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -269,7 +287,8 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
## Known Issues
- **CrowdSec Helm upgrade times out**: `terragrunt apply` on platform stack causes CrowdSec Helm release to get stuck in `pending-upgrade`. Workaround: `helm rollback crowdsec <rev> -n crowdsec`. Root cause: likely ResourceQuota CPU at 302% preventing pods from passing readiness probes. Needs investigation.
- **OpenClaw config is writable**: OpenClaw writes to `openclaw.json` at runtime (doctor --fix, plugin auto-enable). Never use subPath ConfigMap mounts for it — use an init container to copy into a writable volume. Needs 2Gi memory + `NODE_OPTIONS=--max-old-space-size=1536`.
- **OpenClaw config is writable**: OpenClaw writes to `openclaw.json` at runtime (doctor --fix, plugin auto-enable). Never use subPath ConfigMap mounts for it — use an init container to copy into a writable volume. Needs 2Gi memory + `NODE_OPTIONS=--max-old-space-size=1536`. **`mcp.servers` baked into the ConfigMap-loaded openclaw.json gets stripped by `doctor --fix`** — register MCP servers via `openclaw mcp set <name> <json>` in the container startup command instead (CLI-written entries persist across doctor runs). Current servers wired this way: `ha`, `context7`, `playwright` (sidecar at `localhost:3000/mcp`).
- **OpenClaw memory-core indexes `/workspace/memory/`, not `/home/node/.openclaw/memory/`**: `/home/node/.openclaw/memory/main.sqlite` is the index store, NOT a content source. Files written under `/home/node/.openclaw/memory/projects/<x>/*.md` will NOT be indexed. To populate memory-core, write Markdown under `/workspace/memory/projects/<source>/` and run `openclaw memory index --force`. This is what the daily `memory-sync` CronJob in `stacks/openclaw/` does for claude-memory → OpenClaw sync.
- **Goldilocks VPA sets limits**: When increasing memory requests, always set explicit `limits` too — Goldilocks may have added a limit that blocks the change.
## User Preferences

View file

@ -0,0 +1,543 @@
---
name: k8s-version-upgrade-DEPRECATED
description: "DEPRECATED 2026-05-11 — replaced by the Job-chain in stacks/k8s-version-upgrade. See header below."
tools: Read, Write, Edit, Bash, Grep, Glob
model: opus
---
# DEPRECATED — Do NOT invoke this agent
Retired **2026-05-11** after a self-preemption incident: this agent ran inside
the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was
scheduled onto k8s-node4. When the agent tried to `kubectl drain k8s-node4`
(Stage 6, first worker), it evicted itself. The bash process died mid-SSH,
leaving node4 cordoned and the cluster half-upgraded (master at v1.34.7,
workers at v1.34.2).
## Replaced by
A chain of small Kubernetes Jobs, each pinned (via `nodeSelector` +
`kubernetes.io/hostname`) to a node that is NOT its drain target. No pod can
preempt itself because each Job's pod and its target node are always
different.
| Old | New |
|-----|-----|
| Single agent run in claude-agent-service pod | Chain of 7 phase Jobs (preflight → master → worker × 4 → postflight) |
| Whole pipeline in one prompt | Phase body in `stacks/k8s-version-upgrade/scripts/upgrade-step.sh`, dispatched per-phase via `case $PHASE` |
| Detection CronJob POSTs to `claude-agent-service` | Detection CronJob renders Job 0 from `job-template.yaml` via `envsubst` + `kubectl apply` |
| Drain blocks indefinitely on PDB=0 (e.g. single-replica Anubis) | New `predrain_unstick` deletes PDB-blocked pods so drain proceeds |
| `K8sVersionSkew` + `EtcdPreUpgradeSnapshotMissing` alerts | Above + `K8sUpgradeStalled` (in_flight=1 and time()-started_timestamp > 5400s) |
## Where the logic lives now
- **`infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh`** — universal
phase body. Dispatches on `$PHASE`. Each phase spawns the next Job.
- **`infra/stacks/k8s-version-upgrade/job-template.yaml`** — Job template
rendered by `envsubst` at runtime. ConfigMap-mounted at `/template` in
every Job pod.
- **`infra/stacks/k8s-version-upgrade/main.tf`** — Terraform stack: ConfigMaps,
unified `k8s-upgrade-job` ServiceAccount + RBAC, detection CronJob.
- **`infra/docs/runbooks/k8s-version-upgrade.md`** — operator runbook (kill a
stuck Job, skip a phase, manually re-trigger from a specific phase).
## Why kept (not deleted)
Documents the prompted-agent design and is useful as historical reference when
reading post-mortem discussions or comparing approaches. The `name` field has
been suffixed with `-DEPRECATED` so the agent cannot be invoked by name from
`claude-agent-service`.
---
# Original prompt — DO NOT EXECUTE (reference only)
You are the K8s Version Upgrade Agent for a 5-node home-lab Kubernetes cluster (1 master, 4 workers, stacked etcd, no HA).
## Your Job
Given a target patch or minor version of `kubeadm`/`kubelet`/`kubectl`, you orchestrate the full rolling upgrade with safety gates between every node. You do NOT decide WHEN to run — the `k8s-version-check` CronJob in the `k8s-upgrade` namespace fires you off after detection. You only run when invoked.
The sequence (Pre-flight → etcd snapshot → master containerd skew fix → apt repo URL change [minor only] → master kubeadm upgrade → workers sequentially → Post-flight) is non-negotiable. Skipping a step is how clusters die.
## Inputs
The user prompt contains a JSON object with these fields:
```json
{
"target_version": "1.34.5",
"kind": "patch",
"dry_run": false,
"stages": "all"
}
```
| Field | Required | Description |
|---|---|---|
| `target_version` | yes | Exact `X.Y.Z` to land on (e.g. `1.34.5`). The script `infra/scripts/update_k8s.sh` accepts this via `--release`. |
| `kind` | yes | `patch` (no apt-repo URL change) or `minor` (rewrite repo to v$NEW_MINOR/deb on every node before kubeadm). |
| `dry_run` | no, default false | If true, run all SSH + kubectl READ commands but skip every mutating command (`apt-get install`, `kubeadm upgrade apply`, `kubeadm upgrade node`, `kubectl drain/uncordon`, etcd snapshot, systemctl restart). Log what you would do and exit 0. |
| `stages` | no, default `all` | Comma-separated subset of: `preflight`, `snapshot`, `containerd`, `repo`, `master`, `workers`, `postflight`. Run only those stages and exit. Used by tests. |
Parse the prompt's first JSON block to extract these. If anything is missing, abort with a Slack notification ("malformed payload").
## Environment
- **Working dir**: `/workspace/infra` (`WORKSPACE_DIR` env var)
- **Kubeconfig**: `/workspace/infra/config` (use `kubectl --kubeconfig $WORKSPACE_DIR/config ...` in every kubectl call)
- **Prometheus**: `http://prometheus-server.monitoring.svc.cluster.local:80` (in-cluster, no auth)
- **Etcd snapshot**: triggered as a one-shot Job from the existing `default/backup-etcd` CronJob (defined in `stacks/infra-maintenance/`). The Job runs on `k8s-master` with hostNetwork (so etcdctl reaches etcd at 127.0.0.1:2379), mounts the PV-backed NFS export `192.168.1.127:/srv/nfs/etcd-backup`, and writes `etcd-snapshot-<TIMESTAMP>.db` there. Do NOT shell into master with etcdctl directly — the cert paths + NFS mount are already wired into the CronJob.
- **Library script**: `/workspace/infra/scripts/update_k8s.sh` — pipe via SSH to each node, do NOT modify on the fly. Invoke as `ssh ... 'bash -s' < update_k8s.sh --role <role> --release <X.Y.Z>`.
### Credentials — fetched at startup
The k8s-upgrade ServiceAccount has GET on the `k8s-upgrade-creds` Secret in the `k8s-upgrade` namespace (granted by a RoleBinding in `stacks/k8s-version-upgrade/main.tf`). Fetch credentials into `/tmp` files at the start of every run:
```bash
KUBECTL="kubectl --kubeconfig $WORKSPACE_DIR/config"
# SSH private key — mode 0400 required by openssh
$KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
-o jsonpath='{.data.ssh_key}' | base64 -d > /tmp/k8s-upgrade-ssh-key
chmod 400 /tmp/k8s-upgrade-ssh-key
# Slack webhook (URL string)
SLACK_WEBHOOK_K8S_UPGRADE=$($KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
-o jsonpath='{.data.slack_webhook}' | base64 -d)
```
The rest of the prompt uses `/tmp/k8s-upgrade-ssh-key` for SSH and `$SLACK_WEBHOOK_K8S_UPGRADE` for Slack. SSH template:
```bash
SSH="ssh -i /tmp/k8s-upgrade-ssh-key -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/tmp/known_hosts"
```
Every SSH call below uses `$SSH wizard@<host> '<cmd>'`. `accept-new` accepts the host key on first encounter then pins it — if a node was reimaged, clear `/tmp/known_hosts` before retry.
## NEVER do
- Never bypass the halt-on-alert check — even if a single alert "looks unrelated"
- Never start the next worker before the previous one is Ready + all its pods rescheduled + 10-min soak observed
- Never skip the etcd snapshot — even for patch
- Never `kubectl edit/patch/delete` — read-only kubectl plus `drain`/`uncordon` only
- Never `apt-mark hold` something without unholding it first, and vice versa — the script handles this; don't do it manually
- Never run two stages in parallel — sequential only
- Never run if `dry_run=false` AND the cluster has a node Not Ready, or any Upgrade Gates alert firing
- Never push to git, never modify Terraform, never invoke claude-agent-service recursively
## Slack + Pushgateway helpers
Every transition posts to Slack:
```bash
slack() {
local msg="$1"
local hook="${SLACK_WEBHOOK_K8S_UPGRADE:-$SLACK_WEBHOOK_URL}"
curl -sS -X POST -H 'Content-Type: application/json' \
--data "$(jq -nc --arg t "[k8s-upgrade] $msg" '{text: $t}')" \
"$hook"
}
```
Start every message with `[k8s-upgrade]` so it's grep-able.
Pushgateway gauges drive the `EtcdPreUpgradeSnapshotMissing` and ops-visibility metrics:
```bash
PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-upgrade'
push_metric() {
# push_metric <name> <value>
local name="$1" val="$2"
printf '# TYPE %s gauge\n%s %s\n' "$name" "$name" "$val" \
| curl -sS --data-binary @- "$PG"
}
```
Pushes you must make at specific stages (skipped in dry_run):
| When | Metric | Value |
|---|---|---|
| Stage 0 start | `k8s_upgrade_in_flight` | `1` |
| Stage 0 start | `k8s_upgrade_target_minor` | `$target_minor` |
| Stage 2 verified | `k8s_upgrade_snapshot_taken` | `1` |
| Stage 7 clean | `k8s_upgrade_in_flight` | `0` |
| Stage 7 clean | `k8s_upgrade_snapshot_taken` | `0` |
If you abort mid-flight, leave `k8s_upgrade_in_flight=1` so the alert fires and surfaces the half-done state.
## Stage 0: Parse inputs + announce
1. Extract `target_version`, `kind`, `dry_run`, `stages` from the prompt JSON.
2. Derive `target_minor` from `target_version` (split on `.`).
3. Mark the in-flight annotation on the namespace AND push Pushgateway in-flight gauge:
```bash
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-in-flight="$(date -u +%FT%TZ)" \
viktorbarzin.me/k8s-upgrade-target="$target_version" \
--overwrite
push_metric k8s_upgrade_in_flight 1
push_metric k8s_upgrade_snapshot_taken 0
fi
```
4. Slack: `Starting k8s upgrade to v$target_version (kind=$kind, dry_run=$dry_run, stages=$stages)`.
## Stage 1: Pre-flight (`stages` includes `preflight`)
Skip if `stages` excludes `preflight`.
### Check 1.1 — All nodes Ready, no pressure
```bash
kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o json \
| jq -r '.items[] | "\(.metadata.name): \(.status.conditions[] | select(.type=="Ready") | .status), Mem=\(.status.conditions[] | select(.type=="MemoryPressure") | .status), Disk=\(.status.conditions[] | select(.type=="DiskPressure") | .status)"'
```
Abort if any node is not Ready=True, or has MemoryPressure=True or DiskPressure=True.
### Check 1.2 — Halt-on-alert (same query kured uses)
```bash
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
| sort -u)
if [ -n "$ALERTS" ]; then
slack "ABORT preflight — firing alerts:\n$ALERTS"
exit 1
fi
```
### Check 1.3 — 24h-quiet baseline
Re-uses the sentinel-gate Check 4 logic from `stacks/kured/main.tf`. Any node that transitioned Ready in the last 24h means the cluster just absorbed a node reboot — we want a clean baseline before starting a fresh rollout.
```bash
RECENT_REBOOT=0
while IFS= read -r ts; do
[ -z "$ts" ] && continue
diff=$(( $(date +%s) - $(date -d "$ts" +%s) ))
[ "$diff" -lt 86400 ] && RECENT_REBOOT=1 && break
done < <(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}')
if [ "$RECENT_REBOOT" -eq 1 ]; then
slack "ABORT preflight — node transitioned Ready <24h ago (soak window)"
exit 1
fi
```
### Check 1.4 — kubeadm upgrade plan reports our target
```bash
PLAN_TARGET=$($SSH \
wizard@k8s-master 'sudo kubeadm upgrade plan' \
| grep -oE 'You can now apply the upgrade by executing the following command:.*v[0-9]+\.[0-9]+\.[0-9]+' \
| grep -oE 'v[0-9]+\.[0-9]+\.[0-9]+' | head -1 | tr -d v)
```
If `$PLAN_TARGET` does not start with the requested `target_version`, slack-abort:
"`kubeadm upgrade plan` says target is $PLAN_TARGET but caller asked for $target_version — drift; aborting."
Slack: `Pre-flight clean. Proceeding to etcd snapshot.`
## Stage 2: Etcd snapshot (`stages` includes `snapshot`)
Always run — patch OR minor. Triggers a one-shot Job from the existing `default/backup-etcd` CronJob and waits for it to complete.
```bash
JOB_NAME="pre-upgrade-etcd-${target_version}-$(date +%s)"
if [ "$dry_run" = "false" ]; then
$KUBECTL -n default create job --from=cronjob/backup-etcd "$JOB_NAME"
# Wait up to 10 min for snapshot Job to complete
$KUBECTL -n default wait --for=condition=complete --timeout=600s "job/$JOB_NAME" || {
slack "ABORT Stage 2 — etcd snapshot Job did not complete in 10 min"
$KUBECTL -n default describe "job/$JOB_NAME" | tail -30
exit 1
}
# Parse the Job's pod log for "Backup done: <file> (<bytes> bytes)"
LOG=$($KUBECTL -n default logs "job/$JOB_NAME" -c backup-manage --tail=20)
echo "$LOG"
SNAPSHOT_LINE=$(echo "$LOG" | grep -E '^Backup done:')
SIZE=$(echo "$SNAPSHOT_LINE" | grep -oE '\([0-9]+ bytes\)' | grep -oE '[0-9]+')
SNAPSHOT_FILE=$(echo "$SNAPSHOT_LINE" | awk '{print $3}')
if [ -z "$SIZE" ] || [ "$SIZE" -lt 1024 ]; then
slack "ABORT Stage 2 — etcd snapshot empty or missing (size='$SIZE' line='$SNAPSHOT_LINE')"
exit 1
fi
TARGET_PATH="nfs://192.168.1.127:/srv/nfs/etcd-backup/$SNAPSHOT_FILE"
$KUBECTL annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-snapshot-path="$TARGET_PATH" --overwrite
push_metric k8s_upgrade_snapshot_taken 1
else
TARGET_PATH="WOULD: trigger default/backup-etcd Job, wait, verify size"
SIZE="dry-run"
fi
slack "Etcd snapshot saved at $TARGET_PATH (size=$SIZE)"
```
## Stage 3: Master containerd skew fix (`stages` includes `containerd`)
Only run if master containerd version < highest worker containerd version.
```bash
get_ctr_version() {
$SSH \
"wizard@$1" 'containerd --version | awk "{print \$3}" | tr -d v'
}
MASTER_CTR=$(get_ctr_version k8s-master)
WORKER_MAX="0.0.0"
for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
v=$(get_ctr_version "$n")
# Compare semver-ish
if [ "$(printf '%s\n%s' "$v" "$WORKER_MAX" | sort -V | tail -1)" = "$v" ]; then
WORKER_MAX="$v"
fi
done
if [ "$(printf '%s\n%s' "$MASTER_CTR" "$WORKER_MAX" | sort -V | head -1)" = "$MASTER_CTR" ] \
&& [ "$MASTER_CTR" != "$WORKER_MAX" ]; then
# Master is behind — bump
slack "Master containerd $MASTER_CTR < workers $WORKER_MAX bumping master"
if [ "$dry_run" = "false" ]; then
$SSH \
wizard@k8s-master "sudo apt-mark unhold containerd.io \
&& sudo apt-get install -y containerd.io='$WORKER_MAX-1' \
&& sudo apt-mark hold containerd.io \
&& sudo systemctl restart containerd"
# Wait until kubelet on master is Ready again
for i in $(seq 1 60); do
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
[ "$STATUS" = "True" ] && break
sleep 10
done
[ "$STATUS" = "True" ] || { slack "ABORT — k8s-master not Ready after containerd bump"; exit 1; }
fi
slack "Master containerd: $MASTER_CTR → $WORKER_MAX. Master Ready."
else
echo "Master containerd $MASTER_CTR >= workers max $WORKER_MAX — skipping skew fix"
fi
```
## Stage 4: Apt repo URL rewrite for minor bumps (`stages` includes `repo`)
Only run if `kind=minor`.
For each of `k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4`:
```bash
target_minor="$(echo "$target_version" | awk -F. '{print $1"."$2}')"
if [ "$dry_run" = "false" ]; then
$SSH \
"wizard@$node" "echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list \
&& curl -fsSL 'https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/Release.key' | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes \
&& sudo apt-get update"
fi
```
Slack: `Repo rewritten to v$target_minor/deb on all 5 nodes.`
## Stage 5: Master upgrade (`stages` includes `master`)
```bash
# 5.1 Drain
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config drain k8s-master \
--ignore-daemonsets --delete-emptydir-data --force --grace-period=300
fi
# 5.2 Run the library script via SSH pipe
if [ "$dry_run" = "false" ]; then
$SSH \
wizard@k8s-master 'bash -s' \
< $WORKSPACE_DIR/scripts/update_k8s.sh \
-- --role master --release "$target_version"
fi
# 5.3 Uncordon + wait Ready
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config uncordon k8s-master
fi
for i in $(seq 1 60); do
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
-o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
sleep 15
done
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
|| { slack "ABORT — master not Ready or wrong version after upgrade ($STATUS / $KUBELET)"; exit 1; }
# 5.4 All control-plane pods Running
NOT_READY=$(kubectl --kubeconfig $WORKSPACE_DIR/config -n kube-system get pods \
-l 'tier=control-plane' --no-headers | grep -v Running | wc -l)
[ "$NOT_READY" -gt 0 ] && { slack "ABORT — $NOT_READY control-plane pods not Running"; exit 1; }
# 5.5 Re-check halt-on-alert
# (re-run the Check 1.2 query, abort if anything new fires)
slack "Master upgrade complete. Cluster on v$target_version. Healthy."
```
## Stage 6: Workers sequentially (`stages` includes `workers`)
Order: `k8s-node4 → k8s-node3 → k8s-node2 → k8s-node1`. Node1 last because it hosts GPU + Immich and benefits from the longest soak before any other worker is touched (ref: post-mortem-2026-03-16, memory id=570).
For each worker `$node`:
1. Re-check halt-on-alert. If anything fires (e.g. `RecentNodeReboot` on the previous worker), wait + retry up to 30 min, then abort.
2. `kubectl drain $node --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
3. SSH pipe `update_k8s.sh --role worker --release $target_version`
4. `kubectl uncordon $node`
5. Wait until `$node` Ready + kubeletVersion matches + all calico-node + kube-proxy pods on that node Running.
6. **10-min soak**: poll halt-on-alert every 60s. If anything fires, abort. After 10 min clean, proceed.
7. Slack: `Worker $node complete ($i/4)`.
```bash
WORKERS="k8s-node4 k8s-node3 k8s-node2 k8s-node1"
i=0
for node in $WORKERS; do
i=$((i+1))
# Halt-on-alert recheck with retry
for attempt in $(seq 1 30); do
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
| sort -u)
[ -z "$ALERTS" ] && break
echo "Waiting for alerts to clear (attempt $attempt/30): $ALERTS"
sleep 60
done
[ -n "$ALERTS" ] && { slack "ABORT $node — alerts firing after 30min wait: $ALERTS"; exit 1; }
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config drain "$node" \
--ignore-daemonsets --delete-emptydir-data --force --grace-period=300
$SSH \
"wizard@$node" 'bash -s' \
< $WORKSPACE_DIR/scripts/update_k8s.sh \
-- --role worker --release "$target_version"
kubectl --kubeconfig $WORKSPACE_DIR/config uncordon "$node"
fi
# Wait Ready + version match
for w in $(seq 1 60); do
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
-o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
sleep 15
done
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
|| { slack "ABORT — $node not Ready or wrong version ($STATUS / $KUBELET)"; exit 1; }
# 10-min soak with halt-on-alert
echo "Soaking $node for 10 min..."
for sec in $(seq 1 10); do
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor|RecentNodeReboot)$' \
| sort -u)
[ -n "$ALERTS" ] && { slack "ABORT $node mid-soak — alerts: $ALERTS"; exit 1; }
sleep 60
done
slack "Worker $node upgrade complete ($i/4). Soaked clean."
done
```
Note: during the soak we add `RecentNodeReboot` to the ignore-list because we KNOW we just rebooted-as-it-were that node (kubelet restart counts).
## Stage 7: Post-flight (`stages` includes `postflight`)
```bash
# All 5 nodes at target
VERSIONS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes \
-o jsonpath='{range .items[*]}{.metadata.name}:{.status.nodeInfo.kubeletVersion}{"\n"}{end}')
echo "$VERSIONS"
WRONG=$(echo "$VERSIONS" | grep -v ":v${target_version}$" | wc -l)
[ "$WRONG" -ne 0 ] && { slack "ABORT post-flight — $WRONG node(s) not on v$target_version:\n$VERSIONS"; exit 1; }
# Upgrade Gates all inactive
FIRING=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
| sort -u)
[ -n "$FIRING" ] && slack "Post-flight WARN — alerts still firing (cluster on target, but check):\n$FIRING"
# pod-ready ratio >= 0.9
RATIO=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/query' \
--data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \
| jq -r '.data.result[0].value[1] // "0"')
slack "Pod-ready ratio: $RATIO (target ≥ 0.9)"
# Clear the in-flight annotation + Pushgateway gauges
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-in-flight- \
viktorbarzin.me/k8s-upgrade-target- \
viktorbarzin.me/k8s-upgrade-snapshot-path- || true
push_metric k8s_upgrade_in_flight 0
push_metric k8s_upgrade_snapshot_taken 0
fi
slack ":white_check_mark: K8s upgrade complete: cluster on v$target_version."
```
## Rollback
This agent does NOT auto-rollback. If anything aborts mid-flight:
1. Slack the failure with the last known stage + node.
2. Leave the in-flight annotation in place (the operator clears it manually after triage).
3. Operator follows `infra/docs/runbooks/k8s-version-upgrade.md` → "Rollback paths" section.
The etcd snapshot path is annotated on the `k8s-upgrade` namespace for easy recovery.
## Notes for tests
- **Test 1 (CronJob dry-run)**: The CronJob has its own `--dry-run` env var that short-circuits before POST. This agent is not invoked.
- **Test 2 (agent dry-run)**: Invoke with `{"dry_run": true}`. Every SSH + kubectl READ runs, every mutation skipped. The agent should print "WOULD: <cmd>" for each skipped mutation.
- **Test 3 (snapshot-only)**: Invoke with `{"stages": "preflight,snapshot"}`. Pre-flight + etcd snapshot only. Slack notification confirms the file exists. No node touched after that.
- **Test 4 (full run)**: `{"target_version": "1.34.7", "kind": "patch"}` once apt has it. Full sequence.
- **Test 5 (synthetic minor)**: `{"target_version": "1.35.0", "kind": "minor", "dry_run": true}`. Confirms the repo-rewrite plan path without mutation.
## Edge cases
- **Slack down**: Don't block the upgrade — continue, log to stderr.
- **SSH host key changes**: `accept-new` accepts only on first encounter — if a node was reimaged its host key changes; clear `/tmp/known_hosts` before retry.
- **kubectl drain hangs on a PDB-violating pod**: 5-min grace-period is hard. If drain fails, `kubectl drain --disable-eviction --force` is NOT a valid escalation here — slack-abort and let the operator investigate.
- **etcd snapshot dir missing/full**: stat the dir first. If <10 GiB free, abort.
- **Network blip during apt-get**: the script `set -e`s — apt-get will fail loud, the agent's bash will see non-zero exit, we slack-abort. The node is left mid-upgrade (kubeadm half-applied). Operator follows the runbook.
## Verification claims you must make
When you `slack` a SUCCESS message, you must have actually verified:
- All 5 nodes report the target kubelet version via `kubectl get nodes -o jsonpath`
- No alerts firing outside the ignore-list
- pod-ready ratio computed from Prometheus
Do not declare success without those three confirmations.

View file

@ -127,10 +127,65 @@ Pinned via Terraform in `stacks/authentik/`:
| Knob | Value | Surface | Effect |
|------|-------|---------|--------|
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
| `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
| `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
Notes:
- There is **no** `Brand.session_duration`; `UserLoginStage` is the only correct lever for authenticated session lifetime.
- Embedded outpost session storage moved from `/dev/shm` → Postgres table `authentik_providers_proxy_proxysession` in authentik 2025.10. The 2026-04-18 `/dev/shm`-fill outage class is no longer load-bearing in 2026.2.2; the `unauthenticated_age` cap is still the right lever for anonymous-session bloat from external monitors.
- `ProxyProvider.access_token_validity` and `remember_me_offset` stay UI-managed via `ignore_changes`.
- Embedded outpost session storage: PostgreSQL table `authentik_providers_proxy_proxysession` in authentik 2025.10+ (PR #16628), but **only when `IsEmbedded()` returns true** (i.e. `Outpost.managed == "goauthentik.io/outposts/embedded"`). Our outpost record had `managed=null` until 2026-05-10, which silently kept it on the gorilla `FilesystemStore` at `/dev/shm` (TMPDIR) and re-exposed the 2026-04-18 mismatched-session-ID class on every pod restart. Fix landed 2026-05-10: see `authentik_outpost.embedded` in `authentik_provider.tf` and post-mortem `2026-04-18-authentik-outpost-shm-full.md`.
- The proxy outpost service has a known goauthentik 2026.2.2 bug (`internal/outpost/controllers/k8s/service.py:52`): for embedded outposts the controller sets the Service selector to `app.kubernetes.io/name=authentik` (the server pods), not `authentik-outpost-proxy`. We work around it via a `kubernetes_json_patches.service` patch on the outpost record (replaces `/spec/selector` with the outpost's own labels). Without this, endpoints are empty and Traefik forward-auth fails over to the Basic Auth realm `Emergency Access`.
- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
- The `unauthenticated_age` env var is injected via `server.env` / `worker.env` (not `authentik.sessions.unauthenticated_age`) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. The same applies to the existing `authentik.cache.*`, `authentik.web.*`, `authentik.worker.*` blocks (currently inert; live values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced).
## Upgrade Validation Checklist
Run after **any** of these:
- Authentik chart version bump in `stacks/authentik/modules/authentik/main.tf` (the `version = "..."` line on `helm_release.authentik`).
- `goauthentik/authentik` Terraform provider version bump.
- Outpost pod recreation (kured reboot, eviction, manual `rollout restart`, scheduler move).
The fragile surfaces are the `kubernetes_json_patches` and the `Outpost.managed` field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.
```bash
# 1. Service routes to the outpost pod (NOT the server pods).
# Empty endpoints => auth-proxy fallback fires; expected: ONE pod IP, ports 9000/9300/9443.
kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost
# 2. Service selector still excludes the server pods. Expected: includes
# `app.kubernetes.io/name: authentik-outpost-proxy`. If it flips to
# `name: authentik`, the goauthentik upstream bug came back or our
# JSON patch was unset.
kubectl -n authentik get svc ak-outpost-authentik-embedded-outpost -o jsonpath='{.spec.selector}'
# 3. Outpost mode + session backend. Expected log lines on startup:
# {"embedded":true,"event":"Outpost mode",...}
# {"event":"using PostgreSQL session backend",...}
# If embedded=false or `using filesystem session backend`, the postgres
# fix is broken — likely `Outpost.managed` got cleared, or the upstream
# schema started exposing `managed` and TF reset it.
kubectl -n authentik logs deploy/ak-outpost-authentik-embedded-outpost | grep -E '"Outpost mode"|"session backend"' | head -3
# 4. /dev/shm is essentially empty (postgres backend = no filesystem use).
# A row count > a few dozen indicates filesystem fallback is firing.
kubectl -n authentik exec deploy/ak-outpost-authentik-embedded-outpost -- sh -c 'df -h /dev/shm; ls /dev/shm | wc -l'
# 5. Postgres session table is growing with traffic. Expected: rows with
# `expires` ~28 days out (matches access_token_validity = weeks=4).
kubectl -n authentik exec deploy/goauthentik-server -- ak shell -c "
from django.db import connection; c = connection.cursor()
c.execute('SELECT COUNT(*), MAX(expires) FROM authentik_providers_proxy_proxysession')
print(c.fetchone())"
# 6. Edge auth flow: should be 302 → authentik. NOT 401 with WWW-Authenticate.
curl -sS -o /dev/null -D - 'https://terminal.viktorbarzin.me/' -H 'User-Agent: Mozilla/5.0' \
| grep -iE '^HTTP|^location|x-auth-fallback|www-authenticate'
# 7. Terraform plan-to-zero on the whole authentik stack.
( cd stacks/authentik && /home/wizard/code/infra/scripts/tg plan ) | grep -E 'No changes|Plan:'
```
Steps 1, 3, 6 cover the failure modes the Prometheus alerts trigger on (`AuthentikForwardAuthFallbackActive`, `AuthentikOutpostForwardAuth400Spike`). Steps 4 and 5 cover the silent-regression case (filesystem fallback) where the alerts don't fire but the system loses its postgres-backed session persistence on the next pod restart.
If step 2 shows the controller restored `app.kubernetes.io/name=authentik`, watch goauthentik/authentik issue tracker for fixes around `internal/outpost/controllers/k8s/service.py:52` — the upstream patch might let us drop our `kubernetes_json_patches.service` workaround.

View file

@ -53,6 +53,7 @@
| insta2spotify | Instagram reel song ID to Spotify playlist | insta2spotify |
| trading-bot | Event-driven trading with sentiment analysis | trading-bot |
| claude-memory | Persistent memory MCP server | claude-memory |
| paperless-mcp | Paperless-ngx document search MCP (barryw/PaperlessMCP). Traefik bearer auth via Aetherinox api-token-middleware. `auth=none` at ingress; gateway-level bearer enforced by `paperless-mcp/bearer-auth` Middleware CRD. Tokens + paperless API token in Vault `secret/paperless-mcp`. | paperless-mcp |
| council-complaints | Islington civic reporting pilot | council-complaints |
## Optional
@ -78,6 +79,7 @@
| paperless-ngx | Document management | paperless-ngx |
| jsoncrack | JSON visualizer | jsoncrack |
| servarr | Media automation (Sonarr/Radarr/etc) | servarr |
| aiostreams | Stremio stream aggregator (Real-Debrid + Torrentio/Comet/MediaFusion/StremThru/Knaben). `auth=app` (own UUID+password); canary stream-probe + 3 alerts; weekly NFS config + Stremio-account-collection backups to `/srv/nfs/aiostreams-backup/`. PG-backed user config. | servarr/aiostreams |
| ntfy | Push notifications | ntfy |
| cyberchef | Data transformation | cyberchef |
| diun | Docker image update notifier — detects new versions, fires webhook to n8n upgrade agent | diun |

View file

@ -7,8 +7,9 @@ description: |
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
(4) User mentions "health check", "cluster status", "cluster health",
(5) User asks "is everything running" or "any problems".
Runs 42 cluster-wide checks (nodes, workloads, monitoring, certs,
backups, external reachability) with safe auto-fix for evicted pods.
Runs 44 cluster-wide checks (nodes, workloads, monitoring, certs,
backups, external reachability, PVE host thermals + load) with safe
auto-fix for evicted pods.
author: Claude Code
version: 2.0.0
date: 2026-04-19
@ -66,7 +67,7 @@ bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
```
## What It Checks (42 checks)
## What It Checks (44 checks)
| # | Check | Notes |
|---|-------|-------|
@ -112,6 +113,8 @@ bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
| 40 | External — Cloudflared + Authentik Replicas | deployments fully ready |
| 41 | External — ExternalAccessDivergence Alert | alert not firing |
| 42 | External — Traefik 5xx Rate (15m) | top-10 services emitting 5xx |
| 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL 83 °C (TjMax) |
| 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL 38 of 44 threads |
## Safe Auto-Fix Rules
@ -256,9 +259,9 @@ kubectl logs -n external-secrets deploy/external-secrets --tail=100
kubectl get pods -n cloudflared
kubectl logs -n cloudflared -l app=cloudflared --tail=100
# Authentik
kubectl get pods -n authentik -l app=authentik-server
kubectl logs -n authentik -l app=authentik-server --tail=100
# Authentik (Helm chart names the deployment goauthentik-server)
kubectl get deployment -n authentik goauthentik-server
kubectl logs -n authentik deploy/goauthentik-server --tail=100
# ExternalAccessDivergence alert
kubectl exec -n monitoring deploy/prometheus-server -- \
@ -295,6 +298,133 @@ kubectl exec -n monitoring deploy/prometheus-server -- \
- Exit code 143 → SIGTERM / graceful shutdown failed
3. Cross-check dbaas + NFS + secrets are healthy.
## Performance forensics — top consumers + optimization hints
When the cluster is healthy (script returns 0) but the host is hot or load
is elevated, switch from "what broke?" to "what's expensive?". Run these
in order; stop as soon as the root cause is obvious.
### Step 1 — Snapshot top consumers cluster-wide
```bash
# Top 15 pods by current CPU
kubectl top pods --all-namespaces --sort-by=cpu --no-headers | head -15
# Top 5 nodes by CPU + memory pressure
kubectl top nodes
# Top 15 by 5-min rolling rate (smoothed — kills noise from one-off spikes)
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(namespace,pod)%20(rate(container_cpu_usage_seconds_total%7Bcontainer!%3D''%7D%5B5m%5D)))" \
| python3 -m json.tool | head -80
```
### Step 2 — For each suspect pod, get the WHY
For every pod in the top-N, gather these BEFORE proposing a fix:
```bash
NS=<namespace>; POD=<pod>; CONT=$(kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].name}')
# What it does (image + command)
kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].image}{"\n"}{.spec.containers[0].args}{"\n"}'
# Resource limits + current usage
kubectl -n $NS top pod $POD --containers
kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].resources}'
# Recent logs filtered for reconcile loops, watch storms, slow queries
kubectl -n $NS logs $POD -c $CONT --tail=200 --since=5m 2>&1 \
| grep -iE 'reconcil|watch|scrape|index|loop|retry|slow|timeout' | tail -20
# Restart count + recent OOM
kubectl -n $NS describe pod $POD | grep -E 'Restart Count|Last State|Reason'
# Self-exported metrics (for apps that publish on /metrics)
kubectl -n $NS exec $POD -c $CONT -- wget -qO- localhost:<port>/metrics 2>/dev/null | head -50
```
### Step 3 — apiserver / etcd specific deep-dive (when control-plane is hot)
```bash
# Top request producers by verb+resource (last 30 min)
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(resource,verb)%20(rate(apiserver_request_total%5B30m%5D)))" \
| python3 -m json.tool
# Top user agents (which clients are hammering)
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(user_agent)%20(rate(apiserver_request_total%5B30m%5D)))" \
| python3 -m json.tool
# Long-running requests (WATCH / CONNECT — log streams, pod-watchers)
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
"http://localhost:9090/api/v1/query?query=apiserver_longrunning_requests" \
| python3 -m json.tool
# etcd write rate + DB size
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
"http://localhost:9090/api/v1/query?query=rate(etcd_disk_wal_fsync_duration_seconds_count%5B5m%5D)" \
| python3 -m json.tool
```
### Step 4 — PVE host specific deep-dive (when temp / load is high)
Checks 43 + 44 capture package temp + 5-min load avg with PASS/WARN/FAIL
thresholds — that's the first stop. When those WARN or FAIL, the
follow-up commands below trace which VM / process is the source:
```bash
# Per-core temps (broader than the package summary in check 43)
ssh root@192.168.1.127 'for f in /sys/class/hwmon/hwmon0/temp*_input; do
base=${f%_input}; label=$(cat ${base}_label 2>/dev/null || echo "${base##*/}")
val=$(cat "$f"); echo " $label: $((val/1000))°C"
done'
# Per-VM CPU (each VM = one kvm process)
ssh root@192.168.1.127 'top -bn1 -o %CPU | grep kvm | head -10'
# pvestatd anomaly check — bursts > 50% usually mean LV count > 1000
ssh root@192.168.1.127 'lvs --noheadings 2>/dev/null | wc -l'
# Stale snapshots (any '_pre-*' that survived past their rollback window)
ssh root@192.168.1.127 'lvs --noheadings -o lv_name 2>/dev/null | awk "/_pre-/" | head -20'
```
### Step 5 — Optimization decision
For each consumer in the top-N, fill in a row:
| Pod / Process | CPU (m) | Why busy | Tunable | Est saving | Trade-off | Effort |
|---|---|---|---|---|---|---|
Then rank by ROI (saving / effort) and surface the top 3-5. **Hold back the ones where saving < 50m unless effort is also < 5 min.**
### Common causes + tunables (catalogue)
| Symptom | Likely cause | Tunable |
|---|---|---|
| **`kube-apiserver` > 1 core sustained** | `CONNECT pods/log` streams from `alloy`/`promtail` using apiserver-tail; OR Kyverno PolicyReport churn (background+enforce mode); OR VPA fanout (309 VPAs cause ~7 req/s) | Switch alloy/promtail to `loki.source.file`; raise Kyverno `backgroundScanInterval`; reduce VPA count |
| **`pvestatd` 70-100% bursts** | LV metadata scan over > 1000 LVs (typically stale `_pre-*` snapshots from ad-hoc node ops) | Delete stale snapshots; `/usr/local/bin/lvm-pvc-snapshot prune` |
| **Frigate > 2 cores** | Birdseye `mode: continuous` (16% on frigate.output); LPR debug; debug logging; too many active cameras × detect.fps | `birdseye.mode: motion`; `lpr.debug_save_plates: false`; remove debug loggers |
| **`vault-0` looping ERRORs every ~10s** | DB static-role not in connection's `allowed_roles` list (drift between role and connection) | Add role to `vault_database_secret_backend_connection.*.allowed_roles` in TF |
| **Alloy DS > 100m/pod** | `loki.source.kubernetes` (apiserver-tail) instead of `loki.source.file` | Switch to file-tail (~5× drop per pod) |
| **Prometheus default 1m scrape** | Chart default; new sample every minute | Raise `server.global.scrape_interval` to 2m; pin critical jobs (snmp-ups) to 30s; bump `for: 1m` alerts to `for: 3m` |
| **`kube-controller-manager` periodic ERROR loop** | Aggregated APIService discovery fails (calico/metrics-server unreachable, OR stuck Terminating pod still in endpoints) | Force-delete stuck pod; verify APIService Available; check pod runc bug on k8s-master |
| **etcd write > 1 MB/s** | PolicyReport thrash, too-frequent secret rotation, or audit log mode = RequestResponse | Trim Kyverno reports config; raise rotation_period; downgrade audit policy to Metadata for noisy resources |
### What NOT to touch
- **calico-node, etcd write rate, kube-controller-manager core work, pg-cluster replication** — structural cost, touching them risks correctness.
- **Pods doing legitimate request-serving work** (web servers, databases under load) — optimize the workload, not the runtime.
- **Anything where Goldilocks VPA upperBound is already close to current request** — no headroom to cut.
### Source-of-truth notes
- **All infra mutations go via Terraform** (`scripts/tg plan/apply`). The recipes above are diagnostic; the FIX lives in `infra/stacks/<name>/main.tf` or chart values.
- **Pod-internal config files** (e.g., Frigate's `/config/config.yml` on a PVC) are not TF-managed — edit in-pod and document in `infra/docs/runbooks/`.
- **PVE host-level state** (LVM snapshots, pvestatd) — SSH + manual ops; record in memory if the pattern recurs.
## Notes on the canonical / hardlink setup
The authoritative copy of this SKILL.md lives at

View file

@ -0,0 +1,199 @@
---
name: upgrade-state
description: |
Audit the three autonomous-upgrade pipelines (apps via Keel, OS via
unattended-upgrades+kured, K8s components via the version-check chain).
Use when:
(1) User asks "/upgrade-state" or "are we current",
(2) User asks "what's pending upgrade" or "what's the upgrade state",
(3) User asks if Keel / kured / k8s-version-check is healthy,
(4) User asks about kept-back / held packages or pending reboots,
(5) Periodic survey before the next `k8s-version-check` daily run.
Read-only — no `--fix`. Exits 0 healthy / 1 attention / 2 stalled.
author: Claude Code
version: 1.0.0
date: 2026-05-18
---
# Upgrade-state
## MANDATORY: Run the script first
When this skill is invoked, your **first action** must be to run
`upgrade_state.sh` and reason over its output before doing anything
else. Do NOT improvise individual `kubectl` / `ssh` calls — the script
is the authoritative surface.
```bash
bash /home/wizard/code/infra/scripts/upgrade_state.sh
```
For programmatic use:
```bash
bash /home/wizard/code/infra/scripts/upgrade_state.sh --json | tee /tmp/upgrade-state.json
```
Then:
1. Report the rendered table verbatim — it answers the user's
"are we current" question in three lines.
2. For every `⚠` or `✗` row, surface the relevant drill-down lines
underneath and propose a next action (links in the table below).
3. Only reach for ad-hoc commands when investigating beyond what the
script reported.
Exit codes: `0` healthy, `1` attention warranted, `2` stalled / broken.
## What it covers (3 pipelines)
| Layer | What runs | Cadence | Data sources |
|---|---|---|---|
| **Apps** | Keel polls every watched Deployment's container registry; rolls on new digest | hourly | Prom (`pending_approvals`, `registries_scanned_total`), Keel pod logs |
| **OS** | `unattended-upgrades` in-release patching; `kured` reboots when `/var/run/reboot-required` is set | daily 02:00-06:00 London | SSH fan-out to all 5 nodes |
| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | daily 12:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` |
The K8s pipeline pushes a small set of gauges to the Prometheus
Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`):
- `k8s_upgrade_available{kind="patch"|"minor",target=…}` — 1 if newer release detected
- `k8s_version_check_last_run_timestamp` — when detection last ran
- `k8s_upgrade_in_flight` — 0/1
- `k8s_upgrade_started_timestamp` — when the current chain started (0 when idle)
`K8sUpgradeStalled` alert fires when `in_flight=1` and the chain has
been running >90 minutes. The script raises `✗` in the same window.
## Status-icon legend
| Icon | Meaning |
|---|---|
| `✓` | Healthy, fully current |
| `→` | Update available, not yet applied (K8s patch/minor) |
| `…` | In flight — chain currently running |
| `⚠` | Attention: held-with-bumps, recent errors, pending approvals |
| `✗` | Broken: pod down, alert firing, chain stalled |
## Drill-down — when a row trips, what to do
### Apps `⚠` — pending approvals or errors
```bash
# Read recent Keel log lines
kubectl -n keel logs deploy/keel --since=24h --tail=200
# What is Keel currently tracking?
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=count by (image) (registries_scanned_total)'
# Is the scrape live?
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=up{job="kubernetes-pods",app="keel"}'
```
Common Keel errors:
- `failed to add image watch job` — image annotation mistyped (rare; Kyverno auto-injects)
- `registry authentication required` — bad imagePullSecret on the watched Deployment
- `bad tag pattern` — Keel can't parse the watched image's tag against its policy
### OS `⚠` — held packages with bumps
The script flags any package held via `apt-mark hold` that ALSO appears
in `apt list --upgradable` — excluding k8s components (the K8s pipeline
owns those) and the kernel (kured handles the reboot half).
Typical cause: a major-version bump (e.g. containerd 1.7 → 2.2,
runc 1.1 → 1.4). These are held because they need cluster-wide
coordination, not silent in-release patching.
```bash
# Inspect the situation on the flagged node
ssh wizard@10.0.20.10X 'apt-mark showhold; apt list --upgradable 2>/dev/null'
# Unhold + upgrade a specific package
ssh wizard@10.0.20.10X 'sudo apt-mark unhold containerd && sudo apt-get install -y containerd'
```
Node IPs: master=`100`, node1=`101`, node2=`102`, node3=`103`, node4=`104`.
### OS `⚠` — pending reboot
A node has `/var/run/reboot-required`. Kured will reboot it inside the
next 02:00-06:00 London window (any day of the week).
```bash
# Force a manual reboot inside the window (rare)
kubectl drain k8s-nodeX --delete-emptydir-data --ignore-daemonsets
ssh wizard@10.0.20.10X sudo systemctl reboot
```
### OS `✗` — kured not Running
```bash
kubectl -n kured get pods
kubectl -n kured logs daemonset/kured --tail=100
# Verify sentinel gate (kured-sentinel-gate DaemonSet writes /var/run/gated-reboot-required)
kubectl -n kured get pods -l name=kured-sentinel-gate
```
### K8s `→` — patch/minor available
Detection ran, target identified, chain NOT started. The chain spawns
on the same daily detection cycle — typically within ~24h of the
target first being detected.
```bash
# Inspect Pushgateway state
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep ^k8s_upgrade
# Trigger a manual run of the detection CronJob
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check manual-detect-$(date +%s)
```
### K8s `…` — in flight
The Job chain is running. Watch its progress:
```bash
kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp
kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200 --prefix
```
### K8s `✗ stalled``K8sUpgradeStalled` would fire
Chain in-flight >90m. The Job is most likely stuck on drain or a
pre-flight check.
```bash
kubectl -n k8s-upgrade get jobs
kubectl -n k8s-upgrade describe job <stuck-job>
kubectl -n k8s-upgrade logs job/<stuck-job> --tail=300
# If you need to clear the in-flight flag (after diagnosing):
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- sh -c \
"printf 'k8s_upgrade_in_flight 0\nk8s_upgrade_started_timestamp 0\n' | \
wget -qO- --post-file=- 'http://prometheus-prometheus-pushgateway:9091/metrics/job/k8s-version-upgrade' \
--header='Content-Type: text/plain'"
```
### K8s `✗ detection stale` — last detection >9 days
```bash
kubectl -n k8s-upgrade get cronjob k8s-version-check
kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp | tail -5
```
If the CronJob hasn't fired on time, suspect:
- `suspend=true` on the CronJob (`var.enabled=false` in the
`k8s-version-upgrade` Terraform stack)
- Image-pull failure on the version-check pod
- Pushgateway scrape gone stale
## Companion command-line flags
```bash
bash infra/scripts/upgrade_state.sh # rendered table (default)
bash infra/scripts/upgrade_state.sh --json # machine output
bash infra/scripts/upgrade_state.sh --kubeconfig X # override kubeconfig
```

4
.gitleaksignore Normal file
View file

@ -0,0 +1,4 @@
# git-crypt encrypts these at rest; the working-tree plaintext is local-only.
# gitleaks scans the staged working-tree copy and can't see that they're
# encrypted on disk in git, so allowlist by fingerprint.
stacks/recruiter-responder/secrets/privkey.pem:private-key:1

View file

@ -154,6 +154,37 @@ lifecycle {
**Audit**: `rg "KYVERNO_LIFECYCLE_V1" stacks/ | wc -l` — should grow (never shrink). Add the marker to every new pod-owning resource. The `_template/main.tf.example` stub shows the canonical form.
### `# KYVERNO_LIFECYCLE_V2` — Keel auto-update annotations
When a namespace is labeled `keel.sh/enrolled=true`, the `inject-keel-annotations` ClusterPolicy (`stacks/kyverno/modules/kyverno/keel-annotations.tf`) injects three annotations on every Deployment / StatefulSet / DaemonSet:
```
keel.sh/policy: force
keel.sh/trigger: poll
keel.sh/pollSchedule: "@every 1h"
```
To suppress the resulting Terraform drift, **enrolled workloads** must extend their `ignore_changes` block:
```hcl
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
```
The V2 snippet is added **per workload** as namespaces are phase-enrolled — not as a mass sweep. Workloads in un-enrolled namespaces do not receive the annotation and don't need the V2 block.
Per-workload opt-out: add the label `keel.sh/policy: never` on the Deployment metadata (not pod template); the policy's `exclude` clause respects it, no annotation gets injected, no `ignore_changes` needed.
**Audit**: `rg "KYVERNO_LIFECYCLE_V2" stacks/` — count should equal the number of enrolled workloads.
**Design context**: `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`.
## Tier System
`0-core` | `1-cluster` | `2-gpu` | `3-edge` | `4-aux` — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label.
- Containers without explicit `resources {}` get default limits (256Mi for edge/aux — causes OOMKill for heavy apps)

150
CONTEXT.md Normal file
View file

@ -0,0 +1,150 @@
# Infra
Terragrunt-managed homelab declaring a 5-node Kubernetes cluster on a single Proxmox host. Vault is the secrets source of truth; everything else flows from this repo via `scripts/tg apply`.
## Language
### Code organization
**Service**:
The deployed app as a domain concept — one logical thing that runs in the cluster (e.g. immich, technitium, freshrss). Defined by exactly one **Stack**.
_Avoid_: bare "app" without the Service definition; "deployment" (collides with K8s `Deployment`).
**Stack**:
The HCL directory under `stacks/<name>/` that defines a Service, applied independently with `scripts/tg apply`. A Stack is the unit of Terraform organisation; a Service is the running thing. They are 1:1 but not synonyms.
_Avoid_: using "Stack" when you mean the running Service.
**Module**:
A reusable HCL primitive under `modules/`, consumed by Stacks via `source =`.
_Avoid_: "library", "package".
**Factory module**:
A Module that hides convention (defaults, drift handling, secret wiring) behind a small input surface. Canonical examples: `ingress_factory`, `nfs_volume`, `k8s_app`, `helm_app`, `postgres_app`.
_Avoid_: "wrapper".
**State tier**:
Terraform state-backend partition. **Tier 0** = bootstrap Stacks (`infra`, `platform`, `cnpg`, `vault`, `dbaas`, `external-secrets`) on local SOPS-encrypted state. **Tier 1** = every other Stack, on PG-backed state.
_Avoid_: "phase", "bootstrap stack" — say Tier 0 explicitly.
### Cluster
**Node**:
A K8s worker VM (`k8s-master`, `k8s-node1..4`). Default reading of the bare word "node" in this repo.
_Avoid_: "k8s node" (redundant), "host" (ambiguous).
**PVE node** / **PVE host**:
The single physical Dell R730 running Proxmox; sole hypervisor and sole NFS server. There is exactly one.
_Avoid_: "server", "hypervisor", "Proxmox" alone when you mean the host.
**Namespace tier**:
A namespace-prefix partition (`0-core-*`, `1-cluster-*`, `2-gpu-*`, `3-edge-*`, `4-aux-*`) driving PriorityClass, default resources, and ResourceQuota — generated by **Kyverno policy** from the namespace name. Orthogonal to **State tier**.
_Avoid_: "Service tier" (the partition is on the namespace, not the Service); collapsing Namespace tier with State tier — they are different axes.
**Kyverno policy**:
The convention engine of the cluster — a ClusterPolicy or Policy resource that mutates/generates/validates on admission. Owns Namespace tier limits/quotas, `dns_config` injection on every pod-owning workload, Forgejo pull-credential sync across namespaces, TLS-secret replication. When the repo says "this happens automatically", a Kyverno policy is usually the actor.
_Avoid_: bare "policy" (overloaded with Vault, RBAC, NetworkPolicy).
**Critical-path Service**:
One of {Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared} — replicas ≥3, PDB enforced, monitored independently.
_Avoid_: "core service" (collides with the `0-core-*` Namespace tier name).
**Namespace-owner**:
A non-admin identity declared in `secret/platform → k8s_users` (JSON map). Owns one or more namespaces and one or more public subdomains.
_Avoid_: bare "user", "tenant".
### Networking
**Public domain**:
`viktorbarzin.me`, served through Cloudflare. DNS records are either **proxied** (Cloudflare CDN/WAF in front) or **non-proxied** (direct A/AAAA reachable via Cloudflared Tunnel).
_Avoid_: "external", "outside".
**Internal domain**:
`viktorbarzin.lan`, served by Technitium DNS. Resolves only inside the homelab network.
_Avoid_: bare "lan", "private", "intranet".
**Ingress auth tier**:
The `auth = "..."` parameter on `ingress_factory`, one of `required` (Authentik forward-auth gates every request), `app` (the backend owns its login), `public` (anonymous Authentik binding for audit only), or `none` (Anubis-fronted content, or native-client API).
_Avoid_: "auth mode" — the canonical key is `auth`.
**Authentik outpost**:
A standalone Authentik deployment that terminates the proxy/auth flow for a specific binding model. The repo runs two distinct ones: the default outpost (used by `auth = "required"`) and the `public` outpost (anonymous binding, used by `auth = "public"`).
_Avoid_: conflating outpost with Authentik core; "Authentik instance".
**Cloudflared Tunnel**:
The channel by which non-proxied **public domain** traffic reaches the cluster, terminating at Traefik. Backs every `dns_type = "non-proxied"` record and is the fallback path for the wildcard `*.viktorbarzin.me`.
_Avoid_: "the tunnel" without "Cloudflared" (could mean Headscale).
**Ingress chain**:
The opinionated stack of Traefik middlewares that `ingress_factory` layers onto every Ingress. Slots, in order: forward-auth (per **Ingress auth tier**) → anti-AI scraping (default-on when no Authentik is in the path) → CrowdSec bouncer (fail-open) → retry (2× / 100ms) → rate-limit (429, not 503). Adding or removing a middleware is a Stack-level choice, but the chain order is convention.
_Avoid_: "middleware list", "Traefik chain". The Anubis PoW gate is upstream of this chain, not inside it.
### Storage
**proxmox-lvm-encrypted**:
Default StorageClass for any workload holding sensitive data (databases, auth, password managers, email, financial data). LUKS2 over a Proxmox LVM-thin LV.
_Avoid_: bare "encrypted PVC" — name the StorageClass.
**proxmox-lvm**:
Block StorageClass for non-sensitive workloads (caches, monitoring data, indexes, app state without secrets).
**NFS volume**:
RWX file storage for shared media libraries, large datasets, or anything that needs to be inspected from outside K8s. Provisioned via the `nfs_volume` Module.
_Avoid_: "shared storage" (ambiguous).
**nfs-truenas StorageClass**:
A historical SC name retained only because StorageClass strings are immutable on bound PVs. The underlying server is the **PVE host**, not TrueNAS; TrueNAS is decommissioned.
_Avoid_: assuming this means TrueNAS.
**3-2-1 backup**:
The named posture of where data lives: **Copy 1** = live on the PVE thin pool (sdc), **Copy 2** = sda backup disk (`/mnt/backup`), **Copy 3** = offsite Synology NAS. Per-PVC file-level rsync from LVM thin snapshots; databases additionally dump to NFS for per-DB restore.
_Avoid_: bare "backup" without saying which copy you mean (a service is "backed up" only once it's on Copy 2; Copy 3 is the disaster floor).
### Secrets
**Vault path**:
Convention: `secret/<service>` for Service-owned secrets, `secret/viktor` for personal/global, `secret/platform` for cluster-wide maps (`k8s_users`, `homepage_credentials`).
_Avoid_: conflating Vault path (e.g. `secret/viktor`) with Vault field (e.g. `forgejo_pull_token`).
**ExternalSecret** / **ESO**:
A K8s manifest that materialises a Vault KV value as a K8s Secret. Two ClusterSecretStores: `vault-kv` (KV engine) and `vault-database` (rotating DB creds).
**Plan-time secret**:
A secret value read in Terraform via `data "kubernetes_secret"` (i.e. via the ESO-created K8s Secret) at plan time, with no Vault provider call. Distinct from a **vault data source** read (`data "vault_kv_secret_v2"`), which still goes through the Vault provider. A few Stacks remain hybrid (plan-time for env vars, vault data source for module inputs).
**Sealed Secret**:
A user-managed secret committed to a Stack directory as `sealed-*.yaml`. Distinct from ExternalSecret — Sealed Secrets carry their own bytes, ExternalSecrets reference Vault.
### CI/CD
**GHA build + Woodpecker deploy**:
The split where Docker images are built+pushed by GitHub Actions and Woodpecker only runs `kubectl set image` on a deploy-only pipeline. Repos that can't fit GHA limits stay on Woodpecker for build too.
_Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy".
**Anubis**:
A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content-bearing sites without app-level auth (blog, wiki, landing pages). Never in front of Git, WebDAV, CalDAV, or API endpoints (clients can't solve PoW).
## Relationships
- A **Service** is defined by exactly one **Stack**, which declares zero or more **Modules** and resolves to one or more K8s workloads.
- A **Namespace-owner** owns one or more namespaces and one or more public subdomains.
- A **Service** owns its **Vault path** at `secret/<service>`, surfaces values through **ExternalSecrets**, and reads them at plan time via **plan-time secrets**.
- An **Ingress** picks exactly one **Ingress auth tier**; the choice defines how strangers reach the backend.
- A **proxmox-lvm-encrypted** PVC binds to one Node at a time (RWO) and requires a Service-level backup CronJob; an **NFS volume** is RWX and is backed up at the host level via rsync.
- **State tier** and **Namespace tier** are orthogonal — a Tier 0 Stack can deploy a Service into any Namespace tier and vice versa.
## Example dialogue
> **Dev:** "I'm adding a new **Service** — FastAPI backend with its own JWT login. Do I need Authentik?"
> **Domain expert:** "If the FastAPI login is the gate, set `auth = "app"` on the ingress. That records the intent that you _chose_ not to layer Authentik — leave a one-line comment above stating what gates the Service, or `scripts/tg` will refuse the apply."
> **Dev:** "And storage?"
> **Domain expert:** "Does it hold user data? If yes, `proxmox-lvm-encrypted` — that's the default for anything sensitive. Add a backup CronJob writing to `/mnt/main/<service>-backup/`. If the data is just caches, plain `proxmox-lvm` is fine."
> **Dev:** "What about a Secret with the JWT signing key?"
> **Domain expert:** "Put the key in `secret/<service>` in Vault, then declare an **ExternalSecret** to materialise it as a K8s Secret. Read it at plan time with `data "kubernetes_secret"` — that keeps Vault out of the plan path."
## Flagged ambiguities
- **"tier"** is overloaded — *Namespace tier* (`0-core`..`4-aux`, scheduling priority) is distinct from *State tier* (Tier 0 / Tier 1, Terraform backend partition). Always qualify which axis.
- **"node"** can mean a K8s Node (default) or a PVE node. For Proxmox-level statements, say **PVE node** explicitly.
- **"service"** spans two distinct concepts: the deployed app (capitalised **Service**, this repo's domain noun) and the K8s `Service` object (in backticks or qualified "K8s Service"). Lowercase "service" in prose is fine when context disambiguates; flag it when it doesn't.
- **"secret"** spans Vault entries, K8s Secret objects, **ExternalSecrets**, and **Sealed Secrets**. Always specify which.
- **"proxied"** / **"non-proxied"** refer to Cloudflare's CDN posture for a DNS record, _not_ Anubis or forward-auth layering.

View file

@ -7,9 +7,11 @@ ARG SOPS_VERSION=3.9.4
ARG KUBECTL_VERSION=1.34.0
ARG VAULT_VERSION=1.18.1
# Install system packages (single layer)
# Install system packages (single layer).
# python3: required by scripts/check-ingress-auth-comments.py, invoked
# by scripts/tg before every plan/apply.
RUN apk add --no-cache \
bash curl git git-crypt jq openssh-client openssl unzip \
bash curl git git-crypt jq openssh-client openssl python3 unzip \
&& rm -rf /var/cache/apk/*
# Terraform

View file

@ -44,7 +44,7 @@ graph TB
| Authentik Worker | 2026.2.2 | `stacks/authentik/` | Background task processors (2 replicas) |
| PgBouncer | Latest | `stacks/authentik/` | PostgreSQL connection pooler (3 replicas) |
| Embedded Outpost | - | Built into Authentik | Forward auth endpoint for Traefik |
| Traefik ForwardAuth | - | `ingress_factory` module | Middleware for protected ingresses |
| Traefik ForwardAuth | - | `modules/kubernetes/ingress_factory/` | Middleware attached when `auth = "required"` or `"public"` |
| Vault OIDC Method | - | `stacks/vault/` | Human SSO authentication to Vault |
| Vault K8s Auth | - | `stacks/vault/` | Service account JWT authentication |
@ -52,7 +52,16 @@ graph TB
### Forward Authentication Flow
Services configured with `protected = true` in the `ingress_factory` module automatically get Traefik ForwardAuth middleware configured. When an unauthenticated user accesses a protected service:
Services pick an auth tier via the `auth` enum on the `ingress_factory` module (default `"required"`, fail-closed):
| Tier | Effect | When to use |
|------|--------|-------------|
| `"required"` | Authentik forward-auth gates every request | Backend has no own user auth — Authentik is the only gate |
| `"app"` | No Authentik middleware; backend's own login is the gate | Backend handles its own user auth (NextAuth, Django, OAuth, bearer-token API) |
| `"public"` | Authentik anonymous binding via `public` outpost | Audit trail without gating; only works for top-level browser navigation |
| `"none"` | No Authentik middleware at all | Anubis-fronted content, webhooks, OAuth callbacks, native-client APIs (CalDAV, WebDAV, Git) |
When `auth = "required"`, an unauthenticated request flows:
1. Request hits Traefik ingress
2. ForwardAuth middleware calls Authentik embedded outpost
@ -64,6 +73,8 @@ Services configured with `protected = true` in the `ingress_factory` module auto
Authentik adds authentication headers (user, email, groups) to forwarded requests. These headers are stripped before reaching the backend to prevent confusion.
**Anti-exposure guard**: every `auth = "app"` or `auth = "none"` line MUST have a preceding `# auth = "<tier>": <reason>` comment documenting what gates the backend (for `"app"`) or why the endpoint is intentionally public (for `"none"`). The convention is enforced by `scripts/check-ingress-auth-comments.py`, which `scripts/tg` runs on every `plan/apply/destroy/refresh` and blocks the terragrunt invocation if violated. Stack-scoped — each stack documents itself.
### Social Login & Invitation Flow
All new users must use an invitation link to register. The invitation-enrollment flow:
@ -144,8 +155,9 @@ The public client flow:
| Path | Purpose |
|------|---------|
| `stacks/authentik/` | Authentik deployment (servers, workers, PgBouncer) |
| `stacks/platform/modules/ingress_factory/` | Traefik ForwardAuth middleware config |
| `stacks/platform/modules/traefik/middleware.tf` | ForwardAuth middleware definition |
| `modules/kubernetes/ingress_factory/` | Auth-tier enum + per-ingress middleware composition |
| `stacks/traefik/modules/traefik/middleware.tf` | ForwardAuth middleware definitions (required + public outposts) |
| `scripts/check-ingress-auth-comments.py` | Comment-convention guard wired into `scripts/tg` |
| `stacks/vault/auth.tf` | Vault OIDC and K8s auth methods |
### Vault Paths
@ -160,17 +172,40 @@ The public client flow:
- `stacks/platform/` - Traefik ingress with ForwardAuth
- `stacks/vault/` - Vault auth methods
### Ingress Protection Example
### Ingress Protection Examples
Authentik-gated admin UI (default):
```hcl
module "myapp_ingress" {
source = "./modules/ingress_factory"
source = "../../modules/kubernetes/ingress_factory"
name = "myapp"
namespace = "myapp"
tls_secret_name = var.tls_secret_name
# auth = "required" is the default — Authentik forward-auth is the gate.
}
```
name = "myapp"
host = "myapp.viktorbarzin.me"
protected = true # Enables ForwardAuth middleware
Backend with its own user auth (no Authentik in the way):
```hcl
module "myapp_ingress" {
source = "../../modules/kubernetes/ingress_factory"
name = "myapp"
namespace = "myapp"
tls_secret_name = var.tls_secret_name
# auth = "app": myapp uses NextAuth + Google OAuth; mobile clients can't follow Authentik 302.
auth = "app"
}
```
# ... other config
Intentionally public webhook receiver:
```hcl
module "myapp_ingress" {
source = "../../modules/kubernetes/ingress_factory"
name = "webhook"
namespace = "webhooks"
tls_secret_name = var.tls_secret_name
# auth = "none": upstream signs payloads with HMAC; no user identity expected.
auth = "none"
}
```

View file

@ -1,4 +1,10 @@
# Automated Service Upgrades
# Automated Upgrades
This doc covers three independent automation paths:
1. **Service-level upgrades** — Container image bumps for OSS apps (DIUN → n8n → claude-agent → Terraform). Most of this doc.
2. **OS-level upgrades on K8s nodes**`unattended-upgrades` + `kured` with sentinel-gate + Prometheus halt-on-alert. See "K8s Node OS Upgrades" section and the runbook at `docs/runbooks/k8s-node-auto-upgrades.md`.
3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — weekly detection CronJob → chain of phase Jobs (preflight → master → worker × 4 → postflight). See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`.
## Overview
@ -205,3 +211,145 @@ The `DIUN Upgrade Agent` workflow is imported once into n8n's PG DB — it is **
- **`N8N_BLOCK_ENV_ACCESS_IN_NODE=false`** must be set on the n8n deployment for expressions to read `$env.*` at all.
- **Troubleshooting 401**: the workflow will show `success` status on the webhook node but error on `Run Upgrade Agent`. Inspect in n8n UI → Executions, or query `execution_entity` + `execution_data` directly. Claude-agent-service logs will also show `POST /execute HTTP/1.1 401 Unauthorized`.
- **Patching the live workflow** (one-off, since it's not in TF): `UPDATE workflow_entity SET nodes = REPLACE(nodes::text, OLD, NEW)::json WHERE name = 'DIUN Upgrade Agent';`
## K8s Node OS Upgrades
Independent of the service-upgrade pipeline above. Drives apt package updates + reboots on the 5 K8s VMs (master + 4 workers).
### Stack
- **In-guest**: `unattended-upgrades` runs apt upgrades within Allowed-Origins (`-security`, `-updates`, ESM). Package-Blacklist excludes runtime components (`containerd`, `containerd.io`, `runc`, `cri-tools`, `kubernetes-cni`, `calico-*`, `cni-plugins-*`, `docker-ce`). `apt-mark hold` on `kubelet`, `kubeadm`, `kubectl` (and runtime pkgs as belt-and-braces). `Automatic-Reboot=false` — kured handles reboots.
- **Reboot driver**: `kured` (chart `kured-5.11.0`, app `1.21.0`). Window 02:00-06:00 Europe/London every day of the week (Mon-Fri-only restriction dropped 2026-05-16 — see PM), period=1h, concurrency=1, reboot-delay=30s, drainTimeout=30m.
- **Reboot gate (sentinel)**: `kured-sentinel-gate` DaemonSet creates `/var/run/gated-reboot-required` only when (a) host needs reboot, (b) all nodes Ready, (c) all calico-node pods Running, (d) **no node has transitioned Ready in the last 24h** (24h soak window).
- **Reboot gate (Prometheus)**: kured `--prometheus-url` polls `prometheus-server.monitoring.svc:80` before each drain. ANY firing alert blocks unless it matches the ignore-regex `^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$`.
- **Health alert library**: 10 alerts in the `Upgrade Gates` group (`prometheus_chart_values.tpl`): `KubeAPIServerDown`, `KubeStateMetricsDown`, `PrometheusRuleEvaluationFailing`, `PVCStuckPending`, `RecentNodeReboot` (the explicit 24h soak signal), `MysqlStandaloneDown`, `ClusterPodReadyRatioDropped`, `NodeMemoryPressure`, `NodeDiskPressure`, `KubeQuotaAlmostFull`. Plus the existing 200+ alerts in the cluster-wide library (anything firing blocks kured).
- **Notifications**: kured `notifyUrl` posts drain-start/drain-finish to Slack via Vault `secret/kured.slack_kured_webhook`. Alertmanager separately routes critical alerts to `#alerts`.
### Source of truth
| Concern | Location |
|---|---|
| Package config (uu, holds, blacklist) | `modules/create-template-vm/cloud_init.yaml` (within `is_k8s_template`) |
| kured Helm release + sentinel-gate DS | `stacks/kured/main.tf` |
| Upgrade Gates alerts | `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` |
### Day-2 changes
Cloud-init only runs on first boot. Existing nodes are brought into compliance with a one-shot SSH push — see the runbook section "Restore / re-apply unattended-upgrades config to existing nodes" in `docs/runbooks/k8s-node-auto-upgrades.md`.
### Why this design
The 26h cluster outage on 2026-03-16 was triggered by an unattended-upgrades kernel push that corrupted containerd's overlayfs snapshotter cluster-wide. The remediations:
- 24h soak (sentinel-gate Check 4) gives a full day of observation between consecutive node reboots — broken updates show up as Prometheus alerts before any other node restarts.
- Prometheus halt-on-alert turns ANY firing alert into a hard block — including the 6 Node Runtime Health alerts and the 10 Upgrade Gates alerts that explicitly model "the cluster is in a bad state."
- Package-Blacklist on runtime components prevents the exact failure mode (containerd/runc auto-bumps).
- `Automatic-Reboot=false` keeps reboot policy in kured (window, ordering, gating), not in apt.
### Operational reference
See `docs/runbooks/k8s-node-auto-upgrades.md` for: verifying health, halting rollout, restoring config to a re-imaged node, rolling back a bad upgrade, and the past-incident timeline.
## K8s Version Upgrades
Independent of the OS-upgrade and service-upgrade pipelines. Drives
kubeadm/kubelet/kubectl bumps (patch + minor) on all 5 K8s VMs.
### Architecture
```
k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns)
│ probe apt-cache madison kubeadm (master) → latest available patch
│ probe HEAD https://pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor?
│ push k8s_upgrade_available metric to Pushgateway
▼ if a target is detected
envsubst on /template/job-template.yaml | kubectl apply -f -
│ spawns Job 0 = k8s-upgrade-preflight-<target_version>
Job 0 — preflight (pinned: k8s-node1)
Job 1 — master upgrade (pinned: k8s-node1) drains k8s-master
Job 2 — worker (pinned: k8s-node1) drains k8s-node4
Job 3 — worker (pinned: k8s-node1) drains k8s-node3
Job 4 — worker (pinned: k8s-node1) drains k8s-node2
Job 5 — worker (pinned: k8s-master) drains k8s-node1 ← control-plane toleration
Job 6 — postflight (no pinning)
```
Each Job runs `scripts/upgrade-step.sh`, which dispatches on `$PHASE` and ends
by spawning the next Job (`envsubst < /template/job-template.yaml | kubectl
apply -f -`). Job names are deterministic (`k8s-upgrade-<phase>-<target_version>[-<node>]`)
so `apply` reconciles to a single Job per run — re-running a failed Job
won't duplicate downstream Jobs.
### Self-preemption history (the reason for the Job-chain rewrite)
The v1 design ran the whole upgrade inside the `claude-agent-service`
Deployment (1 replica, no nodeSelector). On 2026-05-11 the agent's pod was
scheduled to k8s-node4. When the agent ran `kubectl drain k8s-node4` during
Stage 6, it evicted itself — the bash process died after the drain but
before the SSH-pipe to install kubeadm on node4. The cluster ended up
half-upgraded (master at v1.34.7, workers at v1.34.2). The rewrite to a
chain of `nodeSelector`-pinned Jobs eliminates this failure mode because
each Job's pod and its drain target are always different nodes.
### Components
- **Detection CronJob + ConfigMaps + RBAC**: `infra/stacks/k8s-version-upgrade/main.tf`.
- Image is the claude-agent-service image (kubectl + ssh-client + curl + jq + envsubst).
- One unified ServiceAccount `k8s-upgrade-job` serves both the detection CronJob and every chain Job.
- **Phase body**: `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh`.
Dispatches on `$PHASE` (preflight | master | worker | postflight). Computes
`NEXT_PHASE` / `NEXT_TARGET_NODE` / `NEXT_RUN_ON` and spawns the next Job.
Includes a `predrain_unstick` helper that pre-deletes pods on the target
node whose PDB has `disruptionsAllowed=0` (otherwise drain loops forever on
single-replica deployments like Anubis instances).
- **Job template**: `infra/stacks/k8s-version-upgrade/job-template.yaml`.
envsubst-rendered at runtime. Mounts a `creds` Secret, a `scripts`
ConfigMap, and a `template` ConfigMap into each Job pod.
- **Per-node script**: `infra/scripts/update_k8s.sh`. Caller passes
`--role master|worker --release X.Y.Z`. Piped via SSH into each node by
upgrade-step.sh.
- **Three Upgrade Gates alerts**:
- `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
- `EtcdPreUpgradeSnapshotMissing``k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently.
- `K8sUpgradeStalled``k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
- **Pushgateway metrics**:
- `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
- `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)
- `k8s_upgrade_started_timestamp` (set in preflight; used by `K8sUpgradeStalled`)
- `k8s_upgrade_available{kind,running,target}` (pushed by detection CronJob)
- `k8s_version_check_last_run_timestamp` (staleness watchdog)
### Source of truth
| Concern | Location |
|---|---|
| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `stacks/k8s-version-upgrade/main.tf` |
| Phase orchestration | `stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
| Job template | `stacks/k8s-version-upgrade/job-template.yaml` |
| Per-node upgrade script | `scripts/update_k8s.sh` |
| Alerts | `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
| Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` |
| Deprecated agent prompt (reference) | `.claude/agents/k8s-version-upgrade.deprecated.md` |
### Why this design
The cluster has a single control plane (no HA). A failed `kubeadm upgrade apply` is an outage. Mitigations:
- **Mandatory etcd snapshot before every run** (even patch). Recovery point if master breaks.
- **Halt-on-alert before every drain**. Reuses the same Prometheus ignore-list regex kured uses — any unrelated cluster-health alert blocks. Three gate alerts catch upgrade-specific half-states (version skew, missing snapshot, stalled chain).
- **Job pinning eliminates self-preemption**. Each Job's pod runs on a node that is NOT its drain target. k8s-node1 hosts every Job except the one that drains it (which runs on k8s-master with a control-plane toleration).
- **Sequential workers with 10-min inter-node soak**. Same risk-bounding as the 24h OS-reboot soak, but tightened because kubelet failures surface within minutes — not hours.
- **Master upgrade goes first, workers last**. If master breaks, the cluster is already degraded so further worker upgrades would just delay recovery. By upgrading master first, we either succeed (workers can roll afterward) or fail loud (operator triages before any worker is touched).
- **No auto-rollback**. kubeadm doesn't support clean downgrade; the snapshot + manual apt rollback in the runbook is the recovery path.
- **PDB-blocked pods don't stall the chain**. `predrain_unstick` deletes PDB=0 pods on the target node directly (bypassing the eviction API), so the parent Deployment recreates them elsewhere. This was the workaround applied manually during the 2026-05-11 recovery for Anubis single-replica instances.
### Secrets
| Secret | Vault Path | Purpose |
|--------|-----------|---------|
| SSH private key | `secret/k8s-upgrade.ssh_key` | Jobs SSH `wizard@<node>` |
| SSH public key | `secret/k8s-upgrade.ssh_key_pub` | Deployed to nodes' `~/.ssh/authorized_keys` |
| Slack webhook | `secret/k8s-upgrade.slack_webhook` | Pipeline notifications (separate channel from kured) |
The previous `api_bearer_token` entry is gone — the chain does not POST to `claude-agent-service`.
### Operational reference
See `docs/runbooks/k8s-version-upgrade.md` for: verifying health, manually triggering detection, killing a stuck Job, skipping a phase, rollback paths (master / worker / mid-flight abort), and SSH key rotation.

View file

@ -18,7 +18,7 @@ graph TB
subgraph Proxmox["Proxmox VE"]
direction TB
MASTER["VM 200: k8s-master<br/>8c / 32GB<br/>10.0.20.100"]
NODE1["VM 201: k8s-node1<br/>16c / 32GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:PreferNoSchedule"]
NODE1["VM 201: k8s-node1<br/>16c / 48GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:PreferNoSchedule"]
NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
@ -62,7 +62,7 @@ graph TB
| Model | Dell PowerEdge R730 |
| CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) |
| Total Cores/Threads | 22 cores / 44 threads |
| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~160GB total (5 K8s VMs x 32GB) |
| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~176GB total (k8s-node1 48GB + 4 K8s VMs x 32GB) |
| GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) |
| Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD |
| Hypervisor | Proxmox VE |
@ -72,12 +72,20 @@ graph TB
| VM | VMID | vCPUs | RAM | Network | Role | Taints |
|----|------|-------|-----|---------|------|--------|
| k8s-master | 200 | 8 | 32GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` |
| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to whichever node carries the GPU) |
| k8s-node1 | 201 | 16 | 48GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to whichever node carries the GPU) |
| k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
**Total Cluster Resources**: 48 vCPUs, ~160GB RAM (5 nodes x 32GB)
**Total Cluster Resources**: 48 vCPUs, ~176GB RAM (k8s-node1 48GB + 4 nodes x 32GB)
> **node1 RAM (2026-05-10)**: bumped from 32 → 48 GiB out-of-band via
> `qm set 201 --memory 49152` because VMID 201 is intentionally not
> managed by Terraform yet (telmate/proxmox provider bug with iSCSI
> PVCs — see `infra/stacks/infra/main.tf` line 442). Driver: GPU
> multi-tenancy (frigate + ytdlp + llama-swap + immich-ml) was
> hitting 94% memory-request saturation on the old size. Adopt this
> VM into TF (`module "k8s-node1"`) once we've migrated to bpg/proxmox.
### GPU Passthrough

View file

@ -0,0 +1,118 @@
# llama-cpp / llama-swap
## Overview
In-cluster, OpenAI-compatible vision-LLM endpoint. A single
`mostlygeek/llama-swap:cuda` Deployment fronts three GGUF models
served by `llama.cpp`'s `llama-server` subprocesses, hot-swapped on
demand by `llama-swap`. One Service, one `/v1` endpoint, model
selected by the request body `model` field.
Initial use case: vision-LLM benchmark on a curated Immich album,
choosing between **Qwen3-VL-8B**, **MiniCPM-V-4.5**, and
**Qwen3-VL-4B** for instagram-poster's candidate-scoring path.
Future consumers (Home Assistant, agentic tooling) can hit the same
endpoint via LiteLLM at the cluster gateway.
First benchmark run (2026-05-10): see
`infra/docs/benchmarks/2026-05-10-vision-llm.md`. Verdict: **qwen3vl-4b**
for the request path (3.55 s p50, 100% parse, decisive top-N
distribution). qwen3vl-8b for caption polish on top picks.
## Why llama.cpp + llama-swap (not Ollama)
Verified across 7+7 research/challenger subagents (2026-05-10):
- **Broader OpenAI-compat surface**`tool_choice`, `image_url`
remote URLs, native bearer auth via `--api-key`, `/reranking`,
Anthropic `/v1/messages` shim.
- **Native observability**`/metrics`, `/health` returns 503 during
model load (proper K8s startup-probe semantics), `/slots` per-slot
tracking. Ollama still has the `/metrics` issue
[#3144](https://github.com/ollama/ollama/issues/3144) open.
- **Stricter structured output** — native GBNF on `/completion`,
JSON-schema-to-GBNF converter, optional `LLAMA_LLGUIDANCE=ON`.
- **Vision coverage for our targets** — llama.cpp ≥ b9095 supports
Qwen3-VL and MiniCPM-V-4.5 natively; Ollama needs the official
`qwen3-vl` tag (community GGUFs broken — split-mmproj
[#14575](https://github.com/ollama/ollama/issues/14575)) and the
`openbmb/minicpm-v4.5` Ollama tag is 8 months stale.
Ollama still wins for Llama-3.2-Vision (`mllama` cross-attention) and
ecosystem polish (Go/JS SDKs, langchain-ollama, n8n nodes, HA built-in)
— the latter is mooted by fronting llama.cpp with **LiteLLM** at the
gateway.
## Components
| Component | Resource | Purpose |
|-----------|----------|---------|
| llama-swap Deployment | `kubernetes_deployment.llama_swap` | One pod, one OpenAI-compat endpoint, hot-swaps model subprocesses |
| llama-swap ConfigMap | `kubernetes_config_map.llama_swap_config` | YAML model entries (cmd, ttl, checkEndpoint) |
| llama-swap Service | `kubernetes_service.llama_swap` | ClusterIP `:8080``llama-swap.llama-cpp.svc.cluster.local` |
| Models PVC | `module.nfs_models` (NFS-RWX `/srv/nfs-ssd/llamacpp`) | Shared GGUF store, 30Gi |
| Download Job | `kubernetes_job_v1.download_models` | Pulls Q4_K_M GGUF + mmproj per model, creates stable `model.gguf` / `mmproj.gguf` symlinks, warms page cache |
## Storage
NFS-SSD on the Proxmox host (`192.168.1.127:/srv/nfs-ssd/llamacpp`).
Cold model load is ~40s × 3 startups ≈ 2 min in a 25-30 min benchmark
run (<10%). The download Job warms the kernel page cache after pulling
GGUFs so first inference reads from warm cache.
If steady-state cold-load latency becomes a problem, **Path B**: carve
~50Gi from a Proxmox SSD as an LV, attach as a vdisk to k8s-node1,
mount on-host, expose via a static `kubernetes_persistent_volume` with
`local` source + node1 affinity. NVMe-class load times. Out of scope
for the initial deployment.
## GPU allocation
The llama-swap pod requests `nvidia.com/gpu: 1` (whole-T4
allocation). The shared T4 is also used by Immich's ML pod
(`immich.immich-machine-learning`); only one of the two can hold the
GPU at a time. Operator must scale immich-ml to 0 before running a
benchmark and restore it after:
```bash
kubectl scale -n immich deploy/immich-machine-learning --replicas=0
# ... benchmark ...
kubectl scale -n immich deploy/immich-machine-learning --replicas=1
```
## Models served
| ID | HF repo | Quant | Ctx | mmproj |
|----|---------|-------|-----|--------|
| `qwen3vl-8b` | `Qwen/Qwen3-VL-8B-Instruct-GGUF` | Q4_K_M | 3072 | yes |
| `minicpm-v-4-5` | `openbmb/MiniCPM-V-4_5-gguf` | Q4_K_M | 3072 | yes |
| `qwen3vl-4b` | `Qwen/Qwen3-VL-4B-Instruct-GGUF` | Q4_K_M | 3072 | yes |
llama.cpp build pinned via the `llama-swap:cuda` image (ships a
recent llama.cpp ≥ b9095, which includes Qwen3-VL projection fix
[#20899](https://github.com/ggml-org/llama.cpp/issues/20899) and
mtmd Flash-Attention regression fix
[#16962](https://github.com/ggml-org/llama.cpp/issues/16962)).
## Endpoints
- `GET /v1/models` — list configured models
- `POST /v1/chat/completions` — standard OpenAI chat (vision via
`image_url` content parts, base64 or remote URL)
- `POST /completion` — llama.cpp native completion (preferred for
GBNF-constrained structured output to avoid 2026 regression magnet
on `/v1/chat/completions`)
- `GET /metrics` — Prometheus
- `GET /health` — 200 once a model is fully loaded; 503 during load
## Known issues / decisions
- **Cluster-wide GPU contention** — only one of llama-swap or
immich-ml can hold the T4. No GPU sharing solution wired in
(MPS/MIG would help but T4 has no MIG and MPS is overkill for two
workloads).
- **Filename-agnostic config** — the download Job creates stable
`model.gguf` / `mmproj.gguf` symlinks per model dir so the
llama-swap config doesn't need to track exact HF filenames (which
change between releases).
- **TF schema**`llama-cpp` (PG backend on dbaas).

View file

@ -57,7 +57,7 @@ graph TB
|-----------|---------|----------|---------|
| Prometheus | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Metrics collection and storage, scrape configs for all services |
| Grafana | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Visualization, 14+ dashboards (API server, CoreDNS, GPU, UPS, etc.) |
| Loki | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Log aggregation and querying |
| Loki | **DEPLOYED 2026-05-18** (SingleBinary mode, 30d retention, 50Gi PVC on `proxmox-lvm`, ruler enabled → Alertmanager). Re-enabled from previous "operational overhead" disable. Ships logs via Alloy DaemonSet (now on all nodes including master after 2026-05-19 toleration add). | `stacks/monitoring/modules/monitoring/` | Log aggregation and querying |
| Alertmanager | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Alert routing with cascade inhibitions |
| Uptime Kuma | Latest (Diun monitored) | `stacks/uptime-kuma/` | Internal + external HTTP monitors, status page |
| External Monitor Sync | Python 3.12 | `stacks/uptime-kuma/` | CronJob (10min) syncs `[External]` monitors from `cloudflare_proxied_names` |
@ -176,6 +176,35 @@ The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10
Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (port 993) on `10.0.20.202`, and Dovecot exporter metrics on port 9166.
#### Security Alerts (Wave 1 — planned, beads `code-8ywc`)
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
| # | Source | Event | Severity |
|---|---|---|---|
| K2 | kube-audit | SA token used from outside cluster | critical |
| K3 | kube-audit | Secret read in vault/sealed-secrets/external-secrets by non-allowlisted SA | critical |
| K4 | kube-audit | Exec into vault/kube-system/dbaas/cnpg-system pod by non-allowlisted user | warning |
| K5 | kube-audit | Mass delete (>5 Pod/Secret/CM in 60s) | critical |
| K6 | kube-audit | Audit policy itself modified | critical |
| K7 | kube-audit | New `*,*` ClusterRole created | warning |
| K8 | kube-audit | Anonymous binding granted | critical |
| K9 | kube-audit | `me@viktorbarzin.me` request from non-allowlist sourceIP | critical |
| V1 | vault-audit | Root token created | critical |
| V2 | vault-audit | Audit device disabled/modified | critical |
| V3 | vault-audit | Seal status changed | critical |
| V4 | vault-audit | Policy written/modified (allowlist Terraform actor) | warning |
| V5 | vault-audit | Auth failure spike >10/min | warning |
| V6 | vault-audit | Token with policies different from parent created | critical |
| V7 | vault-audit | Viktor's entity_id from non-allowlist remote_addr (requires `x_forwarded_for_authorized_addrs`) | critical |
| S1 | sshd-pve | sshd auth success from non-allowlist IP | critical |
K1 (cluster-admin grant) intentionally skipped — see security.md.
Allowlist source-IP CIDRs (used by K2, K9, V7, S1): `10.0.20.0/22`, `192.168.1.0/24`, K8s pod CIDR, K8s service CIDR, Headscale tailnet. Policy: no public-IP access; all admin paths transit LAN or Headscale.
IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-policy tuning. Retention: 90d for security streams.
#### Backup Alerts
- **PostgreSQLBackupStale**: >36h since last backup
- **MySQLBackupStale**: >36h since last backup

View file

@ -111,16 +111,20 @@ Namespaces are labeled with a tier (`tier: 0` through `tier: 4`). Kyverno auto-g
This prevents resource exhaustion and enforces governance without manual quota management.
#### Security Policies (ALL in Audit Mode)
#### Security Policies
**Why audit mode?** Gradual rollout without breaking existing workloads. Policies collect violations, then selectively enforced after cleanup.
**Why audit mode first?** Gradual rollout without breaking existing workloads. Policies collect violations, then selectively enforced after cleanup.
| Policy | Purpose | Enforcement |
|--------|---------|-------------|
| `deny-privileged-containers` | Block privileged pods | Audit |
| `deny-host-namespaces` | Block hostNetwork/hostPID/hostIPC | Audit |
| `restrict-sys-admin` | Block CAP_SYS_ADMIN | Audit |
| `require-trusted-registries` | Only allow approved image registries | Audit |
**Wave 1 plan (locked 2026-05-18, see beads `code-8ywc`):** all four below flip from Audit → Enforce with `failurePolicy: Ignore` preserved and an exclude list covering the 31 critical namespaces (keel, calico-system, authentik, vault, cnpg-system, dbaas, monitoring, traefik, technitium, mailserver, kyverno, metallb-system, external-secrets, proxmox-csi, nfs-csi, nvidia, kube-system, cloudflared, crowdsec, reverse-proxy, reloader, descheduler, vpa, redis, sealed-secrets, headscale, wireguard, xray, infra-maintenance, metrics-server, tigera-operator). Phased: one policy per day with PolicyReport observation.
| Policy | Purpose | Current | Planned (wave 1) |
|--------|---------|---------|------------------|
| `deny-privileged-containers` | Block privileged pods | Audit | **Enforce** |
| `deny-host-namespaces` | Block hostNetwork/hostPID/hostIPC | Audit | **Enforce** |
| `restrict-sys-admin` | Block CAP_SYS_ADMIN | Audit | **Enforce** |
| `require-trusted-registries` | Only allow approved image registries (forgejo.viktorbarzin.me, docker.io, ghcr.io, quay.io, registry.k8s.io, gcr.io, oci://ghcr.io/sergelogvinov) | Audit | **Enforce** |
Cosign `verify-images` is **deferred** beyond wave 1 — needs image-signing infrastructure (Sigstore / cosign + KMS) before it can enforce meaningfully.
#### Operational Policies
@ -163,6 +167,112 @@ Removed April 2026. The rewrite-body Traefik plugin used to inject hidden trap l
**Implementation**: See `stacks/poison-fountain/` and `stacks/platform/modules/traefik/middleware.tf`
### Audit Logging & Anomaly Detection (Wave 1)
Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
| Item | State |
|---|---|
| W1.2 Vault `file` audit device | **LIVE**`vault_audit.file` in `stacks/vault/main.tf:287`, writing to `/vault/audit/vault-audit.log` on `proxmox-lvm-encrypted` PVC |
| W1.2 Vault `x_forwarded_for_authorized_addrs = 10.10.0.0/16` | **LIVE** — applied via `tg apply -target=helm_release.vault` on 2026-05-18; all 3 vault pods restarted cleanly |
| W1.2 Vault audit log shipping to Loki | **LIVE**`audit-tail` sidecar in vault pods + Alloy DaemonSet ships to Loki with `container="audit-tail"`. Verified via `{namespace="vault",container="audit-tail"}` LogQL query. |
| W1.1 K8s API audit policy + shipping | **LIVE** — kube-apiserver audit policy was already configured (Metadata level, `/var/log/kubernetes/audit.log`, 7d retention). Alloy DaemonSet now tolerates control-plane taint, scrapes the audit log file, ships to Loki with `job=kubernetes-audit`. K2-K9 alert rules in Loki ruler. |
| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7); **S1 PENDING** — fires once promtail/Alloy on PVE host ships sshd journal with `job=sshd-pve`. |
| W1.4 Kyverno security policies → Enforce | **LIVE** — 3 policies in Enforce mode with 35-namespace exclude list. |
| W1.5 Kyverno trusted-registries → Enforce | **LIVE** — explicit allowlist (15 registries + 6 DockerHub library bare names + 56 DockerHub user repos). Verified by admission dry-run: `evilcorp.example/malware:v1` BLOCKED, `alpine:3.20` and `docker.io/library/alpine:3.20` ALLOWED. |
| W1.6 Calico observe-phase (pilot: recruiter-responder) | **LIVE** (2026-05-19) — GlobalNetworkPolicy `wave1-egress-observe-recruiter-responder` with rules `[action:Log, action:Allow]`. FelixConfiguration.flowLogsFileEnabled approach abandoned (Calico Enterprise-only field, rejected by OSS v3.26). Log action emits iptables LOG with prefix `calico-packet: ` → kernel → journald → Alloy → Loki. Verified: `{job="node-journal"} \|~ "calico-packet"` returns real packet metadata (SRC/DST/PROTO). Expand to more namespaces by adding to `namespaceSelector`. |
| W1.7 NetworkPolicy phased enforce | **PENDING** — needs ~1 week of W1.6 observation, then build empirical allowlist from Loki queries, flip GNP rules from `[Log, Allow]` to `[Allow specific dests, Deny rest]`. |
The block below documents the locked design.
Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
#### Detection sources
| Source | Mechanism | Ships via | Loki job label |
|---|---|---|---|
| K8s API audit log | Custom audit policy on kube-apiserver: drop `get`/`list`/`watch` at `None` for most resources, log writes at `Metadata`, secret reads at `Metadata`, `exec`/`portforward` at `RequestResponse`, exclude kubelet+controller-manager noise. Codified in `stacks/infra` kubeadm config templating. | Alloy DaemonSet tails `/var/log/kubernetes/audit/*.log` | `job=kube-audit` |
| Vault audit log | `file` audit device on existing Vault PVC. Vault listener config sets `x_forwarded_for_authorized_addrs` trusting Traefik pod CIDR so `remote_addr` is the real client IP, not Traefik's. | Alloy tails audit log file | `job=vault-audit` |
| PVE sshd auth log | journald `_SYSTEMD_UNIT=ssh.service` | promtail systemd unit on Proxmox host (192.168.1.127) | `job=sshd-pve` |
| Calico flow log | `flowLogsFileEnabled: true` in Calico Felix config | Alloy (cluster-wide) | `job=calico-flow` (W1.6 only) |
#### Alert rules (16 total)
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel.
**K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**
| # | Event | Severity |
|---|---|---|
| K2 | ServiceAccount token used from outside cluster (sourceIPs not in pod CIDR or trusted LAN) | critical |
| K3 | Secret READ in `vault`, `sealed-secrets`, `external-secrets` namespaces by a non-allowlisted ServiceAccount | critical |
| K4 | Exec into a pod in `vault`, `kube-system`, `dbaas`, `cnpg-system` (excluding `me@viktorbarzin.me` + 1 break-glass SA) | warning |
| K5 | >5 deletes of `Pod`, `Secret`, or `ConfigMap` in 60s by any single actor | critical |
| K6 | `audit-log-path` flag or audit policy modified on kube-apiserver | critical |
| K7 | New ClusterRole created with `verbs: ["*"]` and `resources: ["*"]` | warning |
| K8 | Anonymous binding granted (any RoleBinding/CRB referencing `system:anonymous` or `system:unauthenticated`) | critical |
| K9 | Authenticated request where `user.username == "me@viktorbarzin.me"` AND `sourceIPs[0]` NOT in allowlist CIDRs | critical |
**Vault audit (V1-V7):**
| # | Event | Severity |
|---|---|---|
| V1 | Root token created | critical |
| V2 | Audit device disabled or modified | critical |
| V3 | Seal status changed (`sys/seal` write) | critical |
| V4 | Policy written or modified (allowlist Terraform-driven writes by source IP / token role) | warning |
| V5 | Authentication failure spike >10/min on any auth method | warning |
| V6 | Token created with policies different from parent (privilege escalation) | critical |
| V7 | Vault audit event where `auth.entity_id == <viktor-entity-id>` AND `remote_addr` NOT in allowlist CIDRs | critical |
**Host (S1):**
| # | Event | Severity |
|---|---|---|
| S1 | PVE sshd auth success from source IP NOT in allowlist | critical |
#### Allowlist — "expected source IPs" for K2, K9, V7, S1
| CIDR | Source |
|---|---|
| `10.0.20.0/22` | VLAN 20 (K8s cluster + main LAN) |
| `192.168.1.0/24` | Proxmox host LAN + Sofia LAN (same RFC1918 block in both physical locations; cross-site traffic transits Headscale so the CIDR matches only on-LAN clients in either location) |
| K8s pod CIDR (verify at implementation time) | In-cluster pods talking to apiserver |
| K8s service CIDR | Service-to-apiserver traffic |
| Headscale tailnet | VPN-connected devices |
**Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert.
#### Why no canary tokens
Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden.
#### Why no K1 (cluster-admin grant detection)
Viktor opted out. Gap covered indirectly by K7 (new `*,*` ClusterRole created), K8 (anonymous binding), and K3 (secret read on Vault namespace) — most attacker progressions toward cluster-admin trigger one of these.
#### IOPS / disk-wear
Custom audit policy reduces volume ~80-90% vs default Metadata-everywhere. Loki tuned for fewer larger chunks: `chunk_target_size: 1.5MB`, `chunk_idle_period: 30m`, snappy compression. Retention 90d for security streams (matches Technitium DNS query log precedent). Net estimate: ~1-2 GB/day additional disk writes after tuning.
### NetworkPolicy Default-Deny Egress (Wave 1 — observe-then-enforce, tier 3+4)
Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
**Approach (γ): cluster-wide observe-then-enforce.**
1. **Week 0:** Enable Calico flow logs cluster-wide. Apply a GlobalNetworkPolicy with selector `tier in {tier-3, tier-4}`, `action: Log` (no Deny). Ship flow logs to Loki.
2. **Week 1:** Build per-namespace egress allowlist from observed traffic. Common allowlist module `tier3_egress_baseline` covers DNS, NTP, internal Vault/ESO/Authentik, Brevo SMTP, Cloudflare API, OAuth providers. Per-namespace add-ons for service-specific external destinations.
3. **Week 2-3:** Apply default-deny + allowlist per-namespace, starting `recruiter-responder` (smallest egress footprint — local llama-cpp). Watch 24-48h per namespace, iterate. Roll out 3-5 namespaces/day.
**Scope exclusions:** tier 0/1/2 namespaces (defer to wave 2), 31 critical infra namespaces (same exclude list as Kyverno).
**DNS handling:** Calico GlobalNetworkPolicy supports domain-based rules via the `domains:` selector which queries CoreDNS internally. Static IPs reserved for fixed-IP services (Brevo SMTP relay).
**Known risks:**
- Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
- Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
### TLS & HTTP/3
**Traefik** handles TLS termination:

View file

@ -0,0 +1,253 @@
# Vision-LLM benchmark — Malaga / Seville album
**Run ID:** `2026-05-10-1424` · **Date:** 2026-05-10 · **Operator:** wizard
100 photos randomly sampled (seed=42) from the Immich album `🇪🇸 Malaga
Seville` (`46565b85-7580-4ac1-91a6-1ece2cf8634d`, 1556 image assets +
9 videos), scored by three local vision-LLMs served by `llama-swap`
on a single Tesla T4. Goal: pick a model to wire into
`instagram-poster`'s `/candidates` ranking path.
## TL;DR
**Recommendation: `qwen3vl-4b`.**
- **Fastest** by a wide margin (3.55 s p50, 60% of qwen3vl-8b),
important once this is in the request path of `/candidates`.
- **100% structured-output success** — same as the other two; GBNF
grammar enforcement worked across the board.
- **Captions are competitive** with the 8B model in qualitative review
(tied or close on 8/10 sampled photos; 8B wins on Flair, 4B wins on
Latency).
- **Most decisive scorer** — 47/100 photos got IG-fit=9 vs 17 for
qwen3vl-8b and 9 for minicpm. We get more signal at the top end
for ranking.
Use qwen3vl-8b for *manual* caption refinement (top-1 of the day) if
caption polish matters. Use minicpm-v-4-5 for nothing immediate — it's
the most conservative scorer and the slowest at high quantiles, with
no offsetting wins in this dataset.
## Setup
- Hardware: 1× Tesla T4 (16 GiB VRAM), `nvidia.com/gpu` time-slicing
enabled (replicas=100), pod scheduled on `k8s-node1`.
- Server: `mostlygeek/llama-swap:cuda` (ships llama.cpp `b9085-046e28443`)
on `llama-swap.llama-cpp.svc.cluster.local:8080`.
- Models: GGUF Q4_K_M, mmproj F16 except qwen3vl-4b which used the
Q8_0 mmproj (alphabetically first matching the glob).
- Image prep: EXIF-transposed, long-edge resized to 1024 px, JPEG q=90,
base64-embedded as `image_url` data URLs.
- Generation: `temperature=0`, `top_k=1`, `enable_thinking=false`,
GBNF grammar pinning the JSON schema (6 fields, 110 ints, ≤8 tags).
- Run isolation: `immich-machine-learning` scaled to 0 for the
duration to avoid noisy GPU contention. *(Diagnostic note: the
scheduling failure that triggered this was actually node1 RAM —
not GPU — at 94% allocated. Time-slicing was already on. Bumping
node1 RAM is tracked as a follow-up.)*
## Headline numbers
| model | n | parse_ok | p50 latency | p95 latency | median IG-fit | median aesthetic |
|-------|---|----------|-------------|-------------|---------------|------------------|
| **qwen3vl-4b** | 100 | 100% | **3.55 s** | 4.06 s | 8.0 | 8.0 |
| minicpm-v-4-5 | 100 | 100% | 5.62 s | 6.00 s | 7.0 | 8.0 |
| qwen3vl-8b | 100 | 100% | 5.98 s | 6.64 s | 7.0 | 8.0 |
Total wall time for the run: **33 m 32 s** (300 calls + 3 cold loads
of ~30 s each).
## What each model is good at
### qwen3vl-4b — fast and decisive
- p50 3.55 s — comfortable for adding to `/candidates` request path.
- IG-fit distribution skews right (47 nines), spreading 6 → 9 fairly
evenly, which is what you want from a *ranker*.
- Captions are emoji-friendly, hashtag-friendly, sometimes
hallucinatory (e.g. labelled a Seville street as "Barcelona's
colourful streets" once).
- Failure mode to watch: occasional double-down on the same caption
template ("Lost in the tiles. 🌿" repeated across two unrelated
blue-dress photos).
### minicpm-v-4-5 — conservative, terse
- Most conservative scorer: 65% of photos got IG-fit=7. Only 9 nines.
Less useful as a top-N ranker because the top is squashed.
- Fastest p95 of the three (6.0 s) but slower p50 than qwen3vl-4b.
- Captions are short and lower-case ("azulejo dreams.",
"sunshine & secrets") — distinct voice but less Instagram-native.
### qwen3vl-8b — most polished captions
- Best subject identification (specifically named "Metropol Parasol"
and "Plaza de España" by name where the others said "modern
architecture" / "plaza").
- Captions read well: "Coffee & calm vibes ☕️", "where modern meets
historic under a brilliant sky".
- Slowest p50 (5.98 s) and tightest score distribution (median 7,
17 nines) — middle of the pack as a ranker.
## Top-10 agreement (Kendall-tau-style overlap)
How many of each model's top-10 IG-fit picks appear in another
model's top-10:
| pair | overlap |
|------|---------|
| qwen3vl-4b ↔ qwen3vl-8b | 5/10 |
| minicpm-v-4-5 ↔ qwen3vl-4b | 4/10 |
| minicpm-v-4-5 ↔ qwen3vl-8b | 4/10 |
Read: there's moderate but not strong agreement. The models pick
roughly half the same "best" photos and half different ones. For
ranking, that's a healthy sign — they're not collapsing to a single
notion of "good", so combining their scores would add real signal.
## Cost-equivalent context
Approximate cost to score the same 100 photos via cloud APIs
(prompt ≈ 1100 tokens incl. image, completion ≈ 100 tokens):
| backend | input | output | per-100 photos |
|---------|-------|--------|----------------|
| Local llama-swap on T4 | — | — | ≈ $0.04 (electricity, ~70 W × 7 min) |
| Anthropic Haiku 4.5 | $1.00/M | $5.00/M | ≈ $0.15 |
| Anthropic Sonnet 4.6 | $3.00/M | $15.00/M | ≈ $0.45 |
| Google Gemini 2.5 Flash | $0.30/M | $2.50/M | ≈ $0.05 |
Local is competitive with Gemini Flash on marginal cost. The case
for keeping it local is privacy (Immich originals never leave the
LAN), no rate-limits, and no per-call quota planning. The case
against is the GPU is finite — adding this to a request path means
sharing T4 time with frigate, ytdlp, and (when we restore it)
immich-ml. With time-slicing on, that's tractable but each tenant
gets fewer full-GPU kernels.
## Sample captions (10 representative photos, all 3 models)
`9f9c9faf-...` — man at ornate arch in Seville
- **minicpm-v-4-5** (7/8): sunshine & secrets
- **qwen3vl-4b** (9/8): Found the perfect spot to pause. 🌞 #travel #architecture
- **qwen3vl-8b** (9/8): Standing in Seville's magic, framed by history and color.
`0d53158b-...` — Metropol Parasol
- **minicpm-v-4-5** (9/8): Seville's skyward statement.
- **qwen3vl-4b** (9/8): Seville's iconic modern structure against a vibrant blue sky.
- **qwen3vl-8b** (9/8): The Metropol Parasol in Seville, where modern meets historic under a brilliant sky.
`23137c1c-...` — coffee shot
- **minicpm-v-4-5** (8/7): a morning pick-me-up and a smile.
- **qwen3vl-4b** (8/7): Coffee & chill. Perfectly imperfect moment.
- **qwen3vl-8b** (9/8): Coffee & calm vibes ☕️
`3376ee87-...` — woman against tiled wall
- **minicpm-v-4-5** (7/8): azulejo dreams.
- **qwen3vl-4b** (9/8): Lost in the tiles. 🌿
- **qwen3vl-8b** (9/8): Blue tiles, golden light, and a moment of quiet grace.
`4f2fea45-...` — courtyard
- **minicpm-v-4-5** (7/8): hidden gems of seville
- **qwen3vl-4b** (7/8): Timeless beauty in a Spanish courtyard. 🌿
- **qwen3vl-8b** (7/8): A serene courtyard in Seville, where palm trees sway under the sun.
`ea713729-...` — flower-market street (qwen3vl-4b confused location)
- **minicpm-v-4-5** (7/8): Seville's hidden gems.
- **qwen3vl-4b** (7/8): Walking through *Barcelona's* colorful streets, backlit by golden hour.
- **qwen3vl-8b** (7/8): Walking through Seville's vibrant streets, lavender in hand.
The full list of 10 sample sets is in the auto-generated section
below; the raw 300-row JSON is at `benchmark-2026-05-10-1424.json`
in this directory.
## Operational cost during the run
- llama-swap pod (1× T4 wholly allocated for the duration): ~33 min.
- Immich-ML downtime: ~33 min. New uploads weren't auto-tagged or
CLIP-embedded during this window. No user-visible impact (Immich
search against already-indexed assets still worked via pgvector).
- Network egress: zero — Immich originals stayed on the LAN, all
scoring traffic was in-cluster.
## Reproducibility
```bash
DATA_DIR=/tmp/benchmark \
IMMICH_API_KEY=… \
LLAMA_SWAP_URL=http://localhost:18080 \
poetry run python -m instagram_poster.benchmark run \
--album-id 46565b85-7580-4ac1-91a6-1ece2cf8634d \
--models qwen3vl-8b,minicpm-v-4-5,qwen3vl-4b \
--limit 100 --random-seed 42 --run-id 2026-05-10-1424
```
The same `--random-seed` reproduces the photo sample exactly. Prompt
version `4bbb7e7721da24d9` is the SHA-256 of the system prompt + user
prompt + GBNF grammar; rerunning under the same prompt version against
the same seed should produce within-noise identical scores (the models
themselves are temperature=0, top_k=1).
## Next steps
- **Wire `qwen3vl-4b` into `instagram-poster`** as an additional ranking
signal alongside CLIP-based recency in `/candidates`. Cache the score
per asset_id so we don't re-pay 4 s on every list refresh.
- **Bump k8s-node1 RAM** so immich-ml + llama-swap can co-exist (drain
→ resize → uncordon, with kubelet `systemReserved` adjusted in
`stacks/infra/main.tf`).
- **Re-benchmark with shared GPU** once node1 RAM is bumped, to get
realistic latency numbers when the T4 is also under load from
immich-ml and frigate.
- **Front llama-swap with LiteLLM** so Home Assistant and any other
consumer can hit one OpenAI-compat gateway. Track separately.
---
## Auto-generated report
Below is the unedited output of `python -m instagram_poster.benchmark
report --run-id 2026-05-10-1424`, kept for diff-checking against
future runs.
### Per-model summary
| model | n | parse_ok % | error % | p50 latency | p95 latency | median IG-fit | median aesthetic |
|-------|---|-----------|--------|------------|-------------|--------------|------------------|
| minicpm-v-4-5 | 100 | 100.0 | 0.0 | 5617 ms | 5998 ms | 7.0 | 8.0 |
| qwen3vl-4b | 100 | 100.0 | 0.0 | 3552 ms | 4063 ms | 8.0 | 8.0 |
| qwen3vl-8b | 100 | 100.0 | 0.0 | 5981 ms | 6637 ms | 7.0 | 8.0 |
### Score histograms (instagram_fit_score 110)
#### minicpm-v-4-5
```
1: (0) 2: (0) 3: (0) 4: (0) 5: (0)
6: ███████ (7)
7: █████████████████████████████████████████████████████████████████ (65)
8: ███████████████████ (19)
9: █████████ (9)
10: (0)
```
#### qwen3vl-4b
```
1: (0) 2: (0) 3: (0) 4: (0) 5: (0)
6: █████ (5)
7: ████████████████ (16)
8: ████████████████████████████████ (32)
9: ███████████████████████████████████████████████ (47)
10: (0)
```
#### qwen3vl-8b
```
1: (0) 2: (0) 3: (0) 4: (0) 5: (0)
6: ███████████ (11)
7: ███████████████████████████████████████████████████████ (55)
8: █████████████████ (17)
9: █████████████████ (17)
10: (0)
```
### Top-10 by IG-fit per model — see `benchmark-2026-05-10-1424.json`
(Tables omitted from the curated report; available in the JSON dump
alongside this file.)

File diff suppressed because it is too large Load diff

72
docs/known-issues.md Normal file
View file

@ -0,0 +1,72 @@
# Known Issues
Catalog of recurring or upstream-blocked failure modes with their
mitigations. Anything that requires a manual workaround should be
documented here — if a future session can hit the same issue, it
deserves an entry. Each entry should have: symptom, root cause, current
mitigation, and the trigger that lets us un-mitigate.
---
## 2026-05-17 — NVIDIA GPU driver fails on Ubuntu 26.04 (kernel 7.0.x)
**Symptom.** `nvidia-driver-daemonset-*` in `nvidia` namespace
CrashLoopBackOff on the GPU node. Logs say:
Could not resolve Linux kernel version
… or, post chart-upgrade, ImagePullBackOff on a `*-ubuntu26.04` tag.
**Root cause.** NVIDIA has not published any `nvcr.io/nvidia/driver:*-ubuntu26.04`
images (0 tags as of 2026-05-17; verified with skopeo). When a k8s node
running the GPU operator gets `do-release-upgrade`'d to Ubuntu 26.04
Resolute Raccoon, NFD relabels the node with
`feature.node.kubernetes.io/system-os_release.VERSION_ID=26.04` and the
operator computes the driver image tag `<version>-ubuntu26.04` — which
404s on pull. Both gpu-operator chart v25.10.1 and v26.3.1 exhibit the
same behaviour once NFD has detected 26.04.
**Current mitigation (active on k8s-node1 since 2026-05-17).**
1. Host kernel rolled back to `6.8.0-117-generic` (Ubuntu 24.04 HWE
kernel — still installed at `/lib/modules/6.8.0-117-generic`).
2. `apt-mark hold` on: `linux-image-6.8.0-117-generic`,
`linux-headers-6.8.0-117-generic`, `linux-modules-6.8.0-117-generic`,
`linux-image-generic`, `linux-headers-generic`, `linux-generic`.
3. `/etc/os-release` on k8s-node1 replaced with the Ubuntu 24.04 Noble
content (was a symlink to `/usr/lib/os-release`; now a regular file
under `/etc`). Backup at `/etc/os-release.bak-pre-spoof-2026-05-17`.
NFD-worker reads `/etc/os-release` and now reports
`system-os_release.VERSION_ID=24.04`, so the operator picks the
matching ubuntu24.04 driver image which DOES exist.
4. gpu-operator chart pinned to v25.10.1 in
`stacks/nvidia/modules/nvidia/main.tf`; driver pinned to 570.195.03
in `stacks/nvidia/modules/nvidia/values.yaml`.
**This is gross but stable.** The kernel matches what 24.04 ships, and
the `apt-mark hold` keeps it that way. /etc/os-release lying about the
OS only affects userland callers that key off it — none of our
deployed services do (we verified by grepping the cluster).
**Trigger to un-mitigate.** Periodically check for ubuntu26.04 driver
tags. Once they appear:
docker run --rm quay.io/skopeo/stable list-tags \
docker://nvcr.io/nvidia/driver \
| python3 -c "import json,sys; d=json.load(sys.stdin); \
print(len([t for t in d['Tags'] if 'ubuntu26.04' in t]))"
When that returns a non-zero count:
1. Restore `/etc/os-release` from backup
(`/etc/os-release.bak-pre-spoof-2026-05-17`) on k8s-node1.
2. Remove apt-mark holds for the kernel packages.
3. `apt full-upgrade` to land the latest 26.04 kernel + reboot.
4. Bump the gpu-operator chart pin to the matching version that ships
ubuntu26.04 driver images. Bump `driver.version` in values.yaml to
the current chart default.
**See also.** `docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md`
for full incident timeline + the recovery procedure.
**Beads.** `code-8vr0` (P1, OPEN).

View file

@ -0,0 +1,265 @@
# Infra Audit — 2026-04-20
**Status**: Design (post-research, post-challenge)
**Author**: Viktor Barzin (audit run by Claude)
**Scope**: `infra/` Terragrunt stacks + platform services (`claude-agent-service`, `claude-memory-mcp`, `beadboard`, `broker-sync`)
**Goals**: Reliability · Declarative-first · Reduced maintenance overhead · Maintained scalability
**Method**: 5 parallel research agents (R1 Reliability, R2 Declarative, R3 Maintenance, R4 Scalability, R5 Security) → 91 raw findings → 2 independent challengers → filtered/corrected/ranked backlog below.
## Context
The home-lab has grown into a mature stack (105 Tier-1 Terragrunt stacks + 6 Tier-0 SOPS, CNPG, Vault+ESO, Kyverno, Traefik, Authentik, CrowdSec, Woodpecker CI, Redis-Sentinel, MySQL-standalone, Proxmox-NFS). Recent work has been consolidation: MySQL InnoDB-Cluster → standalone (2026-04-16), Redis Phase 7 refactor (2026-04-19), NFS fsid=0 SEV1 post-mortem (2026-04-14), Authentik outpost /dev/shm fix (2026-04-18). This audit surveys everywhere that remains — what's brittle, what's manual, what's dark, what hasn't caught up to recent decisions — and ranks fixes by impact and by operator fatigue.
## Corrections up-front (challenger round)
Before reading the backlog, these findings from the research phase are **dropped, corrected, or reframed** — challengers spot-checked live state and proved them wrong, already-solved, or intentional-by-design. Being honest about this is the point of the challenge round:
| Finding as stated | Actual state | Action |
|---|---|---|
| R4#1: Worker nodes 86-91% memory saturation | Live `kubectl top nodes`: 44-51% across k8s-node{1-4} | **DROPPED** — bad metric pull |
| R4#2: Frigate CPU unbounded (1.5 CPU request, no limit) | Cluster policy is **all CPU limits removed** to avoid CFS throttling (`infra/.claude/CLAUDE.md` → Resource Management) | **DROPPED** — by design |
| R4#7: Redis no `maxmemory-policy` | `infra/stacks/redis/modules/redis/main.tf:254` sets `maxmemory-policy allkeys-lru` (Phase 7, 2026-04-19) | **DROPPED** — already solved |
| R2#1: 307 Kyverno lifecycle markers is a drift risk | Markers are the **canonical discoverability tag**`ignore_changes` only accepts static attribute paths, snippet convention is the only viable path; reframe as *"markers are fine, missing markers are the risk"* | **REFRAMED** |
| R2#3: 140 `ignore_changes` blocks | Actual: **310** across `.tf` files (2.2× off) | **CORRECTED** |
| R3#10: 65 CronJobs | Actual: 59 (10% off) | **CORRECTED** |
| R1#1: 47 deployments missing probes | Actual: **115 missing at least one probe; 103 missing both** | **CORRECTED (much worse than reported)** |
| R1#9: MySQL standalone no HA/PDB | Intentional post-2026-04-16 migration from InnoDB Cluster. Backup + restore matter; HA is explicit deferred. | **REFRAMED** — split into HA (deferred) / backup-restore (open) / connection pool (open) |
| R1#10: PDB gaps include Traefik, Authentik | Traefik & Authentik PDBs `minAvailable=2` exist (CLAUDE.md). The real gaps are **CrowdSec LAPI, Calico-apiserver, ESO webhook, Woodpecker-server** | **CORRECTED (list pruned)** |
| R5#2: 4 Kyverno security policies in Audit | **All 16 ClusterPolicies are in Audit** — zero in Enforce. | **CORRECTED (worse)** |
---
## Executive summary — top 5 cross-cutting themes
These are the themes that survive the challenge round and hit ≥2 concerns. Each headline is a 1-line hook; deep-dives below.
1. **Declarative escape hatches (NFS exports, master-node file provisioners, null_resource initializers)**`/etc/exports` is not in Terraform, which is the **root cause of the 2026-04-14 SEV1**; 6 null_resources + 3 SSH file provisioners still orchestrate critical state. *Hits R2 + R1 + R3.*
2. **Observability has blind spots where pain would actually come from** — no OOMKill alert routing, no NFS capacity monitor, no GPU utilization dashboard, no ESO refresh-lag alert, no CronJob success-rate summary. Alerts exist but they don't cover the operator's real failure modes. *Hits R1 + R3 + R4.*
3. **Supply-chain hygiene: image pinning + Renovate + admission signing** — 84 `:latest` tags in production TF, zero Renovate/Dependabot across 18 repos (~15 hr/mo toil by estimate), no cosign/trivy on push. Single theme unifies security posture, maintenance toil, and determinism. *Hits R3 + R5.*
4. **Reliability-probes & graceful shutdown are genuinely uneven** — 115 deployments missing at least one probe (incl. 103 missing both), 50+ Recreate deployments with no `terminationGracePeriodSeconds`/`preStop`. This is the quietly-largest reliability debt. *Hits R1 + R3 (pager toil).*
5. **Backup coverage is uneven: 30+ PVCs lack app-level CronJobs** — Proxmox host snapshots cover the disk, but Forgejo (!), Affine, Paperless, Hackmd, Matrix, Owntracks have no app-aware dumps. Restore granularity is file-level, not entity-level. *Hits R1 + R5 (compliance) + R3 (restore rehearsal toil).*
Honourable mentions that didn't make top 5 but sit just below: Kyverno audit→enforce transition (security), ESO refresh-lag alert (secrets reliability), Vault hardening (audit log offsite, root-token K8s-secret scope), Cloudflared tunnel-token SPOF (not replica SPOF — those are 3), Dolt PVC sizing + backup.
---
## Scoring method
Two parallel rankings — scan both.
**Rank A — Impact × Reversibility (the original formula)**
`score = Impact × (6 - Effort) × (6 - Risk)` — each dimension 1-5.
**Rank B — Operator fatigue weight**
`score = Impact × (6 - Effort) × FatigueWeight` where `FatigueWeight = 3` if the finding introduces *daily/weekly manual toil* and `1` otherwise. This re-ranks by how much pain the unfixed state causes per month.
Both rankings below. When they agree, that's the clear signal. When they diverge, that's where Rank B (fatigue) wins — Viktor has stated operator fatigue dominates abstract risk for a solo-operator lab.
---
## Ranked backlog (filtered, deduplicated, corrected)
Counts below reflect **post-challenge corrected numbers**. Every row has a reference verified either by a spot-check (file:line) or a live cluster command.
| ID | Title | Concerns | Impact | Effort | Risk | Rank A | Rank B | Refs |
|---|---|---|---:|---:|---:|---:|---:|---|
| F01 | NFS `/etc/exports` not in Terraform (SEV1 root cause) | R2+R1 | 5 | 3 | 2 | **60** | **45** | `infra/scripts/pve-nfs-exports`, PM 2026-04-14 |
| F02 | 115 deployments missing probes (103 missing both) | R1+R3 | 5 | 3 | 2 | **60** | **45** | `kubectl get deploy -A -o json` |
| F03 | Zero Renovate/Dependabot across 18 repos | R3+R5 | 4 | 2 | 1 | **80** | **48** | `find /home/wizard/code -name ".renovaterc*"` → 0 results |
| F04 | 84 `:latest` image tags in production TF | R3+R5+R4 | 4 | 2 | 2 | **64** | **48** | `grep -rn ':latest' infra/stacks` |
| F05 | No OOMKill / unschedulable / node-CPU alert | R1+R4+R3 | 5 | 3 | 1 | **75** | **45** | Grep Prometheus rules — no `OOMKilling` rule present |
| F06 | 6 `null_resource` DB initializers in `dbaas` stack | R2 | 4 | 3 | 3 | **36** | **36** | `grep -n null_resource infra/stacks/dbaas` |
| F07 | 3 SSH+file provisioners on k8s-master (audit, OIDC, etcd) | R2 | 4 | 3 | 3 | **36** | **36** | `stacks/platform/modules/rbac/apiserver-oidc.tf` |
| F08 | ESO refresh-lag alert missing (52 ExternalSecrets) | R1+R5+R3 | 4 | 2 | 1 | **80** | **48** | `stacks/external-secrets/` — no PrometheusRule for refresh lag |
| F09 | 30+ PVCs without app-level backup CronJobs | R1+R5 | 4 | 3 | 2 | **48** | **36** | Affine, Forgejo, Hackmd, Matrix, Owntracks, Paperless (no `*-backup` CJ) |
| F10 | Cloudflared tunnel-token SPOF (replicas OK, token shared) | R1+R5 | 3 | 4 | 2 | **24** | **8** | `stacks/cloudflared/` single tunnel credential |
| F11 | MySQL restore never rehearsed end-to-end | R1+R4+R3 | 4 | 2 | 2 | **64** | **48** | No `mysql-restore-drill` CJ; runbook untested post-migration |
| F12 | Kyverno policies all 16 in Audit — **sequence carefully** | R2+R5 | 4 | 3 | **4** | **24** | **24** | `kubectl get clusterpolicy` |
| F13 | 97 RollingUpdate deployments lack explicit surge bounds | R1 | 2 | 2 | 2 | **32** | **12** | TF defaults inherit from Helm/k8s (25%/25%) |
| F14 | CronJob success-rate dashboard + alert rollup missing | R3+R4 | 3 | 2 | 1 | **60** | **36** | `CronJobTooOld` rule — partial; no 24h rollup |
| F15 | Authentik outpost /dev/shm fix applied via Helm API only | R1+R5 | 3 | 2 | 2 | **48** | **48** | Not in TF — upgrade-reversion risk |
| F16 | Dolt (beads DB) no backup CronJob — 2Gi PVC near full | R1+R4 | 4 | 2 | 2 | **64** | **32** | `stacks/beads/` — no `dolt-backup` CJ |
| F17 | Vault StatefulSet `updateStrategy=OnDelete` (manual roll) | R1+R3 | 2 | 2 | 3 | **24** | **24** | `kubectl get sts -n vault -o yaml` |
| F18 | No NetworkPolicies cluster-wide | R4+R5 | 4 | **5** | **4** | **8** | **8** | `kubectl get netpol -A` → 0-2 |
| F19 | RBAC `oidc-power-user` has cluster-wide secrets r/w | R5 | 4 | 3 | 3 | **36** | **12** | `stacks/platform/modules/rbac/` |
| F20 | No image supply-chain verification (cosign, trivy on push) | R5 | 4 | 4 | 3 | **24** | **8** | No admission controller for signatures |
| F21 | Vault audit log offsite backup not configured | R5+R1 | 3 | 2 | 1 | **60** | **36** | `stacks/vault/` — no `audit-log-sync` CJ |
| F22 | Claude-agent, beadboard, broker-sync singletons | R1 | 2 | 2 | 2 | **32** | **12** | `kubectl get deploy -n claude-agent,beadboard,broker-sync` |
| F23 | 50+ Recreate deployments lack graceful-shutdown hooks | R1+R3 | 3 | 3 | 2 | **36** | **36** | `grep -L terminationGracePeriodSeconds stacks/**` |
| F24 | CoreDNS scaled via `kubectl scale` not TF | R2 | 3 | 2 | 2 | **48** | **32** | Command in runbook; no TF resource for replicas |
| F25 | GPU / inference-latency SLO unmonitored | R4+R5 | 3 | 3 | 2 | **36** | **36** | No dcgm dashboard; Frigate liveness checks only |
| F26 | Prometheus TSDB 200Gi — retention untracked | R4 | 2 | 2 | 1 | **40** | **20** | `stacks/monitoring/` |
| F27 | Pod Security Standards labels unset on all namespaces | R5 | 3 | 2 | 3 | **36** | **12** | `kubectl get ns -o json \| jq '.items[].metadata.labels'` |
| F28 | Authentik worker VPA upperBound 2.3× actual request | R4 | 2 | 2 | 2 | **32** | **20** | Goldilocks dashboard |
| F29 | 9 DB rotation targets, no post-rotation verification loop | R5+R3 | 3 | 2 | 2 | **48** | **36** | Vault DB engine every 7d; no auto-verify |
| F30 | Tier-0 SOPS workflow 7-step vs 3-step Tier-1 | R3 | 2 | 2 | 1 | **40** | **20** | `scripts/state-sync` — manual decrypt/encrypt/commit |
**Rank A leaders (top 8)**: F03, F08, F05, F11, F04, F16, F01, F02 — "big cluster wins, cheap to try"
**Rank B leaders (top 8)**: F03, F04, F08, F11, F15, F01, F02, F05 — "what's paining you weekly"
F03 (Renovate), F08 (ESO refresh alert), F11 (MySQL restore drill) and F01 (NFS in TF) lead in **both** rankings → these are the clear "do first" candidates.
---
## Per-concern deep dives
### R1 — Reliability (18 raw → 11 real after challenge)
Filtered: dropped R1#1/9/10 (incorrect numbers, intentional choices). What actually matters:
- **Probes (F02)** — 115 deployments missing at least one probe; 103 missing both. The corrected count is 2.4× the original claim. Worst offenders are batch workloads (CronJob-spawned) that legitimately skip probes — but long-lived ones (Affine, Hackmd, mailserver sidecars) genuinely need them. Triage: filter by `spec.replicas ≥ 1` and `containers[].command != ["/bin/sh","-c"]`-style short-runners, then add readiness+liveness one-by-one.
- **Cloudflared tunnel token SPOF (F10)** — Replicas are 3 (per CLAUDE.md), so the agent finding "SPOF" framed as replicas is wrong. The real SPOF is the *tunnel credential*. Secondary tunnel with weighted Cloudflare DNS records is the honest fix — medium effort, low urgency unless tunnel CA rolls keys.
- **PDB gaps (F13-like, excluded from table)** — After challenger correction, gaps are: CrowdSec LAPI (3 replicas, no PDB), ESO webhook+controller, Woodpecker-server. Not urgent — drain-test with `kubectl drain --dry-run` shows no current issue.
- **App-level backups (F09)** — Proxmox host captures the PVC contents nightly via LVM snapshot + rsync with `--link-dest` weekly versioning, so file-level recovery is covered. But for databases inside PVCs (e.g. Affine's Postgres in-pod, Paperless' SQLite), app-aware dumps give transactional consistency. Audit pass: enumerate every PVC without a sibling `*-backup` CronJob, add one for the ones that host embedded DBs.
- **MySQL restore drill (F11)** — Migrated 4 days ago. Runbook exists. End-to-end restore (dump → new DB → connect an app → verify) hasn't been rehearsed. SEV1 risk if a dump has been silently broken since migration.
- **Vault update strategy (F17)**`OnDelete` means helm upgrade leaves pods untouched; must manually `kubectl delete pod` to restart. Low impact (infrequent) but procedural toil.
- **Dolt PVC near-full + no backup (F16)**`bd list --status in_progress` runs against this DB; it's load-bearing for cross-session task state. Grow the PVC (resize annotation) + add dolt dump CronJob.
### R2 — Declarative Coverage & Drift (16 raw → 8 real)
Filtered: dropped R2#1 (Kyverno markers are by-design), corrected R2#3 to 310.
- **NFS exports (F01)** — The file is git-managed at `infra/scripts/pve-nfs-exports` but deployed via `scp + exportfs -ra`, not Terraform. This is the exact path that caused the 2026-04-14 SEV1 (fsid=0 on wrong exports line). Options: (a) `null_resource` with `local-exec scp + remote-exec exportfs -ra` triggered on hash of content (partial — SSH dep); (b) new module `pve_host_config` that templates and SCPs multiple PVE-host artifacts with checksum verification. (b) is the cleaner long-term fix.
- **Null-resource initializers (F06)** — 6 in `dbaas` (MySQL users, CNPG cluster, TF-state role, payslip DB, job-hunter DB). Some are genuinely unavoidable (bootstrapping DB before the DB exists); others could use `postgresql_grant` / `mysql_user` providers.
- **SSH file provisioners on k8s-master (F07)**`apiserver-oidc.tf`, `audit-policy.tf`, `etcd tuning`. One-way sync, no drift detection. Proposed quick wins (per `2026-02-22-node-drift-quick-wins-design.md` already exists). Continue/finish the plan.
- **CoreDNS scaling manual (F24)** — Current runbook uses `kubectl scale`/`set env`/`set affinity`. Drift-prone; convert to `kubernetes_deployment` TF resource overriding the Helm chart's scale/affinity fields.
- **MySQL InnoDB Cluster + operator TF resources still present** — Phase 4 cleanup. Low urgency, but removing reduces cognitive load on anyone reading `stacks/dbaas/`.
- **Technitium readiness-gate null_resource with `timestamp()` trigger** — Runs every apply, 3-6 min wall time. Replace with a real health-check on `terraform_data` with `triggers_replace = { checksum = sha256(config) }`.
- **GPU node taints + Proxmox CSI labels via null_resource kubectl** — No drift detection. Fix is in the `2026-02-22-node-drift-quick-wins-design.md` plan.
### R3 — Maintenance overhead (18 raw → 10 real)
- **Renovate (F03)** — The single highest-leverage maintenance fix. 18 repos × ~0.8 hrs/month manual version sweep = real time. Add `.github/renovate.json` (grouping rules for Terraform providers, K8s provider, Docker images) + auto-merge patch-level. Start with `infra/` only; expand after 2 weeks.
- **Image pinning (F04)** — 84 `:latest` tags in production TF. Root CLAUDE.md still says "use 8-char git SHA tags" but that's not enforced. Admission control via Kyverno `require-trusted-registries` is in Audit today — add a sibling policy `forbid-latest-tag` also in Audit. Separate from F03 because pin-to-SHA + Renovate is a synergistic pair.
- **MySQL restore drill (F11)** — tracked under R1 for impact; also a maintenance item because the restore *procedure* has not been test-updated since migration.
- **CronJob alert rollup (F14)** — 59 CronJobs; "which were healthy last 24h" takes ad-hoc `kubectl get jobs --sort-by` scrolling. Add a Grafana panel with `kube_cronjob_status_last_successful_time < now - 2×schedule` summary.
- **Graceful-shutdown toil (F23)** — 50+ Recreate deployments without `terminationGracePeriodSeconds` or `preStop`. Noisy pager hits after node drain. One-off sweep: add a 30s `terminationGracePeriodSeconds` default via Kyverno mutation rule.
- **Tier-0 SOPS workflow (F30)** — 7-step decrypt/edit/encrypt/commit vs Tier-1's 3-step. Combined `tg` wrapper flag `--edit <stack>` that auto-decrypts → EDITOR → auto-encrypts → commit in one command. Moderate win; low risk.
- **Stale `in_progress` beads** — 7 stale tasks in `bd list --status in_progress` at audit start. Session-end hook checks this; 3-5 days without notes is the signal. CLAUDE.md covers the rule — it's followed-sometimes, not enforced.
- **Runbook staleness** — no `last_reviewed` frontmatter on runbook MDs; trivial to add. One-off sweep then keep it honest.
- **CI/CD template unification** — "GHA build → Woodpecker deploy" is the documented pattern for 10 repos; rest still on Woodpecker-only. Track as follow-ups per repo in `bd`.
- **Kyverno DNS-config boilerplate 307 markers** — Not a problem (see correction at top). Do add a lint rule in CI that flags any `kubernetes_deployment` without `# KYVERNO_LIFECYCLE_V1` marker; that's the real drift risk.
### R4 — Scalability (18 raw → 9 real)
Filtered: dropped R4#1 (metric mispull), R4#2 (CPU-limit policy), R4#7 (Phase 7 solved).
- **CNPG memory headroom** — Currently 2Gi limit. Top-line metric at quiet time; add a `ContainerNearOOM > 85%` rule that watches CNPG specifically (general rule exists; CNPG is Tier 0 so deserves explicit binding).
- **HPA cluster-wide: zero** — Every stateless service is 1:1. Not urgent at current node-CPU 8-31%, but one big feature (Immich re-index, Authentik load spike) tips the balance. Pilot: HPA on Traefik (CPU-driven), observe, expand.
- **Redis no HPA + HAProxy singleton** — Wire Sentinel into direct client access (Phase 8 of Redis refactor, per R1#11 of raw findings). Currently all 17 consumers go via HAProxy — the single-point bypass was deliberate (simpler client config), but the HAProxy is now the SPOF Sentinel was meant to prevent. Worth a plan doc (`plans/2026-MM-DD-redis-phase8-sentinel-clients.md`).
- **PgBouncer pool sizing unknown** — Authentik has 3 pods, each opening N connections. At load spikes (big org sync), pool exhaustion. Short-term: `pgbouncer_show_pools` metric + alert at 80% util. Longer-term: pool-size tuning based on observed wait times.
- **Prometheus TSDB (F26)** — 200Gi retention unquantified. Risk: disk fills → scrape gaps → audit blind. Add `kubelet_volume_stats_used_bytes{persistentvolumeclaim="prometheus-server"} > 0.85 * capacity` alert.
- **NFS capacity not monitored** — PVE host has 1TB HDD LV. No `node_filesystem_avail_bytes` scrape from PVE host (it's outside the cluster). Install node_exporter on PVE host; scrape via Prometheus federation or remote_write.
- **VPA quarterly review unscheduled** — Goldilocks is in `Initial` mode (not Auto, by design). Review is manual per quarter. Calendar event + runbook link.
- **Registry single instance** — Registry outage = no pod restarts. Post-mortem 2026-04-19 documented a container-engine pin; replica count still 1. Consider HA registry backed by S3-compat store (MinIO in-cluster) for the second replica — but low urgency given probe CJ monitors integrity every 15m.
- **No ResourceQuota utilization alert** — Quota exhaustion invisible until a pod refuses to schedule. `kube_resourcequota{type="used"} / kube_resourcequota{type="hard"} > 0.85` rule.
### R5 — Security & Secrets (21 raw → 13 real)
- **Vault `vault-unseal-key` K8s Secret (F21-related)** — Challenger A said it wasn't present; it is (`kubectl get secret -n vault`). Used by auto-unseal. RBAC on the secret should restrict to `vault-server` SA only. Audit the `role` + `rolebinding` in `stacks/vault/`.
- **Vault audit log offsite (F21)** — Rotated logs not synced to NFS backup. Add a `vault-audit-log-sync` CronJob or append the audit log path to `nfs-change-tracker` inotify list (zero-Terraform change if the latter).
- **Kyverno audit → enforce (F12) — sequence carefully** — All 16 policies are in Audit today. Naive switch to Enforce will block legitimate workloads (Loki, Frigate, nvidia-device-plugin, wireguard have privileged/host-ns requirements — all documented). Plan: (a) generate `Kyverno PolicyException` CRs for known-good workloads first; (b) enforce one policy at a time, 1-week observation; (c) start with `require-trusted-registries` (least breakage risk). **DANGEROUS TO EXECUTE NAIVELY — don't batch.**
- **No NetworkPolicies (F18)** — Challenger correctly flagged the effort (5) and risk (4): wrong NetworkPolicy stops Authentik from reaching its DB in minutes. Approach: allow-list namespace-wide first (e.g. `authentik` ns can reach `dbaas` on 5432), expand over a month. Single biggest latent security improvement but needs runway.
- **RBAC oidc-power-user secrets r/w cluster-wide (F19)** — Scope down: list which Authentik groups get this binding, remove `secrets:*` from the cluster role, add namespace-scoped RoleBindings where needed. Medium effort, high leverage.
- **Image supply chain (F20)** — cosign verification + admission controller is the mature path. Trivy-on-push fits in GHA workflows. Both unblocked after F04 (pinning).
- **`:latest` tags (overlap F04)** — Security aspect: signed-image admission requires stable refs.
- **Privileged containers** — Loki, WireGuard, NVIDIA, Frigate known-exceptions. Document the exceptions inline (comment block on the TF resource) so future maintainers don't accidentally "fix" them.
- **Git history plaintext secrets** — Challenger B flagged unverified. One way to verify cheaply: `git secrets --scan-history`. Add it as a pre-audit one-off.
- **CrowdSec Metabase disabled, no Prometheus exporter** — R5#18. Enable the Prometheus exporter (no Metabase) for attack-pattern visibility; very cheap.
- **cert-manager evaluation paused** — Documented pause; TLS rotation relies on Cloudflare wildcard. Confirm no local `Ingress` uses a self-managed cert that could expire silently. `kubectl get cert -A` → expect 0.
- **Pod Security Standards (F27)** — Label every namespace `pod-security.kubernetes.io/enforce=restricted` (or baseline). Known-exception namespaces get explicit downgrades. Medium effort, paid back by making future admission decisions uniform.
- **CrowdSec LAPI quorum** — 3 replicas but quorum/consensus behavior undocumented. One-page runbook: what happens if 1, 2, or 3 LAPI pods die.
- **Authentik outpost fix (F15)** — Applied via API, not TF. Next Helm upgrade reverts. Add the `/dev/shm` emptyDir to `stacks/authentik/values.yaml` templatefile.
---
## Dangerous-to-execute (handle with care)
Flagged by challengers; each needs a gradual rollout plan, not a single commit.
1. **F12 — Kyverno Audit → Enforce en masse**. Write `PolicyException` CRs for known-safe workloads first. One policy per week. Observe.
2. **F18 — NetworkPolicies cluster-wide**. Default-deny breaks inter-namespace lookups silently. Namespace-by-namespace rollout, with `kubectl logs -f` tailing the policy-engine events.
3. **PDB additions without drain-test**. New PDB + tight `minAvailable` can deadlock during node cordons. `kubectl drain --dry-run` every new PDB on every node first.
4. **F20 — Signed-image admission**. Must follow F04 (pinning). Un-pinned admission = half the cluster fails to pull.
## Gaps the agents missed
From challenger "GAPS" analyses, collated:
- **Disaster-recovery drill coverage** — backup docs are comprehensive (CLAUDE.md is extensive). End-to-end *restore* rehearsal frequency = never documented. Track per-component: MySQL, PostgreSQL/CNPG, Vault, etcd, NFS, registry blobs.
- **Service mesh evaluation** — Never formally evaluated (Istio, Linkerd, Cilium-in-mesh-mode). Could subsume NetworkPolicy effort + mTLS + observability. Worth a design doc even if answer is "no, too much complexity for the gain."
- **Chaos engineering coverage** — Zero. No pod-kill cron, no node-failure drill. Low urgency given maturity, but would validate F02 probe quality and F23 graceful-shutdown coverage cheaply.
- **Operator onboarding friction** — Nobody else in the "lab team" but Emo exists in `claude-agent-service`. If Emo needs to take over a component for a week, what's the runbook?
- **Alert noise / fatigue rate** — No finding measured how many alerts actually page vs. auto-resolve. `alertmanager_notifications_total` by receiver is the metric; needs a Grafana panel.
- **Secrets-in-image-layers** — Docker images built locally may contain secrets from build env. `trivy image --scanners secret` on registry images is a one-off audit.
- **Runbook → post-mortem → runbook-update loop** — Post-mortem 2026-04-14 produced runbook updates; no general tracker that every incident produces a runbook change.
## Alternative framings (from challengers, preserved for future reference)
- **Split "MySQL singleton" into 3 items** (HA / backup / pool). Accepted — see R1 and R4 treatment.
- **6th concern: Observability & Pager Fatigue** — Considered; the themes already hit R1+R3+R4 under Theme 2 of the executive summary. Keeping 5 concerns but carving "Observability gaps" as a theme, not a new research axis.
- **One-thing-this-weekend**: Challenger B nominated *NFS in Terraform*, Challenger A nominated *`:latest` tag sweep*. F01 wins on SEV1 prevention; F04 wins on toil. Both valid. Pick by energy level: F01 is 1 deliberate session; F04 is low-cognition grep-replace.
- **Re-rank by operator fatigue (Rank B) always**. Partially accepted — presented side-by-side in the table.
---
## Recommended next moves
Ordered for a solo operator balancing SEV-prevention, fatigue reduction, and preserved energy for larger work:
**Week 1 (SEV-prevention + quick-wins, low cognitive load):**
- F01: NFS exports into a `pve_host_config` Terraform module (one deliberate session)
- F04: Sweep `:latest` tags, add Kyverno `forbid-latest-tag` in Audit
- F08: ESO refresh-lag PrometheusRule
- F05: OOMKill / Unschedulable / Node-CPU PrometheusRule
**Week 2 (fatigue reduction):**
- F03: Renovate in `infra/` only (narrow pilot)
- F14: CronJob success-rate Grafana panel + alert rollup
- F16: Dolt backup CronJob + PVC grow
- F11: First MySQL restore drill (scheduled, documented)
**Month 2 (durable fixes, gradual):**
- F06/F07: Replace null_resources + SSH provisioners with native TF resources, one at a time
- F02: Probe sweep — add readiness+liveness to the 20 long-lived deployments first
- F12: Kyverno Enforce transition, one policy per week
- F15: Authentik outpost /dev/shm into values.yaml
**Month 3+ (structural):**
- F18: NetworkPolicies — namespace-by-namespace
- F19: RBAC scope-down
- F20: Signed-image admission
- Service-mesh evaluation (design doc)
- Restore-drill calendar for every backup target
No beads tasks auto-filed by this audit — user decides which findings merit `bd create`.
---
## Appendix — verification references (spot-checked)
Every numeric claim in the backlog was confirmed by one of these commands at audit time (2026-04-20):
| Claim | Command | Result |
|---|---|---|
| Node memory 44-51% | `kubectl top nodes --no-headers` | k8s-node1: 45%, node2: 51%, node3: 49%, node4: 44%, master: 17% |
| 115 deploys missing ≥1 probe | `kubectl get deploy -A -o json \| jq '[.items[] \| select(.spec.template.spec.containers[0].readinessProbe == null or .spec.template.spec.containers[0].livenessProbe == null)] \| length'` | 115 |
| 103 deploys missing BOTH probes | same, with `and` | 103 |
| 310 ignore_changes blocks | `grep -r "ignore_changes" infra --include=*.tf --include=*.hcl \| wc -l` | 310 |
| 59 CronJobs | `kubectl get cronjobs -A --no-headers \| wc -l` | 59 |
| All 16 Kyverno ClusterPolicies in Audit | `kubectl get clusterpolicy -o jsonpath='...validationFailureAction...'` | 16/16 Audit, 0 Enforce |
| Redis `maxmemory-policy allkeys-lru` | `grep -n maxmemory-policy infra/stacks/redis` | `modules/redis/main.tf:254` |
| Zero Renovate configs | `find /home/wizard/code -name '.renovaterc*' -o -name 'renovate.json' \| grep -v node_modules` | 0 |
| Vault `vault-unseal-key` Secret exists | `kubectl get secret -n vault` | present (37d old) |
| NFS `/etc/exports` not in TF | `grep -rn 'fsid=' infra/stacks` | 0 matches; only `infra/scripts/pve-nfs-exports` |
| Frigate CPU limit by policy | `infra/.claude/CLAUDE.md` → "All CPU limits removed cluster-wide" | confirmed |
| MySQL standalone intentional | `infra/.claude/CLAUDE.md` → "migrated from InnoDB Cluster 2026-04-16" | confirmed |
Other claims (84 `:latest` tags, 52 ExternalSecrets, 30+ PVCs without backup CJs) were surfaced by research agents; challengers spot-checked a subset and agreed the order-of-magnitude holds. Full list in `/home/wizard/.claude/plans/let-s-run-a-thorough-floating-pnueli.md` research digest.
## Deliverable disposition
- This document is the audit output.
- No `bd` tasks were created by the audit. Pick findings to ticket after reading.
- When filing: use `F##` as a tag, title with the finding's headline, acceptance criteria from the deep-dive paragraph, priority from Rank B.
- Plan file at `~/.claude/plans/let-s-run-a-thorough-floating-pnueli.md` retains the full 91-finding digest + challenger reports for reference; can be deleted after any follow-up tickets are filed.

View file

@ -0,0 +1,165 @@
# Auto-Upgrade Apps Design
**Date**: 2026-05-16
**Status**: Approved (brainstorm + grill complete; implementation pending)
## Problem
Three constraints in tension across the cluster's ~70 services:
1. **Keep apps at latest.** Most services drift behind upstream; manual bumps don't scale.
2. **Stay Terraform-compatible.** Image refs live in `.tf`; we want declarative source of truth.
3. **Don't let the pull-through cache serve stale `:latest`.** Cache layer must not lie about what `:latest` means today.
The previous `Diun → n8n → Service Upgrade Agent` flow handled (1) via changelog-reviewed PR bumps for third-party. Self-hosted services have inconsistent CI: 1 of 11 fully wired (CI builds + pushes + rolls out), 6 partially wired (build but no rollout trigger), 4 with no CI at all. Self-hosted services typically pull `forgejo.viktorbarzin.me/viktor/<name>:<8-char-sha>` with Terraform tracking each SHA in `var.image_tag`.
The user wants to simplify by retiring the changelog-review agent and moving to a pure "latest, always" model, with the cache freshness concern handled at the cache layer (already done — see Architecture §1).
## Decisions
| # | Decision | Notes |
|---|----------|-------|
| 1 | **Auto-roll for everything** (no PR-bump gate) | Retires the Service Upgrade Agent; Diun's role narrows to notification only |
| 2 | **Actuator: Keel** ([keel.sh](https://keel.sh)) | Annotation-driven Deployment/StatefulSet/DaemonSet auto-update operator |
| 3 | **Tag scheme: `:latest` where it exists, `:major` where it doesn't, glob+`ignore_changes` last resort** | `keel.sh/policy: force` for `:latest` / `:major`; tag string stays in Terraform |
| 4 | **Opt-out-pure (no skip-list)** | Every workload auto-rolls, including Vault, CNPG, operators, CNI, CSI. User accepts recoverability risk |
| 5 | **Phased rollout (9 phases)** | Low-risk → bootstrap. Catch up to latest as we phase in. Each phase soaks ~1 week |
| 6 | **Per-phase: single combined PR** | Switch image refs to floating tag + add to Kyverno mutate allowlist in same commit |
| 7 | **Diun is the audit source for catch-up** | Existing 6h-poll already reports outdated images; export as worklist per phase |
| 8 | **Polling, hourly** (`@every 1h`) | Not webhooks — single mechanism, all registries supported |
| 9 | **Rollback: `kubectl rollout undo` → pin in Terraform → add `keel.sh/policy: never`** | (c) from grill: immediate undo, durable Terraform pin within ≤1h before next Keel poll |
| 10 | **Implementation: Kyverno cluster-wide mutate** | One `ClusterPolicy` injects Keel annotations; phase boundary = `NamespaceSelector` allowlist |
| 11 | **Keel exempt from its own mutate** | One-line `NamespaceSelector` exclusion. Supervisor self-update has uniquely bad failure mode |
| 12 | **Uniform CI model for all self-hosted** | CI builds + pushes `:latest`, Keel polls and rolls. No per-repo `kubectl set image` step. Retires the GHA-migrated SHA-tag flow (memory id=388) |
## Architecture
### 1. Cache freshness — already correct
Pull-through cache at `10.0.20.10` already splits caching by URL at the nginx layer:
- `location ~ /v2/.*/blobs/``proxy_cache_valid 200 24h` — blobs cached (content-addressed, immutable)
- `location /v2/` (manifests) → pass through, no cache
Combined with `registry.proxy.ttl: 0` at the docker-registry layer, mutable manifests revalidate against upstream on every pull. **No cache changes needed for this design.** The CLAUDE.md note "Use 8-char git SHA tags — `:latest` causes stale pull-through cache" predates the nginx URL-split fix and should be updated as part of this work.
### 2. Detection — Keel polls upstream
Keel runs as a Deployment in its own namespace. Every annotated workload polls its registry hourly (Keel-managed; configurable per workload). On detection of a new digest under the watched tag:
- `keel.sh/policy: force` (for mutable tags `:latest`, `:16`, `:7`, etc.) → trigger Deployment update (pod template hash changes → restart)
- `keel.sh/policy: minor` / `major` / `glob` (only for images that publish neither `:latest` nor a stable floating tag) → rewrite tag string on the Deployment; requires `lifecycle { ignore_changes = [...image] }`
### 3. Application — kubelet pull through the cache
When Keel triggers restart:
1. kubelet asks the cache (via containerd hosts.toml) for `image:tag` manifest.
2. nginx passes the manifest request through to the docker-registry layer.
3. docker-registry (with `proxy.ttl: 0`) passes through to upstream.
4. Upstream returns current digest.
5. kubelet pulls blobs (mostly cached at nginx layer; new blobs from upstream).
6. New pod runs new image.
### 4. Annotation injection — Kyverno mutate
Single `ClusterPolicy` adds these annotations to every Deployment / StatefulSet / DaemonSet in opted-in namespaces:
```yaml
metadata:
annotations:
keel.sh/policy: force
keel.sh/trigger: poll
keel.sh/pollSchedule: "@every 1h"
```
Phase = a `match.any[].resources.namespaces` list. Phase advance = append namespaces. Keel namespace is excluded.
### 5. Terraform drift handling
Existing convention (`# KYVERNO_LIFECYCLE_V1` marker) handles `dns_config` injection. We extend with a new marker:
```hcl
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
```
This is added per workload as we phase in. Mechanical, grep-able.
## Phase ordering
| Phase | Set | Rationale |
|-------|-----|-----------|
| 0 | Foundation (Keel install, Kyverno ClusterPolicy with empty allowlist) | Build infra without enrolling anything |
| 1 | Self-hosted (forgejo-hosted: ~11 services) | We own the code; failures are easy to diagnose |
| 2 | Stateless third-party web apps (linkwarden, postiz, affine, etc.) | No migrations |
| 3 | Exporters, sidecars, utilities | Stateless |
| 4 | Stateful-but-tolerant (Grafana, Prometheus, etc.) | Restart-safe state |
| 5 | State-coupled with migrations (Nextcloud, Forgejo, paperless-ngx, mailserver) | Schema-migration risk |
| 6 | Authentik | Auth outage |
| 7 | Operators (cnpg-operator, ESO, kured, descheduler) | Operator skew |
| 8 | Critical infra (Calico, proxmox-csi, nfs-csi, traefik, metallb) | Node-level outage potential (memory id=390: 26h Calico cascade) |
| 9 | Bootstrap (Vault, CNPG PG cluster, mysql-standalone) | Lose recoverability if broken |
Per-phase: combined PR → apply (catch-up rolls happen) → soak 1 week → next phase. If a service breaks repeatedly, apply rollback runbook (decision #9) and proceed; re-enroll later or leave pinned.
## Risk register
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| Bad upstream image rolls into prod | High | Service-level outage | Existing alerts (`KubePodCrashLooping`, `KubeletImagePullErrors`, `PodsStuckContainerCreating`); rollback runbook (decision #9) |
| Catch-up rollout overwhelms cache | Medium | ImagePullBackOff cascade (memory id=603) | Rate-limit catch-up to ~5 rollouts/6h via `-target=` per phase; same pacing as retired Service Upgrade Agent (memory id=612) |
| Calico / CSI auto-roll cascades (memory id=390: 26h outage) | Low-Medium | Cluster-level outage | Phase 8 is intentionally late; user opted into the risk; rollback to pinned chart version via Terraform |
| Vault auto-rolls to broken image | Low | Loss of secrets sync; 43 ExternalSecrets stop reconciling | Phase 9 last; Tier 0 SOPS state allows manual recovery |
| CNPG PG cluster auto-rolls to broken image | Low | Tier 1 Terraform state inaccessible; 105 stacks can't apply | Phase 9 last; Tier 0 stack `cnpg` is bootstrap-capable |
| Helm-atomic-trap services (memory id=981) | Medium | `terraform apply` hangs in pending-rollback | Identify `helm_release` services with `atomic = true`; either remove atomic or skip from Keel |
| Keel itself rolls to broken version | Low | Supervisor down; no auto-rolls until manual pin | Decision #11: exempt Keel from mutate |
| Terraform drift after Kyverno injects annotation | High at first | Spurious diffs on every plan | KYVERNO_LIFECYCLE_V2 marker (Architecture §5); applied incrementally per phase |
## What we give up
- **Terraform no longer tracks deployed version.** Image refs in `.tf` say `:latest` or `:16`, but the running digest is whatever Keel pulled. To know what's running: `kubectl describe pod`. This is a deliberate trade — the previous SHA-pinned flow tracked version in TF but required N stack edits per deploy.
- **No changelog review before rollout.** The Service Upgrade Agent's risk classification is gone. We rely on alerts to catch breakage post-deploy, not prevent it.
- **CLAUDE.md SHA-tag rule is reversed for this design.** The "use 8-char git SHA tags" rule predates the nginx URL-split fix. New rule (post-rollout): "use floating tags + Keel annotation" — to be updated in both `infra/.claude/CLAUDE.md` and the repo-root `CLAUDE.md` once Phase 1 is stable.
## Decisions resolved post-grill
### Q1 — Uniform CI model for ALL self-hosted (resolved 2026-05-16)
Every self-hosted service moves to the same shape:
```
CI (GHA or Woodpecker) → build → push :latest (optionally also :<SHA> for traceability) → done
Keel → poll registry → detect new digest → trigger rollout
```
The 10 GHA-migrated repos (memory id=388: Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints) drop the `Woodpecker API → kubectl set image` step. Their `.woodpecker/deploy.yml` and `.woodpecker/build-fallback.yml` files become obsolete; remove during Phase 1.
Terraform image refs for all self-hosted: `<registry>/<repo>:latest` (with `${var.image_tag}` defaulting to `"latest"` where the variable exists).
### Q2 — No-CI self-hosted services (resolution: uniform participation)
| Service | Action |
|---------|--------|
| `wealthfolio` | Switch Terraform to upstream `wealthfolio/wealthfolio:latest` (DockerHub). No CI needed. |
| `chrome-service` | Verify whether `:v4` is a deliberate pin. If yes → tag stays, add `keel.sh/policy: never` label. If no → switch to `:latest` or `:major`. Investigate during Phase 1 prep. |
| `beadboard` (used by `beads-server`) | Add minimal Woodpecker CI: build on push → push `:latest`. User-owned. |
| `freedify` | Add minimal Woodpecker CI: build on push → push `:latest`. User-owned. |
## Open questions (still need resolution before Phase 1)
1. **`helm_release atomic = true` services**: count and identify before Phase 1. Either remove `atomic` (preferred — eliminates the memory id=981 trap), or skip from Kyverno mutate via per-namespace exclusion. Survey command: `grep -rn 'atomic.*true' infra/stacks/ infra/modules/`.
## Out of scope
- Cache TTL changes — current config is already correct (nginx URL-split).
- Webhook-based Keel triggers — polling is sufficient for this cadence.
- Replacing Diun — kept for notification visibility into new tags not yet under Keel annotation (during phase rollout).
- Keel approval gate (`keel.sh/approvals: N`) — user wants unattended auto-roll.
- Keel auto-rollback on health-check failure — out of scope for v1; revisit if breakage rate is high.

View file

@ -0,0 +1,322 @@
# Auto-Upgrade Apps Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Move the cluster from a mix of pinned-SHA / pinned-semver / ad-hoc `:latest` references to a Keel-driven auto-update model where every workload tracks `:latest` (or a chosen `:major` floating tag) and rolls automatically when upstream advances.
**Architecture:** Kyverno cluster-wide `ClusterPolicy` mutates Deployments / StatefulSets / DaemonSets in opted-in namespaces with Keel annotations (`keel.sh/policy: force`, `keel.sh/trigger: poll`, `keel.sh/pollSchedule: @every 1h`). Keel polls registries, triggers rollout on new digest. kubelet pulls fresh manifest via the nginx URL-split cache (manifests passthrough, blobs cached). Phase advance = expand the `NamespaceSelector` allowlist.
**Tech Stack:** Keel, Kyverno, Terraform / Terragrunt, Helm, Diun (notification only), nginx, docker/distribution
**Design doc:** `docs/plans/2026-05-16-auto-upgrade-apps-design.md`
**Key context:**
- Cache is already correctly configured (nginx URL-split + `proxy.ttl: 0`). No cache changes needed.
- Per-stack `lifecycle.ignore_changes` is already required for the existing `dns_config` Kyverno mutation (KYVERNO_LIFECYCLE_V1 convention). This plan extends it with a V2 marker for Keel annotations.
- Service Upgrade Agent (Diun → n8n → claude bumps tfvars) is retired by this design. n8n workflow + supporting scripts are removed once Phase 9 completes.
- CLAUDE.md "use 8-char git SHA tags" rule is reversed by this design (see Open Q1 in design doc).
---
## Phase 0 — Foundation
### Task 0.1: Resolve remaining open question
Q1 and Q2 from the design doc are resolved (uniform `:latest` + Keel model for all self-hosted; per-service plan for no-CI services).
Remaining open question:
**Helm-atomic services.** Survey:
```bash
grep -rn 'atomic.*true' /home/wizard/code/infra/stacks/ /home/wizard/code/infra/modules/
```
For each match: either remove `atomic = true` (preferred) or add the namespace to a Kyverno exclusion list. Document inline before Phase 1 proceeds.
---
### Task 0.2: Create the Keel stack
**Files:**
- Create: `stacks/keel/terragrunt.hcl`
- Create: `stacks/keel/main.tf`
- Create: `stacks/keel/variables.tf`
- Create: `stacks/keel/modules/keel/main.tf`
**Step 1:** Add `keel` to `terragrunt.hcl` `locals.tier0_stacks`**NO**. Keel is Tier 1 (depends on Kyverno + Keel image registry access). Keep it in Tier 1.
**Step 2:** Deploy via Helm chart `keel-hq/keel` (verify current version via context7 before pinning).
Key Helm values:
- `polling.enabled: true`
- `helmProvider.enabled: false` (we use annotations, not Helm hooks)
- `notifications.slack.enabled: true` with channel `#deployments` (verify channel exists)
- Registry credentials: mount Forgejo PAT from Vault via ExternalSecret (`secret/viktor/forgejo_pull_token`).
**Step 3:** Verify Keel can authenticate to all five registries (Docker Hub, ghcr, quay, k8s.io, kyverno via the local cache; Forgejo direct).
**Acceptance:**
- `kubectl -n keel get pod` shows Keel Ready.
- `kubectl -n keel logs deploy/keel | grep registry` shows successful manifest queries.
---
### Task 0.3: Author the Kyverno ClusterPolicy
**Files:**
- Create: `stacks/kyverno/modules/kyverno/keel-annotations.tf` (or extend `security-policies.tf`)
ClusterPolicy `inject-keel-annotations`:
```yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: inject-keel-annotations
spec:
background: true
rules:
- name: add-keel-annotation
match:
any:
- resources:
kinds: [Deployment, StatefulSet, DaemonSet]
namespaces: [] # populated per phase
exclude:
any:
- resources:
namespaces: ["keel"] # decision #11
- resources:
# Workloads can opt out by setting this label
selector:
matchLabels:
keel.sh/policy: never
mutate:
patchStrategicMerge:
metadata:
annotations:
+(keel.sh/policy): force
+(keel.sh/trigger): poll
+(keel.sh/pollSchedule): "@every 1h"
```
- `+()` syntax adds only if not present (preserves per-workload overrides).
- `exclude.selector.matchLabels[keel.sh/policy=never]` is the per-workload escape hatch (used during rollback per decision #9).
**Step 2:** Initially deploy with `namespaces: []` — policy exists but matches nothing.
**Acceptance:**
- `kubectl get clusterpolicy inject-keel-annotations` shows Ready.
- `kubectl get deploy -A -o yaml | grep keel.sh/policy` shows no matches yet (empty allowlist).
---
### Task 0.4: Define the KYVERNO_LIFECYCLE_V2 marker convention
**Files:**
- Modify: `AGENTS.md` — add the V2 snippet to the "Kyverno Drift Suppression" section
- Modify: `.claude/CLAUDE.md` — reference the V2 marker
Snippet to copy-paste:
```hcl
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
```
Backfill order: per-phase, only on workloads about to be enrolled. Not a mass sweep.
---
## Phase 1 — Self-hosted (uniform model)
**Set:** all self-hosted services. Three sub-categories:
- **Woodpecker-build-only (6):** `claude-agent-service`, `fire-planner`, `job-hunter`, `payslip-ingest`, `recruiter-responder`, `claude-memory-mcp`.
- **GHA-migrated (10, per memory id=388):** Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints. (Note: claude-memory-mcp appears in both lists — verify.)
- **No-CI (4, per design Q2):** `wealthfolio` (→ upstream), `chrome-service` (verify pin intent), `beadboard` (add CI), `freedify` (add CI).
- **Already-uniform (1):** `kms-website` — already pushes `:latest` AND SHA; just needs Keel annotation.
### Task 1.1: Audit current image refs
```bash
grep -rE 'image\s*=\s*"(forgejo\.viktorbarzin\.me|viktorbarzin)' /home/wizard/code/infra/stacks/ | sort
```
Tabulate per service: current tag, CI type (GHA / Woodpecker / none), action needed.
### Task 1.2: Per-service uniform conversion
For each Woodpecker-build-only service:
1. Edit Terraform: `local.image_tag` / `var.image_tag``"latest"`.
2. Add the KYVERNO_LIFECYCLE_V2 snippet (annotations ignore_changes).
3. Verify `.woodpecker.yml` pushes `:latest` on every build (most do via `auto_tag: true`).
For each GHA-migrated service:
1. Edit Terraform: switch `image_tag` from SHA reference to `"latest"`.
2. Add the KYVERNO_LIFECYCLE_V2 snippet.
3. Edit `.github/workflows/build-and-deploy.yml`: push `:latest` (in addition to `:<8-char-sha>` for traceability). Remove the Woodpecker API POST step.
4. Delete `.woodpecker/deploy.yml` and `.woodpecker/build-fallback.yml` from each repo (no longer needed).
5. Remove the Woodpecker repo config for these repos from Terraform if applicable.
For each no-CI service:
- `wealthfolio`: change Terraform image to `wealthfolio/wealthfolio:latest` (upstream DockerHub). Validate the image starts cleanly.
- `chrome-service`: check git blame on the `:v4` pin. If deliberate → label `keel.sh/policy: never`. If accidental → bump to upstream `:latest`.
- `beadboard`, `freedify`: write a minimal `.woodpecker.yml` (single build step pushing to Forgejo `:latest`). Trigger an initial build to populate `:latest`.
For `kms-website`: only add the Keel annotation; CI changes optional.
### Task 1.3: Add Phase 1 namespaces to Kyverno allowlist
Edit `stacks/kyverno/modules/kyverno/keel-annotations.tf`:
```yaml
namespaces:
- claude-agent-service
- fire-planner
- job-hunter
- payslip-ingest
- recruiter-responder
- claude-memory-mcp
- kms-website
# GHA-migrated set:
- website # or whatever the namespace is named per repo
- k8s-portal
- f1-stream
- apple-health-data
- audiblez-web
- plotting-book
- insta2spotify
- audiobook-search
- council-complaints
# No-CI set:
- beads-server
- chrome-service
- freedify
- wealthfolio
```
Verify each namespace name from `kubectl get ns` before locking in (some may differ from the repo name).
Apply. Watch `kubectl get deploy -n <ns> -o yaml | grep keel.sh` confirm annotations injected. Watch Keel logs for first poll cycle picking up the workloads.
### Task 1.4: Soak
1 week. Monitor:
- Slack `#deployments` for Keel rollout notifications.
- `KubePodCrashLooping` alerts.
- Manual `kubectl rollout status` on each service after a Keel-triggered rollout.
If any service breaks repeatedly: apply rollback runbook (decision #9), record the service in a "pin list" with reason, proceed.
**Acceptance:**
- All 7 services running latest digests within 24h of Phase 1 apply.
- No CrashLooping persisting >1h.
- No more than 2 services pinned-out during the soak week.
---
## Phase 2 — Stateless third-party web apps
**Set:** linkwarden, postiz, affine, isponsorblocktv, audiobookshelf, freshrss, tandoor, immich (verify it qualifies — has external DB so app-restart is safe), excalidraw, hackmd, send, jsoncrack, sparkyfitness, etc. (~15-20 services — full list from `kubectl get deploy -A` filtered against the phase-1 set + skip-bucket).
### Task 2.1: Audit current tags via Diun
```bash
# Diun's REST API or UI exports a "new tags available" report
# Use as the per-service decision source
```
For each service, pick floating tag:
- `:latest` if upstream publishes it and it's stable.
- `:<major>` (e.g. `:2`, `:v3`) if `:latest` is unreliable.
- `glob` + `ignore_changes` as last resort.
### Task 2.2: Catch-up PR
Single combined PR:
- Per-stack: switch image tag from pinned semver to chosen floating tag (Diun-informed).
- Per-stack: add KYVERNO_LIFECYCLE_V2 snippet.
- Append Phase 2 namespaces to Kyverno allowlist.
Apply with `-target=` per stack to pace rollouts (≤5 per hour to avoid cache burst — memory id=603).
### Task 2.3: Soak — 1 week, same monitoring as Phase 1.
---
## Phases 39 — same template
For each phase, repeat:
1. Define the set (precise namespace list).
2. Audit current tags (Diun + grep).
3. Pick floating tag per service.
4. Combined PR: image-ref change + lifecycle snippet + Kyverno allowlist update.
5. Apply paced (≤5/hr).
6. Soak 1 week. Pin-out any service that breaks repeatedly.
Set definitions per phase: see design doc Phase Ordering table.
**Special-handling phases:**
- **Phase 7 (Operators).** Restart of an operator can confuse its managed CRD reconciles. Use `imagePullPolicy: Always` + readiness check before declaring stable. Investigate cnpg-operator and ESO restart behavior in advance.
- **Phase 8 (Critical infra).** Calico/CSI DaemonSet rollouts impact each node briefly. Verify `updateStrategy.rollingUpdate.maxUnavailable: 1` on every DaemonSet before enrollment. Memory id=390 (26h Calico-cascade outage) is the cautionary tale.
- **Phase 9 (Bootstrap).** Vault, CNPG, mysql-standalone. Coordinate with backup window. Take a fresh snapshot of `/srv/nfs/<db>-backup/` before applying the phase enrollment.
---
## Cleanup tasks (after Phase 9 stable)
### Task C.1: Retire Service Upgrade Agent
**Files:**
- Modify: `stacks/n8n/` — remove the Service Upgrade Agent workflow
- Delete: any supporting scripts (`infra/scripts/service-upgrade-*.sh` if they exist)
- Modify: `stacks/diun/` — disable webhook notification to n8n (keep Slack notification for visibility)
### Task C.2: Update CLAUDE.md files
- Reverse the "use 8-char git SHA tags" rule in `infra/.claude/CLAUDE.md` "Docker images" line.
- Reverse same in root `/CLAUDE.md` if duplicated.
- Add a new section documenting the Keel model + KYVERNO_LIFECYCLE_V2 snippet.
- Update memory via `mcp__claude_memory__memory_update` on entries 388, 612, 604 (CI/CD architecture, Service Upgrade Agent retirement, cache TTL clarification).
### Task C.3: Add a runbook
**Files:**
- Create: `docs/runbooks/keel-rollback.md`
Document the rollback flow (decision #9): `kubectl rollout undo` → Terraform pin → annotation `keel.sh/policy: never`.
### Task C.4: Tidy Diun
Drop image-pin overrides for MySQL, PostgreSQL, Redis from Diun config (no longer needed since they're Keel-managed; the previous skip was for the retired changelog-agent path).
---
## Rollback (whole project)
If the auto-roll experiment goes badly cluster-wide (multiple cascading failures, repeated outages), revert:
1. Set Kyverno ClusterPolicy `inject-keel-annotations` to empty `namespaces: []`.
2. Existing annotations remain on workloads, but Keel continues to act on them — so also disable Keel: scale `keel` Deployment to 0.
3. Pin every workload's Terraform image_tag back to its current running digest (use `kubectl get deploy -A -o jsonpath='{range .items[*]}{.metadata.name}:{.spec.template.spec.containers[0].image}{"\n"}{end}'`).
4. Document failure modes in `post-mortems/2026-XX-XX-keel-rollback.md`.
5. Reconsider opt-in approach for next iteration.
---
## Success criteria
- All ~70 services running latest within 8 weeks of Phase 0 completion.
- Zero unrolled-back outages caused by Keel.
- ≤5 services on the "pin list" (i.e. ≥93% auto-roll success rate).
- `terragrunt plan` shows no spurious diffs from Kyverno-injected annotations (KYVERNO_LIFECYCLE_V2 working as intended).
- Service Upgrade Agent + supporting infra retired.

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,112 @@
# MySQL 8.4.8 → 8.4.9 Upgrade — Design
**Date**: 2026-05-19
**Status**: Drafted, **NOT scheduled**. Execute only inside a planned maintenance window with user sign-off.
**Beads**: (filed alongside this doc)
**Related**: `docs/runbooks/restore-mysql.md`, beads `code-eme8` / `code-k40p` (closed in `ea475c3d`)
## Background
On 2026-05-18, Keel auto-bumped the `mysql:8.4` floating tag on the
`mysql-standalone` StatefulSet from 8.4.8 to 8.4.9. The in-server data
dictionary upgrade (80408 → 80409) stalled reliably: ~24 s of writes to
`mysql.ibd` + redo log after "Server upgrade started", then complete
silence — no CPU, no flushes, no errors, no completion. The `boot`
thread sat in user-space sleep (`State: S`, `wchan: 0`) for 10+
minutes; the MySQLX socket appeared but `mysqld.sock` never did. Even
with `liveness_probe.initial_delay_seconds = 600`, the upgrade never
completed.
Recovery (commit `ea475c3d`): pinned image to `mysql:8.4.8` exactly,
wiped the corrupted PVC, restored from the 00:30 UTC mysqldump. Total
downtime: ~25 min. Forgejo + 7 dependent apps offline during that
window.
## Root cause — best evidence
We never proved this definitively because we couldn't connect to MySQL
during the stall, but the strongest hypothesis is **flush starvation
during the DD upgrade's mandatory checkpoint**:
1. Upgrade rewrites `mysql.st_spatial_reference_systems` (5103 SRS
defs) + dirties pages across the system tablespace.
2. Reaches a point where it must checkpoint before continuing.
3. The page-cleaner thread can't drain dirty pages fast enough because
`innodb_io_capacity=100` (1.6 MB/s effective flush rate, default is
200, recommended for SSDs is 2000+) combined with
`innodb_page_cleaners=1`.
4. The `boot` thread waits on a pthread condvar that the flush
coordinator should signal but never does within probe timeout.
Why we're not 100 % certain:
- LUKS2-encrypted block storage (`proxmox-lvm-encrypted`) may
contribute its own flush latency.
- We didn't capture a stack trace from the stalled `boot` thread
(`/proc/1/task/118/stack` was `permission denied`).
- A genuine MySQL 8.4.9 bug in the SRS-update path is possible (worth
checking the MySQL bug tracker before retry).
**Organizational root cause** (definitive): the `mysql:8.4` floating
tag let Keel auto-bump without testing. Already fixed — image pinned
to `mysql:8.4.8` exactly.
## Decisions
| # | Decision | Notes |
|---|----------|-------|
| 1 | **Approach: wipe + re-init on 8.4.9** (logical migration via fresh init + dump-restore) | The DD upgrade is the broken path. A fresh 8.4.9 init starts at version 80409 directly — no upgrade ever runs. We've executed wipe+restore once in ~25 min; the path is now well-trodden. |
| 2 | **Pre-flight: bump InnoDB IO config** | `innodb_io_capacity=2000`, `innodb_io_capacity_max=4000`, `innodb_page_cleaners=4`. These are the long-term-correct values regardless of the upgrade — current settings are ~10× too conservative for the workload. |
| 3 | **Restore strategy: per-database dumps, NOT the full `--all-databases` dump** | Per-db dumps at `/srv/nfs/mysql-backup/per-db/<db>/` skip the `mysql` system schema entirely. Avoids the question of "will 8.4.8 mysql-schema rows confuse 8.4.9". User accounts get recreated via Vault + null_resource. |
| 4 | **Fresh dump immediately before cutover, not yesterday's** | The daily dump runs at 00:30 UTC. The cutover dump must come from < 60 s before scale-to-0 to minimize data loss. Kick `mysql-backup-per-db` CronJob manually. |
| 5 | **Maintenance window required** | All MySQL-dependent apps offline ~25 min: Forgejo (+ registry → ImagePullBackOff cascade), Nextcloud, HackMD, Grafana, Paperless, Uptime-Kuma, Shlink, realestate-crawler, phpipam, technitium, vikunja, freshrss, finance, resume. Pick a low-traffic window (suggest Sunday 03:00 UK). |
| 6 | **Single rollback path: re-pin to 8.4.8 + same wipe/restore flow** | If 8.4.9 fresh init misbehaves post-restore, rollback IS the same procedure, just with image=8.4.8. The pinned 8.4.8 dump survives. No new failure modes. |
| 7 | **Out of scope for this upgrade**: tuning that doesn't gate the upgrade | Right-sizing buffer pool, switching to async commits, changing storage class, replication — all separate decisions. |
## Verification gates
Before declaring done:
1. `kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$PW" -e "SELECT VERSION();"` returns `8.4.9`.
2. `SHOW DATABASES;` lists all 20 user databases.
3. Table count per schema matches the pre-upgrade snapshot (recorded
in step 1 of the plan).
4. `forgejo` logs show successful DB ping; `kubectl -n forgejo get pod` is 1/1 Running.
5. `kubectl get deploy,sts -A` shows no unready workloads.
6. `bash infra/scripts/cluster_healthcheck.sh --quiet` returns same or
better PASS/WARN/FAIL ratio as pre-upgrade.
7. Forgejo integrity probe reports 0 failures (manual trigger).
8. `RegistryCatalogInaccessible` not firing in Prometheus.
## Risks + mitigations
| Risk | Likelihood | Mitigation |
|---|---|---|
| 8.4.9 fresh init has *some other* unobserved bug | Low | Smoke-test on a parallel PVC in dbaas before touching the real one (optional but cheap — adds 30 min). See plan Phase 1. |
| Per-db dump-restore misses a database the user added recently | Low | Compare `SHOW DATABASES` against the per-db dump directory listing pre-cutover. If a DB exists in MySQL but not in `/srv/nfs/mysql-backup/per-db/`, dump it manually first. |
| Forgejo/roundcubemail static-user passwords drift again after restore | Certain | Already documented in runbook — DROP USER + CREATE USER from Vault values immediately after restore. |
| The cutover dump itself is corrupt | Very low | mysqldump exits non-zero on failure. CronJob already pushes `backup_last_success_timestamp` to Pushgateway. Verify timestamp is fresh before proceeding. |
| Apps fail to reconnect after MySQL restart | Low | Already-proven recipe: `kubectl rollout restart` on the affected deployments. Listed exhaustively in runbook §B.8. |
| 8.4.9 fresh init *also* stalls (root cause was NOT flush starvation) | Medium-low | Pre-flight test on parallel PVC catches this before maintenance window. If real prod init stalls, immediately revert TF pin to 8.4.8, redo same dump-restore flow. Same 25 min downtime as the original recovery. |
## Why not alternatives
- **In-place DD upgrade with bumped IO config**: simpler, but if it
still stalls we lose 3060 min waiting + still fall back to
wipe+restore. Same data risk; worse expected time. We *would* learn
whether the bumped IO settings fix the upgrade, but the fresh init
approach makes that knowledge unnecessary.
- **Parallel migration (new mysql-standalone-new pod alongside)**:
cleanest rollback (instant via service-selector flip), but needs TF
surgery to declare two StatefulSets temporarily and isn't worth the
complexity when the wipe+restore approach is now proven.
- **Wait for 8.4.10 / 8.5 LTS**: leaves us stuck on 8.4.8 indefinitely.
Acceptable for now (we're pinned), but not a permanent answer.
## Out of scope
- A standby/replica MySQL for zero-downtime upgrades (separate
initiative — see future planning around CNPG-style HA for MySQL).
- Removing `proxmox-lvm-encrypted` LUKS2 from the equation (the
encryption is a security requirement; debugging its flush latency is
separate).
- Replacing MySQL with PostgreSQL (long-term goal for some apps; not
this upgrade).

View file

@ -0,0 +1,349 @@
# MySQL 8.4.8 → 8.4.9 Upgrade — Plan
**Date**: 2026-05-19
**Status**: Drafted, **NOT scheduled**
**Design**: `2026-05-19-mysql-8.4.9-upgrade-design.md`
**Estimated downtime**: 2530 min (all MySQL-dependent apps offline)
**Window**: Suggest Sunday 03:00 UK (low traffic, kured window doesn't fight us)
## Pre-flight (before the maintenance window)
### P.1 Optional smoke test on a parallel PVC (recommended, +30 min)
In a non-production session, before scheduling the real cutover:
```bash
# 1. Create a temporary StatefulSet `mysql-smoketest` in dbaas with the
# same image (mysql:8.4.9), same configmap, brand-new PVC.
# Use a one-off kubectl apply -f /tmp/smoketest.yaml — NOT Terraform —
# so it doesn't pollute the real stack.
# 2. Verify it inits to 8.4.9 cleanly (mysqld.sock appears, "ready for connections").
# 3. Restore one of the smaller per-db dumps (e.g. resume, freshrss) into it.
# 4. Delete the smoketest StatefulSet + PVC.
```
Outcome:
- ✅ Init succeeds → proceed with the real upgrade with high confidence.
- ❌ Init stalls → root cause was not flush starvation. Halt and re-investigate. The real upgrade is unsafe.
### P.2 Read the MySQL 8.4.9 release notes + bug tracker
Specifically look for issues filed since 8.4.9 GA against the DD upgrade
path or `st_spatial_reference_systems`. If a known fix landed in 8.4.10
or 8.5.x, consider waiting.
### P.3 Confirm backup pipeline is healthy
```bash
# Latest per-db dumps exist for all 20 databases
kubectl -n dbaas exec mysql-standalone-0 -- bash -c \
'for d in $(ls /backup/per-db/); do echo -n "$d: "; ls -t /backup/per-db/$d/ | head -1; done'
# Pushgateway shows recent success
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep mysql-backup-per-db
```
### P.4 Pin maintenance window and notify
Brief the user. Confirm window. Disable any background scrapers /
schedulers / bots that would create noise during the cutover.
## Execution (inside the maintenance window)
### Step 1 — Pre-flight snapshot
```bash
ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
# Record current state for verification later
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
-e "SELECT table_schema, COUNT(*) AS tables FROM information_schema.tables \
WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \
GROUP BY table_schema;" > /tmp/mysql-pre-upgrade-table-counts.txt
cat /tmp/mysql-pre-upgrade-table-counts.txt
```
### Step 2 — Trigger a fresh per-db dump
```bash
kubectl -n dbaas create job --from=cronjob/mysql-backup-per-db pre-upgrade-$(date +%s)
# Wait for completion (typically <2 min)
kubectl -n dbaas wait --for=condition=complete --timeout=300s job/pre-upgrade-<timestamp>
```
Verify all 20 databases dumped:
```bash
kubectl -n dbaas exec mysql-standalone-0 -- bash -c \
'for d in $(ls /backup/per-db/); do
newest=$(ls -t /backup/per-db/$d/ | head -1)
echo "$d: $newest"
done'
```
Every entry should have a `dump_<today>_*.sql.gz` listed.
### Step 3 — Bump InnoDB IO config + image pin in Terraform
In `stacks/dbaas/modules/dbaas/main.tf`:
```diff
- innodb_io_capacity=100
- innodb_io_capacity_max=200
- innodb_page_cleaners=1
+ innodb_io_capacity=2000
+ innodb_io_capacity_max=4000
+ innodb_page_cleaners=4
```
```diff
- # Pinned to 8.4.8 — 8.4.9 DD upgrade got stuck (no progress, no CPU)
- # repeatedly across multiple attempts. ...
- image = "mysql:8.4.8"
+ # Re-pinned to 8.4.9 on 2026-MM-DD after the wipe+reinit upgrade
+ # path (see docs/plans/2026-05-19-mysql-8.4.9-upgrade-*).
+ image = "mysql:8.4.9"
```
Commit but **do not apply yet**.
### Step 4 — Stop MySQL
```bash
kubectl -n dbaas scale statefulset mysql-standalone --replicas=0
# Wait for pod deletion
kubectl -n dbaas wait --for=delete pod/mysql-standalone-0 --timeout=120s
```
### Step 5 — Wipe the PVC
```bash
PV=$(kubectl -n dbaas get pvc data-mysql-standalone-0 -o jsonpath='{.spec.volumeName}')
kubectl patch pv "$PV" -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
kubectl -n dbaas delete pvc data-mysql-standalone-0
# Confirm PV vanishes (CSI cleans up the LV)
kubectl get pv | grep -q "$PV" && echo "WARNING: PV still present" || echo "PV cleaned up"
```
### Step 6 — Apply Terraform (8.4.9 + bumped IO)
```bash
cd stacks/dbaas
/home/wizard/code/infra/scripts/tg apply
```
This creates a fresh 5 Gi PVC + new pod on `mysql:8.4.9`. Initial-init
takes ~30 s. Verify:
```bash
kubectl -n dbaas wait --for=condition=ready pod/mysql-standalone-0 --timeout=300s
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();"
# expect: 8.4.9
```
**If the pod fails to become Ready within 5 min**: this is the
"root cause was not flush starvation" failure mode. Abort the upgrade,
revert the image pin to 8.4.8 in TF, re-run from Step 4 (wipe + apply
8.4.8 + restore). Total extra downtime ~25 min.
### Step 7 — Restore per-db dumps (NOT the full --all-databases dump)
```bash
ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
cat <<YAML | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: mysql-restore-per-db-$(date +%Y-%m-%d)
namespace: dbaas
spec:
ttlSecondsAfterFinished: 3600
template:
spec:
restartPolicy: Never
containers:
- name: restore
image: mysql:8.4.9
command: ["bash","-c"]
args:
- |
set -euo pipefail
for db in \$(ls /backup/per-db/); do
newest=\$(ls -t /backup/per-db/\$db/ | head -1)
echo "=== Restoring \$db from \$newest ==="
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" \
-e "CREATE DATABASE IF NOT EXISTS \\\`\$db\\\`;"
gunzip -c "/backup/per-db/\$db/\$newest" | \
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" "\$db"
done
echo "=== All databases restored ==="
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" -e 'SHOW DATABASES;'
env:
- name: MYSQL_ROOT_PASSWORD
valueFrom: { secretKeyRef: { name: cluster-secret, key: ROOT_PASSWORD } }
volumeMounts:
- { name: backup, mountPath: /backup, readOnly: true }
volumes:
- name: backup
persistentVolumeClaim: { claimName: dbaas-mysql-backup-host, readOnly: true }
YAML
```
Watch: `kubectl -n dbaas logs -f job/mysql-restore-per-db-<date>`.
Expected time: ~3 min for all 20 databases.
### Step 8 — Recreate Vault-rotated + static users
The per-db restore did NOT touch `mysql.user`. Recreate all app users
fresh:
```bash
# Static users (forgejo, roundcubemail) from Vault
FORGEJO_PW=$(vault kv get -field=mysql_forgejo_password secret/viktor)
RC_PW=$(vault kv get -field=mysql_roundcubemail_password secret/viktor)
kubectl -n dbaas exec -i mysql-standalone-0 -- bash -c 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD"' <<SQL
CREATE USER IF NOT EXISTS 'forgejo'@'%' IDENTIFIED WITH caching_sha2_password BY '$FORGEJO_PW';
CREATE USER IF NOT EXISTS 'roundcubemail'@'%' IDENTIFIED WITH caching_sha2_password BY '$RC_PW';
GRANT ALL PRIVILEGES ON \`forgejo\`.* TO 'forgejo'@'%';
GRANT ALL PRIVILEGES ON \`roundcubemail\`.* TO 'roundcubemail'@'%';
FLUSH PRIVILEGES;
SQL
# Vault-DB-engine-rotated users: force re-rotation so Vault rewrites the
# user with the current password held in K8s secrets
for role in $(vault list -format=json database/roles | jq -r '.[]' | grep '^mysql-'); do
echo "Rotating $role"
vault write -f "database/rotate-role/$role"
done
# Technitium has a separate password-sync job — kick it
kubectl -n technitium create job --from=cronjob/technitium-password-sync \
technitium-postupgrade-$(date +%s)
```
### Step 9 — Restart MySQL-dependent apps
```bash
for ns_app in \
"forgejo:deploy/forgejo" \
"nextcloud:deploy/nextcloud" \
"hackmd:deploy/hackmd" \
"monitoring:deploy/grafana" \
"paperless-ngx:deploy/paperless-ngx" \
"uptime-kuma:deploy/uptime-kuma" \
"url:deploy/shlink" \
"phpipam:deploy/phpipam" \
"technitium:sts/technitium" \
"vikunja:deploy/vikunja" \
"freshrss:deploy/freshrss" \
"finance:deploy/finance" \
"resume:deploy/resume" \
"realestate-crawler:deploy/realestate-crawler-api" \
"realestate-crawler:deploy/realestate-crawler-celery" \
"realestate-crawler:deploy/realestate-crawler-celery-beat" \
"realestate-crawler:deploy/realestate-crawler-ui"; do
ns=${ns_app%%:*}; app=${ns_app##*:}
kubectl -n "$ns" rollout restart "$app" &
done
wait
```
Wait for all to become ready:
```bash
until [ "$(kubectl get deploy,sts -A -o json | \
jq -r '.items[] | select(.spec.replicas != .status.readyReplicas and .spec.replicas > 0) | .metadata.name' | \
wc -l)" -eq 0 ]; do
sleep 5
done
echo "All workloads ready"
```
### Step 10 — Force ImagePullBackOff pods to retry (Forgejo registry was offline)
```bash
for ns in chrome-service fire-planner freedify; do
kubectl -n "$ns" delete pod --all 2>/dev/null || true
done
```
### Step 11 — Clean up failed CronJob pods from the outage window
```bash
kubectl delete pods -A --field-selector=status.phase=Failed
```
### Step 12 — Verify (matches design §Verification gates)
```bash
# 1. Version
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();"
# expect: 8.4.9
# 2-3. Databases + table counts
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
-e "SELECT table_schema, COUNT(*) FROM information_schema.tables \
WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \
GROUP BY table_schema;" > /tmp/mysql-post-upgrade-table-counts.txt
diff /tmp/mysql-pre-upgrade-table-counts.txt /tmp/mysql-post-upgrade-table-counts.txt
# expect: no diff (or only counts that grew between snapshots)
# 4. Forgejo
kubectl -n forgejo get pod
kubectl -n forgejo logs deploy/forgejo --tail=20 | grep -iE "ORM engine|ready"
# expect: 1/1 Running, "ORM engine initialized"
# 5. Cluster health
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --quiet
# 6. Registry integrity probe
kubectl -n monitoring create job --from=cronjob/forgejo-integrity-probe \
postupgrade-$(date +%s)
kubectl -n monitoring logs job/postupgrade-<timestamp> --tail=5
# expect: "Probe complete: 0 failures"
# 7. RegistryCatalogInaccessible not firing
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/alerts' | \
python3 -c "import json,sys; d=json.load(sys.stdin); [print(a['labels']['alertname']) for a in d['data']['alerts'] if a['state']=='firing']"
# expect: empty / no RegistryCatalogInaccessible
```
### Step 13 — Commit + push the Terraform change
```bash
git add stacks/dbaas/modules/dbaas/main.tf
git commit -m "dbaas: pin MySQL to 8.4.9 after successful wipe+reinit upgrade
Executed per docs/plans/2026-05-19-mysql-8.4.9-upgrade-{design,plan}.md.
The full upgrade ran clean — fresh init on 8.4.9 sidestepped the DD
upgrade stall. IO config bumped to 2000/4 (was 100/1) for the workload.
"
git push
```
## Rollback path (if Step 6 or Step 7 fails catastrophically)
The wipe at Step 5 is destructive — once executed, the original disk
is gone. Rollback is **same procedure, image=8.4.8**:
1. Edit TF: `image = "mysql:8.4.8"`
2. `kubectl -n dbaas scale sts mysql-standalone --replicas=0`
3. Re-wipe (already wiped; just `tg apply`)
4. Run the Step 7 restore Job again (now on 8.4.8)
5. Run Step 8-11
6. Update Terraform comment to reflect retained 8.4.8 pin.
Extra downtime: ~25 min on top of the existing window.
## Post-upgrade follow-ups
- Update `infra/.claude/CLAUDE.md` MySQL row to reflect 8.4.9 pin.
- Update `docs/runbooks/restore-mysql.md` to reflect 8.4.9.
- Re-evaluate whether the new IO config (2000/4) is overkill for the
workload after 1-2 weeks — could drop to 1000/2 if needed.
- Optional: file a follow-up task to investigate MySQL HA/replication
so the next upgrade isn't blocking.

View file

@ -0,0 +1,135 @@
# HA Control Plane (3 masters) — Design
**Date**: 2026-05-21
**Status**: Drafted, NOT scheduled
**Beads**: code-n0ow
**Trigger**: today's k8s 1.34.7→1.34.8 autonomous-upgrade session repeatedly hit a storm cascade rooted in single-master apiserver outages
## Problem statement
The autonomous k8s upgrade pipeline (`stacks/k8s-version-upgrade/`) is
correct end-to-end but **cannot push through the cluster's
single-master architecture**. Each attempted upgrade today rolled
back via the same cascade:
1. Chain drains master → `kubeadm upgrade apply` swaps a static-pod
manifest (etcd → apiserver → controller-manager → scheduler).
2. While a manifest swap is in flight, the affected control-plane
component is briefly down — for apiserver, that means ~1060s of
"connection refused" to `10.96.0.1:443` from every kubelet and
operator pod in the cluster.
3. **Several operators die during that window** instead of waiting:
- **tigera-operator**: logs `[ERROR] Get "https://10.96.0.1:443/api?timeout=32s": connect: connection refused` then exits 1 immediately
- gpu-operator, cnpg-cloudnative-pg, kube-controller-manager: similar leader-lease failures
4. Kubelet restarts those pods → image pulls + initial reads → storm
of disk I/O on master (we observed 563 MB/s from tigera alone).
5. **The storm slows apiserver-to-kubelet status sync** past kubeadm's
hardcoded 5-min watch on the pod's `kubernetes.io/config.hash`
annotation.
6. kubeadm declares the upgrade "did not change after 5m0s",
**rolls back to the previous manifest**, exits non-zero.
7. Chain Job retries (backoffLimit=1) → same storm → same failure.
Chain dead.
The container runtime, the script logic, the RBAC permissions are all
fine after today's fixes. The **single master is the bottleneck**.
## Why HA control plane fixes this
With 3 masters running etcd quorum + apiserver behind an LB:
| Failure mode | Single master | 3-master HA |
|---|---|---|
| Master reboot / kubeadm upgrade | Apiserver completely down 1060s | Other 2 masters serve clients; LB transparently fails over |
| etcd quorum during one master being down | Total outage (1/1 broken) | Quorum maintained (2/3 healthy) |
| Tigera/operators see apiserver as "down" | Yes → crashloop storm | No → keep running through |
| kubeadm `static-pod hash` watch | Times out under load (today's bug) | Never under load; sync stays fast |
| Pipeline upgrade success rate | Brittle / needs manual nursing | Truly autonomous |
The k8s upgrade chain doesn't need to be aware of *any* of this — the
underlying availability of apiserver makes the chain's gates
naturally pass on each iteration.
## Decisions (proposed — to be confirmed)
| # | Decision | Notes |
|---|----------|-------|
| 1 | **3 masters** (not 5) | Quorum tolerates 1 failure, sufficient for home-lab. 5 would tolerate 2 but doubles etcd write amplification. |
| 2 | **Sizing**: match current `k8s-master` (8 vCPU, 32GB RAM, ~64 GB disk) for all 3 | Symmetric. New VMs `k8s-master-2`, `k8s-master-3` on Proxmox. |
| 3 | **Apiserver LB**: **pfSense HAProxy** (existing pattern, see mailserver-pfsense-haproxy.md) over keepalived+haproxy-on-each-master | Pros: no per-node moving parts, mirrors the mailserver layout already in production. Cons: pfSense becomes more SPoF — but it's already SPoF for everything else (DNS, gateway, ingress). |
| 4 | **VIP**: pick an unused IP on the cluster VLAN, e.g. `10.0.20.99`, point all kubeconfigs + kubelet `--server` at it | Internal-only VIP; external API access stays via Cloudflared. |
| 5 | **etcd**: kubeadm-managed (existing); just `kubeadm join --control-plane` brings new members into the etcd cluster automatically | Avoids running etcd separately. |
| 6 | **kured-sentinel-gate**: extend "quorum-safe" check to verify ≥2 control-plane nodes Ready before allowing a reboot | Otherwise kured could reboot 2 masters at once and break quorum. |
| 7 | **etcd backup**: today's `etcd-backup` CronJob already takes a snapshot from one member; that's still sufficient (etcd snapshot is a consistent point-in-time). No new work needed. | |
| 8 | **Migration order**: add masters one at a time, run smoke (kubectl from each), then cut over kubeconfigs | Each `kubeadm join --control-plane` is reversible (just `kubeadm reset` + remove from etcd member list). |
## Out of scope
- HA pfSense itself (separate, much bigger initiative)
- Multi-DC failover
- External etcd cluster (we're sticking with kubeadm-managed stacked etcd)
- Rebuilding cluster from scratch — we'll join into the existing one
## Risk register
| Risk | Mitigation |
|---|---|
| etcd quorum split-brain during member join | kubeadm join is atomic; if it fails, the new member doesn't join the quorum. Existing etcd stays healthy. |
| LB misconfiguration → all kubectl breaks | Smoke-test from each master before flipping clients. Keep a kubeconfig pointing directly at one master as fallback. |
| Existing kubeconfigs (dev VM, agents, woodpecker) need updating | List all consumers, update in a single TF apply. |
| New masters get scheduled some workload pods unintentionally | Verify control-plane taint is applied at join time. |
| Cluster-wide cert rotation might be needed | kubeadm join handles certs automatically using the `--certificate-key` from `kubeadm init phase upload-certs`. |
| 32GB per master × 3 = 96GB RAM used for control plane alone | Proxmox host has headroom; not blocking. |
## Verification
After all 3 masters joined + LB up:
```bash
# All 3 masters listed
kubectl get nodes -l node-role.kubernetes.io/control-plane=
# etcd quorum healthy
kubectl -n kube-system exec etcd-k8s-master -- etcdctl \
--endpoints=https://10.0.20.100:2379,https://10.0.20.X:2379,https://10.0.20.Y:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health --cluster
# Failover test: cordon master-1, reboot it, observe kubectl still works through LB
kubectl drain k8s-master --delete-emptydir-data --ignore-daemonsets
ssh wizard@k8s-master.viktorbarzin.lan sudo reboot
# Pipeline test: re-trigger k8s upgrade chain (e.g. for whatever the next patch is)
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check ha-validation-$(date +%s)
# Expect: full chain succeeds end-to-end without manual intervention
```
## Cost estimate
- 2× VMs at 8 vCPU + 32GB RAM each = +64GB RAM on Proxmox host
- ~+128GB disk usage (2× 64GB master disks)
- ~2-4 hours of operator time end-to-end (VM provisioning + kubeadm join + LB config + smoke)
## What's already in place from today's work
(All these are prerequisites that were fixed during today's
investigation — they stay relevant when HA lands.)
- Master containerd 1.6.22 → 2.2.2, runc 1.1.8 → 1.4.0 (fixed
`runc: unable to signal init: permission denied` on Ubuntu 26.04)
- Pipeline script bugs: 3× `grep -vE` pipefail, 1× RBAC missing
`get daemonsets`, 1× `RecentNodeReboot` not ignored in master phase
- Kill-switch ConfigMap mechanism (`k8s-upgrade-killswitch`)
- Kubeadm-apply retry wrapper in `update_k8s.sh` (helps but doesn't
fully fix the storm cascade)
- Quiet-baseline threshold 3600s → 600s
## Reference
Commits from today's session:
- `10b261d2` — first `grep -vE` pipefail
- `0c8b46df` — 2 more pipefail sites
- `fc0510aa` — kill-switch + RecentNodeReboot ignore + 600s threshold
- `2dc7e001` — kubeadm apply 3-attempt retry

View file

@ -0,0 +1,269 @@
# OpenClaw devvm access + async task pattern — design
**Date:** 2026-05-22
**Stack:** `infra/stacks/openclaw`
**Status:** Approved (in-session, see chat history 2026-05-22)
## Goal
Give the OpenClaw pod (running in K8s) two new capabilities:
1. **Host-tools bundle** — common Linux CLIs the upstream OpenClaw image
doesn't ship (`ssh`, `scp`, `vault`, `dig`, `jq`, `yq`, `ripgrep`, `fd`,
`gnupg`, `tmux`, etc.). OpenClaw can't `apt install` because the
container runs as non-root `node` (uid 1000).
2. **devvm async task pattern** — OpenClaw spawns long-running work as
`tmux` sessions on devvm, sends prompts via `tmux send-keys`, captures
progress via `tmux capture-pane`. Sessions live on devvm, so they
survive OpenClaw pod restarts.
OpenClaw uses this combination as a **trusted fallback** for tasks too
expensive, sensitive, or stateful for in-pod execution: Vault lookups,
multi-step `claude-code` work, anything needing wizard's full home-lab
access.
## Why now
- The in-pod sandbox is `security=full` but the container is minimal —
no `ssh`, no `vault`, no `dig`, no `tmux`.
- The user wants OpenClaw to be a first-line agent that delegates heavy
work to the dev VM rather than duplicate that work in a constrained pod.
- Long-running work (multi-minute `claude-code` sessions) shouldn't be
tied to a single synchronous `claude -p` invocation — needs persistence
and pollability.
## Architecture decision: stay on K8s
Discussed migrating OpenClaw to run directly on devvm (would obviate the
host-tools bundle + most of the SSH setup). Decision: **stay on K8s**.
Reasons:
- Keeps HA (5-node cluster vs single devvm reboot)
- Keeps ingress/Authentik/Telegram entry chain intact
- Keeps Prometheus scrape + exporter sidecar
- Keeps PVC backup pipeline (LVM snapshots + Synology offsite)
- Resource isolation — a runaway LLM session can't stress wizard's daily-driver VM
- Migration cost is several days; this design is ~150 LoC + an 80-line wrapper
The mental model — "OpenClaw is sandboxed, delegates to wizard@devvm for
trusted heavy lifting" — is a clean security boundary. Worth preserving.
## Architecture
### Pod side (`infra/stacks/openclaw/main.tf`)
Two new init containers added to the OpenClaw Deployment, after the
existing four:
#### Init 5 — `install-host-tools`
- Image: `debian:bookworm-slim` (matches main container base for glibc compat)
- Idempotent: skips if `/tools/host-tools/.installed-v1` exists
- `apt-get install --download-only --no-install-recommends` for:
`openssh-client dnsutils iputils-ping wget gnupg jq ripgrep fd-find ncdu htop strace tcpdump tmux unzip`
- Iterates `.deb` files in `/var/cache/apt/archives/`, `dpkg-deb -x` each
into `/tools/host-tools/root/` (preserves `usr/bin`, `usr/sbin`,
`usr/lib` layout)
- Downloads static binaries to `/tools/host-tools/bin/`:
- `vault` (HashiCorp releases, pinned version)
- `yq` (mikefarah/yq GitHub releases, pinned version)
- Smoke test: invokes `--version` on each bundled binary; fails init if
any won't load (catches glibc / shared-lib drift at deploy time, not
runtime)
- Writes marker file with version
#### Init 6 — `setup-ssh-config`
- Image: uses the just-installed host-tools (debian:bookworm-slim base
with `/tools/host-tools/root/usr/bin` on PATH so `ssh-keyscan` works)
- Runs after `install-host-tools`
- Idempotent: skips if `/home/node/.openclaw/.ssh/.configured-v1` exists
- Creates `/home/node/.openclaw/.ssh/` (uid 1000)
- Copies `/ssh/id_rsa` (tmpfs secret mount) → `~/.ssh/id_rsa` with 0600
(the secret tmpfs mount has wider perms that openssh rejects)
- Writes `~/.ssh/config`:
```ssh-config
Host devvm
HostName 10.0.10.10
User wizard
IdentityFile ~/.ssh/id_rsa
UserKnownHostsFile ~/.ssh/known_hosts
StrictHostKeyChecking yes
```
PATH handling on the remote side: devvm's sshd uses the default
non-interactive PATH (`/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin`)
and does NOT load `~/.profile` or `~/.bashrc` (memory id=740). Client-side
`SetEnv PATH=…` doesn't help because sshd's `AcceptEnv` is `LANG LC_*` only.
Solution: install the binaries openclaw cares about into `/usr/local/bin/`
on devvm (see "Devvm side" below).
- Pre-seeds `~/.ssh/known_hosts` via `ssh-keyscan -H 10.0.10.10`
- Writes marker file
#### Main container
- `PATH` env updated: prepend
`/tools/host-tools/root/usr/bin:/tools/host-tools/root/usr/sbin:/tools/host-tools/bin`
- No other changes to the startup command
### Devvm side
#### `/usr/local/bin/openclaw-task` wrapper
Canonical source: `infra/stacks/openclaw/files/openclaw-task.sh`.
Installed to devvm at `/usr/local/bin/openclaw-task` (`sudo cp`, `sudo
chmod +x`) so non-interactive SSH finds it on the default PATH without
needing `~/.profile`. Updates: re-run the install steps from the
canonical source.
Also: `sudo ln -s /home/wizard/.local/bin/claude /usr/local/bin/claude`
so `ssh devvm claude …` works in non-interactive mode. `vault` and `tmux`
are already at `/usr/bin/` (system packages) so no symlink needed for
those.
POSIX shell script. Subcommands:
| Subcommand | Behavior |
|---|---|
| `new <id> <cmd...>` | Spawns detached tmux session `openclaw-task-<id>`, pipes pane output to `~/openclaw-tasks/<id>.log` |
| `claude <id> <prompt>` | Convenience: spawns interactive `claude` in a tmux session, send-keys the prompt + Enter |
| `send <id> <keys...>` | `tmux send-keys -t openclaw-task-<id> "$@"` — caller supplies `Enter` literal if needed |
| `capture <id> [lines]` | `tmux capture-pane -t … -p -S -<lines>` (default last 1000) |
| `log <id>` | `cat ~/openclaw-tasks/<id>.log` |
| `tail <id>` | `tail -n 100 -f ~/openclaw-tasks/<id>.log` (mainly for human ops) |
| `list` | tmux session list filtered to `openclaw-task-*`, one id per line |
| `status <id>` | `running` if tmux session alive, `ended` otherwise |
| `kill <id>` | `tmux kill-session -t openclaw-task-<id>` (log file is kept) |
| `purge <id>` | `kill` + `rm -f ~/openclaw-tasks/<id>.log` |
Task state lives entirely on devvm:
- tmux sessions persist across SSH disconnects and OpenClaw pod restarts
- `~/openclaw-tasks/<id>.log` is the durable transcript even after a
session is killed
- No central database — `tmux list-sessions` is the source of truth for
"what's running"
Naming convention: tmux sessions are prefixed `openclaw-task-` so they
don't collide with wizard's own tmux work (`0`, `Openclaw`, `read-only`).
### Memory note
File at `/workspace/memory/projects/openclaw-runtime/devvm-fallback.md`
teaching OpenClaw the pattern. Indexed by the existing daily
`memory-sync` CronJob (or via manual `node openclaw.mjs memory index
--force` for the initial seed).
Content (verbatim):
```markdown
# Using devvm as a fallback
When in-pod tools/permissions block you, SSH to devvm and use it. The
devvm runs as wizard with full home-lab access (Vault, kubectl, git
repos, Cloudflare, etc.) and has Claude Code v2+ installed.
## One-shot lookup
ssh devvm 'vault kv get -field=brave_api_key secret/openclaw'
ssh devvm 'claude -p "investigate why frigate is restarting"'
## Long-running async work — USE THIS for anything > ~2 min
Spawn in a tmux session on devvm. Sessions survive OpenClaw pod restarts.
# spawn
ssh devvm openclaw-task new my-task "claude -p --dangerously-skip-permissions 'do the thing'"
# poll progress (last 1000 lines of pane)
ssh devvm openclaw-task capture my-task
# interactive claude (send follow-up prompts)
ssh devvm openclaw-task claude my-task "initial prompt"
ssh devvm openclaw-task send my-task "follow-up prompt" Enter
# housekeeping
ssh devvm openclaw-task list
ssh devvm openclaw-task status my-task
ssh devvm openclaw-task kill my-task
Logs persist at ~/openclaw-tasks/<id>.log on devvm even after a session
is killed. Use `ssh devvm openclaw-task log <id>` to retrieve them.
```
## Devvm: no infra changes
Pre-existing state verified 2026-05-22:
- pubkey from `/ssh/id_rsa` (Vault `secret/openclaw → ssh_key`) matches the
`ssh-ed25519 AAAA…lug node@openclaw-58cd9f7987-884bv` line in
`~/.ssh/authorized_keys` (the comment is a stale pod name; the key
itself is stable from Vault)
- sshd listens on 0.0.0.0:22 ✓
- `claude` v2.1.126 at `/home/wizard/.local/bin/claude`
- `tmux` 3.4 installed, server already running with existing user sessions ✓
Only changes (one-time, done in the same session via `sudo`):
- Install `openclaw-task` wrapper to `/usr/local/bin/openclaw-task`
- Symlink `/home/wizard/.local/bin/claude``/usr/local/bin/claude`
## Tradeoffs / risks
- **Bundle size on NFS**: ~30MB extracted. Acceptable on
`/srv/nfs/openclaw/tools`.
- **Library version drift**: bundled binaries link against bookworm libs.
Smoke test in `install-host-tools` catches breakage on the next pod
restart if upstream OpenClaw image rebases.
- **Full-shell SSH**: explicit user choice. Blast radius if openclaw is
prompt-injected = full wizard access. Mitigation: keep OpenClaw's
plugin allowlist tight (current allow list: `memory-core, recruiter-api,
telegram, openrouter, brave, openai, codex`).
- **tmux server lifecycle on devvm**: if wizard's tmux server dies (rare —
usually only on devvm reboot), in-flight openclaw tasks are killed.
Acceptable for home lab. Task logs persist regardless.
- **Task log unbounded growth**: `~/openclaw-tasks/*.log` grows forever.
Out of scope here. User can add a `find -mtime +N -delete` cron later.
- **Init container order**: `setup-ssh-config` depends on
`install-host-tools` finishing first. K8s init containers run
sequentially in declaration order — natural ordering, no explicit
dependency mechanism needed.
## Testing — E2E flows required by user
1. **Tools present**:
`kubectl -n openclaw exec <pod> -c openclaw -- ssh -V` returns version,
same for `dig`, `vault`, `jq`, `yq`, `tmux`, `rg`.
2. **SSH happy path**:
`kubectl -n openclaw exec <pod> -c openclaw -- ssh devvm 'hostname'`
returns `devvm`.
3. **Claude one-shot**:
`kubectl -n openclaw exec <pod> -c openclaw -- ssh devvm 'claude -p "what is 1+1"'`
returns `2`.
4. **Async task lifecycle**:
- `ssh devvm openclaw-task new test-1 "sleep 30; echo done"`
- `ssh devvm openclaw-task list` contains `test-1`
- `ssh devvm openclaw-task status test-1` returns `running`
- wait 35s
- `ssh devvm openclaw-task log test-1` contains `done`
- `ssh devvm openclaw-task status test-1` returns `ended`
5. **Persistence test** (the key requirement):
- Spawn long task: `ssh devvm openclaw-task new persist-1 "sleep 120; echo survived > /tmp/persist-1.proof"`
- `kubectl -n openclaw delete pod <openclaw-pod>` — pod recreated
- Wait for new pod ready (init containers run, skip via marker, fast)
- `kubectl -n openclaw exec <new-pod> -c openclaw -- ssh devvm openclaw-task list`
contains `persist-1`
- Wait for original sleep to finish; verify `/tmp/persist-1.proof`
contains `survived` from new pod
6. **Memory note lookup**:
`kubectl -n openclaw exec <pod> -c openclaw -- node openclaw.mjs memory search 'devvm fallback'`
returns the note.
## Docs to update with the change
- `infra/docs/plans/2026-05-22-openclaw-devvm-access-design.md` (this doc)
- `infra/docs/plans/2026-05-22-openclaw-devvm-access-plan.md` (implementation plan)
- `infra/.claude/reference/service-catalog.md` (one-line addition under
OpenClaw: "Has SSH to devvm with host-tools bundle; long-running async
tasks via `openclaw-task` wrapper on devvm")
- `infra/.claude/CLAUDE.md` "Known Issues" section is left alone — none of
the existing OpenClaw caveats change.

View file

@ -117,7 +117,7 @@ Contributing distractions:
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P2 | Codify the catch-all Proxy provider + embedded outpost config in Terraform | Architecture | Adopt `goauthentik/authentik` Terraform provider in `infra/stacks/authentik/`. Import the existing UUID `0eecac07-97c7-443c-8925-05f2f4fe3e47` and the catch-all provider pk=5. Move `kubernetes_json_patches` into TF so the fix is reviewable in git. | TODO |
| P2 | Codify the catch-all Proxy provider + embedded outpost config in Terraform | Architecture | Adopt `goauthentik/authentik` Terraform provider in `infra/stacks/authentik/`. Import the existing UUID `0eecac07-97c7-443c-8925-05f2f4fe3e47` and the catch-all provider pk=5. Move `kubernetes_json_patches` into TF so the fix is reviewable in git. **Done 2026-05-10**: `authentik_outpost.embedded` resource + `authentik_provider_proxy.catchall.access_token_validity` codified, plan-to-zero on the whole stack. The `Outpost.managed` field is server-set (not in provider schema) and preserved across applies because TF only writes known fields. Same-day work also flipped the outpost's session backend from filesystem (`/dev/shm`) to PostgreSQL — see `.claude/reference/authentik-state.md`. | **DONE** |
| P2 | Runbook: Authentik forward-auth troubleshooting | Docs | Add a runbook at `docs/runbooks/authentik-forward-auth-400.md` with the "grep outpost logs first" first step, plus pointer commands for `/dev/shm` usage, session file count, and recent authorize events. | TODO |
### P3 — Upstream + architectural
@ -125,8 +125,8 @@ Contributing distractions:
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P3 | Comment/support on authentik issue [#20093](https://github.com/goauthentik/authentik/issues/20093) | Upstream | Request either a persistent-backed session store (Redis/DB) OR a configurable GC interval shorter than the default 5 min. | TODO |
| P3 | Consider shortening `access_token_validity` from 168h (7 days) to 24h | Config | Reduces steady-state session file count from ~181k to ~26k (7× reduction). Trade-off: users re-auth daily. Viktor's call on UX tolerance. | TODO |
| P3 | Evaluate moving forward-auth away from the embedded outpost | Architecture | The embedded outpost is a single replica Go binary with in-memory session state. An external, multi-replica outpost with Redis-backed sessions is the production-grade deployment. Probably overkill for a home-lab, but worth noting. | TODO (paused) |
| P3 | Consider shortening `access_token_validity` from 168h (7 days) to 24h | Config | Original idea: shrink steady-state session file count (~7× reduction) at the cost of daily re-auth. **Resolved differently 2026-05-10**: switched the outpost to the PostgreSQL session backend (`Outpost.managed = goauthentik.io/outposts/embedded` + `AUTHENTIK_POSTGRESQL__*` envFrom), which makes session count irrelevant for tmpfs sizing and lets us BUMP `access_token_validity` to `weeks=4` for better UX without cost. | **DONE (alt)** |
| P3 | Evaluate moving forward-auth away from the embedded outpost | Architecture | Original framing: external, multi-replica outpost with Redis-backed sessions. **Resolved 2026-05-10** by enabling the postgres-backed session store on the embedded outpost itself (PR goauthentik/authentik#16628). Sessions now persist across pod restarts; the original "in-memory state" concern is moot. Multi-replica still requires a goauthentik upstream fix (PgBouncer-friendly session migration), but the loss-of-state class of failures is gone. | **DONE (alt)** |
## Lessons Learned

View file

@ -0,0 +1,164 @@
# Post-Mortem: kured Reboots Silently Stalled for 6 Days + Anubis HA Lift
| Field | Value |
|-------|-------|
| **Date** | 2026-05-16 |
| **Duration** | 6 days of unbooted pending-reboot packages (2026-05-10 → 2026-05-16) |
| **Severity** | SEV3 — no user-facing impact; latent risk (kernel/libc CVEs queued, not landing) |
| **Affected Services** | None directly; OS-reboot pipeline halted on all 5 K8s nodes |
| **Status** | Root cause fixed (kured Helm value), defensive defaults added (Anubis HA, kured drain-timeout, CNPG 3 instances) |
## Summary
After unattended-upgrades was re-enabled on the K8s nodes on 2026-05-10,
kured was supposed to drive rolling node reboots within the MonFri
02:0006:00 London window. Instead, kured logged "Reboot not required"
every hour for six straight days while the `kured-sentinel-gate`
DaemonSet on every host happily reported "ALL CHECKS PASSED — creating
/var/run/gated-reboot-required". The gate WAS open. kured was looking
in the wrong place.
The kured Helm chart derives the sentinel hostPath from
`dirname(configuration.rebootSentinel)`. The stack set
`rebootSentinel = "/sentinel/gated-reboot-required"` — which pointed
the chart at hostPath `/sentinel/` (an empty auto-created directory).
The sentinel-gate writes to `/var/run/gated-reboot-required` on the
host. Two different host directories. kured silently skipped reboots
for six days.
Found on 2026-05-16 while auditing why "automatic upgrades aren't
happening" alongside the K8s version-upgrade Job-chain (PM
2026-05-11). Fixed in one commit; took the opportunity to also
eliminate three latent drain-time hazards (Anubis single-replica PDB
deadlock, kured unbounded drain timeout, CNPG-only-2-instances).
## Impact
- **User-facing**: None. Existing kernels, libc, and userspace kept running. CVEs queued in `/var/run/reboot-required.pkgs` on every node but were never exploited.
- **Backlog**: All 5 nodes accumulated `linux-image-*` + `libc6` queued for reboot. Largest gap was master at ~6 days. Workers also 56 days.
- **Detection gap**: kured exposes no Prometheus signal for "I checked but said no". The hourly "Reboot not required" line in stdout is the only trace, and nobody was tailing it. The architecture had two layers (sentinel-gate gate + kured sentinel check) but no verification that the two layers were looking at the same path.
- **Side discovery**: 8 Anubis instances would have stalled drain anyway via single-replica + `PDB minAvailable=1` (the same trap that stalled the manual K8s upgrade on 2026-05-11). Even if the kured path bug were fixed in isolation, Monday's first reboot would have hit the Anubis trap and idled forever (kured default `--drain-timeout=0` = unlimited).
## Timeline (UTC)
| Time | Event |
|------|-------|
| **Mar 16 21:26** | kured-sentinel-gate DaemonSet introduced after the 26h overlayfs cascade outage. Original sentinel cool-down 30m. |
| **May 10 ~16:57** | Last successful kured pod restart picked up new Helm values. `rebootSentinel = "/sentinel/gated-reboot-required"`. Same commit re-enabled unattended-upgrades in cloud_init and stretched the sentinel cool-down 30m → 24h. |
| **May 10 ~17:00 → May 15 06:16** | unattended-upgrades on every node successfully installs kernel + libc patches, writes `/var/run/reboot-required`. |
| **May 1015** | sentinel-gate Check 14 all pass every 5 min on every host. Touches `/var/run/gated-reboot-required`. Logs "ALL CHECKS PASSED". |
| **May 1015** | kured polls `/sentinel/gated-reboot-required` (empty dir, file does not exist). Returns "Reboot not required" every hour. No reboots happen. |
| **May 11 20:4021:00** | Separate K8s-version-upgrade incident (master upgraded to v1.34.7, workers stalled mid-rollout because the upgrade agent drained its own host). Manual recovery 5/115/12. **kured stall noticed but not investigated**: cluster healthy, K8sVersionSkew firing was tracked as the urgent issue. |
| **May 11 22:47 → May 12 00:01** | Manual worker drains hit the Anubis single-replica PDB trap (drain loops). Resolved by direct-deleting Anubis pods to bypass eviction API. This was the first signal that single-replica `minAvailable=1` patterns deadlock drains. |
| **May 16 10:56 UTC** | While auditing "what runs the upgrades" for the user, the kured + sentinel-gate log/path mismatch became visible. |
| **May 16 11:13 UTC** | `stacks/kured/main.tf`: `rebootSentinel = "/sentinel/..."``"/var/run/gated-reboot-required"`. Re-init, plan, apply. |
| **May 16 11:14 UTC** | kured DaemonSet rolls out the new spec. Volume hostPath becomes `/var/run`. kured pod can now see `/sentinel/reboot-required` (32B, from uu) AND `/sentinel/gated-reboot-required` (0B, from gate). Confirmed via `kubectl exec` listing. |
| **May 16 11:44 UTC** | Anubis HA module change deployed: `shared_store_url` variable → `store: { backend: valkey }` block appended to policy YAML, default replicas 2, PDB `maxUnavailable=1`, topology `DoNotSchedule`. Cyberchef applied as canary. Confirmed: Redis DB 5 starts receiving challenge state. |
| **May 16 11:4811:53 UTC** | Remaining 7 Anubis stacks applied (DBs 612). 8/8 deployments at 2/2 Ready, replicas spread on different nodes. Smoke-tested 6 of 8 public URLs return 200. |
| **May 16 12:05 UTC** | kured `drainTimeout: "30m"` added + applied. pg-cluster bumped from 2 → 3 instances. |
| **May 16 12:11 UTC** | pg-cluster phase = "Cluster in healthy state", 3/3 ready. |
## Root Cause
The Helm chart `kured-5.11.0` computes:
```
{{- $sentinel_dir := dir .Values.configuration.rebootSentinel -}}
# template renders both volume mount and hostPath using $sentinel_dir
```
So `rebootSentinel` is doubly-purposed: it's both the **CLI arg path inside
the pod** AND the **hostPath on the node**. Setting it to `/sentinel/...`
caused:
- pod arg: `--reboot-sentinel=/sentinel/gated-reboot-required` (looks at `/sentinel/` inside the pod)
- hostPath: `/sentinel/` (auto-created empty directory by `type: Directory`)
- mountPath inside pod: `/sentinel/` (mapped from hostPath above)
Meanwhile the gate DaemonSet was configured with hostPath `/var/run`
mountPath `/host/var-run`, and wrote `gated-reboot-required` to its local
`/host/var-run/` which became the host's `/var/run/gated-reboot-required`.
The two daemons never touched the same directory.
**Why this was hard to spot**:
1. Both layers logged success: sentinel-gate said "ALL CHECKS PASSED", kured said "Reboot not required". Neither claimed an error.
2. No Prometheus alert exists for "kured polled, gate is open, kured still didn't act". The Upgrade Gates alert group catches firing-alert-during-rollout, not silently-skipped-rollout.
3. The Helm chart's auto-derivation of hostPath from a config value is undocumented surprising behavior. The mental model is "rebootSentinel is just the in-pod path"; the hostPath co-mutation is invisible.
## Remediation
### Primary fix
- `stacks/kured/main.tf`: `rebootSentinel = "/var/run/gated-reboot-required"`. Both the chart-derived hostPath and the kured CLI arg now align with where the gate writes.
### Defensive companion changes (same session)
| Change | Purpose | Stack |
|---|---|---|
| `drainTimeout = "30m"` on kured | Fail closed instead of looping forever if a future PDB or finalizer stalls drain. Node stays Schedulable (no silent capacity loss). | `stacks/kured/main.tf` |
| Anubis: shared-state Valkey/Redis backend | Eliminate the single-replica drain deadlock + provide real HA. PDB changed `minAvailable=1``maxUnavailable=1`. Replicas 1 → 2 with `topologySpreadConstraint: DoNotSchedule`. | `modules/kubernetes/anubis_instance/main.tf` + 8 callers |
| pg-cluster: 2 → 3 instances | Failover during primary's node drain no longer depends on the lone replica being caught up. CNPG always has a fully-current candidate. | `stacks/dbaas/modules/dbaas/main.tf` |
| Orphan `mysql-standalone` PDB deleted | Helm-stamped leftover (selector required 4 labels, pod has 3 → matched 0 pods). Was dead code; deletion is safe. | `kubectl` (not TF-managed) |
### Verified post-fix
- `kubectl -n kured exec deploy/kured -- ls /sentinel/` lists both `reboot-required` and `gated-reboot-required` on every node.
- 8 Anubis Deployments at 2/2 Ready; pods spread across different nodes (verified via `kubectl get pods -o wide`).
- Redis DBs 5, 7, 8, 10 receiving challenge state from real public traffic post-apply (Palo Alto Networks scanner hit blog).
- pg-cluster 3/3 healthy, phase = "Cluster in healthy state".
- kured args show `--drain-timeout=30m`.
## Lessons
1. **Auto-derivation in Helm charts is invisible drift surface.** The chart's
habit of deriving hostPath from a CLI-arg-shaped value is the kind of
"convenient default" that hides during normal review. Mitigation:
pin `hostFilePath` explicitly in `configuration` so the host path is
declared, not derived. (Did not do this in the fix because the
single-config approach is now correct; flagging as future improvement.)
2. **"Silently skipped" needs a Prometheus signal.** The Upgrade Gates
alerts cover "rollout in progress + something went wrong". They don't
cover "we haven't rolled in 7 days when we should have". Suggested:
add `KuredRebootBacklog` — fires when `kured_reboot_required ==
1` (kured exposes this) for more than 24h continuously. The kured
chart already serves `/metrics`; just needs a rule. (Deferred.)
3. **Single-replica `PDB: minAvailable=1` is a deadlock pattern.** It
reads as "protect this pod" but actually means "block all voluntary
disruption forever". Manifested in 9 places (8 Anubis + mysql-standalone
with broken selector). The Anubis fix is now in place via shared-store
replicas=2; the `mysql-standalone` selector was already broken so it
matched 0 pods (and was deleted as cruft). Worth auditing the cluster
periodically for any new pattern of the same shape.
4. **k8s-node1 containerd source drift** (Ubuntu archive's `containerd`
vs Docker's `containerd.io`) is benign but should be documented.
Audited during this session: not a blocker for kured because both
variants are in the Package-Blacklist and both are apt-held. The
version skew with master (1.6.22 vs 1.7.24/1.7.27) is what the
K8s version-upgrade Stage 3 "containerd bump" exists to fix.
5. **CNPG drain handling at 2 replicas is fragile.** Switchover works
but the lone replica must be caught up; in practice this means
on a busy cluster, a primary-node drain could stall for tens of
seconds while CNPG promotes. 3 instances eliminates this. Worth
considering for every long-running multi-instance stateful workload.
## Detection / Prevention Followups
- [ ] `KuredRebootBacklog` Prometheus alert. Spec: `kured_reboot_required == 1 and (time() - timestamp(kured_reboot_required)) > 86400`.
- [ ] Add a `hostFilePath` value to the kured Helm release for explicit declaration (current setup is correct but undocumented).
- [ ] Audit periodically for new single-replica + `minAvailable=1` PDB patterns (could be a Kyverno warn policy).
- [ ] Phase 4: clean up the InnoDB Cluster CR + remaining `mysql-cluster-pdb` once the bitnami legacy is fully decommissioned.
## File pointers
| What | Where | Commit |
|---|---|---|
| kured sentinel path fix | `infra/stacks/kured/main.tf` | c17d87e1 |
| Anubis HA (module + 8 callers) | `infra/modules/kubernetes/anubis_instance/` + 8 `stacks/<app>/main.tf` | 6e920f96 |
| kured drainTimeout + CNPG 3-replica | `infra/stacks/kured/main.tf` + `infra/stacks/dbaas/modules/dbaas/main.tf` | a726e963 |
| K8s version-upgrade Job-chain (related context) | `infra/stacks/k8s-version-upgrade/` | 01bc16d5 (5/11) |
| Architecture doc | `infra/docs/architecture/automated-upgrades.md` | (updated 5/11) |
| Runbook | `infra/docs/runbooks/k8s-version-upgrade.md` | (updated 5/11) |
| Deprecated agent prompt (self-preemption history) | `infra/.claude/agents/k8s-version-upgrade.deprecated.md` | 01bc16d5 |

View file

@ -0,0 +1,160 @@
# Post-Mortem: GPU Driver Crashloop after Ubuntu 26.04 Upgrade on k8s-node1
**Date:** 2026-05-17
**Author:** Viktor Barzin / Claude (incident response)
**Severity:** SEV-3 (GPU workloads unavailable: frigate, immich-ml, llama-swap, ytdlp/yt-highlights all Pending; no impact to non-GPU services)
**Beads:** `code-8vr0` (P1)
**Status:** Blocked on upstream — NVIDIA has not published Ubuntu 26.04 driver images yet
## Summary
`nvidia-driver-daemonset-sg22g` on k8s-node1 went into CrashLoopBackOff
with 76+ restarts. Root cause: k8s-node1 was upgraded to **Ubuntu 26.04
LTS (Resolute Raccoon)** at some point, putting the running kernel at
`7.0.0-15-generic`. The NVIDIA driver daemonset's installer container
runs `apt-get install linux-headers-<kernel>` against Ubuntu 24.04's
noble repositories (the container's base OS), which don't carry
`linux-headers-7.0.0-15-generic`, so the build aborts with:
Could not resolve Linux kernel version
Attempted fix (chart upgrade v25.10.1 → v26.3.1 with driver 580.105.08
and `kernelModuleType: open`) succeeded at the chart level but produced
a worse outcome: the v26.3.1 operator auto-detects the host OS via NFD
and constructs the image tag `<version>-ubuntu26.04`, which 404s on
pull. `skopeo list-tags docker://nvcr.io/nvidia/driver` confirms zero
ubuntu26.04 tags exist (vs 779 ubuntu22.04 and 206 ubuntu24.04 tags).
Rolled the chart back to v25.10.1 (pinned in TF) to restore the closest-
to-working state pending an upstream fix or kernel rollback.
## Impact
- GPU resource `nvidia.com/gpu` = 0 on k8s-node1 (only GPU node)
- All GPU-bound workloads Pending or 0/N Ready:
- `frigate/frigate`
- `immich/immich-machine-learning`
- `llama-cpp/llama-swap`
- `nvidia/nvidia-exporter`
- `ytdlp/yt-highlights`
- Downstream alerts firing: `NvidiaExporterDown`, 5× Uptime Kuma monitors
(Frigate, Immich ML, nvidia-exporter, …), `GPUNodeUnschedulable` not
firing (node is schedulable, just no GPU advertised)
- No data loss; no user-facing service degradation outside the GPU stack
## Timeline (Europe/Sofia, UTC+3)
- pre-incident — `apt-get dist-upgrade` (or `do-release-upgrade`) bumped
k8s-node1 from Ubuntu 24.04 → 26.04. Apt history.log doesn't capture
the upgrade (rotated by `do-release-upgrade`).
- ~2026-05-11 — node rebooted into kernel `7.0.0-15-generic`. NFD
reports `system-os_release.VERSION_ID = 26.04`,
`kernel-version.full = 7.0.0-15-generic`.
- 2026-05-17 04:00 (approx) — driver daemonset enters CrashLoopBackOff
on every kubelet restart cycle. Error: "Could not resolve Linux kernel
version".
- 2026-05-17 13:35 — chart upgrade attempt v25.10.1 → v26.3.1, driver
570.195.03 → 580.105.08, `kernelModuleType: open`. Helm applies
cleanly but driver pod ImagePullBackOff on
`driver:580.105.08-ubuntu26.04`.
- 2026-05-17 ~13:45 — skopeo confirms zero ubuntu26.04 tags on
nvcr.io/nvidia/driver. Decision: roll chart back, pin in TF, document
the gotcha, file the kernel rollback as the next step.
## Root Causes
1. **Host OS upgraded to Ubuntu 26.04** ahead of NVIDIA's driver image
support window. NVIDIA typically lags new Ubuntu LTS releases by
weeks-to-months on the driver-container front.
2. **gpu-operator chart was not pinned** prior to today. The TF
`helm_release` had `version` commented out, so any apply could
re-resolve to the latest chart and follow its OS-auto-detection
logic. With v25.10.1, the operator fell back to ubuntu24.04 image
suffix (which pulls successfully but fails to compile against kernel
7.0). With v26.3.1, the operator picks the correct (per-NFD)
ubuntu26.04 suffix — which doesn't exist.
3. **No alert for "GPU device count = 0 on a GPU node"** — the cluster
had 14+ hours of silent GPU outage before noticing. `NvidiaExporterDown`
fires only when the metrics exporter itself stops scraping, not when
the operator's driver pod is unhealthy.
## What We Changed in This Session
- `stacks/nvidia/modules/nvidia/main.tf` — pinned
`helm_release.nvidia-gpu-operator.version = "v25.10.1"` so future
applies don't surprise us with v26.3.1's stricter OS detection.
- `stacks/nvidia/modules/nvidia/values.yaml` — comment block explaining
the situation; driver version stays at `570.195.03` as the last-known
config that produced a pullable image.
- `docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md`
this file.
## What We Did NOT Do (Pending User Decision)
- **Roll back the host kernel** on k8s-node1 from `7.0.0-15-generic`
to `6.8.0-117-generic`. The 6.8 kernel is still installed at
`/lib/modules/6.8.0-117-generic` and the matching headers at
`/usr/src/linux-headers-6.8.0-117-generic`, so GRUB can boot it and
the driver image's apt sources (Ubuntu 24.04 noble) carry
`linux-headers-6.8.0-117-generic`. This would require draining the
node, editing GRUB defaults, `apt-mark hold` to prevent future drift,
and rebooting — needs explicit user OK.
- **Add a probe + alert** for `nvidia.com/gpu` resource count on the
GPU node. Should fire within 10 minutes of the operator failing to
publish the resource, regardless of which sub-pod failed.
## Recovery Procedure (next time)
### If the driver-installer fails with "Could not resolve Linux kernel version"
1. Identify the running kernel: `uname -r` on the affected node.
2. Check whether NVIDIA ships an image for that kernel/distro combo:
docker run --rm quay.io/skopeo/stable list-tags \
docker://nvcr.io/nvidia/driver \
| python3 -c "import json,sys; d=json.load(sys.stdin); \
print([t for t in d['Tags'] if '<distro>' in t][:5])"
3. If yes, point the chart at the right version + ensure NFD reports
the matching OS.
4. If no (and a kernel rollback is acceptable):
- `kubectl cordon <node>` then `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data`
- `nsenter -t 1 -m -p -u sed -i 's/^GRUB_DEFAULT=.*/GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 6.8.0-117-generic"/' /etc/default/grub`
- `nsenter -t 1 -m -p -u update-grub`
- `nsenter -t 1 -m -p -u apt-mark hold linux-image-6.8.0-117-generic linux-headers-6.8.0-117-generic linux-generic linux-image-generic linux-headers-generic`
- Reboot: `nsenter -t 1 -m -p -u systemctl reboot`
- After boot: `kubectl uncordon <node>` and wait for the GPU
daemonset to come Ready
## Action Items
- [x] Pin gpu-operator chart to v25.10.1 in TF
- [x] Document situation in this post-mortem
- [ ] Roll back k8s-node1 host kernel to 6.8.0-117-generic + apt-mark
hold (needs user authorization for node reboot)
- [ ] Add Prometheus alert `GPUNodeNoGPUResource` — fires when a node
labeled `nvidia.com/gpu.present=true` has `nvidia.com/gpu` capacity
of 0 for >10m
- [ ] Periodically re-check NVIDIA's NGC catalog for ubuntu26.04 driver
tags — file as a quarterly checkup once we see the first 26.04
tag, unpin the chart and revert this post-mortem's mitigation
- [ ] Audit ALL host packages with `apt-mark hold` semantics. The
memory of the March 2026 outage says we disabled
`unattended-upgrades``do-release-upgrade` is a separate path
that should be gated too
## Lessons
- **Operator-style charts that auto-detect host OS can silently break
when the host fleet leapfrogs upstream image support.** Pin the chart
version + driver version, and treat upstream support gaps as a hard
blocker rather than a guaranteed-to-resolve race condition.
- **Drain-and-revert host kernel is the right escape hatch when
upstream image lags.** Make sure the previous kernel and its headers
stay installed (don't aggressively purge old kernels in apt
autoremove).
- **NFD labels are authoritative for the operator's image-tag
construction.** If you need to lie about OS version (e.g., to force a
24.04 image on a 26.04 host), edit the NFD label — but only as a last
resort; the chart upgrade made clear the operator will eventually
reconcile this.

View file

@ -0,0 +1,133 @@
# Post-Mortem: nfs-csi Keel-Triggered Upgrade Broke Master Node CSI
**Date:** 2026-05-17
**Author:** Viktor Barzin / Claude (incident response)
**Severity:** SEV-3 (1 of 5 CSI node DaemonSet pods stuck CrashLoopBackOff; controller pair flapping)
**Duration:** ~2 hours from first detection to all-green
## Summary
The Keel auto-update operator polled the `csi-driver-nfs` Helm chart and rolled
`v4.13.1 → v4.13.2`. The new chart's controller Deployment scheduled both
replicas onto `k8s-master` (no built-in control-plane exclusion). Both replicas
used `hostNetwork: true` and tried to bind the same host ports
(`19809` for `node-driver-registrar`, `29653` for `liveness-probe`), so one
controller pod CrashLoopBackOff'd with `bind: address already in use`. The
upgrade also left behind multiple orphan controller pods in containerd that
kubelet could no longer reconcile — they held the host ports even after the
helm rollback removed them from K8s state.
The `csi-nfs-node` DaemonSet pod on master then could not start either: its
own `node-driver-registrar` and `liveness-probe` containers tried to bind
the same host ports and lost to the zombies.
## Impact
- 1× `csi-nfs-node` pod on `k8s-master` stuck CrashLoopBackOff (16+ restarts)
- CSI plugin unregistered on master → no NFS volumes could be mounted on
master-hosted pods (calico-typha cert mount failed, etcd backup CronJob
failed)
- Controller flap (2 replicas fighting) → intermittent
`csi-resizer`/`csi-snapshotter` failure for the whole cluster
- Cascade: kured-sentinel, node-local-dns, prometheus-node-exporter,
csi-node-driver (Calico) all bounced on master while kubelet thrashed
No data loss; no production-facing outages observed (CSI mounts on the four
worker nodes kept working).
## Timeline (Europe/Sofia, UTC+3)
- ~07:46 — Keel polls forgejo + DockerHub manifests, sees a new digest under
the `csi-driver-nfs` `4.13.x` channel, triggers Helm upgrade
- 07:46:16 — `helm upgrade csi-driver-nfs` runs; new controller Deployment
scheduled (no `affinity` block → both replicas land on `k8s-master`)
- ~07:50 — Controller replicas fight for ports `19809`, `29653`; one stays in
CrashLoopBackOff
- ~08:00 — User notices "CSI issue ... due to the upgrade"; investigation
begins
- 08:15 — `helm rollback csi-driver-nfs` to revision 8 (v4.13.1) — controllers
on master deleted via K8s, but containerd retains them as live sandboxes
- 08:30 — Live `podAntiAffinity` + `nodeAffinity: control-plane DoesNotExist`
added to the controller Deployment via patch (controllers now correctly
schedule on node1+node3)
- 08:40 — `csi-nfs-node` master pod still CrashLoopBackOff; ports 19809/29653
held by orphan PIDs (livenessprobe PID 1816, csi-node-driver PID 1944,
plus 5× csi-provisioner from zombie controller pods)
- 09:00 — Privileged pkill via `hostPID: true` pod failed
(`permission denied` from runc — containerd refused to signal init in the
zombie containers)
- 09:03 — `nsenter -t 1 -m -p -u systemctl restart kubelet` on master cleared
the orphan containers via cgroup GC; ports freed
- 09:04 — `csi-nfs-node` master pod reaches 3/3 Ready; cluster green
- 09:09 — Terraform `apply`: pin `helm_release.version = "4.13.1"`, add
`controller.affinity` to values
## Root Causes
1. **`csi-driver-nfs` Helm chart in TF was unpinned.** The `helm_release` had
no `version = ...` field, so it floated to whatever the chart repo
advertised. Keel polled this and rolled forward.
2. **Chart v4.13.2 dropped the implicit control-plane exclusion** that v4.13.1
shipped with. Without it, the K8s scheduler chose master for both
controller replicas.
3. **Two controller replicas + hostNetwork = port conflict on the same node.**
The chart did not add `podAntiAffinity` between the replicas. Live state
has it now; TF now does too.
4. **Helm rollback does not always clean containerd sandboxes.** When the
prior revision's pods are abandoned mid-flight (image-pull-pending, etc.),
containerd can keep multiple sandbox instances for the same pod-UID.
Kubelet GC is the only thing that reliably reaps these — restarting it
forces a reconciliation pass that drops orphans.
## What We Fixed
- **`stacks/nfs-csi/modules/nfs-csi/main.tf`** (this commit):
- `version = "4.13.1"` pin on the `helm_release` (defense in depth — namespace
is already excluded from Kyverno-Keel injection, but the chart could still
drift on a `terraform apply` without a pin)
- `controller.affinity` block with `podAntiAffinity` (different hosts for
replicas) and `nodeAffinity` (exclude `node-role.kubernetes.io/control-plane`)
- Inline comments explaining both decisions
- **Kyverno keel-annotations**: `nfs-csi` was already in the namespace exclude
list (decision from authentik incident 2026-05-17). Verified still there
in `stacks/kyverno/modules/kyverno/keel-annotations.tf:91`.
## Recovery Procedure (next time)
If `csi-nfs-node` on a node CrashLoopBackOff with `bind: address already in use`:
1. **Find which host ports are bound**`lsof -i :19809`, `lsof -i :29653`
(from a privileged hostPID pod on the affected node).
2. **Try `crictl rmp -f <pod-id>`** on zombie pods (those K8s no longer
tracks). Will fail with `unable to signal init: permission denied` if
the containers are sufficiently stuck.
3. **Restart kubelet on the affected node** via `nsenter -t 1 -m -p -u
systemctl restart kubelet` (privileged hostPID pod). Kubelet's GC
reconciles containerd state and reaps the orphans.
4. **Force-delete the DaemonSet pod** to clear the back-off
(`kubectl delete pod -n nfs-csi csi-nfs-node-XXXX --force --grace-period=0`).
DaemonSet recreates it; with the ports free, containers start cleanly.
## Action Items
- [x] Pin `csi-driver-nfs` chart version in TF
- [x] Add `controller.affinity` to TF (podAntiAffinity + control-plane exclude)
- [x] Document recovery procedure (this post-mortem)
- [ ] Audit other unpinned `helm_release` blocks — every chart used in
Kyverno-excluded namespaces should still be pinned to prevent
`terraform apply` drift. (Filed as follow-up — not blocking.)
- [ ] Consider adding a `kured` or daily script that detects orphan
containerd sandboxes whose pod-UID is unknown to the apiserver and
reaps them automatically. (Filed as follow-up — not blocking.)
## Lessons
- **Keel exclusion ≠ chart pin.** The namespace was already excluded from
Keel injection, but the helm_release was unpinned — so a `terraform apply`
alone could re-trigger the same break. Both layers needed locking down.
- **`crictl rmp -f` is not always sufficient.** When containerd refuses to
signal init, kubelet restart is the next escalation step before SSH/reboot.
- **The Keel rollout phase 2-6 design ASSUMED stateful operators were
excluded.** CSI was correctly excluded — but the chart version itself was
still a moving target via plain `terraform apply`. The exclude-list catches
Keel; the version pin catches everything else.

View file

@ -0,0 +1,207 @@
# K8s Node Auto-Upgrades
## Overview
OS-level package upgrades on the 5 K8s VMs (master + 4 workers) are driven by `unattended-upgrades` and rebooted by `kured`, with multiple safety gates layered on top to prevent the failure mode that caused the March 2026 26h cluster outage.
## Architecture
```
apt-daily.timer (random within window)
│ apt-get update
apt-daily-upgrade.timer (random within window)
│ unattended-upgrades runs
│ - Allowed-Origins: -security, -updates, ESM
│ - Package-Blacklist: containerd*, runc, calico-*, cni-plugins-*, docker-ce
│ - apt-mark hold on kubelet, kubeadm, kubectl, containerd*, runc
│ - Automatic-Reboot=false (kured handles reboots)
▼ if kernel/glibc/systemd updated
/var/run/reboot-required appears on the host
▼ (sentinel-gate DaemonSet polls every 5min)
kured-sentinel-gate checks:
├── 1. Host has /var/run/reboot-required
├── 2. ALL nodes Ready
├── 3. ALL calico-node pods Running
└── 4. NO node Ready-transition in last 24h (soak window)
▼ all pass
touch /var/run/gated-reboot-required
▼ (kured polls every 1h within 02:00-06:00 London, any day of the week)
kured checks Prometheus before draining:
│ http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts
│ ANY firing alert (except ignore-list) blocks the drain
│ Ignore-list: ^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$
▼ no blockers
kured drains the node (priority-ordered, 310s budget)
kured runs /bin/systemctl reboot
▼ node returns
kured uncordons + posts Slack notification (configuration.notifyUrl)
▼ 24h cool-down begins (sentinel-gate Check 4)
```
## Components
### unattended-upgrades (in-guest)
- **Config**: `/etc/apt/apt.conf.d/52unattended-upgrades-k8s` + `/etc/apt/apt.conf.d/20auto-upgrades`
- **Source of truth**: `infra/modules/create-template-vm/cloud_init.yaml` (lines for `is_k8s_template`)
- **Day-2 push**: SSH-based — see "Restore / re-apply config" below
### kured (Helm release)
- **Stack**: `infra/stacks/kured/main.tf`
- **Helm chart**: `kured-5.11.0` (image `ghcr.io/kubereboot/kured:1.21.0`)
- **Window**: 02:00-06:00 Europe/London, every day of the week (was Mon-Fri until 2026-05-16), period=1h, concurrency=1
- **Sentinel**: `/sentinel/gated-reboot-required` (created by sentinel-gate DaemonSet)
- **Slack hook**: Vault `secret/kured``slack_kured_webhook`
### kured-sentinel-gate (DaemonSet)
- **Source**: `kubernetes_daemon_set_v1.kured_sentinel_gate` in `infra/stacks/kured/main.tf` (lines ~120-260)
- **Image**: `bitnami/kubectl:latest`
- **Loop period**: every 300s
- **Gate logic**: 4 checks — see Architecture diagram
### Upgrade Gates Prometheus alerts
- **Source**: `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` group `Upgrade Gates`
- **10 alerts**: KubeAPIServerDown, KubeStateMetricsDown, PrometheusRuleEvaluationFailing, PVCStuckPending, RecentNodeReboot, MysqlStandaloneDown, ClusterPodReadyRatioDropped, NodeMemoryPressure, NodeDiskPressure, KubeQuotaAlmostFull
- **Effect**: kured `--prometheus-url` polls Prometheus before each drain — any non-ignored firing alert halts the rollout
## Common Operations
### Verify the system is healthy
```bash
# kured pods + sentinel-gate Running on all 5 nodes
kubectl -n kured get pods
# kured can reach Prometheus
kubectl -n kured exec ds/kured -- /usr/bin/kured --help | grep prometheus
# Upgrade Gates rules loaded + state
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -q -O- 'http://localhost:9090/api/v1/rules' | \
jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | " \(.name): \(.state)"'
# Per-node unattended-upgrades status
for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
echo "=== $n ==="
ssh $n "systemctl is-active unattended-upgrades; apt list --upgradable 2>/dev/null | wc -l"
done
```
### Halt rollout in an emergency
```bash
# Option 1: scale kured to 0 (most decisive)
kubectl -n kured scale ds kured --replicas=0
# When ready: kubectl -n kured scale ds kured --replicas=5
# Option 2: silence the gate via Alertmanager (allows kured to retry once silence expires)
# Use Alertmanager UI at https://prometheus.viktorbarzin.me/alertmanager/
```
### Force halt by adding a custom blocker alert
- Add a PrometheusRule expression that's always-1 (e.g. `vector(1)`) to the `Upgrade Gates` group temporarily.
- Apply, wait for sync (~120s), kured will block on the next poll.
- Remove when ready.
### Pause apt upgrades on a single node
```bash
ssh <node> sudo systemctl stop unattended-upgrades
ssh <node> sudo systemctl disable unattended-upgrades
# Re-enable when ready:
ssh <node> sudo systemctl enable --now unattended-upgrades
```
### Restore / re-apply unattended-upgrades config to existing nodes
Cloud-init only runs on first boot. To bring existing nodes into compliance with the IaC:
```bash
# Per node — installs uu, drops apt config, holds k8s/runtime packages, enables service
for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
ssh $n sudo bash -s <<'EOF'
set -e
systemctl unmask unattended-upgrades 2>/dev/null || true
DEBIAN_FRONTEND=noninteractive apt-get install -y unattended-upgrades update-notifier-common
cat > /etc/apt/apt.conf.d/52unattended-upgrades-k8s <<'CONF'
Unattended-Upgrade::Allowed-Origins {
"${distro_id}:${distro_codename}";
"${distro_id}:${distro_codename}-security";
"${distro_id}:${distro_codename}-updates";
"${distro_id}ESMApps:${distro_codename}-apps-security";
"${distro_id}ESM:${distro_codename}-infra-security";
};
Unattended-Upgrade::Package-Blacklist {
"^containerd(\.io)?$";
"^runc$";
"^cri-tools$";
"^kubernetes-cni$";
"^calico-.*";
"^cni-plugins-.*";
"^docker-ce$";
};
Unattended-Upgrade::DevRelease "false";
Unattended-Upgrade::Automatic-Reboot "false";
CONF
cat > /etc/apt/apt.conf.d/20auto-upgrades <<'CONF'
APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Unattended-Upgrade "1";
CONF
apt-mark hold kubelet kubeadm kubectl
apt-mark hold containerd containerd.io runc 2>/dev/null || true
systemctl enable --now unattended-upgrades
EOF
done
```
### Roll back a bad apt upgrade
1. Identify the package(s) that broke things from `/var/log/apt/history.log` on the affected node.
2. Hold them: `sudo apt-mark hold <pkg>`.
3. Downgrade: `sudo apt-get install -y --allow-downgrades <pkg>=<previous-version>` (find versions via `apt-cache madison <pkg>`).
4. Reboot the node manually if the package needs it.
5. Add the package to the `Unattended-Upgrade::Package-Blacklist` in `cloud_init.yaml` AND drop the holds via the SSH push above so future apt runs skip it.
### kured halted — investigate which alert is blocking
```bash
# Show kured logs — it logs "blocking alerts" when halting
kubectl -n kured logs ds/kured --tail=100 | grep -i alert
# List currently firing alerts (any of these blocks kured):
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -q -O- 'http://localhost:9090/api/v1/alerts' | \
jq -r '.data.alerts[] | select(.state == "firing") | " \(.labels.alertname) (\(.labels.severity // "info"))"' | sort -u
```
The alert is either:
- One of the 10 `Upgrade Gates` (genuine cluster-health issue — fix it),
- A pre-existing alert (any of the ~211 in the library — investigate),
- Or `RecentNodeReboot` — expected for 24h after each node reboot. This is the soak window.
### Verify the 24h soak is enforcing
```bash
# Sentinel-gate logs Check 4 outcome
kubectl -n kured logs ds/kured-sentinel-gate --tail=20 | grep -E "soak|cool-down|24"
# kured won't drain another node until the most recent Ready-transition is >24h ago.
# If you need to override (e.g. emergency security patch), shorten the cool-down by
# editing infra/stacks/kured/main.tf (sentinel script: 86400 → smaller) and applying.
```
## Past Incidents
- **2026-03-16 SEV-1**: Kured + Containerd Cascade Outage (26h). See `docs/post-mortems/2026-03-16-kured-containerd-cascade-outage.html`. Root cause: unattended-upgrades pushed a kernel update → kured rebooted nodes → containerd's overlayfs snapshotter corrupted → image pulls failed → calico broke → cascading outage. Remediations now baked into this system: 24h soak, Prometheus halt-on-alert, Package-Blacklist for runtime components, sentinel-gate health checks.
## File Pointers
| What | Where |
|------|-------|
| kured Helm + sentinel-gate | `infra/stacks/kured/main.tf` |
| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
| Cloud-init for new nodes | `infra/modules/create-template-vm/cloud_init.yaml` |
| Slack webhook | Vault `secret/kured``slack_kured_webhook` |
| Post-mortem | `infra/docs/post-mortems/2026-03-16-kured-containerd-cascade-outage.html` |
| Architecture doc | `infra/docs/architecture/automated-upgrades.md` (OS section) |

View file

@ -0,0 +1,323 @@
# K8s Version Upgrade Pipeline
## Overview
Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 5 K8s
VMs are upgraded automatically by a weekly detection CronJob that seeds a
chain of small phase Jobs. Each Job is **pinned to a node that is NOT its
drain target** — so no pod in the chain can preempt itself.
The chain (Sun 12:00 UTC weekly):
```
detection CronJob → preflight Job → master Job → worker × 4 Jobs → postflight Job
```
This is **independent** of the OS-side `unattended-upgrades + kured`
pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts.
Schedules can overlap (kured runs daily 02:00-06:00 London; detection
here runs Sun 12:00 UTC) — when a kured reboot lands within 24h of the
Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates
group blocks the version-upgrade preflight, so the chain self-defers
to the next Sunday rather than rolling on top of a half-fresh node.
## Architecture
```
k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns, SA: k8s-upgrade-job)
│ kubectl get nodes → running version
│ ssh master 'apt-cache madison kubeadm' → latest patch (within current minor)
│ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor available?
│ push k8s_upgrade_available{kind,running,target} → Pushgateway
▼ if a target is detected
envsubst on /template/job-template.yaml | kubectl apply -f -
│ creates k8s-upgrade-preflight-<target_version>
Job 0 — preflight (pinned: k8s-node1)
├── All nodes Ready + no Mem/Disk pressure
├── halt-on-alert (kured-style ignore-list)
├── 24h-quiet baseline (no Ready transitions <24h ago)
├── kubeadm upgrade plan matches target
├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
├── Trigger backup-etcd Job, wait, verify snapshot byte count
├── SSH master: containerd skew fix (if master < workers)
├── SSH all 5 nodes: apt repo URL rewrite (only kind=minor)
└── spawn_next → k8s-upgrade-master-<target_version>
Job 1 — master upgrade (pinned: k8s-node1)
├── halt-on-alert recheck (no firing alerts)
├── drain k8s-master (predrain_unstick deletes PDB-blocked pods)
├── ssh wizard@k8s-master 'bash -s' < /scripts/update_k8s.sh -- --role master --release X.Y.Z
├── kubectl uncordon k8s-master; wait Ready + version match
├── verify control-plane pods Running
├── halt-on-alert recheck (allows RecentNodeReboot)
└── spawn_next → k8s-upgrade-worker-<v>-k8s-node4
Job 2 — worker k8s-node4 (pinned: k8s-node1)
Job 3 — worker k8s-node3 (pinned: k8s-node1)
Job 4 — worker k8s-node2 (pinned: k8s-node1)
(identical pattern: halt-on-alert wait 30m → drain → ssh script → uncordon → 10-min soak → spawn_next)
Job 5 — worker k8s-node1 (pinned: k8s-master + control-plane toleration)
└── spawn_next → k8s-upgrade-postflight-<target_version>
Job 6 — postflight (no pinning)
├── Verify all 5 nodes at target version
├── Verify no firing Upgrade Gates alerts
├── Compute pod-ready ratio (should be ≥ 0.9)
├── Clear k8s-upgrade-* annotations on namespace
├── Push k8s_upgrade_in_flight=0, k8s_upgrade_snapshot_taken=0, k8s_upgrade_started_timestamp=0
└── Slack: ✅ K8s upgrade complete
```
**Pin choices summarised:**
- k8s-node1 hosts every Job that drains master or another worker. k8s-node1
itself is upgraded **last**.
- k8s-master hosts Job 5 (which drains k8s-node1). Job 5's spec includes a
toleration for `node-role.kubernetes.io/control-plane:NoSchedule`.
- If anyone reorders the worker sequence, the pin for Job 5 needs to track
whatever worker is upgraded last. The mapping is in `scripts/upgrade-step.sh`
→ the `case "${PHASE}:${TARGET_NODE:-}"` block.
## Components
### Shared resources (one-time, Terraform-managed)
| Resource | Purpose |
|---|---|
| **ConfigMap `k8s-upgrade-scripts`** | Mounts `/scripts/upgrade-step.sh` (universal phase body, dispatches on `$PHASE`) and `/scripts/update_k8s.sh` (per-node kubeadm/kubelet/kubectl upgrade body — same script the old manual loop used) in every Job pod. |
| **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. |
| **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. |
| **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. |
| **CronJob `k8s-version-check`** | Sun 12:00 UTC. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. |
### Pushgateway metrics
Pushed by upgrade-step.sh during phase execution; observed by the
`Upgrade Gates` alert group in `stacks/monitoring/.../prometheus_chart_values.tpl`:
| Metric | Pushed by | Cleared by |
|---|---|---|
| `k8s_upgrade_in_flight` (1/0) | preflight Job (set to 1) | postflight Job (set to 0) |
| `k8s_upgrade_started_timestamp` (epoch s) | preflight Job | postflight Job (set to 0) |
| `k8s_upgrade_snapshot_taken` (1/0) | preflight Job (set to 1 after Job=`pre-upgrade-etcd-*` completes with `Backup done:` log of ≥1 KiB) | postflight Job (0) |
| `k8s_upgrade_available{kind,running,target}` | detection CronJob | next detection run (overwrite) |
| `k8s_version_check_last_run_timestamp` | detection CronJob | (cumulative) |
### Upgrade Gates alerts (`Upgrade Gates` group in prometheus_chart_values.tpl)
- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout.
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
- All three alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
### Vault secrets
- `secret/k8s-upgrade/ssh_key` — ed25519 PRIVATE key, used by Jobs to SSH `wizard@<node>`
- `secret/k8s-upgrade/ssh_key_pub` — matching PUBLIC key, deployed to nodes' `~/.ssh/authorized_keys`
- `secret/k8s-upgrade/slack_webhook` — Slack incoming-webhook URL
Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` namespace. The previous `api_bearer_token` entry is GONE — the chain does not POST to `claude-agent-service`.
## Common Operations
### Verify the pipeline is healthy
```bash
# CronJob present + not suspended
kubectl -n k8s-upgrade get cronjob k8s-version-check
# Latest detection run output
kubectl -n k8s-upgrade get jobs -l app=k8s-version-upgrade
kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200
# Chain Jobs from the last run (retained 7 days via ttlSecondsAfterFinished)
kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain
# Pushgateway — running detection metric
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -q -O- 'http://prometheus-prometheus-pushgateway.monitoring:9091/metrics' | \
grep -E '^(k8s_upgrade_(available|in_flight|started_timestamp|snapshot_taken)|k8s_version_check_last_run_timestamp)'
# Upgrade Gates rules loaded
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -q -O- 'http://localhost:9090/api/v1/rules' | \
jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | " \(.name): \(.state)"'
```
### Manually trigger detection (no upgrade)
Use `detection_dry_run=true` to short-circuit before spawning Job 0:
```bash
# Toggle var in TF, apply, and trigger
# (in stacks/k8s-version-upgrade/main.tf)
# variable "detection_dry_run" { default = true }
# scripts/tg apply
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check version-check-test
kubectl -n k8s-upgrade logs -l job-name=version-check-test -f
# When done, flip back to false.
```
### Manually trigger the chain (skip detection)
Useful for testing or to force a specific target. Render Job 0 directly:
```bash
TARGET=1.34.7
KIND=patch
IMAGE=$(kubectl -n k8s-upgrade get cronjob k8s-version-check \
-o jsonpath='{.spec.jobTemplate.spec.template.spec.containers[0].image}')
cat <<EOF | envsubst | kubectl apply -f -
$(kubectl -n k8s-upgrade get cm k8s-upgrade-job-template -o jsonpath='{.data.job-template\.yaml}')
EOF
# Note: export JOB_NAME, PHASE_NEXT, etc. first — see the CronJob's command for
# the full env block. Easier: just trigger detection with the right inputs.
```
### Kill a stuck Job (chain halted mid-flight)
The chain stalls if any Job dies without spawning its successor. `K8sUpgradeStalled`
fires after 90 min. Recovery:
```bash
# 1. Identify the failed Job
kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain
kubectl -n k8s-upgrade describe job/<failed-job-name> | tail -50
kubectl -n k8s-upgrade logs job/<failed-job-name>
# 2. Diagnose. Common causes:
# - drain stuck on PDB-violating pod (predrain_unstick should handle this;
# but a brand-new PDB pattern could escape it — manually delete the pod)
# - SSH from Job pod failing (node restarted? known_hosts mismatch?)
# - kubeadm upgrade failed on a node (check journalctl + apt history on that node)
# 3. Fix the root cause first.
# 4. Delete the failed Job + re-spawn it. Naming is deterministic so
# `kubectl apply` of the same name reconciles to a single Job.
kubectl -n k8s-upgrade delete job/<failed-job-name>
# 5. Manually render + apply the same Job. Pull the template + spec from the
# next-Job-creation block in upgrade-step.sh — easiest is to copy from a
# sibling Job's YAML:
kubectl -n k8s-upgrade get job/<sibling-job-name> -o yaml \
| yq 'del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp, .metadata.managedFields, .status)' \
| yq '.metadata.name = "<failed-job-name>"' \
| yq '.spec.template.spec.containers[0].env[] | select(.name=="PHASE") .value = "<right-phase>"' \
| kubectl apply -f -
# The chain will continue from there. The next-Job-creation step in upgrade-step.sh
# is idempotent (deterministic name) so re-running won't duplicate downstream.
```
### Skip a phase (advanced; use sparingly)
If you've already done the work for a phase manually and want the chain to
jump past it, manually create the NEXT phase's Job with the deterministic
name. The previous phase's spawn-next will see the Job already exists and
short-circuit. Example: master already on target; jump straight to worker:
```bash
TARGET=1.34.7
TGT_LBL=${TARGET//./-}
# (compose Job from upgrade-step.sh spawn_next code, name=k8s-upgrade-worker-$TGT_LBL-k8s-node4, run on k8s-node1)
```
### Halt the pipeline in an emergency
```bash
# Option 1: suspend the detection CronJob (won't stop an in-flight chain)
kubectl -n k8s-upgrade patch cronjob k8s-version-check \
-p '{"spec":{"suspend":true}}' --type=merge
# Re-enable: -p '{"spec":{"suspend":false}}'
# Option 2: delete all in-flight chain Jobs
kubectl -n k8s-upgrade delete jobs -l app=k8s-upgrade-chain
# This leaves the in-flight annotation + Pushgateway gauge intact —
# K8sUpgradeStalled will fire to surface the halt.
# Option 3: force a blocker alert (same regex kured uses)
# — see k8s-node-auto-upgrades.md "Force halt by adding a custom blocker alert"
```
### Clear orphaned in-flight state
After deciding NOT to retry a halted chain:
```bash
kubectl annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-in-flight- \
viktorbarzin.me/k8s-upgrade-target- \
viktorbarzin.me/k8s-upgrade-snapshot-path-
# Reset Pushgateway gauges so K8sUpgradeStalled / EtcdPreUpgradeSnapshotMissing clear:
kubectl -n monitoring port-forward svc/prometheus-prometheus-pushgateway 9091:9091 &
printf '# TYPE k8s_upgrade_in_flight gauge\nk8s_upgrade_in_flight 0\n# TYPE k8s_upgrade_snapshot_taken gauge\nk8s_upgrade_snapshot_taken 0\n# TYPE k8s_upgrade_started_timestamp gauge\nk8s_upgrade_started_timestamp 0\n' \
| curl --data-binary @- http://localhost:9091/metrics/job/k8s-version-upgrade
kill %1
```
### Rollback paths
`kubeadm` does **not** support in-place downgrade. If a run fails:
#### Master broke during/after kubeadm upgrade
1. Identify the etcd snapshot: `kubectl get ns k8s-upgrade -o jsonpath='{.metadata.annotations.viktorbarzin\.me/k8s-upgrade-snapshot-path}'`
2. Restore etcd per `infra/docs/runbooks/restore-etcd.md`.
3. Manually downgrade master `kubeadm`/`kubelet`/`kubectl` to the pre-upgrade version. Find versions in `/var/log/apt/history.log` on the node:
```bash
ssh wizard@k8s-master 'sudo cat /var/log/apt/history.log | tail -40'
# Pre-upgrade versions are in the most recent "Commandline: apt-get install"
sudo apt-mark unhold kubeadm kubelet kubectl
sudo apt-get install --allow-downgrades -y \
kubeadm=<OLD>-1.1 kubelet=<OLD>-1.1 kubectl=<OLD>-1.1
sudo apt-mark hold kubeadm kubelet kubectl
sudo systemctl daemon-reload && sudo systemctl restart kubelet
```
#### Worker broke
1. `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
2. Downgrade apt packages on that node only (see above)
3. `kubectl uncordon <node>`
4. The cluster continues running on the master + remaining workers throughout
### One-shot SSH key rotation
1. Generate new keypair: `ssh-keygen -t ed25519 -f /tmp/k8s-upgrade -N ""`
2. Update Vault:
```bash
vault kv patch secret/k8s-upgrade \
ssh_key=@/tmp/k8s-upgrade \
ssh_key_pub=@/tmp/k8s-upgrade.pub
```
3. Push the new pubkey to every node:
```bash
for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
ssh wizard@$n 'sed -i "/k8s-upgrade-key$/d" ~/.ssh/authorized_keys'
ssh wizard@$n 'echo "$(cat /tmp/k8s-upgrade.pub) k8s-upgrade-key" >> ~/.ssh/authorized_keys'
done
```
4. ESO refreshes within 15 min — or force: `kubectl -n k8s-upgrade annotate externalsecret k8s-upgrade-creds force-sync=$(date +%s) --overwrite`
## Past Incidents
### 2026-05-11 — Self-preemption (agent → Job-chain rewrite)
- The v1 agent ran inside the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was scheduled to k8s-node4.
- During Stage 6 (first worker drain) the agent ran `kubectl drain k8s-node4` — evicting itself.
- The bash process died after the drain but before the SSH-pipe to install kubeadm on node4.
- Node4 was left cordoned; cluster stuck at master v1.34.7, workers v1.34.2 until manual recovery.
- **Mitigation**: rewrote the pipeline as a chain of Jobs, each `nodeSelector`-pinned to a non-target node. New `predrain_unstick` step deletes PDB-blocked single-replica pods (Anubis pattern) before drain so they don't loop forever. Added `K8sUpgradeStalled` alert (in-flight + started_timestamp > 90 min).
## File Pointers
| What | Where |
|------|-------|
| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` |
| Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
| Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` |
| Per-node upgrade script | `infra/scripts/update_k8s.sh` |
| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
| Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` |
| Architecture doc | `infra/docs/architecture/automated-upgrades.md` (K8s Version Upgrades section) |
| Related (OS reboots) | `infra/docs/runbooks/k8s-node-auto-upgrades.md` |
| Deprecated agent prompt (reference) | `infra/.claude/agents/k8s-version-upgrade.deprecated.md` |

View file

@ -9,15 +9,36 @@ how to tune the rate limit, how to revoke if abused.
## Architecture
- **K8s service**: `windows-kms` in namespace `kms`, MetalLB shared LB IP
`10.0.20.200:1688`. ETP=Cluster, so client IPs in vlmcsd logs are SNAT'd
k8s node IPs (not real-world client IPs). Trade-off accepted —
preserving real client IPs would require a dedicated MetalLB IP with
ETP=Local or a PROXY-protocol bounce; vlmcsd doesn't speak PROXY-v2.
- **pfSense WAN forward**: `WAN TCP/1688 → k8s_shared_lb:1688`
(alias = `10.0.20.200`). Description: `KMS public — kms.viktorbarzin.me`.
- **Filter rule** on the WAN interface, TCP/1688, with state-table
per-source caps:
- **K8s service**: `windows-kms` in namespace `kms`, MetalLB **dedicated**
LB IP `10.0.20.202:1688`. ETP=Local, so vlmcsd sees real WAN client IPs
in its log (pfSense WAN forwards do DNAT-only, no SNAT; ETP=Local skips
the kube-proxy SNAT too). Same pattern mailserver used pre-2026-04-19.
Sharing `10.0.20.200` isn't an option — all 10 services there are
ETP=Cluster and MetalLB requires a single ETP per shared IP.
- **Native DNS auto-discovery for LAN clients**: any Windows client with
DNS suffix `viktorbarzin.lan` activates with zero config — Windows
queries `_vlmcs._tcp.viktorbarzin.lan` SRV by default, the SRV target
resolves to `vlmcs.viktorbarzin.lan``10.0.20.202`, and `slmgr /ato`
succeeds. Records:
- `_vlmcs._tcp.viktorbarzin.lan` SRV 0 0 1688 vlmcs.viktorbarzin.lan
- `vlmcs.viktorbarzin.lan` A `10.0.20.202`
- `kms.viktorbarzin.lan` A `10.0.20.200` (Traefik — for the user-facing
website at `https://kms.viktorbarzin.lan/`; **not** the KMS server)
Manual override (e.g., for clients without the suffix or for clients
on the public internet): `slmgr /skms kms.viktorbarzin.me:1688` (WAN
path via pfSense forward) or `slmgr /skms 10.0.20.202:1688` (direct).
To revert a manually-overridden client back to auto-discovery:
`slmgr /ckms`.
- **Pod fluidity**: deployment has `replicas=1` (notifier dedup state is
per-pod) with no node affinity. TCP readiness/liveness probes on 1688
gate Pod Ready on the listener actually being up, so MetalLB only
advertises `10.0.20.202` from a node where vlmcsd is serving.
- **pfSense WAN forward**: `WAN TCP/1688 → k8s_kms_lb:1688`
(alias = `10.0.20.202`, dedicated to KMS). Description: `KMS public —
kms.viktorbarzin.me`. Other forwards using `k8s_shared_lb` (WireGuard,
HTTPS, shadowsocks, smtps, etc.) are unaffected.
- **Filter rule** on the WAN interface, TCP/1688 destination
`<k8s_kms_lb>`, with state-table per-source caps:
- `max-src-conn 50` — concurrent connections per source IP
- `max-src-conn-rate 10/60` — 10 new connections per 60 seconds per
source
@ -26,6 +47,13 @@ how to tune the rate limit, how to revoke if abused.
flushed. (`virusprot` is the only table pfSense's filter generator
targets for `overload`; see `/etc/inc/filter.inc`. Don't try to point
it at a custom table — the schema doesn't expose that knob.)
- **Probe filter in slack-notifier**: a bare TCP open/close (no
Application/Activation block from vlmcsd) is treated as a probe — Uptime
Kuma's port-type monitor on `windows-kms.kms.svc:1688` and the kubelet
readiness/liveness probes both hit this path. Probes increment
`kms_connection_probes_total{source}` (`source``internal_pod`,
`cluster_node`, `external`) and log to stdout, but never post to Slack.
Real activations still post.
## Where the logs are
@ -39,8 +67,11 @@ kubectl logs -n kms -l app=kms-service -c windows-kms --tail=50 -f
kubectl logs -n kms -l app=kms-service -c windows-kms | grep "Incoming KMS request"
```
Source IPs in this log are the SNAT'd node IPs because the LB Service uses
ETP=Cluster on a shared MetalLB IP. Don't expect real WAN client IPs here.
Source IPs from the WAN are real client IPs (pfSense DNAT-only + ETP=Local
preserve them through the chain). LAN clients hitting the LB IP directly
appear as their own IP. Pod-source probes (Uptime Kuma) appear as a Calico
pod IP in `10.10.0.0/16`. Kubelet readiness/liveness probes appear as the
hosting node IP in `10.0.20.0/24`.
### Slack notifier (kms namespace, k8s)
@ -53,6 +84,17 @@ also increment the Prometheus counter `kms_activations_total{product,status}`
exposed on the same pod at `:9101/metrics` (scraped by the cluster-wide
`kubernetes-pods` job; query via Prometheus or Grafana directly).
Probe-only TCP connections (open+close, no KMS RPC) are silently filtered
out of Slack and counted in `kms_connection_probes_total{source}`. Useful
queries:
```promql
# Probe rate by source
rate(kms_connection_probes_total[5m])
# Probes from the public WAN (a non-zero rate here means real port-scans
# are reaching us, not just internal monitoring)
rate(kms_connection_probes_total{source="external"}[5m])
```
### pfSense — virusprot table and filter hits
```bash
@ -93,18 +135,19 @@ The `overload` table entry survives pf reloads. Running
If the activation surface needs to come down (abuse, legal, audit):
1. **pfSense web UI**`Firewall → NAT → Port Forward` → find
`WAN TCP/1688 → k8s_shared_lb` → **delete** (or disable). Apply.
`WAN TCP/1688 → k8s_kms_lb` → **delete** (or disable). Apply.
2. **pfSense web UI**`Firewall → Rules → WAN` → find
`KMS public — kms.viktorbarzin.me`**delete** (or disable). Apply.
3. Verify externally: from a phone tether, `nc -zw3 kms.viktorbarzin.me 1688`
should now fail.
The k8s service stays reachable on the LAN
(`10.0.20.200:1688` and the internal `kms.viktorbarzin.lan` ingress for
the webpage) — only the WAN port-forward is removed.
(`10.0.20.202:1688` directly, and the website at `kms.viktorbarzin.lan`
via Traefik on `10.0.20.200:443`) — only the WAN port-forward is removed.
To put it back, recreate the NAT rule (target alias `k8s_shared_lb`,
port `1688`) and the filter rule with the same per-source caps.
To put it back, recreate the NAT rule (target alias `k8s_kms_lb`,
port `1688`) and the filter rule with the same per-source caps. The alias
itself is independent of any forward and persists across delete/restore.
## Related

View file

@ -1,166 +1,256 @@
# Restore MySQL (InnoDB Cluster)
# Restore MySQL (Standalone)
Last updated: 2026-04-06
Last updated: 2026-05-18 (after the 8.4.9 DD-upgrade disaster recovery)
Applies to the `mysql-standalone` StatefulSet in the `dbaas` namespace
(raw `kubernetes_stateful_set_v1`, migrated from InnoDB Cluster on
2026-04-16). The historic InnoDB-Cluster recovery flow is gone.
## Prerequisites
- `kubectl` access to the cluster
- MySQL root password (from `cluster-secret` in `dbaas` namespace, key `ROOT_PASSWORD`)
- Backup dump available on NFS at `/mnt/main/mysql-backup/`
- `kubectl` against the cluster
- Root password: `kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d`
- A backup dump on NFS at `/srv/nfs/mysql-backup/` (exported via
`dbaas-mysql-backup-host` PVC inside the cluster)
## Backup Location
- NFS: `/mnt/main/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz`
- Mirrored to sda: `/mnt/backup/nfs-mirror/mysql-backup/` (PVE host 192.168.1.127)
- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/mysql-backup/`
- Retention: 14 days (on NFS), latest only (on sda), unlimited (on Synology)
- Size: ~11MB per dump
## Backup Locations
## Restore Procedure
| Location | Purpose | Retention |
|---|---|---|
| `/srv/nfs/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz` | Full daily dump (CronJob `mysql-backup`, daily 00:30 UTC) | 14 days |
| `/srv/nfs/mysql-backup/per-db/<dbname>/dump_*.sql.gz` | Per-DB dumps (CronJob `mysql-backup-per-db`, daily 00:45 UTC) | 14 days |
| Synology `Backup/Viki/nfs/mysql-backup/` | Offsite mirror via inotify-tracked rsync | unlimited |
Latest full dump is ~230MB compressed (~3GB uncompressed). Restore
of a full dump into a fresh MySQL pod takes ~3 minutes.
## Scenario A — Single database restored alongside the others
When one DB is corrupted but MySQL is otherwise fine.
### 1. Identify the backup to restore
```bash
# List available backups
kubectl run mysql-ls --rm -it --image=mysql \
--overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"dbaas-mysql-backup"}}],"containers":[{"name":"mysql-ls","image":"mysql","volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["ls","-lt","/backup/"]}]}}' \
-n dbaas
ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
# List per-db dumps for the affected database
kubectl -n dbaas exec mysql-standalone-0 -- ls -lt /backup/per-db/<dbname>/
# Pipe a chosen dump into MySQL (REPLACE existing data in <dbname>):
kubectl -n dbaas exec -i mysql-standalone-0 -- \
sh -c "zcat /backup/per-db/<dbname>/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -uroot -p\"$ROOT_PWD\" <dbname>"
# Restart consumers
kubectl -n <ns> rollout restart deployment
```
### 2. Get the root password
## Scenario B — Full disaster: data dictionary corrupt or PVC unsalvageable
This is the path executed on 2026-05-18 when a Keel-driven bump to
`mysql:8.4.9` left the data dictionary half-upgraded and 8.4.8 refused
to start (`Server upgrade of version 80408 is still pending`
MY-013379). Wipes the PVC and rehydrates from the daily dump.
**Estimated downtime: 25 minutes.** Plan accordingly — Forgejo +
registry + every MySQL app go offline during this.
### B.1 Stop the failing MySQL pod
```bash
kubectl get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d
kubectl -n dbaas scale statefulset mysql-standalone --replicas=0
```
### 3. Option A: Restore via port-forward (from outside cluster)
### B.2 Verify the dump you intend to restore is healthy
```bash
# Port-forward to MySQL primary
kubectl port-forward svc/mysql -n dbaas 3307:3306 &
# Get root password
ROOT_PWD=$(kubectl get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
# Restore (decompress and pipe to mysql, use --host to avoid unix socket, specify non-default port)
zcat /path/to/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307
ssh root@192.168.1.127 'ls -la /srv/nfs/mysql-backup/dump_*.sql.gz | tail -5'
# Sanity-check the header
ssh root@192.168.1.127 'zcat /srv/nfs/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz | head -20'
# Should show "MySQL dump 10.13 ... Server version 8.4.X"
```
### 3. Option B: Restore via in-cluster pod
```bash
ROOT_PWD=$(kubectl get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
### B.3 Pin MySQL image in Terraform (if it auto-bumped)
kubectl run mysql-restore --rm -it --image=mysql \
--overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"dbaas-mysql-backup"}}],"containers":[{"name":"mysql-restore","image":"mysql","env":[{"name":"MYSQL_PWD","value":"'$ROOT_PWD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root --host mysql.dbaas.svc.cluster.local"]}]}}' \
-n dbaas
If the upgrade was triggered by a Keel bump on a floating tag
(`mysql:8.4`), edit `stacks/dbaas/modules/dbaas/main.tf` to pin to a
known-good exact version (`mysql:8.4.8`). Commit but don't apply yet.
### B.4 Wipe the corrupted PVC
The PV reclaim policy defaults to **Retain** on
`proxmox-lvm-encrypted``kubectl delete pvc` alone leaves the PV
attached to the (corrupted) disk. Flip to `Delete` first so the CSI
driver actually cleans up the underlying LV.
```bash
PV=$(kubectl -n dbaas get pvc data-mysql-standalone-0 -o jsonpath='{.spec.volumeName}')
kubectl patch pv "$PV" -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
kubectl -n dbaas delete pvc data-mysql-standalone-0
```
### 4. Verify restoration
The PV transitions to `Released` then gets cleaned up by the CSI
controller; confirm with `kubectl get pv | grep <PV>` (eventually
disappears).
### B.5 Scale MySQL back up via Terraform
```bash
# Check databases exist
mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307 -e "SHOW DATABASES;"
cd stacks/dbaas && /home/wizard/code/infra/scripts/tg apply
```
# Check InnoDB Cluster status
mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307 -e "SELECT * FROM performance_schema.replication_group_members;"
This recreates the PVC fresh (5Gi initial; pvc-autoresizer grows it
on demand) and starts a brand-new MySQL pod. The pod initializes an
empty datadir using `MYSQL_ROOT_PASSWORD` from the `cluster-secret`
K8s Secret — ~30s to ready.
# Check table counts for key databases
for db in speedtest wrongmove codimd nextcloud shlink grafana technitium; do
echo "=== $db ==="
mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307 -e "SELECT TABLE_NAME, TABLE_ROWS FROM information_schema.TABLES WHERE TABLE_SCHEMA='$db' ORDER BY TABLE_ROWS DESC LIMIT 5;"
### B.6 Restore the full dump via a one-shot Job
```bash
cat <<'YAML' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: mysql-restore-$(date +%Y-%m-%d)
namespace: dbaas
spec:
ttlSecondsAfterFinished: 3600
template:
spec:
restartPolicy: Never
containers:
- name: restore
image: mysql:8.4.8
command: ["bash","-c"]
args:
- |
set -euo pipefail
gunzip -c /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | \
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"$MYSQL_ROOT_PASSWORD"
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"$MYSQL_ROOT_PASSWORD" -e 'SHOW DATABASES;'
env:
- name: MYSQL_ROOT_PASSWORD
valueFrom:
secretKeyRef: { name: cluster-secret, key: ROOT_PASSWORD }
volumeMounts:
- { name: backup, mountPath: /backup, readOnly: true }
volumes:
- name: backup
persistentVolumeClaim: { claimName: dbaas-mysql-backup-host, readOnly: true }
YAML
```
Watch progress: `kubectl -n dbaas logs -f job/<name>`. Takes ~3 min
for a 230MB compressed dump.
### B.7 Reset static MySQL users with passwords from Vault
**This step is mandatory.** `mysqldump` restores rows in `mysql.user`
verbatim, including password hashes. But `null_resource.mysql_static_user`
in Terraform writes the **current Vault password** to `forgejo` and
`roundcubemail` — and that current password rarely matches the dump's
hash. The apps will fail auth (forgejo logs `Error 1045 (28000): Access
denied for user 'forgejo'@'...'`) until you reset them.
```bash
FORGEJO_PW=$(vault kv get -field=mysql_forgejo_password secret/viktor)
RC_PW=$(vault kv get -field=mysql_roundcubemail_password secret/viktor)
kubectl -n dbaas exec -i mysql-standalone-0 -- bash -c 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD"' <<SQL
DROP USER IF EXISTS 'forgejo'@'%';
DROP USER IF EXISTS 'roundcubemail'@'%';
CREATE USER 'forgejo'@'%' IDENTIFIED WITH caching_sha2_password BY '$FORGEJO_PW';
CREATE USER 'roundcubemail'@'%' IDENTIFIED WITH caching_sha2_password BY '$RC_PW';
GRANT ALL PRIVILEGES ON \`forgejo\`.* TO 'forgejo'@'%';
GRANT ALL PRIVILEGES ON \`roundcubemail\`.* TO 'roundcubemail'@'%';
FLUSH PRIVILEGES;
SQL
```
`ALTER USER` sometimes hits `ERROR 1396 Operation ALTER USER failed`
on freshly-restored DBs (stale grant-table cache); `DROP USER` +
`CREATE USER` is the reliable form.
Vault-rotated app users (nextcloud, codimd, grafana, paperless,
phpipam, etc.) are managed by Vault DB engine and their dump password
already matches the live K8s secret, so they need no manual fixup.
### B.8 Restart MySQL-dependent apps
The dump restore brings MySQL up, but app pods still hold stale
connections (and forgejo has been crash-looping). Roll the
deployments to force fresh connections:
```bash
for ns_app in \
"forgejo:deploy/forgejo" \
"nextcloud:deploy/nextcloud" \
"hackmd:deploy/hackmd" \
"monitoring:deploy/grafana" \
"paperless-ngx:deploy/paperless-ngx" \
"uptime-kuma:deploy/uptime-kuma" \
"url:deploy/shlink" \
"realestate-crawler:deploy/realestate-crawler-api" \
"realestate-crawler:deploy/realestate-crawler-celery" \
"realestate-crawler:deploy/realestate-crawler-celery-beat" \
"realestate-crawler:deploy/realestate-crawler-ui"; do
ns=${ns_app%%:*}; app=${ns_app##*:}
kubectl -n "$ns" rollout restart "$app" &
done
wait
```
### 5. Verify application MySQL users exist
After any cluster rebuild or PVC recreation, the MySQL operator only recreates its own system users. Application users may be lost.
If any deployments stay stuck in `ImagePullBackOff` (e.g.
`chrome-service`, `fire-planner`, `freedify`), those rely on the
Forgejo registry — once forgejo is back, just delete their pods to
force a fresh pull:
```bash
ROOT_PWD=$(kubectl get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
# Check all expected application users exist
kubectl exec -n dbaas mysql-cluster-0 -c mysql -- mysql -u root -p"$ROOT_PWD" \
-e "SELECT user, host FROM mysql.user WHERE user IN ('nextcloud','forgejo','crowdsec','grafana','speedtest','wrongmove','codimd','shlink','technitium','uptimekuma');"
# If users are missing, force Vault to re-rotate their credentials:
# vault write -f database/rotate-role/mysql-<app>
# This will recreate the user with the correct password.
#
# For technitium specifically, also run the password sync CronJob:
# kubectl create job --from=cronjob/technitium-password-sync technitium-pw-resync -n technitium
#
# Note: forgejo and uptimekuma may be legacy users not managed by Vault rotation.
kubectl -n chrome-service delete pod --all
kubectl -n fire-planner delete pod --all
kubectl -n freedify delete pod --all
```
### 6. InnoDB Cluster Recovery
If the InnoDB Cluster itself is broken (not just data loss):
```bash
# Check cluster status via MySQL Shell
kubectl exec -it mysql-cluster-0 -n dbaas -c mysql -- mysqlsh root@localhost --password="$ROOT_PWD" -- cluster status
# Force rejoin a member
kubectl exec -it mysql-cluster-0 -n dbaas -c mysql -- mysqlsh root@localhost --password="$ROOT_PWD" -- cluster rejoinInstance root@mysql-cluster-1:3306
```
## Restore Single Database (from per-db backup)
Per-database backups are stored at `/mnt/main/mysql-backup/per-db/<dbname>/` as gzipped SQL dumps.
### 1. List available per-db backups
```bash
ls -lt /mnt/main/mysql-backup/per-db/<dbname>/
```
### 2. Restore a single database
```bash
# Port-forward to MySQL
kubectl port-forward svc/mysql -n dbaas 3307:3306 &
ROOT_PWD=$(kubectl get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
# Restore single database (this replaces only the target database)
zcat /path/to/per-db/<dbname>/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307 <dbname>
```
### 3. Verify
```bash
mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307 -e \
"SELECT TABLE_NAME, TABLE_ROWS FROM information_schema.TABLES WHERE TABLE_SCHEMA='<dbname>' ORDER BY TABLE_ROWS DESC LIMIT 10;"
```
### 4. Restart the affected service only
```bash
kubectl rollout restart deployment -n <namespace>
```
**Advantages over full restore**: Only the target database is affected. All other databases continue running with their current data.
## Alternative: Restore from sda Backup
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
### B.9 Verify recovery
```bash
# 1. SSH to PVE host
ssh root@192.168.1.127
# All workloads ready
kubectl get deploy,sts -A -o json | jq -r '.items[] | select(.spec.replicas != .status.readyReplicas and .spec.replicas > 0) | "\(.metadata.namespace)/\(.metadata.name)"'
# (empty output = healthy)
# 2. Find the latest backup
ls -lt /mnt/backup/nfs-mirror/mysql-backup/
# Database integrity — table counts per schema
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
-e "SELECT table_schema, COUNT(*) FROM information_schema.tables \
WHERE table_schema NOT IN ('information_schema','performance_schema','sys') \
GROUP BY table_schema;"
# 3. Copy backup to a location accessible from cluster (e.g., via kubectl cp)
# Or mount sda backup on a pod:
kubectl run mysql-restore --rm -it --image=mysql \
--overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/mysql-backup"}}],"containers":[{"name":"mysql-restore","image":"mysql","env":[{"name":"MYSQL_PWD","value":"'$ROOT_PWD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root --host mysql.dbaas.svc.cluster.local"]}],"nodeName":"k8s-master"}}' \
-n dbaas
# Forgejo's registry catalog (catches the cascade alert)
kubectl -n monitoring create job --from=cronjob/forgejo-integrity-probe manual-postrestore-$(date +%s)
kubectl -n monitoring logs job/manual-postrestore-<timestamp> --tail=10
# Expect "Probe complete: 0 failures across N repos / M tags / K indexes"
# Cluster-health re-run
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --quiet
```
## Alternative: Restore from Synology (if PVE host is down)
If the PVE host itself is unavailable:
### B.10 Clean up failed CronJob pods from the outage window
```bash
# 1. SSH to Synology NAS
ssh Administrator@192.168.1.13
# 2. Navigate to backup directory
cd /volume1/Backup/Viki/nfs/mysql-backup/
# 3. Copy dump to a temporary location accessible from cluster
# (e.g., via rsync to a surviving node, or restore PVE host first)
kubectl delete pods -A --field-selector=status.phase=Failed
```
## Estimated Time
- Data restore: ~5 minutes (11MB dump)
- InnoDB Cluster recovery: ~15-20 minutes (init containers are slow)
## Why the 8.4.9 upgrade got us — and the version pin
The MySQL 8.4.9 data-dictionary upgrade from 80408 → 80409 stalls
reliably on this hardware. ~24s of writes to `mysql.ibd` and the redo
log, then no further progress, no CPU, no completion. We bumped the
liveness probe to 600s (`initial_delay_seconds`) and still no
progress. Hypothesised root cause: `innodb_io_capacity=100` combined
with `innodb_page_cleaners=1` — the upgrade's spatial-reference-system
flush phase is IO-starved. **Don't retry 8.4.9 without first bumping
IO capacity and pinning a proper maintenance window.**
Until then, the StatefulSet pins to `mysql:8.4.8` exactly, not the
floating `mysql:8.4` tag. Keel will not silently bump it.
## See also
- `docs/runbooks/forgejo-registry-breakglass.md` — companion runbook
for when the cascade has reached the registry layer.
- Beads `code-eme8` / `code-k40p` — incident tracker entries (closed
in commit ea475c3d).

View file

@ -0,0 +1,191 @@
# Security Incident Response
What to do when a wave-1 security alert fires. Each alert links to a Loki query for investigation and concrete remediation steps.
**Status: planned, not yet implemented.** Beads epic: `code-8ywc`. This runbook is the response playbook for when wave 1 ships.
## General workflow
1. **Acknowledge in Alertmanager.** Silence only after triage starts.
2. **Pull context from Loki** (queries below). Get the actor, source IP, timestamp.
3. **Decide: real or false-positive?** Use the "false-positive cases" notes below.
4. **If real:** revoke credentials (Vault token revoke, K8s SA token rotate, SSH key remove, OIDC session invalidate), then post-mortem.
5. **If false-positive:** tune the alert (extend allowlist, refine LogQL query).
## Allowlist CIDRs
All source-IP-based alerts (K2, K9, V7, S1) reference this list. Update in one place: Terraform variable `security_source_ip_allowlist` in `stacks/monitoring`.
- `10.0.20.0/22` — VLAN 20 (cluster + main LAN)
- `192.168.1.0/24` — Proxmox + Sofia LAN
- K8s pod CIDR (verify at implementation time)
- K8s service CIDR
- Headscale tailnet
**Anything outside = alert.** No public-IP exceptions.
## Viktor's identity
`me@viktorbarzin.me` is the ONLY allowlisted human identity. NOT `viktor@viktorbarzin.me`. NOT `emo@viktorbarzin.me`. emo's identity scheme is separate and must be added explicitly if/when needed.
---
## K-alerts (K8s API audit)
### K2 — ServiceAccount token used from outside cluster
**Meaning:** A K8s ServiceAccount token authenticated a request whose `sourceIPs[0]` is not in the pod CIDR or trusted LAN. Stolen SA token used externally.
```logql
{job="kube-audit"} | json | user_username =~ "system:serviceaccount:.*" | sourceIPs_0 !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*"
```
**Action:** Identify the SA. Rotate its token (`kubectl delete secret <sa-token-name>` if old-style, or recreate the SA if projected token). Audit the SA's permissions and tighten.
**False positives:** Pod-to-apiserver traffic that egresses and re-enters via NodePort/LB (rare). Investigate the originating workload.
### K3 — Secret read in sensitive namespace by unexpected actor
**Meaning:** A Secret in `vault`, `sealed-secrets`, or `external-secrets` namespace was read by an SA NOT in the allowlist (ESO controller, sealed-secrets controller, Vault SA, `me@viktorbarzin.me`).
```logql
{job="kube-audit"} | json | verb =~ "get|list" | objectRef_resource = "secrets" | objectRef_namespace =~ "vault|sealed-secrets|external-secrets" | user_username !~ "(me@viktorbarzin.me|system:serviceaccount:external-secrets:.*|system:serviceaccount:sealed-secrets:.*|system:serviceaccount:vault:.*)"
```
**Action:** Identify the actor. If a service account, audit its bindings — it shouldn't have RBAC to read those secrets. Revoke the binding. Rotate any secrets that were read.
### K4 — Exec into sensitive pod
**Meaning:** Someone `kubectl exec`'d into a pod in `vault`, `kube-system`, `dbaas`, or `cnpg-system`.
```logql
{job="kube-audit"} | json | verb = "create" | objectRef_resource = "pods" | objectRef_subresource = "exec" | objectRef_namespace =~ "vault|kube-system|dbaas|cnpg-system" | user_username != "me@viktorbarzin.me"
```
**Action:** Determine if Viktor authorized the exec. If unrecognized actor, revoke their access and rotate any credentials they could have read inside the pod.
**False positives:** Break-glass SAs used during incident response — extend the allowlist to include them by SA name.
### K5 — Mass delete
**Meaning:** Single actor deleted >5 Pods, Secrets, or ConfigMaps in 60 seconds. Either a script gone wrong or destructive intrusion.
```logql
sum by (user_username) (count_over_time({job="kube-audit"} | json | verb = "delete" | objectRef_resource =~ "pods|secrets|configmaps" [1m])) > 5
```
**Action:** Identify actor. If a Terraform apply or known cleanup job, false positive. If unrecognized, suspend the actor's credentials immediately and audit what was deleted.
### K6 — Audit policy modified
**Meaning:** Someone changed the kube-apiserver audit policy. Should only happen via Terraform.
**Action:** Verify the change came from a planned Terraform apply (check recent commits to `stacks/infra`). If not, treat as critical compromise — attacker disabling visibility.
### K7 — New ClusterRole with full wildcards
**Meaning:** A new ClusterRole was created with `verbs: ["*"]` and `resources: ["*"]`. Privilege escalation primitive.
```logql
{job="kube-audit"} | json | verb = "create" | objectRef_resource = "clusterroles" | requestObject_rules_0_verbs_0 = "*" | requestObject_rules_0_resources_0 = "*"
```
**Action:** Verify the change is intentional (some operators install such roles — calico, kyverno). If unrecognized, delete the ClusterRole and audit the creator.
### K8 — Anonymous binding
**Meaning:** A RoleBinding or ClusterRoleBinding was created referencing `system:anonymous` or `system:unauthenticated`. Catastrophic — allows unauthenticated cluster access.
**Action:** Delete the binding immediately. Audit who created it. Treat as full cluster compromise — rotate all secrets, force kubeconfig re-issue.
### K9 — Viktor's identity from unexpected source IP
**Meaning:** A request authenticated as `me@viktorbarzin.me` arrived from a source IP outside the allowlist. Stolen OIDC token / kubeconfig.
```logql
{job="kube-audit"} | json | user_username = "me@viktorbarzin.me" | sourceIPs_0 !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*|<pod-cidr>|<headscale-cidr>"
```
**Action:** Revoke Viktor's OIDC session in Authentik. Rotate Vault OIDC tokens. Audit recent activity from that IP. Verify Viktor's devices for compromise.
**False positives:** Viktor's machine on a new network without VPN — should not happen per the "no public IP access" policy. If it does, the policy needs revisiting, not the alert.
---
## V-alerts (Vault audit)
### V1 — Root token created
```logql
{job="vault-audit"} | json | request_path = "auth/token/create" | response_auth_policies = "root"
```
**Action:** Verify against Terraform / planned operation. Root tokens should ONLY be created during initial Vault setup or break-glass.
### V2 — Audit device disabled/modified
**Action:** Attacker silencing visibility. Re-enable immediately. Treat as critical compromise.
### V3 — Seal status changed
**Action:** Verify whether this is a planned operation (unseal during upgrade). If unplanned, treat as critical.
### V4 — Policy modified
**Action:** Confirm change came from a Terraform apply. Allowlist Terraform's source IP / token role. Otherwise: review the policy diff, revert if malicious.
### V5 — Auth failure spike
**Action:** Identify the auth method and source. If CI token rotation, false positive. If unknown source brute-forcing, block the source IP at pfSense.
### V6 — Token with policies different from parent
**Action:** Privilege escalation attempt. Revoke the new token. Audit the parent token's policies.
### V7 — Viktor's Vault identity from unexpected source IP
**Meaning:** A Vault operation authenticated as Viktor's entity_id arrived from an IP not in the allowlist. Requires `x_forwarded_for_authorized_addrs` to be configured (Vault sits behind Traefik so `remote_addr` is Traefik's pod IP without XFF trust).
**Action:** Revoke Viktor's Vault OIDC tokens. Force OIDC re-auth. Audit Vault access from that IP.
---
## S-alerts (Host)
### S1 — PVE sshd auth success from unexpected IP
```logql
{job="sshd-pve"} |= "Accepted" | regexp "Accepted (?P<method>\\S+) for (?P<user>\\S+) from (?P<ip>\\S+)" | ip !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*|<headscale-cidr>"
```
**Action:** Remove the user's SSH key from `/root/.ssh/authorized_keys` if it's still there. Audit recent sudo/login history (`last`, `sudo -i; journalctl _COMM=sudo`). Consider PVE as compromised — rotate root password, audit `/root/.luks-backup-key`, audit `/usr/local/bin/lvm-pvc-snapshot` and backup scripts for tampering.
---
## False-positive triage decision tree
```
Did the alert fire from a known operational event?
├─ Terraform apply at the same time? → likely V4 (policy modified)
├─ Keel auto-roll? → not a security path
├─ CI/CD pipeline running? → check V5 / K5
└─ Viktor doing recovery work? → K4, K9, S1 candidates
Extend allowlist if persistent
```
## Escalation
For SEV1 (multiple alerts, cluster-admin grants, anonymous bindings, mass deletes):
1. Cordon all nodes (`kubectl cordon`) to prevent further pod scheduling — but be aware this also stops legitimate recovery work
2. Revoke all OIDC sessions in Authentik
3. Rotate Vault root keys + reseal
4. Restore from a pre-incident backup if data integrity is questionable
5. Post-mortem per `incident-response.md`
## Related
- [Security architecture](../architecture/security.md)
- [Monitoring architecture](../architecture/monitoring.md)
- [Incident response (general)](../architecture/incident-response.md)
- Beads epic: `code-8ywc`

View file

@ -67,11 +67,44 @@ runcmd:
- sed -i 's/#Compress=yes/Compress=yes/' /etc/systemd/journald.conf
- systemctl restart systemd-journald
%{if is_k8s_template}
# Disable unattended-upgrades to prevent unexpected kernel updates that can break containerd/kubelet
# (Root cause of 26h cluster outage: unattended-upgrades → kernel update → containerd failure)
- systemctl disable --now unattended-upgrades || true
- apt-get remove -y unattended-upgrades || true
# Re-enabled 2026-05-10: unattended-upgrades is back on, but with a tight
# Allowed-Origins list, a Package-Blacklist for k8s/containerd/runc/calico,
# and Automatic-Reboot disabled (kured + sentinel-gate handles reboots in a
# 24h-soaked rolling window, gated by Prometheus alerts).
# Original outage (March 2026) was kernel update → containerd overlayfs corruption.
# Mitigations: 24h cool-down between node reboots, Prometheus halt-on-alert,
# apt-mark hold on k8s components, Package-Blacklist for runtime components.
- apt-get install -y unattended-upgrades update-notifier-common
- |
cat > /etc/apt/apt.conf.d/52unattended-upgrades-k8s <<'EOF'
Unattended-Upgrade::Allowed-Origins {
"$${distro_id}:$${distro_codename}";
"$${distro_id}:$${distro_codename}-security";
"$${distro_id}:$${distro_codename}-updates";
"$${distro_id}ESMApps:$${distro_codename}-apps-security";
"$${distro_id}ESM:$${distro_codename}-infra-security";
};
Unattended-Upgrade::Package-Blacklist {
"^containerd(\.io)?$$";
"^runc$$";
"^cri-tools$$";
"^kubernetes-cni$$";
"^calico-.*";
"^cni-plugins-.*";
"^docker-ce$$";
};
Unattended-Upgrade::DevRelease "false";
Unattended-Upgrade::Automatic-Reboot "false";
EOF
- |
cat > /etc/apt/apt.conf.d/20auto-upgrades <<'EOF'
APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Unattended-Upgrade "1";
EOF
- systemctl unmask unattended-upgrades 2>/dev/null || true
- systemctl enable --now unattended-upgrades
- apt-mark hold kubelet kubeadm kubectl
- apt-mark hold containerd containerd.io runc 2>/dev/null || true
- systemctl stop kubelet
- containerd config default | sudo tee /etc/containerd/config.toml
- ${containerd_config_update_command}

View file

@ -192,9 +192,9 @@ resource "proxmox_vm_qemu" "cloudinit-vm" {
for_each = var.disk_slot == "scsi0" ? [1] : []
content {
disk {
storage = "local-lvm"
size = var.vm_disk_size
discard = true # Enable TRIM passthrough to LVM thin pool reduces CoW overhead
storage = "local-lvm"
size = var.vm_disk_size
discard = true # Enable TRIM passthrough to LVM thin pool reduces CoW overhead
}
}
}
@ -202,9 +202,9 @@ resource "proxmox_vm_qemu" "cloudinit-vm" {
for_each = var.disk_slot == "scsi1" ? [1] : []
content {
disk {
storage = "local-lvm"
size = var.vm_disk_size
discard = true
storage = "local-lvm"
size = var.vm_disk_size
discard = true
}
}
}

View file

@ -56,8 +56,24 @@ variable "image_tag" {
variable "replicas" {
type = number
default = 1
description = "Replica count. Default 1 because Anubis stores in-flight challenges in process memory — with N>1 a challenge issued by pod A and solved against pod B fails with `store: key not found` (HTTP 500). For HA, configure a shared store (Redis) and bump this. Per-pod 128Mi @ idle is cheap, single-pod restart is sub-second, so 1 is fine for content sites."
default = null
description = "Optional replica count override. When null, defaults to 1 if shared_store_url is null and 2 otherwise. Capped at 2 — Redis can handle more but anti-affinity assumes ≤2 replicas per Anubis instance on a 5-node cluster."
validation {
condition = var.replicas == null || (var.replicas >= 1 && var.replicas <= 2)
error_message = "replicas must be 1 or 2 (or null to auto-pick from shared_store_url presence)."
}
}
variable "shared_store_url" {
type = string
default = null
description = "If set, Anubis stores in-flight challenge state in this Valkey/Redis-protocol URL instead of in-process memory, enabling HA across replicas. Format: redis://host:port/<db-index>. The DB index MUST be unique per Anubis instance (this module assumes 16 DBs available, common in standalone Redis). Cluster Redis is redis-master.redis.svc.cluster.local:6379 with HA via Sentinel + haproxy. Without this, replicas>1 causes ~50% PoW failures (challenge issued by pod A, solved against pod B → 500)."
validation {
condition = var.shared_store_url == null || can(regex("^redis://[a-zA-Z0-9_.-]+:[0-9]+/[0-9]+$", var.shared_store_url))
error_message = "shared_store_url must look like redis://host:port/<db-index> (explicit DB index required)."
}
}
variable "memory" {
@ -88,6 +104,21 @@ locals {
"app.kubernetes.io/managed-by" = "terraform"
}
# Effective replicas: caller-override > shared-store-aware default.
effective_replicas = coalesce(var.replicas, var.shared_store_url == null ? 1 : 2)
# Anubis store config. With backend=valkey, multiple Anubis pods can share
# in-flight PoW state and a challenge issued by pod A is verifiable by pod
# B. Default backend is in-process memory which only works at replicas=1.
store_yaml_block = var.shared_store_url == null ? "" : <<-EOT
store:
backend: valkey
parameters:
url: "${var.shared_store_url}"
EOT
# Strict bot policy. Default Anubis policy only WEIGHs Mozilla|Opera UAs
# and lets unmatched UAs (curl, wget, Python-requests, scrapy, headless
# CLI scrapers) fall through to ALLOW. We import the same upstream
@ -95,7 +126,8 @@ locals {
# capability is filtered.
default_policy_yaml = <<-EOT
bots:
# Hard-deny known-bad bots first.
# Hard-deny known-bad bots first runs before the method bypass so
# a declared bad bot can't sneak through by sending a POST.
- import: (data)/bots/_deny-pathological.yaml
- import: (data)/bots/aggressive-brazilian-scrapers.yaml
# Hard-deny declared AI/LLM crawlers (ClaudeBot, GPTBot, Bytespider, ).
@ -107,13 +139,29 @@ locals {
# Allow /.well-known, /robots.txt, /favicon.*, /sitemap.xml keeps
# the internet working for benign crawlers and discovery clients.
- import: (data)/common/keep-internet-working.yaml
# Catch-all: every remaining request must solve the challenge. This
# closes the "unmatched UA falls through to ALLOW" gap that lets
# curl/wget/Python-requests scrape non-CDN-fronted hosts.
# Allow every non-GET request through. Rationale: AI scrapers steal
# the body of GETs (page content) they don't POST. State-mutating
# methods come from app XHRs (PrivateBin paste creation, Komga
# uploads, SPA actions) and CORS preflight (OPTIONS). Challenging
# those breaks the app, because the JS expects JSON and gets the
# Anubis HTML challenge page. CrowdSec + rate-limit + per-app auth
# already cover abuse on these methods.
- name: allow-non-get-methods
action: ALLOW
expression: method != "GET"
# Catch-all: every remaining (GET) request must solve the challenge.
# This closes the "unmatched UA falls through to ALLOW" gap that
# lets curl/wget/Python-requests scrape non-CDN-fronted hosts.
- name: catchall-challenge
path_regex: .*
action: CHALLENGE
EOT
# Final policy YAML: defaults (or caller override) plus an optional store
# block when shared_store_url is set. Store block is module-managed and
# appended universally callers passing a custom policy_yaml shouldn't
# include their own `store:` block (they would collide).
rendered_policy_yaml = "${coalesce(var.policy_yaml, local.default_policy_yaml)}${local.store_yaml_block}"
}
# Bot policy ConfigMap. Mounted into the pod and referenced by POLICY_FNAME.
@ -124,7 +172,7 @@ resource "kubernetes_config_map" "policy" {
labels = local.labels
}
data = {
"botPolicies.yaml" = coalesce(var.policy_yaml, local.default_policy_yaml)
"botPolicies.yaml" = local.rendered_policy_yaml
}
}
@ -168,7 +216,7 @@ resource "kubernetes_deployment" "anubis" {
}
spec {
replicas = var.replicas
replicas = local.effective_replicas
selector {
match_labels = { app = local.full_name }
@ -185,14 +233,26 @@ resource "kubernetes_deployment" "anubis" {
template {
metadata {
labels = local.labels
annotations = {
# Roll the deployment whenever the policy YAML changes Anubis
# reads the policy at startup, so a ConfigMap update alone
# doesn't take effect until pods restart.
"checksum/policy" = sha256(local.rendered_policy_yaml)
}
}
spec {
# Spread replicas across nodes to survive a single node failure.
# DoNotSchedule (not ScheduleAnyway) so 2 replicas are forced onto
# different hosts otherwise the scheduler may pile them on the
# same node and a single node reboot takes the whole Anubis instance
# down despite replicas=2. On a 5-node cluster the spread is always
# satisfiable; the worst case (4 nodes unavailable) leaves one
# replica Pending, but the other keeps serving.
topology_spread_constraint {
max_skew = 1
topology_key = "kubernetes.io/hostname"
when_unsatisfiable = "ScheduleAnyway"
when_unsatisfiable = "DoNotSchedule"
label_selector {
match_labels = { app = local.full_name }
}
@ -388,7 +448,15 @@ resource "kubernetes_pod_disruption_budget_v1" "anubis" {
namespace = var.namespace
}
spec {
min_available = "1"
# max_unavailable=1 means: at most one pod can be voluntarily disrupted
# at a time. With replicas=2 this allows clean rolling drains (one pod
# goes down other serves traffic first recreates elsewhere). With
# replicas=1 (no shared store) this is functionally equivalent to no
# PDB drain proceeds, brief outage, new pod schedules elsewhere.
# Was min_available=1 before 2026-05-16 which deadlocked drains on
# single-replica instances (eviction API can never satisfy the
# constraint at replicas=1). See PM-2026-05-11.
max_unavailable = "1"
selector {
match_labels = { app = local.full_name }
}

View file

@ -31,9 +31,53 @@ variable "tls_secret_name" {}
variable "backend_protocol" {
default = "HTTP"
}
variable "protected" {
type = bool
default = false
variable "auth" {
type = string
default = "required"
description = <<-EOT
Auth posture for this ingress. Pick by asking "what gates the app?":
* "required" (default, fail-closed): Authentik forward-auth gates every
request. Pick this when the backend has NO built-in user auth and
Authentik is the only thing standing between strangers and the app.
Examples: prowlarr, qbittorrent, netbox, phpipam, k8s-dashboard, any
admin UI shipped without its own login.
* "app": the backend handles its own user authentication (NextAuth,
Django sessions, OAuth, bearer-token API, etc.) and Authentik would
only get in the way. No Authentik middleware is attached; the app's
own login is the gate. Examples: immich, linkwarden, tandoor,
freshrss, affine, actualbudget, audiobookshelf, novelapp.
**Functionally identical to "none"** the distinct name exists to
record intent at the call site so future readers don't have to guess.
* "public": Authentik anonymous binding via the `public` outpost.
Strangers are auto-bound to the `guest` Authentik user; logged-in
users keep their identity in X-authentik-username. Only works for
top-level browser navigation CORS preflight rejects XHR/fetch and
automation can't replay the cookie dance. Audit trail, not a gate.
* "none": no Authentik middleware, no own-auth claim explicitly
public or unauthenticated-by-design. Use for: Anubis-fronted content
sites (where Anubis is the gate), native-client APIs that auth
themselves (Git, /v2/, WebDAV/CalDAV, CardDAV), webhook receivers,
OAuth callbacks, and Authentik outposts themselves.
**Anti-exposure rule** (the reason "app" exists as a distinct mode):
only pick "app" or "none" AFTER you have verified the app has its own
user auth (for "app") OR the endpoint is intentionally public (for
"none"). Picking either of these on a naked admin UI exposes it to the
internet. The default is "required" specifically so accidental omission
fails closed.
**Convention**: when using "app" or "none", add a comment line above
the `auth = "..."` line stating what gates the app or why it's public.
Future-you reads the call site, not the module description.
EOT
validation {
condition = contains(["required", "app", "public", "none"], var.auth)
error_message = "auth must be one of: required, app, public, none."
}
}
variable "ingress_path" {
type = list(string)
@ -142,8 +186,23 @@ variable "homepage_enabled" {
}
locals {
effective_host = var.full_host != null ? var.full_host : "${var.host != null ? var.host : var.name}.${var.root_domain}"
effective_anti_ai = var.anti_ai_scraping != null ? var.anti_ai_scraping : !var.protected
effective_host = var.full_host != null ? var.full_host : "${var.host != null ? var.host : var.name}.${var.root_domain}"
# Anti-AI default: ON when no Authentik auth fronts the ingress (auth =
# "none" or auth = "app" either the app gates users itself or the site
# is intentionally public). When Authentik gates the request
# (required/public), the auth flow already discourages bots.
effective_anti_ai = var.anti_ai_scraping != null ? var.anti_ai_scraping : (var.auth == "none" || var.auth == "app")
# Auth middleware selection. "app" and "none" both attach no Authentik
# middleware "app" signals "the backend has its own user auth", "none"
# signals "intentionally public / native-client API / webhook". The
# distinction lives at the call site for human readers; the runtime
# effect is identical.
auth_middleware = (
var.auth == "required" ? "traefik-authentik-forward-auth@kubernetescrd" :
var.auth == "public" ? "traefik-authentik-forward-auth-public@kubernetescrd" :
null
)
# External monitor enabled by default when the ingress has a public DNS
# record (either CF-proxied or direct A/AAAA). Explicit bool overrides.
@ -254,7 +313,7 @@ resource "kubernetes_ingress_v1" "proxied-ingress" {
var.exclude_crowdsec ? null : "traefik-crowdsec@kubernetescrd",
local.effective_anti_ai ? "traefik-ai-bot-block@kubernetescrd" : null,
local.effective_anti_ai ? "traefik-anti-ai-headers@kubernetescrd" : null,
var.protected ? "traefik-authentik-forward-auth@kubernetescrd" : null,
local.auth_middleware,
var.allow_local_access_only ? "traefik-local-only@kubernetescrd" : null,
var.custom_content_security_policy != null ? "${var.namespace}-custom-csp-${var.name}@kubernetescrd" : null,
var.max_body_size != null ? "${var.namespace}-buffering-${var.name}@kubernetescrd" : null,

View file

@ -0,0 +1,124 @@
#!/usr/bin/env python3
"""Enforce the inline-comment convention for ingress_factory auth tiers.
Every `auth = "app"` or `auth = "none"` line under a stack must have an
immediately-preceding comment block containing `# auth = "<tier>":`
that documents what gates the app (for "app") or why the endpoint is
intentionally public (for "none").
This is the static guard for the anti-exposure rule documented in
`infra/.claude/CLAUDE.md` "Auth" section. It's invoked by `scripts/tg`
before every plan/apply/destroy/refresh, so it fires regardless of who
or what is running terragrunt local laptop, CI, headless agent.
Stack-scoped by design: only checks the .tf files under the stack
being acted on. Other stacks' historical violations don't block work
on the current stack; each stack documents itself the next time it's
edited.
Usage:
check-ingress-auth-comments.py <stack-path> # scan one stack
check-ingress-auth-comments.py --all # scan every stack
"""
import argparse
import os
import re
import sys
AUTH_LINE = re.compile(r'^\s*auth\s*=\s*"(app|none)"\s*$')
COMMENT_LINE = re.compile(r'^\s*#')
COMMENT_TIER = re.compile(r'auth\s*=\s*"(app|none)"')
def scan_dir(path):
violations = []
for root, _, files in os.walk(path):
for f in files:
if not f.endswith('.tf'):
continue
full = os.path.join(root, f)
try:
with open(full) as fh:
lines = fh.readlines()
except OSError:
continue
for i, line in enumerate(lines):
m = AUTH_LINE.match(line)
if not m:
continue
tier = m.group(1)
# Walk backwards through contiguous comment lines.
# Pass if ANY of them documents the matching tier.
ok = False
j = i - 1
while j >= 0 and COMMENT_LINE.match(lines[j]):
cm = COMMENT_TIER.search(lines[j])
if cm and cm.group(1) == tier:
ok = True
break
j -= 1
if not ok:
violations.append((full, i + 1, tier))
return violations
def main():
ap = argparse.ArgumentParser(description=__doc__.splitlines()[0])
g = ap.add_mutually_exclusive_group(required=True)
g.add_argument('path', nargs='?', help='Stack directory to scan')
g.add_argument('--all', action='store_true', help='Scan every stack under stacks/')
args = ap.parse_args()
if args.all:
scan_paths = ['stacks']
else:
if not os.path.isdir(args.path):
print(f"ERROR: {args.path} is not a directory", file=sys.stderr)
sys.exit(2)
scan_paths = [args.path]
violations = []
for p in scan_paths:
violations.extend(scan_dir(p))
if not violations:
return
print(
"\n"
"==============================================================\n"
"ingress_factory auth-comment convention violated\n"
"==============================================================\n"
"\n"
"Every `auth = \"app\"` or `auth = \"none\"` line must have a\n"
"preceding comment line documenting what gates the app (for\n"
"\"app\") or why the endpoint is intentionally public (for\n"
"\"none\"). This guard prevents accidentally exposing private\n"
"services. See infra/.claude/CLAUDE.md Auth section.\n"
"\n"
"Add a comment line directly above the auth line:\n"
"\n"
" # auth = \"app\": <what gates the app, e.g. NextAuth + OAuth>\n"
" auth = \"app\"\n"
"\n"
"or:\n"
"\n"
" # auth = \"none\": <why public, e.g. webhook receiver, CalDAV>\n"
" auth = \"none\"\n"
"\n"
"Violations:",
file=sys.stderr,
)
for path, line_no, tier in violations:
print(
f" {path}:{line_no}: auth = \"{tier}\" missing preceding "
f"`# auth = \"{tier}\":` comment",
file=sys.stderr,
)
print(file=sys.stderr)
sys.exit(1)
if __name__ == '__main__':
main()

View file

@ -23,10 +23,11 @@ FAIL_COUNT=0
FIX=false
QUIET=false
JSON=false
KUBECONFIG_PATH="$(pwd)/config"
KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
KUBECTL=""
JSON_RESULTS=()
TOTAL_CHECKS=42
TOTAL_CHECKS=44
# --- Helpers ---
info() { [[ "$JSON" == true ]] && return 0; echo -e "${BLUE}[INFO]${NC} $*"; }
@ -195,6 +196,19 @@ check_pods() {
section 4 "Problematic Pods"
local bad count detail="" status="PASS"
# Skip pods owned by Jobs (which are owned by CronJobs). A failed CronJob
# retry isn't a problematic pod — the next CronJob fire will replace it.
# Real problems are deployments / statefulsets / daemonsets in trouble.
local job_owned_pods
job_owned_pods=$($KUBECTL get pods -A -o json 2>/dev/null | python3 -c '
import json, sys
d = json.load(sys.stdin)
for p in d["items"]:
owners = p["metadata"].get("ownerReferences", [])
if any(o.get("kind") == "Job" for o in owners):
print(f"{p[\"metadata\"][\"namespace\"]} {p[\"metadata\"][\"name\"]}")
' 2>/dev/null || true)
bad=$( {
$KUBECTL get pods -A --no-headers --field-selector=status.phase!=Running,status.phase!=Succeeded 2>/dev/null \
| grep -E 'CrashLoopBackOff|Error|Pending|Init:|ImagePullBackOff|ErrImagePull' || true
@ -202,6 +216,14 @@ check_pods() {
| grep -E 'CrashLoopBackOff|ImagePullBackOff|ErrImagePull' || true
} | awk '!seen[$1,$2]++' | sed '/^$/d') || true
# Filter out Job-owned pods
if [[ -n "$job_owned_pods" && -n "$bad" ]]; then
bad=$(echo "$bad" | awk -v jp="$job_owned_pods" '
BEGIN { n = split(jp, lines, "\n"); for (i=1;i<=n;i++) skip[lines[i]] = 1 }
{ key = $1 " " $2; if (!(key in skip)) print }
')
fi
count=$(count_lines "$bad")
if [[ "$count" -eq 0 ]]; then
@ -228,7 +250,21 @@ check_evicted() {
section 5 "Evicted/Failed Pods"
local evicted count detail="" status="PASS"
evicted=$($KUBECTL get pods -A --no-headers --field-selector=status.phase=Failed 2>/dev/null || true)
# Exclude pods owned by Jobs — those are CronJob retries that K8s leaves
# behind for log inspection. They're not "evicted" in the cluster-health
# sense and the next CronJob fire replaces them.
evicted=$($KUBECTL get pods -A -o json --field-selector=status.phase=Failed 2>/dev/null | python3 -c '
import json, sys
try:
d = json.load(sys.stdin)
except Exception:
sys.exit(0)
for p in d.get("items", []):
owners = p["metadata"].get("ownerReferences", [])
if any(o.get("kind") == "Job" for o in owners):
continue
print(f"{p[\"metadata\"][\"namespace\"]}\t{p[\"metadata\"][\"name\"]}\t{p.get(\"status\",{}).get(\"reason\",\"\")}")
' 2>/dev/null || true)
count=$(count_lines "$evicted")
if [[ "$count" -eq 0 ]]; then
@ -539,18 +575,25 @@ check_alerts() {
return 0
fi
# Only count warning + critical alerts. Info-level alerts (RecentNodeReboot,
# PVAutoExpanding, etc.) are informational by design and shouldn't be
# treated as a script-level WARN — the alert rules themselves already
# encode the severity.
firing_count=$(echo "$alerts" | python3 -c '
import json, sys
ACTIONABLE = {"warning", "critical"}
def actionable(labels):
return labels.get("severity", "info").lower() in ACTIONABLE
try:
data = json.load(sys.stdin)
if isinstance(data, list):
active = [a for a in data if a.get("status", {}).get("state") == "active"]
active = [a for a in data if a.get("status", {}).get("state") == "active" and actionable(a.get("labels", {}))]
count = len(active)
names = [a.get("labels", {}).get("alertname", "?") for a in active]
print(f"{count}:" + ",".join(names) if count > 0 else "0:")
elif isinstance(data, dict) and "data" in data:
alerts_list = data["data"].get("alerts", [])
firing = [a for a in alerts_list if a.get("state") == "firing"]
firing = [a for a in alerts_list if a.get("state") == "firing" and actionable(a.get("labels", {}))]
count = len(firing)
names = [a.get("labels", {}).get("alertname", "?") for a in firing]
print(f"{count}:" + ",".join(names) if count > 0 else "0:")
@ -598,17 +641,55 @@ check_uptime_kuma() {
return 0
fi
result=$(UPTIME_KUMA_PASSWORD="$uk_pass" ~/.venvs/claude/bin/python3 -c '
import sys, os
# Connect via kubectl port-forward to the internal Service. The public
# URL (uptime.viktorbarzin.me) is behind Authentik forward-auth, which
# 302-redirects the Socket.IO handshake the library uses — there's no
# way for an unauthenticated script to complete the OAuth dance.
# Port-forward gives us a direct path to the in-cluster ClusterIP
# service and works from any host with kubectl access.
local pf_port=18444 pf_pid
$KUBECTL port-forward -n uptime-kuma svc/uptime-kuma "$pf_port:80" >/dev/null 2>&1 &
pf_pid=$!
# Detach from job control so bash doesn't print "Killed" to stderr
# when we SIGKILL the port-forward at the end of this check — that
# message corrupts stdout when stderr is merged for JSON parsing.
disown "$pf_pid" 2>/dev/null || true
# Wait up to 5s for the local listener to come up.
local i
for i in 1 2 3 4 5; do
if (echo >"/dev/tcp/127.0.0.1/$pf_port") 2>/dev/null; then break; fi
sleep 1
done
result=$(UPTIME_KUMA_PASSWORD="$uk_pass" UK_URL="http://127.0.0.1:$pf_port" \
~/.venvs/claude/bin/python3 -c '
import sys, os, time
try:
from uptime_kuma_api import UptimeKumaApi
except ImportError:
print("ERROR:uptime-kuma-api not installed")
sys.exit(0)
# Retry up to 3 times — the Socket.IO handshake is occasionally flaky
# even against the internal service during cluster churn.
last_exc = None
api = None
for attempt in range(3):
try:
api = UptimeKumaApi(os.environ["UK_URL"], timeout=120, wait_events=0.2)
api.login("admin", os.environ["UPTIME_KUMA_PASSWORD"])
break
except Exception as e:
last_exc = e
try: api.disconnect()
except Exception: pass
api = None
time.sleep(2 * (attempt + 1))
if api is None:
print(f"CONN_ERROR:{last_exc}")
sys.exit(0)
try:
api = UptimeKumaApi("https://uptime.viktorbarzin.me", timeout=120, wait_events=0.2)
api.login("admin", os.environ["UPTIME_KUMA_PASSWORD"])
monitors = api.get_monitors()
heartbeats = api.get_heartbeats()
@ -663,6 +744,13 @@ except Exception as e:
print(f"CONN_ERROR:{e}")
' 2>/dev/null) || result="CONN_ERROR:python execution failed"
# Always tear down the port-forward. Use SIGKILL directly — kubectl
# port-forward sometimes ignores SIGTERM during teardown and we don't
# need a graceful exit for a localhost listener. Skip `wait` because
# in `set -m` mode the backgrounded child may not be reapable here,
# causing the script to hang indefinitely; the shell reaps it on exit.
kill -9 "$pf_pid" 2>/dev/null || true
if [[ "$result" == "ERROR:"* ]]; then
[[ "$QUIET" == true ]] && section_always 14 "Uptime Kuma Monitors"
warn "Uptime Kuma: ${result#ERROR:}"
@ -1074,9 +1162,14 @@ for item in data.get("items", []):
expiry = datetime.strptime(date_str.strip(), "%b %d %H:%M:%S %Y %Z")
expiry = expiry.replace(tzinfo=timezone.utc)
days_left = (expiry - datetime.now(timezone.utc)).days
# Threshold rationale (lowered from 30d):
# - cnpg-webhook-cert: CNPG operator auto-rotates at 7d before expiry
# - kyverno-*-tls-pair: Kyverno auto-rotates at 15d before expiry
# - viktorbarzin.me Lets Encrypt wildcard: renewed weekly via Woodpecker
# Anything still <14d at check time is genuinely worth surfacing.
if days_left <= 7:
print(f"FAIL:{ns}/{name}:{days_left}d")
elif days_left <= 30:
elif days_left <= 14:
print(f"WARN:{ns}/{name}:{days_left}d")
except ValueError:
pass
@ -1085,8 +1178,8 @@ for item in data.get("items", []):
' 2>/dev/null) || true
if [[ -z "$cert_issues" ]]; then
pass "All TLS certificates valid for >30 days"
json_add "tls_certs" "PASS" "All valid >30d"
pass "All TLS certificates valid for >14 days"
json_add "tls_certs" "PASS" "All valid >14d"
else
[[ "$QUIET" == true ]] && section_always 22 "TLS Certificate Expiry"
while IFS= read -r line; do
@ -1332,12 +1425,59 @@ check_ha_entities() {
local result
result=$(export HA_CACHE_DIR; python3 << 'PYEOF'
import os, json
from datetime import datetime, timezone, timedelta
# Noise filter rationale:
# * The HA "unavailable" state covers everything from "the iDRAC scrape failed
# 30 seconds ago" to "this iPhone hasn't checked in in 6 hours" to
# "this YAML rest sensor has been broken for a week". Counting all of them
# produces 400+ alerts that are mostly expected (phones in standby, lights
# off, TVs idle).
# * Three filters dramatically cut noise without hiding real outages:
# 1. SKIP_DOMAINS — domains that go unavailable transiently by design
# (mobile_app on backgrounded apps, notify per-device, button/scene/
# event are momentary).
# 2. STALE_HOURS — only count entities that have been unavailable for
# this long. A flapping integration that recovers in <24h is noise;
# one stuck for >24h is real.
# 3. SKIP_DEVICE_HINTS — friendly-name substrings for things that come
# and go (laptops, phones, TVs, vacuums, washers).
SKIP_DOMAINS = {"mobile_app", "device_tracker", "notify", "button", "scene",
"event", "image", "update"}
SKIP_DEVICE_HINTS = ("iphone", "ipad", "macbook", "mac mini", "tv", "bravia",
"playstation", "switch", "roomba", "vacuum", "rumi",
"ipad", "laptop", "phone", "перална", "сушилня",
"миялна", "laptop2")
STALE_HOURS = 24
cache = os.environ["HA_CACHE_DIR"]
with open(f"{cache}/states.json") as f:
states = json.load(f)
unavail = [s for s in states if s.get("state") in ("unavailable", "unknown")]
now = datetime.now(timezone.utc)
threshold = now - timedelta(hours=STALE_HOURS)
def is_stale(s):
if s.get("state") not in ("unavailable", "unknown"):
return False
domain = s["entity_id"].split(".")[0]
if domain in SKIP_DOMAINS:
return False
name = (s.get("attributes", {}).get("friendly_name") or "").lower()
if any(h in name for h in SKIP_DEVICE_HINTS):
return False
# last_changed = when the state last flipped. If it flipped to unavailable
# >24h ago and stayed there, the integration is genuinely broken.
lc = s.get("last_changed") or s.get("last_updated")
if not lc:
return True # no timestamp = treat as old
try:
dt = datetime.fromisoformat(lc.replace("Z", "+00:00"))
except ValueError:
return True
return dt < threshold
unavail = [s for s in states if is_stale(s)]
domains = {}
for s in unavail:
d = s["entity_id"].split(".")[0]
@ -1496,24 +1636,42 @@ with open(f"{cache}/states.json") as f:
autos = [s for s in states if s["entity_id"].startswith("automation.")]
total = len(autos)
disabled = [a["entity_id"] for a in autos if a["state"] == "off"]
disabled_count = len(disabled)
# Noise filter rationale (was: any disabled OR not-triggered-in-30d):
# * "Disabled" alone is fine — Viktor disables automations intentionally
# (seasonal, holiday-only, paused). Only flag when ABANDONED, i.e.
# disabled for >180 days AND never triggered recently.
# * "Stale" alone is fine for low-frequency automations (annual reminders,
# manual triggers). Raise the bar to 180d (was 30d).
DISABLED_STALE_DAYS = 180
STALE_DAYS = 180
now = datetime.now(timezone.utc)
def days_since(ts):
if not ts:
return None
try:
return (now - datetime.fromisoformat(ts.replace("Z", "+00:00"))).days
except Exception:
return None
disabled = []
stale = []
for a in autos:
lt_days = days_since(a.get("attributes", {}).get("last_triggered"))
changed_days = days_since(a.get("last_changed"))
if a["state"] == "off":
continue
lt = a.get("attributes", {}).get("last_triggered")
if lt:
try:
t = datetime.fromisoformat(lt.replace("Z", "+00:00"))
days = (now - t).days
if days > 30:
stale.append(a["entity_id"] + "=" + str(days) + "d")
except:
pass
# Only flag a disabled automation if it has ALSO been untouched for
# the threshold — i.e. genuinely abandoned, not "paused for now".
# Use last_changed as a proxy for "user-touched recently".
if changed_days is None or changed_days > DISABLED_STALE_DAYS:
disabled.append(a["entity_id"])
else:
if lt_days is not None and lt_days > STALE_DAYS:
stale.append(f"{a['entity_id']}={lt_days}d")
disabled_count = len(disabled)
stale_count = len(stale)
disabled_names = "; ".join(disabled)
stale_names = "; ".join(stale[:10])
@ -2307,6 +2465,107 @@ except Exception as e:
}
# --- 42. External Reachability: Traefik 5xx Rate ---
check_pve_thermals() {
section 43 "PVE Host Thermals — Xeon E5-2699v4 package + per-core temps"
local raw status="PASS"
# Read all hwmon temp inputs in one SSH round-trip. Output: one line per
# sensor, "<sensor_label> <celsius>". Falls back gracefully on missing
# labels (Xeon coretemp driver exposes both `Package id 0` and `Core N`).
raw=$(ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no \
root@192.168.1.127 '
cd /sys/class/hwmon/hwmon0 2>/dev/null || exit 1
for tfile in temp*_input; do
[[ -e "$tfile" ]] || continue
base=${tfile%_input}
label=$(cat "${base}_label" 2>/dev/null || echo "$base")
val=$(cat "$tfile" 2>/dev/null)
[[ -n "$val" ]] && echo "$label $((val/1000))"
done
' 2>/dev/null || true)
if [[ -z "$raw" ]]; then
[[ "$QUIET" == true ]] && section_always 43 "PVE Host Thermals"
warn "Could not read hwmon temps from 192.168.1.127 (SSH BatchMode failed or path missing)"
json_add "pve_thermals" "WARN" "SSH failed or hwmon path missing"
return 0
fi
local pkg_temp max_core_temp max_core_label
pkg_temp=$(echo "$raw" | awk '/^Package id/{print $NF; exit}')
max_core_temp=$(echo "$raw" | awk '/^Core/{if($NF>m){m=$NF; lbl=$1" "$2}} END{print m}')
max_core_label=$(echo "$raw" | awk '/^Core/{if($NF>m){m=$NF; lbl=$1" "$2}} END{print lbl}')
# Healthy baseline for this R730 (verified Apr 20-May 8 2026 from
# Prometheus): peak 61-69°C, avg 51-55°C. Treat anything above 65°C
# as a signal that some VM/workload is using too much CPU and warrants
# investigation, even though the Xeon E5-2699v4 has TjMax=83°C /
# Tcrit=93°C. This catches load creep early, well before throttling.
# PASS < 65°C package (within baseline 55-65 °C band)
# WARN 65-82°C package (elevated — investigate top CPU consumer)
# FAIL >= 83°C package (at/above TjMax — throttling imminent)
local detail="package=${pkg_temp}°C max_core=${max_core_temp}°C (${max_core_label})"
if [[ -z "$pkg_temp" ]]; then
[[ "$QUIET" == true ]] && section_always 43 "PVE Host Thermals"
warn "Package temp not found in hwmon output"
json_add "pve_thermals" "WARN" "$detail"
elif [[ "$pkg_temp" -ge 83 ]]; then
[[ "$QUIET" == true ]] && section_always 43 "PVE Host Thermals"
fail "PVE package temp ${pkg_temp}°C >= TjMax (83°C) — throttling imminent. $detail"
json_add "pve_thermals" "FAIL" "$detail"
status="FAIL"
elif [[ "$pkg_temp" -ge 65 ]]; then
[[ "$QUIET" == true ]] && section_always 43 "PVE Host Thermals"
warn "PVE package temp ${pkg_temp}°C above baseline (>65°C) — some VM is using too much CPU; check top kvm processes. $detail"
json_add "pve_thermals" "WARN" "$detail"
else
pass "PVE package ${pkg_temp}°C, hottest core ${max_core_temp}°C (${max_core_label}) — within 55-65°C baseline"
json_add "pve_thermals" "PASS" "$detail"
fi
}
check_pve_load() {
section 44 "PVE Host Load — load avg vs 44-thread capacity"
local raw load_1 load_5 load_15
raw=$(ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no \
root@192.168.1.127 'cat /proc/loadavg' 2>/dev/null || true)
if [[ -z "$raw" ]]; then
[[ "$QUIET" == true ]] && section_always 44 "PVE Host Load"
warn "Could not read /proc/loadavg from 192.168.1.127"
json_add "pve_load" "WARN" "SSH failed"
return 0
fi
load_1=$(echo "$raw" | awk '{print $1}')
load_5=$(echo "$raw" | awk '{print $2}')
load_15=$(echo "$raw" | awk '{print $3}')
# Round load_5 down for integer comparison (avoid bc dep)
local load_5_int
load_5_int=$(printf '%.0f' "$load_5")
# R730: 44 hw threads (22c × HT). Healthy avg ~ 15-22 (~30-50% utilisation
# of thread count). Warn when sustained 5-min above 30 (~70% threads
# busy). Fail when 5-min above 38 (~85% — close to scheduler saturation).
# PASS load_5 < 30
# WARN 30 <= load_5 < 38
# FAIL load_5 >= 38
local detail="1m=${load_1} 5m=${load_5} 15m=${load_15}"
if [[ "$load_5_int" -ge 38 ]]; then
[[ "$QUIET" == true ]] && section_always 44 "PVE Host Load"
fail "PVE 5-min load ${load_5} >= 38 of 44 threads — saturation. $detail"
json_add "pve_load" "FAIL" "$detail"
elif [[ "$load_5_int" -ge 30 ]]; then
[[ "$QUIET" == true ]] && section_always 44 "PVE Host Load"
warn "PVE 5-min load ${load_5} in warn band (30-37 of 44 threads). $detail"
json_add "pve_load" "WARN" "$detail"
else
pass "PVE load avg $detail (< 30/44 threads)"
json_add "pve_load" "PASS" "$detail"
fi
}
check_external_traefik_5xx() {
section 42 "External — Traefik 5xx Rate (15m)"
local query_result detail="" status="PASS"
@ -2463,6 +2722,8 @@ main() {
check_monitoring_css
check_external_replicas
check_external_divergence
check_pve_thermals
check_pve_load
check_external_traefik_5xx
print_summary

View file

@ -207,7 +207,15 @@ else
dst="${BACKUP_ROOT}/pvc-data/${WEEK}/${ns_pvc}"
mkdir -p "${dst}"
rsync_rc=0
rsync -az --delete \
# Per-PVC rsync timeout (30 min). Without this, a single hung
# PVC blocks the entire backup until systemd's TimeoutStartSec
# kills the script (4h ceiling), leaving every later PVC
# unbacked and silently triggering WeeklyBackupFailing. Picked
# 30 min as well above the largest PVC's normal copy time
# (immich-postgres ~10 GiB, ~3 min on local ext4) and well
# below the unit-level budget so we still have headroom to
# finish the rest.
timeout 1800 rsync -az --delete \
${PREV:+--link-dest="${PREV}/${ns_pvc}/"} \
"${PVC_MOUNT}/" "${dst}/" 2>&1 || rsync_rc=$?
if [ "$rsync_rc" -eq 0 ]; then
@ -217,6 +225,12 @@ else
# (in-flight writes have corrupt metadata from skipped journal replay)
PVC_COUNT=$((PVC_COUNT + 1))
log " partial rsync (LUKS noload) for ${ns_pvc} — OK"
elif [ "$rsync_rc" -eq 124 ]; then
# `timeout` exit 124 = wall-clock killed the rsync. Track
# separately so the next run still produces a metric and
# doesn't pretend nothing happened.
warn "rsync timed out for ${ns_pvc} after 30 min — moving on"
PVC_FAIL=$((PVC_FAIL + 1))
else
warn "rsync failed for ${ns_pvc} (rc=$rsync_rc)"
PVC_FAIL=$((PVC_FAIL + 1))
@ -232,7 +246,11 @@ else
relpath="${dbfile#${PVC_MOUNT}/}"
dest_file="${BACKUP_ROOT}/sqlite-backup/${WEEK}/${ns_pvc}/${relpath}"
mkdir -p "$(dirname "${dest_file}")"
if sqlite3 "file://${dbfile}?mode=ro" ".backup '${dest_file}'" 2>/dev/null; then
# 5-min sqlite timeout — same hang-prevention idea
# as rsync above. A corrupted SQLite or one held
# open by a writer in the snapshot can otherwise
# block .backup indefinitely.
if timeout 300 sqlite3 "file://${dbfile}?mode=ro" ".backup '${dest_file}'" 2>/dev/null; then
log " SQLite: ${ns_pvc}/${relpath}"
else
cp "${dbfile}" "${dest_file}" 2>/dev/null || true
@ -326,7 +344,7 @@ fi
# ============================================================
log "--- Step 4: PVE host config ---"
mkdir -p "${BACKUP_ROOT}/pve-config/scripts"
rsync -az --delete /etc/pve/ "${BACKUP_ROOT}/pve-config/etc-pve/" 2>&1 || { warn "Failed to sync /etc/pve"; STATUS=1; }
timeout 300 rsync -az --delete /etc/pve/ "${BACKUP_ROOT}/pve-config/etc-pve/" 2>&1 || { warn "Failed to sync /etc/pve"; STATUS=1; }
for script in /usr/local/bin/lvm-pvc-snapshot /usr/local/bin/daily-backup /usr/local/bin/offsite-sync-backup; do
[ -f "${script}" ] && cp "${script}" "${BACKUP_ROOT}/pve-config/scripts/" 2>/dev/null || true
done

View file

@ -102,6 +102,30 @@ for arg in "$@"; do
esac
done
# Detect if this is a plan/apply/destroy/refresh — anything that reads or
# writes infra state. Cheap pre-flight check below scans only the current
# stack's .tf files for the ingress_factory auth-comment convention. Other
# tg verbs (init, fmt, validate) skip the check.
is_tf_op=false
for arg in "$@"; do
case "$arg" in
plan|apply|destroy|refresh) is_tf_op=true ;;
esac
done
# Anti-exposure guard: every `auth = "app"` or `auth = "none"` in this stack
# must have a preceding `# auth = "<tier>":` comment documenting what gates
# the app or why the endpoint is intentionally public. See:
# - infra/modules/kubernetes/ingress_factory/main.tf (variable description)
# - infra/.claude/CLAUDE.md "Auth" section
# Stack-scoped: untouched stacks aren't blocked from future applies until
# they're actually edited, at which point the convention applies.
if $is_tf_op && [ -n "$STACK_NAME" ]; then
if ! "$REPO_ROOT/scripts/check-ingress-auth-comments.py" "$REPO_ROOT/stacks/$STACK_NAME"; then
exit 1
fi
fi
# Acquire lock for mutating operations (Tier 0 only — Tier 1 uses pg_advisory_lock)
if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then
if command -v vault &>/dev/null && [ -n "${VAULT_TOKEN:-}" ]; then

View file

@ -1,36 +1,114 @@
#!/usr/bin/env bash
#
# K8s component upgrader. Run on a single node (master OR worker) at a time.
# The caller is responsible for:
# - draining + uncordoning the node (this script does not touch kubectl)
# - sequencing nodes (master first, then workers one at a time)
# - pre-flight checks (etcd snapshot, halt-on-alert, etc)
#
# Used by:
# - the k8s-version-upgrade agent (infra/.claude/agents/k8s-version-upgrade.md)
# - manual operators following the runbook (infra/docs/runbooks/k8s-version-upgrade.md)
#
# Old manual orchestration loop (kept for reference — the agent does the
# equivalent now):
# for n in $(kbn | grep 'k8s-node' | awk '{print $1}'); do
# kb drain $n --ignore-daemonsets --delete-emptydir-data
# s wizard@$n 'bash -s' < update_k8s.sh --role worker --release 1.34.5
# kb uncordon $n
# done
# run for all nodes using :
# for n in $(kbn | grep 'k8s-node' | awk '{print $1}'); do echo $n; kb drain $n --ignore-daemonsets --delete-emptydir-data; s wizard@$n 'bash -s' <update_k8s.sh; kb uncordon $n; done
set -euo pipefail
set -e
export stable_version='1.34' # change me
export release="$stable_version.2" # change me
ROLE=""
RELEASE=""
echo "Upgrading to $stable_version"
usage() {
cat <<EOF
Usage: $0 --role <master|worker> --release <X.Y.Z>
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$stable_version/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo mkdir -p /etc/apt/keyrings
curl -fsSL "https://pkgs.k8s.io/core:/stable:/v$stable_version/deb/Release.key" | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes
--role master|worker (required)
--release kubeadm/kubelet/kubectl target patch version, e.g. 1.34.5
sudo apt-mark unhold kubeadm kubelet kubectl
sudo apt-get update
sudo apt-get install -y kubeadm="$release-*"
Behavior:
- Rewrites /etc/apt/sources.list.d/kubernetes.list to the v\$MINOR/deb repo
derived from --release (so a 1.34.x release uses v1.34/deb, 1.35.x uses
v1.35/deb, etc).
- apt-get install kubeadm=<release>-* (apt-mark unhold first).
- master: kubeadm upgrade plan && kubeadm upgrade apply v<release> -y
- worker: kubeadm upgrade node
- apt-get install kubelet=<release>-* kubectl=<release>-* then re-hold.
- systemctl daemon-reload && systemctl restart kubelet
EOF
}
HOSTNAME=$(hostname)
SEARCH_STR="master"
while [[ $# -gt 0 ]]; do
case "$1" in
--role) ROLE="$2"; shift 2;;
--release) RELEASE="$2"; shift 2;;
-h|--help) usage; exit 0;;
*) echo "Unknown arg: $1" >&2; usage; exit 2;;
esac
done
if [[ "$HOSTNAME" == *"$SEARCH_STR"* ]]; then
echo "Upgrading master"
sudo kubeadm upgrade plan && sudo kubeadm upgrade apply v$release -y
else
echo "Upgrading worker"
sudo kubeadm upgrade node
if [[ -z "$ROLE" || -z "$RELEASE" ]]; then
echo "ERROR: --role and --release are required" >&2
usage
exit 2
fi
sudo apt-get install -y kubelet="$release-*" kubectl="$release-*"
sudo apt-mark hold kubeadm kubelet kubectl
if [[ "$ROLE" != "master" && "$ROLE" != "worker" ]]; then
echo "ERROR: --role must be 'master' or 'worker' (got: $ROLE)" >&2
exit 2
fi
# Derive minor track (e.g. 1.34.5 → 1.34)
STABLE_VERSION="$(echo "$RELEASE" | awk -F. '{print $1"."$2}')"
echo "==> Upgrading $(hostname) ($ROLE) to v$RELEASE (track v$STABLE_VERSION)"
# Apt repo URL is pinned per minor track. Rewrite + re-import the signing key
# every run — cheap, idempotent, and handles the minor-bump case where the
# old track's repo no longer carries the target version.
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$STABLE_VERSION/deb/ /" \
| sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo mkdir -p /etc/apt/keyrings
curl -fsSL "https://pkgs.k8s.io/core:/stable:/v$STABLE_VERSION/deb/Release.key" \
| sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes
sudo apt-mark unhold kubeadm kubelet kubectl
sudo apt-get update
sudo apt-get install -y "kubeadm=$RELEASE-*"
if [[ "$ROLE" == "master" ]]; then
echo "==> Master path: kubeadm upgrade plan + apply"
sudo kubeadm upgrade plan
# The first apply may fail with "static Pod hash for component <X> did
# not change after 5m0s" — kubeadm's 5min wait for the kubelet to reload
# a static pod is too tight on our cluster (apiserver-to-kubelet status
# sync latency post-master-reboot can exceed it). The etcd image IS
# actually updated by then, so a 2nd attempt sees etcd already on
# target and skips it. Up to 3 attempts with a 30s delay between.
attempt=1
while ! sudo kubeadm upgrade apply "v$RELEASE" -y; do
if (( attempt >= 3 )); then
echo "ERROR: kubeadm upgrade apply failed after 3 attempts" >&2
exit 1
fi
echo "==> kubeadm apply attempt $attempt failed (likely static-pod-hash 5m timeout). Sleeping 30s then retrying — the previous attempt's manifest writes usually take hold on the 2nd try."
sleep 30
attempt=$(( attempt + 1 ))
done
echo "==> kubeadm upgrade apply succeeded on attempt $attempt"
else
echo "==> Worker path: kubeadm upgrade node"
sudo kubeadm upgrade node
fi
sudo apt-get install -y "kubelet=$RELEASE-*" "kubectl=$RELEASE-*"
sudo apt-mark hold kubeadm kubelet kubectl
sudo systemctl daemon-reload
sudo systemctl restart kubelet
echo "==> Done: $(hostname) is on v$RELEASE"

View file

@ -1,8 +1,14 @@
#!/usr/bin/env bash
#
# OS-major upgrade (Ubuntu do-release-upgrade). NOT in the auto-upgrade
# pipeline — minor apt patches are handled by unattended-upgrades + kured;
# K8s component bumps are handled by the k8s-version-upgrade agent. Run this
# script manually when bumping Ubuntu LTS major versions.
#
# See:
# - infra/docs/runbooks/k8s-node-auto-upgrades.md (apt + reboot)
# - infra/docs/runbooks/k8s-version-upgrade.md (kubeadm/kubelet/kubectl)
# sudo apt update && sudo apt autoremove -y && sudo apt upgrade -y
sudo do-release-upgrade
sudo apt update && sudo apt autoremove -y && sudo apt upgrade -y

619
scripts/upgrade_state.sh Executable file
View file

@ -0,0 +1,619 @@
#!/usr/bin/env bash
#
# upgrade_state.sh — survey the three autonomous-upgrade pipelines.
#
# Companion to cluster_healthcheck.sh, surfaced via the /upgrade-state skill.
# Read-only by design — no --fix.
#
# The three pipelines:
# 1. Apps — Keel polls registries hourly and rolls Deployments tagged
# keel.sh/policy. Metrics on container :9300/metrics.
# 2. OS — unattended-upgrades patches in-release per node; kured
# reboots within a daily 02:00-06:00 London window.
# 3. K8s — k8s-version-check CronJob (Sun 12:00 UTC) detects new
# kubeadm patch/minor releases; Job-chain drains+upgrades
# node-by-node. Pushgateway holds k8s_upgrade_* gauges.
#
# Exit codes: 0 healthy, 1 attention warranted, 2 something stalled.
set -euo pipefail
# --- Colors ---
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
BLUE='\033[0;34m'
BOLD='\033[1m'
NC='\033[0m'
# --- Globals ---
JSON=false
KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="/home/wizard/code/infra/config"
KUBECTL=""
NODES=(k8s-master:10.0.20.100 k8s-node1:10.0.20.101 k8s-node2:10.0.20.102 k8s-node3:10.0.20.103 k8s-node4:10.0.20.104)
SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no)
NOW_EPOCH=$(date -u +%s)
HIGHEST_EXIT=0 # 0 healthy, 1 attention, 2 stalled
# Results — collectors fill these.
APPS_STATUS_ICON=""; APPS_STATUS_TEXT=""
APPS_LAST_CHECK=""; APPS_NEXT=""; APPS_NOTES=""
APPS_ENROLLED=0; APPS_PENDING=0; APPS_UPDATES_LINE=""; APPS_ERROR_LINE=""
OS_STATUS_ICON=""; OS_STATUS_TEXT=""
OS_LAST_CHECK=""; OS_NEXT=""; OS_NOTES=""
OS_DISTRO_SUMMARY=""; OS_KERNEL_SUMMARY=""
OS_PENDING_REBOOT_NODES=""; OS_HELD_DETAIL=""
OS_LAST_UU=""; OS_LAST_KURED=""
K8S_STATUS_ICON=""; K8S_STATUS_TEXT=""
K8S_LAST_CHECK=""; K8S_NEXT=""; K8S_NOTES=""
K8S_RUNNING=""; K8S_PATCH=""; K8S_MINOR=""
K8S_LAST_DETECT_LINE=""; K8S_IN_FLIGHT="no"; K8S_LAST_CHAIN=""
# --- Helpers ---
log() { [[ "$JSON" == true ]] && return 0; echo -e "$*"; }
raise_exit() {
local n="$1"
if [[ "$n" -gt "$HIGHEST_EXIT" ]]; then HIGHEST_EXIT="$n"; fi
return 0
}
usage() {
cat <<EOF
Usage: $0 [--json] [--kubeconfig <path>]
Read-only audit of the three autonomous-upgrade pipelines (apps, OS, k8s).
--json machine-readable JSON
--kubeconfig PATH override kubeconfig
Exit codes: 0 healthy, 1 attention warranted, 2 something stalled.
EOF
}
parse_args() {
while [[ $# -gt 0 ]]; do
case "$1" in
--json) JSON=true; shift ;;
--kubeconfig) KUBECONFIG_PATH="$2"; shift 2 ;;
-h|--help) usage; exit 0 ;;
*) echo "Unknown option: $1" >&2; exit 1 ;;
esac
done
KUBECTL="kubectl --kubeconfig $KUBECONFIG_PATH"
}
# Prometheus query — Prometheus + reload + backup share a network namespace,
# so reaching localhost:9090 works from any of the three sidecars.
prom_q() {
local q="$1"
$KUBECTL -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- "http://localhost:9090/api/v1/query?query=${q}" 2>/dev/null || true
}
pg_metrics() {
$KUBECTL -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- "http://prometheus-prometheus-pushgateway:9091/metrics" 2>/dev/null || true
}
ssh_node() {
local ip="$1"; shift
ssh "${SSH_OPTS[@]}" "wizard@$ip" "$@" 2>/dev/null || true
}
human_age() {
local secs="$1"
if [[ "$secs" -lt 60 ]]; then printf '%ds ago' "$secs"
elif [[ "$secs" -lt 3600 ]]; then printf '%dm ago' $((secs/60))
elif [[ "$secs" -lt 86400 ]]; then printf '%dh ago' $((secs/3600))
else printf '%dd ago' $((secs/86400))
fi
}
# Pushgateway emits floats and scientific notation — coerce to integer
# epoch seconds. Returns 0 if the input is empty / zero / unparseable.
to_epoch_int() {
local v="${1:-}"
if [[ -z "$v" || "$v" == "0" ]]; then echo 0; return; fi
python3 -c "import sys; v=sys.argv[1]; print(int(float(v)))" "$v" 2>/dev/null || echo 0
}
# --- 1. Apps (Keel) ---
collect_apps() {
local pending tracked enrolled updates_24h errors
# Enrolled: count Deployments with keel.sh/policy != never (Keel itself
# is policy=never). The Kyverno auto-injection labels namespaces
# keel.sh/enrolled=true, but the annotation is what Keel watches.
enrolled=$($KUBECTL get deploy -A -o json 2>/dev/null | python3 -c '
import json, sys
data = json.load(sys.stdin)
n = sum(1 for d in data["items"]
if (d["metadata"].get("annotations") or {}).get("keel.sh/policy", "never") != "never")
print(n)
' 2>/dev/null || echo 0)
APPS_ENROLLED="$enrolled"
# Pending approvals (sum across Keel pods).
pending=$(prom_q 'sum(pending_approvals)' | python3 -c '
import json, sys
try:
r = json.load(sys.stdin)["data"]["result"]
print(int(float(r[0]["value"][1])) if r else 0)
except Exception:
print(0)
' 2>/dev/null || echo 0)
APPS_PENDING="$pending"
# Tracked images — proxy for "is the scrape live?".
tracked=$(prom_q 'count(count by (image) (registries_scanned_total))' | python3 -c '
import json, sys
try:
r = json.load(sys.stdin)["data"]["result"]
print(int(float(r[0]["value"][1])) if r else 0)
except Exception:
print(0)
' 2>/dev/null || echo 0)
# Last scrape age — `up{job="kubernetes-pods", app="keel"}` is 1 if the
# most recent scrape succeeded. We surface the wallclock age via a tiny
# `time() - timestamp(up{...})` query.
APPS_LAST_CHECK=$(prom_q 'time()-timestamp(up{job="kubernetes-pods",app="keel"})' | python3 -c '
import json, sys
try:
r = json.load(sys.stdin)["data"]["result"]
if not r: print("scrape not live")
else:
secs = int(float(r[0]["value"][1]))
if secs < 60: print(f"{secs}s ago")
elif secs < 3600: print(f"{secs//60}m ago")
else: print(f"{secs//3600}h ago")
except Exception:
print("?")
' 2>/dev/null || echo "?")
# Recent updates: count lines in Keel logs that report a successful
# rollout. Keel logs an "update completed" message per rollout.
local log_24h
log_24h=$($KUBECTL -n keel logs deploy/keel --since=24h --tail=2000 2>/dev/null || true)
updates_24h=$(echo "$log_24h" | grep -cE 'update completed|successfully updated|deployment updated' 2>/dev/null || true)
[[ -z "$updates_24h" ]] && updates_24h=0
APPS_UPDATES_LINE="$updates_24h in last 24h (tracked images: $tracked)"
# Known-benign Keel error patterns to suppress. Each is a real error
# line Keel emits, but the surrounding behaviour is fine, so flagging
# them in /upgrade-state is just noise.
# - `bot.Run(): can not get configuration for bot [slack]` — Keel
# 1.2.0 registers a Slack socket-mode bot whenever SLACK_BOT_TOKEN
# is set, then fails because we don't supply an `xapp-` app-level
# token. We don't want the interactive bot (no approvals; opt-out
# auto-update). The Slack NOTIFICATION sender works independently
# of the bot, so rollout messages still post to #general.
# - `failed to check digest` with a transient network error —
# Keel polls ~175 image manifests against public registries
# hourly. Occasional `i/o timeout` / `connection refused` /
# `TLS handshake timeout` / `no such host` / `EOF` /
# `context deadline exceeded` are inherent to public-internet
# polling at that scale and auto-recover on the next poll.
# Actionable digest-check failures surface as HTTP 401/404
# (auth, removed-tag) — those are NOT filtered.
# - `failed to check digest` with HTTP 5xx — upstream registry
# having a problem (DockerHub maintenance, Forgejo restart,
# etc.). Same recovery pattern as network errors: next hourly
# poll succeeds once upstream is back. Persistent 5xx for >24h
# would indicate a real registry-side issue, but that surfaces
# via the registry's own monitoring (e.g. forgejo-integrity-probe
# + RegistryCatalogInaccessible), not via Keel logs.
local benign_re='bot\.Run\(\): can not get configuration for bot \[slack\]'
benign_re+='|SLACK_APP_TOKEN must have the (previf|prefix)'
benign_re+='|failed to check digest.*(i/o timeout|connection refused|connection reset|context deadline exceeded|TLS handshake timeout|no such host|: EOF)'
benign_re+='|failed to check digest.*non-successful response \(status=5[0-9][0-9]'
errors=$(echo "$log_24h" | grep -iE '"level":"(error|fatal)"|level=error' | grep -vE "$benign_re" | tail -3 || true)
if [[ -z "$errors" ]]; then
APPS_ERROR_LINE="(none in last 24h)"
else
APPS_ERROR_LINE="$(echo "$errors" | wc -l | tr -d ' ') error(s); newest: $(echo "$errors" | tail -1 | cut -c1-120)"
fi
# Keel pod state.
local pod_status
pod_status=$($KUBECTL -n keel get pods -l app=keel -o jsonpath='{.items[*].status.phase}' 2>/dev/null || true)
if [[ "$pod_status" != *"Running"* ]]; then
APPS_STATUS_ICON="✗"; APPS_STATUS_TEXT="down"
APPS_NOTES="Keel pod not Running ($pod_status)"
raise_exit 2
elif [[ "$pending" -gt 0 || -n "$errors" ]]; then
APPS_STATUS_ICON="⚠"; APPS_STATUS_TEXT="attn"
APPS_NOTES="$enrolled enrolled; $pending pending; $(echo "$errors" | wc -l | tr -d ' ') recent error(s)"
raise_exit 1
else
APPS_STATUS_ICON="✓"; APPS_STATUS_TEXT="healthy"
APPS_NOTES="$enrolled enrolled, 0 pending, 0 errors"
fi
APPS_NEXT="rolling, hourly poll"
}
# --- 2. OS (apt + kured) ---
collect_os() {
local distros kernels distro_uniq kernel_uniq
distros=$($KUBECTL get nodes -o jsonpath='{range .items[*]}{.status.nodeInfo.osImage}{"\n"}{end}' 2>/dev/null)
kernels=$($KUBECTL get nodes -o jsonpath='{range .items[*]}{.status.nodeInfo.kernelVersion}{"\n"}{end}' 2>/dev/null)
distro_uniq=$(echo "$distros" | sort -u | tr '\n' ',' | sed 's/,$//; s/,/, /g')
kernel_uniq=$(echo "$kernels" | sort -u | tr '\n' ',' | sed 's/,$//; s/,/, /g')
OS_DISTRO_SUMMARY="$distro_uniq"
OS_KERNEL_SUMMARY="$kernel_uniq"
# SSH fan-out — parallel background subshells, write per-node results to tmp files.
local tmpdir; tmpdir=$(mktemp -d)
trap 'rm -rf "$tmpdir"' RETURN
local entry name ip
for entry in "${NODES[@]}"; do
name="${entry%%:*}"; ip="${entry##*:}"
(
local out reboot held upgradable uu_log
reboot=$(ssh_node "$ip" 'test -f /var/run/reboot-required && echo yes || echo no')
held=$(ssh_node "$ip" 'apt-mark showhold 2>/dev/null')
upgradable=$(ssh_node "$ip" 'apt list --upgradable 2>/dev/null | tail -n +2')
uu_log=$(ssh_node "$ip" 'tail -1 /var/log/unattended-upgrades/unattended-upgrades.log 2>/dev/null')
printf 'reboot=%s\n' "$reboot" > "$tmpdir/$name"
printf 'held<<<EOF\n%s\nEOF\n' "$held" >> "$tmpdir/$name"
printf 'upgradable<<<EOF\n%s\nEOF\n' "$upgradable" >> "$tmpdir/$name"
printf 'uu_log=%s\n' "$uu_log" >> "$tmpdir/$name"
) &
done
wait
# Aggregate.
local pending_reboots=() held_with_bumps_lines=() newest_uu_ts=0 newest_uu_iso=""
for entry in "${NODES[@]}"; do
name="${entry%%:*}"
[[ -f "$tmpdir/$name" ]] || continue
local reboot held upgradable uu_log uu_ts
reboot=$(awk -F= '/^reboot=/{print $2}' "$tmpdir/$name")
held=$(awk '/^held<<<EOF$/,/^EOF$/' "$tmpdir/$name" | sed '1d;$d')
upgradable=$(awk '/^upgradable<<<EOF$/,/^EOF$/' "$tmpdir/$name" | sed '1d;$d')
uu_log=$(awk -F= '/^uu_log=/{sub(/^uu_log=/,""); print}' "$tmpdir/$name")
[[ "$reboot" == "yes" ]] && pending_reboots+=("$name")
# Held + upgradable, excluding k8s components (managed by k8s pipeline).
local pkg from to bump
while IFS= read -r line; do
[[ -z "$line" ]] && continue
pkg=$(echo "$line" | awk -F/ '{print $1}')
# Skip k8s and kernel/linux-image — the chain handles those.
case "$pkg" in
kubeadm|kubectl|kubelet) continue ;;
linux-image-*|linux-headers-*|linux-modules-*|linux-generic|linux-headers-generic|linux-image-generic) continue ;;
esac
# Only flag if the package is held.
if echo "$held" | grep -qx "$pkg"; then
to=$(echo "$line" | awk '{print $2}')
from=$(echo "$line" | sed -n 's/.*from: \([^ ]*\).*/\1/p')
bump="$pkg ${from%-*}${to%-*}"
held_with_bumps_lines+=("$name: $bump")
fi
done <<<"$upgradable"
# Newest uu timestamp (ISO at start of log line).
uu_ts=$(echo "$uu_log" | sed -E 's/^([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}).*/\1/')
if [[ -n "$uu_ts" ]]; then
local epoch; epoch=$(date -u -d "$uu_ts" +%s 2>/dev/null || echo 0)
if [[ "$epoch" -gt "$newest_uu_ts" ]]; then
newest_uu_ts="$epoch"; newest_uu_iso="$uu_ts"
fi
fi
done
OS_PENDING_REBOOT_NODES="${pending_reboots[*]:-}"
if [[ ${#held_with_bumps_lines[@]} -gt 0 ]]; then
OS_HELD_DETAIL=$(printf '%s\n' "${held_with_bumps_lines[@]}" | sort -u | paste -sd '; ' -)
fi
if [[ "$newest_uu_ts" -gt 0 ]]; then
local age=$((NOW_EPOCH - newest_uu_ts))
OS_LAST_UU="$newest_uu_iso UTC ($(human_age "$age"))"
OS_LAST_CHECK="$(human_age "$age") (uu daily)"
else
OS_LAST_UU="(no uu log accessible)"
OS_LAST_CHECK="?"
fi
# Last kured reboot — newest Ready transition across worker nodes.
# `Ready -> True` is what kured causes when the node returns; we surface
# the most recent timestamp and the node it belongs to.
local kured_raw kured_iso kured_node kured_ep kured_age
kured_raw=$($KUBECTL get nodes -o json 2>/dev/null | python3 -c '
import json, sys
from datetime import datetime
data = json.load(sys.stdin)
best = (0, "", "")
for n in data["items"]:
name = n["metadata"]["name"]
for c in n["status"].get("conditions", []):
if c["type"] == "Ready":
dt = datetime.strptime(c["lastTransitionTime"], "%Y-%m-%dT%H:%M:%SZ")
ep = int(dt.timestamp())
if ep > best[0]:
best = (ep, name, c["lastTransitionTime"])
print(f"{best[0]}|{best[1]}|{best[2]}")
' 2>/dev/null || echo "0||")
kured_ep="${kured_raw%%|*}"
kured_node=$(echo "$kured_raw" | cut -d'|' -f2)
kured_iso=$(echo "$kured_raw" | cut -d'|' -f3)
if [[ "$kured_ep" -gt 0 ]]; then
kured_age=$((NOW_EPOCH - kured_ep))
OS_LAST_KURED="$kured_iso ($kured_node, $(human_age "$kured_age"))"
else
OS_LAST_KURED="?"
fi
OS_NEXT="daily 02:00-06:00 London"
# Kured pod health.
local kured_pods kured_unhealthy
kured_pods=$($KUBECTL -n kured get pods -l app.kubernetes.io/name=kured -o jsonpath='{range .items[*]}{.status.phase}{"\n"}{end}' 2>/dev/null)
kured_unhealthy=$(echo "$kured_pods" | grep -cv '^Running$' 2>/dev/null || true)
local notes=()
[[ -n "$OS_HELD_DETAIL" ]] && notes+=("held with bumps: $OS_HELD_DETAIL")
[[ -n "$OS_PENDING_REBOOT_NODES" ]] && notes+=("pending reboot: $OS_PENDING_REBOOT_NODES")
if [[ "$kured_unhealthy" -gt 0 ]]; then
OS_STATUS_ICON="✗"; OS_STATUS_TEXT="kured down"
OS_NOTES="kured pods not all Running"
raise_exit 2
elif [[ ${#notes[@]} -gt 0 ]]; then
OS_STATUS_ICON="⚠"; OS_STATUS_TEXT="attn"
OS_NOTES="${notes[*]}"
raise_exit 1
else
OS_STATUS_ICON="✓"; OS_STATUS_TEXT="healthy"
OS_NOTES="distros uniform; no held bumps; no pending reboots"
fi
}
# --- 3. K8s (kubeadm/kubelet/kubectl) ---
collect_k8s() {
local kver_list kver_uniq metrics target_patch target_minor last_run in_flight started
kver_list=$($KUBECTL get nodes -o jsonpath='{range .items[*]}{.status.nodeInfo.kubeletVersion}{"\n"}{end}' 2>/dev/null)
kver_uniq=$(echo "$kver_list" | sort -u)
local n_uniq; n_uniq=$(echo "$kver_uniq" | wc -l | tr -d ' ')
if [[ "$n_uniq" -eq 1 ]]; then
K8S_RUNNING="$kver_uniq across $(echo "$kver_list" | wc -l | tr -d ' ')/$(echo "$kver_list" | wc -l | tr -d ' ') nodes"
else
K8S_RUNNING="mixed: $(echo "$kver_uniq" | paste -sd', ' -)"
fi
local running_ver; running_ver=$(echo "$kver_uniq" | head -1)
metrics=$(pg_metrics)
# All five may legitimately be absent (cluster never ran the upgrade
# chain, kind="minor" not detected, etc.) — `|| true` keeps pipefail
# from killing the script on no-match.
target_patch=$(echo "$metrics" | { grep -E '^k8s_upgrade_available\{[^}]*kind="patch"' || true; } | sed -n 's/.*target="\([^"]*\)".*/\1/p' | head -1)
target_minor=$(echo "$metrics" | { grep -E '^k8s_upgrade_available\{[^}]*kind="minor"' || true; } | sed -n 's/.*target="\([^"]*\)".*/\1/p' | head -1)
# Pushgateway emits these with `{instance="",job="..."}` labels — the
# `awk '$1 ~ /^name(\{|$)/'` form matches both bare and labelled metrics.
last_run=$(echo "$metrics" | awk '$1 ~ /^k8s_version_check_last_run_timestamp(\{|$)/{print $2}' | head -1 || true)
in_flight=$(echo "$metrics" | awk '$1 ~ /^k8s_upgrade_in_flight(\{|$)/{print $2}' | head -1 || true)
started=$(echo "$metrics" | awk '$1 ~ /^k8s_upgrade_started_timestamp(\{|$)/{print $2}' | head -1 || true)
# Pushgateway timestamps come back in scientific notation
# (e.g. 1.779052159e+09) — convert to plain integer seconds.
local last_run_int started_int
last_run_int=$(to_epoch_int "$last_run")
started_int=$(to_epoch_int "$started")
if [[ "$last_run_int" -gt 0 ]]; then
local age=$((NOW_EPOCH - last_run_int))
K8S_LAST_CHECK="$(human_age "$age") (daily cron)"
if [[ -n "$target_patch" ]]; then
K8S_LAST_DETECT_LINE="last run $(human_age "$age"): available v$target_patch (patch)"
elif [[ -n "$target_minor" ]]; then
K8S_LAST_DETECT_LINE="last run $(human_age "$age"): available v$target_minor (minor)"
else
K8S_LAST_DETECT_LINE="last run $(human_age "$age"): no upgrade available"
fi
else
K8S_LAST_CHECK="(metric missing)"
K8S_LAST_DETECT_LINE="(no k8s_version_check_last_run_timestamp in Pushgateway)"
fi
K8S_PATCH="${target_patch:-none}"
K8S_MINOR="${target_minor:-none}"
# In-flight / last chain.
if [[ "${in_flight:-0}" == "1" ]]; then
K8S_IN_FLIGHT="yes"
local since=0
[[ "$started_int" -gt 0 ]] && since=$((NOW_EPOCH - started_int))
K8S_LAST_CHAIN="in-flight (started $(human_age "$since"))"
else
K8S_IN_FLIGHT="no"
if [[ "$started_int" -gt 0 ]]; then
local age=$((NOW_EPOCH - started_int))
K8S_LAST_CHAIN="$(human_age "$age")"
else
K8S_LAST_CHAIN="never (or zeroed)"
fi
fi
K8S_NEXT="$(next_daily_noon_utc)"
# Status logic.
local stalled=0
if [[ "${in_flight:-0}" == "1" && "$started_int" -gt 0 ]]; then
# K8sUpgradeStalled fires after 5400s (90m) per monitoring stack.
local since=$((NOW_EPOCH - started_int))
[[ "$since" -gt 5400 ]] && stalled=1
fi
local last_run_age=999999999
[[ "$last_run_int" -gt 0 ]] && last_run_age=$((NOW_EPOCH - last_run_int))
if [[ "$stalled" == "1" ]]; then
K8S_STATUS_ICON="✗"; K8S_STATUS_TEXT="stalled"
K8S_NOTES="K8sUpgradeStalled would fire — chain in-flight >90m"
raise_exit 2
elif [[ "$last_run_age" -gt $((9*86400)) ]]; then
K8S_STATUS_ICON="✗"; K8S_STATUS_TEXT="detection stale"
K8S_NOTES="last detection >9d ago"
raise_exit 2
elif [[ "${in_flight:-0}" == "1" ]]; then
K8S_STATUS_ICON="…"; K8S_STATUS_TEXT="in-flight"
K8S_NOTES="upgrade chain running"
raise_exit 1
elif [[ -n "$target_patch" ]]; then
K8S_STATUS_ICON="→"; K8S_STATUS_TEXT="$target_patch"
K8S_NOTES="running $running_ver → v$target_patch (patch) available"
raise_exit 1
elif [[ -n "$target_minor" ]]; then
K8S_STATUS_ICON="→"; K8S_STATUS_TEXT="$target_minor"
K8S_NOTES="running $running_ver → v$target_minor (minor) available"
raise_exit 1
else
K8S_STATUS_ICON="✓"; K8S_STATUS_TEXT="current"
K8S_NOTES="running $running_ver, nothing newer"
fi
}
# Next daily 12:00 UTC — pure bash date math, no croniter. Schedule was
# weekly Sunday until 2026-05-18; now `0 12 * * *` in the
# k8s-version-upgrade stack. If we're still before today's 12:00 UTC,
# the next run is today; otherwise it's tomorrow.
next_daily_noon_utc() {
local hr days_ahead
hr=$(date -u +%H)
if [[ "$hr" -lt 12 ]]; then days_ahead=0; else days_ahead=1; fi
date -u -d "+$days_ahead days" +"%a %Y-%m-%d 12:00 UTC"
}
# --- Renderers ---
# The table uses `column -t` so we don't have to compute visual widths
# manually (the status icons are multi-byte UTF-8 and ANSI escapes don't
# play nice with `printf %-Xs`). Trade-off: no in-cell colour, but the
# icon character already carries the signal.
render_table() {
echo
printf "${BOLD}Upgrade state — %s${NC}\n" "$(date -u +'%Y-%m-%d %H:%M UTC')"
echo
{
echo "Layer|Status|Last check|Next upgrade|Notes"
echo "-----|------|----------|------------|-----"
printf 'Apps|%s %s|%s|%s|%s\n' "$APPS_STATUS_ICON" "$APPS_STATUS_TEXT" "$APPS_LAST_CHECK" "$APPS_NEXT" "$APPS_NOTES"
printf 'OS |%s %s|%s|%s|%s\n' "$OS_STATUS_ICON" "$OS_STATUS_TEXT" "$OS_LAST_CHECK" "$OS_NEXT" "$OS_NOTES"
printf 'K8s |%s %s|%s|%s|%s\n' "$K8S_STATUS_ICON" "$K8S_STATUS_TEXT" "$K8S_LAST_CHECK" "$K8S_NEXT" "$K8S_NOTES"
} | column -t -s '|' -o ' | '
echo
printf "${BOLD}--- Apps (Keel) ---${NC}\n"
echo "Enrolled deployments: $APPS_ENROLLED"
echo "Recent rollouts: $APPS_UPDATES_LINE"
echo "Pending approvals: $APPS_PENDING"
echo "Last Keel error: $APPS_ERROR_LINE"
echo
printf "${BOLD}--- OS (apt + kured) ---${NC}\n"
echo "Ubuntu per node: $OS_DISTRO_SUMMARY"
echo "Kernel per node: $OS_KERNEL_SUMMARY"
echo "Pending reboot: ${OS_PENDING_REBOOT_NODES:-none}"
echo "Held packages with upstream bumps: ${OS_HELD_DETAIL:-none (excluding k8s components)}"
echo "Last uu run (newest across nodes): $OS_LAST_UU"
echo "Last kured reboot (newest Ready transition): $OS_LAST_KURED"
echo "Next kured window: $OS_NEXT"
echo
printf "${BOLD}--- K8s (kubeadm/kubelet/kubectl) ---${NC}\n"
echo "Running: $K8S_RUNNING"
echo "Latest patch (apt): ${K8S_PATCH}"
echo "Next minor available: ${K8S_MINOR}"
echo "Detection: $K8S_LAST_DETECT_LINE"
echo "In-flight: $K8S_IN_FLIGHT | Last chain start: $K8S_LAST_CHAIN"
echo "Next detection: $K8S_NEXT"
echo
}
render_json() {
# Pipe values into Python via env vars so we don't need to worry about
# embedded quotes/backslashes in error lines.
APPS_STATUS_ICON="$APPS_STATUS_ICON" APPS_STATUS_TEXT="$APPS_STATUS_TEXT" \
APPS_LAST_CHECK="$APPS_LAST_CHECK" APPS_NEXT="$APPS_NEXT" APPS_NOTES="$APPS_NOTES" \
APPS_ENROLLED="$APPS_ENROLLED" APPS_PENDING="$APPS_PENDING" \
APPS_UPDATES_LINE="$APPS_UPDATES_LINE" APPS_ERROR_LINE="$APPS_ERROR_LINE" \
OS_STATUS_ICON="$OS_STATUS_ICON" OS_STATUS_TEXT="$OS_STATUS_TEXT" \
OS_LAST_CHECK="$OS_LAST_CHECK" OS_NEXT="$OS_NEXT" OS_NOTES="$OS_NOTES" \
OS_DISTRO_SUMMARY="$OS_DISTRO_SUMMARY" OS_KERNEL_SUMMARY="$OS_KERNEL_SUMMARY" \
OS_PENDING_REBOOT_NODES="$OS_PENDING_REBOOT_NODES" OS_HELD_DETAIL="$OS_HELD_DETAIL" \
OS_LAST_UU="$OS_LAST_UU" OS_LAST_KURED="$OS_LAST_KURED" \
K8S_STATUS_ICON="$K8S_STATUS_ICON" K8S_STATUS_TEXT="$K8S_STATUS_TEXT" \
K8S_LAST_CHECK="$K8S_LAST_CHECK" K8S_NEXT="$K8S_NEXT" K8S_NOTES="$K8S_NOTES" \
K8S_RUNNING="$K8S_RUNNING" K8S_PATCH="$K8S_PATCH" K8S_MINOR="$K8S_MINOR" \
K8S_LAST_DETECT_LINE="$K8S_LAST_DETECT_LINE" K8S_IN_FLIGHT="$K8S_IN_FLIGHT" K8S_LAST_CHAIN="$K8S_LAST_CHAIN" \
HIGHEST_EXIT="$HIGHEST_EXIT" \
python3 -c '
import json, os
from datetime import datetime, timezone
def env(k): return os.environ.get(k, "")
out = {
"as_of_utc": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
"highest_exit": int(env("HIGHEST_EXIT")),
"apps": {
"status": env("APPS_STATUS_ICON"),
"status_text": env("APPS_STATUS_TEXT"),
"last_check": env("APPS_LAST_CHECK"),
"next_upgrade": env("APPS_NEXT"),
"notes": env("APPS_NOTES"),
"enrolled": int(env("APPS_ENROLLED") or 0),
"pending_approvals": int(env("APPS_PENDING") or 0),
"updates_line": env("APPS_UPDATES_LINE"),
"errors_line": env("APPS_ERROR_LINE"),
},
"os": {
"status": env("OS_STATUS_ICON"),
"status_text": env("OS_STATUS_TEXT"),
"last_check": env("OS_LAST_CHECK"),
"next_upgrade": env("OS_NEXT"),
"notes": env("OS_NOTES"),
"distros": env("OS_DISTRO_SUMMARY"),
"kernels": env("OS_KERNEL_SUMMARY"),
"pending_reboot_nodes": env("OS_PENDING_REBOOT_NODES"),
"held_with_bumps": env("OS_HELD_DETAIL"),
"last_uu_run": env("OS_LAST_UU"),
"last_kured_reboot": env("OS_LAST_KURED"),
},
"k8s": {
"status": env("K8S_STATUS_ICON"),
"status_text": env("K8S_STATUS_TEXT"),
"last_check": env("K8S_LAST_CHECK"),
"next_upgrade": env("K8S_NEXT"),
"notes": env("K8S_NOTES"),
"running": env("K8S_RUNNING"),
"patch_target": env("K8S_PATCH"),
"minor_target": env("K8S_MINOR"),
"last_detection_line": env("K8S_LAST_DETECT_LINE"),
"in_flight": env("K8S_IN_FLIGHT"),
"last_chain": env("K8S_LAST_CHAIN"),
},
}
print(json.dumps(out, indent=2))
'
}
main() {
parse_args "$@"
collect_apps
collect_os
collect_k8s
if [[ "$JSON" == true ]]; then
render_json
else
render_table
fi
exit "$HIGHEST_EXIT"
}
main "$@"

Binary file not shown.

Binary file not shown.

View file

@ -87,5 +87,5 @@ module "ingress" {
name = "<app-name>"
tls_secret_name = var.tls_secret_name
dns_type = "proxied" # "proxied" (Cloudflare CDN), "non-proxied" (direct A/AAAA), or "none"
protected = false # Set true to require Authentik login
auth = "required" # "required" (Authentik login), "public" (anonymous bound to guest), or "none" (no auth)
}

View file

@ -29,6 +29,20 @@ provider "registry.terraform.io/goauthentik/authentik" {
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
"zh:090260dc7889ea822ec1d899344e1ee23eba5290461989c0796149c9511f2316",
"zh:13c2655ff824b0dc4b9bb832b5ca6d41dba97cb280330258c5fef4115e236209",
"zh:166a73c3a810c9c895d68a8ff968158f339f8a2c1c03e20ec9fc5ed99cc64e20",
"zh:203777eae1cdc711233315499643180604cff2324411b186b7cf07fdbe16f655",
"zh:3b2f18c9a8d28dac74dc6bbf168c946855ab9c68f053578d4630c50d5eaf30a0",
"zh:4822275985f6b74b6196c47112316a4252db22cf4ceaef7c9ab4c66d488abf2f",
"zh:53ea97562666c8a5a2f6d63d418a302a7f8ee4b7bb7da35dedaa89aa5708b7f0",
"zh:56b8a230901e3550c92a1d3f58ee9dafe9853f30fe4315af3ab28ae63262e15d",
"zh:6293ab7b1fd8206a0c853591f50186aca4a1eff117b2a773e10760a23a2c83e9",
"zh:9433970f79fb92d8aae3ee436db5630ab312c78b6dc9df9c1db3273a18f8aaa1",
"zh:95df406214f79b3b98222d7c7fe8fc319a3d90b7a9d53e1d5abbda5dfb8b9436",
"zh:a85880da0552a42c8f449390fbd7d8b03541d1a13e04bba9f1404fa658754260",
"zh:a95f6e9bd62c67e70eba1b1a14728856b9a6a28cd1e5e3be54a7718882c87e7f",
"zh:dd599b51c5beb34a4c6feece244fde07d2558d69929449ab1fd39a5ebe738781",
]
}
@ -56,6 +70,18 @@ provider "registry.terraform.io/hashicorp/kubernetes" {
version = "3.1.0"
hashes = [
"h1:oodIAuFMikXNmEtil5MQgP4dfSctUBYQiGJfjbsF3NY=",
"zh:0215c5c60be62028c09a2f22458e89cda3ef5830a632299f1d401eb3538874b0",
"zh:09ebb9f442431e278a310a9423f32caf467cb4b3cad3fe59573ca71fa7b14e20",
"zh:0c4e5912f83bb35846ae0a9ae54fc320706ee61894cd21cc6b4181b1c5a2fa5c",
"zh:1678c982853ad461e65ccb5e79d585e13ed109dd47dab2a66d3a7a304faeef65",
"zh:1c050a5c15e330457a9c18caacf61a923c59d663e13f2962e4b32f04fef523a0",
"zh:2c55bcec83be58ec132c7cb0a1ac644758b800d794fdc636d53a0eada0358a3a",
"zh:a062bb0aa316c08d8460c66a5d68da71da40de5d3bc3b31abcf3a1a9a19650f1",
"zh:a26fdea0afaa9b247c73c0b42843ca51ba7db0ac2571f9d3d50dcabd20ca1b98",
"zh:c872c9385a78d502bf5823d61cd3bb0f9a0585030e025eb12585c83451beeaa1",
"zh:f180879af931182beee4c8c0d9dab62b81d86f17ddcbe3786ef4c7cec9163a4e",
"zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
"zh:f70f5789264069e0eef06f9b5d5fde955ef7206f7d446d1ce51a4c37a3f3e02f",
]
}

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:SBlzGxotNUN6HH9d0S-m@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:ts7DGcKmTTY-5ujz4mhh@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "actualbudget"
}
}

View file

@ -18,6 +18,11 @@ variable "budget_encryption_password" {
# and are unknown at plan time on first apply, so we cannot base `count` on
# them directly. Callers pass these booleans as hardcoded plan-time constants
# that reflect whether the corresponding credentials are expected to exist.
variable "enabled" {
type = bool
default = true
description = "Deploy this instance. When false, only the PVC is kept (data preservation); deployment, service, ingress, http-api, and cronjob are not created. Flip back to true to bring the instance back."
}
variable "enable_http_api" {
type = bool
default = false
@ -44,7 +49,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
name = "actualbudget-${var.name}-data-encrypted"
namespace = "actualbudget"
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -58,9 +63,17 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_deployment" "actualbudget" {
count = var.enabled ? 1 : 0
metadata {
name = "actualbudget-${var.name}"
namespace = "actualbudget"
@ -127,6 +140,7 @@ resource "kubernetes_deployment" "actualbudget" {
}
resource "kubernetes_service" "actualbudget" {
count = var.enabled ? 1 : 0
metadata {
name = "budget-${var.name}"
namespace = "actualbudget"
@ -148,7 +162,12 @@ resource "kubernetes_service" "actualbudget" {
}
module "ingress" {
source = "../../../modules/kubernetes/ingress_factory"
count = var.enabled ? 1 : 0
source = "../../../modules/kubernetes/ingress_factory"
# auth = "app": Actual Budget enforces a server password + per-user login
# on its own sync API. Authentik forward-auth was 302-ing the mobile/web
# sync clients; Actual's own auth gates users.
auth = "app"
namespace = "actualbudget"
name = "budget-${var.name}"
tls_secret_name = var.tls_secret_name
@ -163,7 +182,7 @@ resource "random_string" "api-key" {
}
resource "kubernetes_deployment" "actualbudget-http-api" {
count = var.enable_http_api ? 1 : 0
count = var.enabled && var.enable_http_api ? 1 : 0
metadata {
name = "actualbudget-http-api-${var.name}"
namespace = "actualbudget"
@ -229,6 +248,7 @@ resource "kubernetes_deployment" "actualbudget-http-api" {
}
resource "kubernetes_service" "actualbudget-http-api" {
count = var.enabled && var.enable_http_api ? 1 : 0
metadata {
name = "budget-http-api-${var.name}"
namespace = "actualbudget"
@ -250,7 +270,7 @@ resource "kubernetes_service" "actualbudget-http-api" {
}
resource "kubernetes_cron_job_v1" "bank-sync" {
count = var.enable_bank_sync ? 1 : 0
count = var.enabled && var.enable_bank_sync ? 1 : 0
metadata {
name = "bank-sync-${var.name}"
namespace = "actualbudget"
@ -271,48 +291,93 @@ resource "kubernetes_cron_job_v1" "bank-sync" {
spec {
container {
name = "bank-sync"
image = "curlimages/curl"
image = "alpine:3.20"
command = ["/bin/sh", "-c", <<-EOT
PUSHGATEWAY="http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/bank-sync-${var.name}"
set -u
apk add --no-cache curl jq >/dev/null 2>&1
USER_NAME='${var.name}'
SYNC_ID='${var.sync_id}'
API_KEY='${random_string.api-key.result}'
PW='${var.budget_encryption_password}'
PG="http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/bank-sync-$USER_NAME"
API="http://budget-http-api-$USER_NAME"
START=$(date +%s)
HTTP_CODE=$(curl -s -o /tmp/response.txt -w '%%{http_code}' \
-X POST --location \
'http://budget-http-api-${var.name}/v1/budgets/${var.sync_id}/accounts/banksync' \
--header 'accept: application/json' \
--header 'budget-encryption-password: ${var.budget_encryption_password}' \
--header 'x-api-key: ${random_string.api-key.result}')
# Enumerate active accounts: open + on-budget.
ACCOUNTS=$(curl -fsS "$API/v1/budgets/$SYNC_ID/accounts" \
-H "x-api-key: $API_KEY" \
-H "budget-encryption-password: $PW" \
| jq -c '.data[] | select(.closed == false and .offbudget == false) | {id, name}')
END=$(date +%s)
DURATION=$((END - START))
if [ "$HTTP_CODE" = "200" ]; then
SUCCESS=1
LAST_SUCCESS=$END
else
SUCCESS=0
echo "Bank sync failed with HTTP $HTTP_CODE:"
cat /tmp/response.txt
echo ""
if [ -z "$ACCOUNTS" ]; then
echo "ERROR: GET /accounts returned no eligible accounts; aborting"
exit 1
fi
# Pushgateway POST preserves metrics not in the payload, so on
# failure we omit bank_sync_last_success_timestamp to keep the
# prior success value this prevents BankSyncStale from firing
# alongside BankSyncFailing after a single failed run.
{
printf '# HELP bank_sync_success Whether the last bank sync succeeded (1=ok, 0=fail)\n'
printf '# TYPE bank_sync_success gauge\n'
printf 'bank_sync_success %s\n' "$SUCCESS"
printf '# HELP bank_sync_duration_seconds Duration of the last bank sync run\n'
printf '# TYPE bank_sync_duration_seconds gauge\n'
printf 'bank_sync_duration_seconds %s\n' "$DURATION"
if [ "$SUCCESS" = "1" ]; then
printf '# HELP bank_sync_last_success_timestamp Unix timestamp of the last successful sync\n'
printf '# TYPE bank_sync_last_success_timestamp gauge\n'
printf 'bank_sync_last_success_timestamp %s\n' "$LAST_SUCCESS"
: > /tmp/payload
rm -f /tmp/any_success
# Per-account sync. Each account has its own PSD2/GoCardless
# quota (4 successful pulls per 24h), so we treat them
# independently one rate-limited account doesn't mark the
# run as a failure.
echo "$ACCOUNTS" | while IFS= read -r ACCT; do
[ -z "$ACCT" ] && continue
ID=$(echo "$ACCT" | jq -r '.id')
NAME=$(echo "$ACCT" | jq -r '.name')
LABEL=$(echo "$NAME" | sed -E 's/[^a-zA-Z0-9]+/_/g')
HTTP_CODE=$(curl -s -o /tmp/r.txt -w '%%{http_code}' \
-X POST "$API/v1/budgets/$SYNC_ID/accounts/$ID/banksync" \
-H 'accept: application/json' \
-H "x-api-key: $API_KEY" \
-H "budget-encryption-password: $PW") || HTTP_CODE=0
NOW=$(date +%s)
if [ "$HTTP_CODE" = "200" ]; then
echo "OK account=$NAME"
printf 'bank_sync_account_success{account="%s"} 1\n' "$LABEL" >> /tmp/payload
printf 'bank_sync_account_last_success_timestamp{account="%s"} %s\n' "$LABEL" "$NOW" >> /tmp/payload
: > /tmp/any_success
else
echo "FAIL account=$NAME http=$HTTP_CODE body=$(cat /tmp/r.txt)"
printf 'bank_sync_account_success{account="%s"} 0\n' "$LABEL" >> /tmp/payload
fi
} | curl -s --data-binary @- "$PUSHGATEWAY"
done
END=$(date +%s)
DUR=$((END - START))
if [ -f /tmp/any_success ]; then
ANY=1
else
ANY=0
fi
# Pushgateway POST preserves prior values for label sets not
# in the payload, so per-account last_success_timestamp values
# for accounts that failed this run keep their prior good
# values that's what BankSyncAccountStale alerts on.
{
printf '# HELP bank_sync_account_success Per-account sync result (1=ok, 0=fail)\n'
printf '# TYPE bank_sync_account_success gauge\n'
printf '# HELP bank_sync_account_last_success_timestamp Per-account Unix timestamp of last successful sync\n'
printf '# TYPE bank_sync_account_last_success_timestamp gauge\n'
cat /tmp/payload
printf '# HELP bank_sync_success 1 if at least one account synced this run\n'
printf '# TYPE bank_sync_success gauge\n'
printf 'bank_sync_success %s\n' "$ANY"
printf '# HELP bank_sync_duration_seconds Total duration of the cron run\n'
printf '# TYPE bank_sync_duration_seconds gauge\n'
printf 'bank_sync_duration_seconds %s\n' "$DUR"
if [ "$ANY" = "1" ]; then
printf '# HELP bank_sync_last_success_timestamp Unix timestamp of the most recent successful sync of any account\n'
printf '# TYPE bank_sync_last_success_timestamp gauge\n'
printf 'bank_sync_last_success_timestamp %s\n' "$END"
fi
} | curl -fsS --data-binary @- "$PG"
EOT
]
}
@ -326,3 +391,24 @@ resource "kubernetes_cron_job_v1" "bank-sync" {
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# State migration for the new `enabled` toggle (2026-05-13): adding
# count to these resources shifts their addresses to [0]. Without
# moved {}, Terraform would destroy+recreate. Existing http-api / bank-sync
# resources already had count, so no migration needed there.
moved {
from = kubernetes_deployment.actualbudget
to = kubernetes_deployment.actualbudget[0]
}
moved {
from = kubernetes_service.actualbudget
to = kubernetes_service.actualbudget[0]
}
moved {
from = kubernetes_service.actualbudget-http-api
to = kubernetes_service.actualbudget-http-api[0]
}
moved {
from = module.ingress
to = module.ingress[0]
}

View file

@ -57,6 +57,7 @@ resource "kubernetes_namespace" "actualbudget" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.edge
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -120,6 +121,10 @@ module "anca" {
}
# https://budget-emo.viktorbarzin.me/
# Disabled 2026-05-13: Emo isn't using this instance. PVC is preserved so
# we can flip enabled back to true to bring the instance back as-was.
# The empty accounts list (vs. anca/viktor) was causing the daily bank-sync
# CronJob to fail and trigger BankSyncStale.
module "emo" {
source = "./factory"
name = "emo"
@ -128,16 +133,10 @@ module "emo" {
nfs_server = var.nfs_server
depends_on = [kubernetes_namespace.actualbudget]
tier = local.tiers.edge
enable_http_api = true
enable_bank_sync = true
enabled = false
enable_http_api = false
enable_bank_sync = false
budget_encryption_password = lookup(local.credentials["emo"], "password", null)
sync_id = lookup(local.credentials["emo"], "sync_id", null)
homepage_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Budget Emo"
"gethomepage.dev/description" = "Personal budget"
"gethomepage.dev/icon" = "actual-budget.png"
"gethomepage.dev/group" = "Finance & Personal"
"gethomepage.dev/pod-selector" = ""
}
homepage_annotations = {}
}

View file

@ -88,6 +88,7 @@ resource "kubernetes_namespace" "affine" {
name = "affine"
labels = {
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -155,7 +156,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
name = "affine-data-encrypted"
namespace = kubernetes_namespace.affine.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -169,6 +170,13 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_deployment" "affine" {
@ -324,8 +332,12 @@ resource "kubernetes_deployment" "affine" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -351,7 +363,11 @@ resource "kubernetes_service" "affine" {
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
source = "../../modules/kubernetes/ingress_factory"
# auth = "app": AFFiNE has its own workspace auth + bearer-token API
# used by desktop/mobile sync clients. Authentik forward-auth was 302-ing
# those API callers; AFFiNE's own auth gates users.
auth = "app"
dns_type = "non-proxied"
namespace = kubernetes_namespace.affine.metadata[0].name
name = "affine"

View file

@ -53,11 +53,130 @@ resource "authentik_provider_proxy" "catchall" {
# doesn't require an HCL edit.
authorization_flow = data.authentik_flow.default_authorization_implicit_consent.id
invalidation_flow = data.authentik_flow.default_provider_invalidation.id
# Cookie / proxysession TTL. Drives `Max-Age` on `authentik_proxy_*`
# cookies and the `expires` column in `authentik_providers_proxy_proxysession`.
# See note on the embedded outpost below bumping this requires an outpost
# pod restart for the gorilla session store to rebind.
access_token_validity = "weeks=4"
lifecycle {
ignore_changes = [property_mappings, jwt_federation_sources, skip_path_regex, internal_host, basic_auth_enabled, basic_auth_password_attribute, basic_auth_username_attribute, intercept_header_auth, access_token_validity]
ignore_changes = [property_mappings, jwt_federation_sources, skip_path_regex, internal_host, basic_auth_enabled, basic_auth_password_attribute, basic_auth_username_attribute, intercept_header_auth]
}
}
# -----------------------------------------------------------------------------
# Embedded outpost record. Adopted into Terraform 2026-05-10 as part of the
# postgres-session-backend fix:
# - `managed` is set server-side to `goauthentik.io/outposts/embedded` so
# the outpost binary's `IsEmbedded()` check returns true it loads the
# PostgreSQL session backend (PR #16628). The Terraform provider does
# NOT expose `managed` in the schema, so the field is preserved across
# applies (TF only writes fields it knows about).
# - kubernetes_json_patches.deployment carries:
# * dshm 2Gi tmpfs (covers the 2026-04-18 ENOSPC class of issues)
# * resources requests/limits
# * `app.kubernetes.io/component=server` pod label so the K8s service
# selector lights up endpoints (works around goauthentik 2026.2.2
# service.py:52 selector mismatch on standalone embedded outposts).
# * AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME} envFrom the
# shared `goauthentik` Secret so the postgres session backend has
# credentials to connect to the dbaas cluster.
# - kubernetes_json_patches.service replaces the controller-set selector
# (which incorrectly targets `app.kubernetes.io/name=authentik`, i.e.
# the goauthentik-server pods) with the outpost's own labels.
# -----------------------------------------------------------------------------
resource "authentik_outpost" "embedded" {
name = "authentik Embedded Outpost"
type = "proxy"
protocol_providers = [authentik_provider_proxy.catchall.id]
service_connection = "99e227a7-4562-4888-9660-4c27da678c50"
config = jsonencode({
log_level = "trace"
docker_labels = null
authentik_host = "https://authentik.viktorbarzin.me/"
docker_network = null
container_image = null
docker_map_ports = true
refresh_interval = "minutes=5"
kubernetes_replicas = 1
kubernetes_namespace = "authentik"
authentik_host_browser = ""
object_naming_template = "ak-outpost-%(name)s"
authentik_host_insecure = false
kubernetes_service_type = "ClusterIP"
kubernetes_ingress_path_type = null
kubernetes_image_pull_secrets = []
kubernetes_ingress_class_name = null
kubernetes_disabled_components = []
kubernetes_ingress_annotations = {}
kubernetes_ingress_secret_name = "authentik-outpost-tls"
kubernetes_httproute_annotations = {}
kubernetes_httproute_parent_refs = []
kubernetes_json_patches = {
deployment = [
{
op = "add"
path = "/spec/template/spec/volumes"
value = [{ name = "dshm", emptyDir = { medium = "Memory", sizeLimit = "2Gi" } }]
},
{
op = "add"
path = "/spec/template/spec/containers/0/volumeMounts"
value = [{ name = "dshm", mountPath = "/dev/shm" }]
},
{
op = "add"
path = "/spec/template/spec/containers/0/resources"
value = { limits = { memory = "2560Mi" }, requests = { cpu = "100m", memory = "128Mi" } }
},
{
op = "add"
path = "/spec/template/metadata/labels/app.kubernetes.io~1component"
value = "server"
},
{
op = "add"
path = "/spec/template/spec/containers/0/env/-"
value = { name = "AUTHENTIK_POSTGRESQL__HOST", valueFrom = { secretKeyRef = { name = "goauthentik", key = "AUTHENTIK_POSTGRESQL__HOST" } } }
},
{
op = "add"
path = "/spec/template/spec/containers/0/env/-"
value = { name = "AUTHENTIK_POSTGRESQL__PORT", valueFrom = { secretKeyRef = { name = "goauthentik", key = "AUTHENTIK_POSTGRESQL__PORT" } } }
},
{
op = "add"
path = "/spec/template/spec/containers/0/env/-"
value = { name = "AUTHENTIK_POSTGRESQL__USER", valueFrom = { secretKeyRef = { name = "goauthentik", key = "AUTHENTIK_POSTGRESQL__USER" } } }
},
{
op = "add"
path = "/spec/template/spec/containers/0/env/-"
value = { name = "AUTHENTIK_POSTGRESQL__PASSWORD", valueFrom = { secretKeyRef = { name = "goauthentik", key = "AUTHENTIK_POSTGRESQL__PASSWORD" } } }
},
{
op = "add"
path = "/spec/template/spec/containers/0/env/-"
value = { name = "AUTHENTIK_POSTGRESQL__NAME", valueFrom = { secretKeyRef = { name = "goauthentik", key = "AUTHENTIK_POSTGRESQL__NAME" } } }
},
]
service = [
{
op = "replace"
path = "/spec/selector"
value = {
"app.kubernetes.io/managed-by" = "goauthentik.io"
"app.kubernetes.io/name" = "authentik-outpost-proxy"
"goauthentik.io/outpost-name" = "authentik-embedded-outpost"
"goauthentik.io/outpost-type" = "proxy"
"goauthentik.io/outpost-uuid" = "0eecac0797c7443c892505f2f4fe3e47"
}
},
]
}
})
}
# -----------------------------------------------------------------------------
# Default User Login stage bound to default-authentication-flow.
# Adopted into Terraform 2026-05-01 to set session_duration=weeks=4 so users

217
stacks/authentik/guest.tf Normal file
View file

@ -0,0 +1,217 @@
# =============================================================================
# Public Guest user + auto-login flow + public proxy provider + dedicated
# outpost.
#
# Backs the `auth = "public"` tier of the ingress_factory module. Architecture:
#
# * `guest` user (in `Public Guests` group, NOT `Allow Login Users`).
# * `public-auto-login` flow: anonymous user enters expression policy sets
# `pending_user = guest` user_login stage logs them in. No UI shown.
# * `Provider for Public` proxy provider (forward_domain, cookie_domain
# `viktorbarzin.me`) with `authentication_flow = public-auto-login`.
# * Dedicated `Public Outpost` Deployment+Service (managed by Authentik's
# K8s controller). Bound to the public provider only there is no other
# provider claiming `viktorbarzin.me` on this outpost, so every request
# it sees runs the public flow regardless of host.
# * `public-auth.viktorbarzin.me` ingress exposes the public outpost's
# `/outpost.goauthentik.io/*` path so OAuth callbacks land on it (the
# embedded outpost doesn't know about the public provider, so callbacks
# can't go to authentik.viktorbarzin.me).
#
# Traffic flow for a stranger hitting an `auth = "public"` ingress:
# 1. Traefik's `authentik-forward-auth-public` middleware public outpost.
# 2. No session cookie 302 to `https://authentik.viktorbarzin.me/...`
# with redirect_uri = `https://public-auth.viktorbarzin.me/.../callback`.
# 3. Authentik runs `public-auto-login` (no UI), issues session.
# 4. 302 public-auth.viktorbarzin.me callback public outpost validates
# state and sets `authentik_proxy_<public-hash>` cookie on `viktorbarzin.me`.
# 5. 302 original URL Traefik retries forward_auth public outpost
# validates cookie 200 with `X-authentik-username: guest`.
#
# A user already logged into anything else on viktorbarzin.me (the catchall)
# still gets recognised here Authentik prefers an existing session and the
# public provider's authorization_flow auto-approves anyone, so their real
# username shows up in `X-authentik-username`. Strangers get `guest`.
# =============================================================================
resource "authentik_user" "guest" {
username = "guest"
name = "Guest"
path = "users/system"
is_active = true
type = "internal"
# No password set: the user_login stage in `public_auto_login` logs the
# request in via pending_user pre-set by an expression policy. There is no
# UI path for `guest` to authenticate via password the user is also kept
# out of `Allow Login Users`, so even a leaked password cannot be used to
# complete the standard login flow.
lifecycle {
ignore_changes = [attributes, email]
}
}
resource "authentik_group" "public_guests" {
name = "Public Guests"
users = [authentik_user.guest.id]
# NOT a child of "Allow Login Users" keeps a hypothetical leaked password
# from promoting `guest` to a real user via the standard login flow.
}
# Pre-stage policy: sets pending_user = guest before user_login stage runs.
# Mutates `request.context["flow_plan"].context["pending_user"]` the
# canonical pattern (the user_login stage reads pending_user from
# `flow_plan.context`). Direct `request.context["pending_user"]` mutations
# don't propagate, since policy request.context is not the same dict as
# flow_plan.context.
resource "authentik_policy_expression" "set_guest_user" {
name = "set-public-guest-user"
expression = trimspace(<<-EOT
request.context["flow_plan"].context["pending_user"] = ak_user_by(username="guest")
return True
EOT
)
}
# Dedicated user_login stage for the public flow. 4-week session matches the
# default authentication stage; means a stranger only goes through the auto-
# bind once per ~month per device.
resource "authentik_stage_user_login" "public_guest_login" {
name = "public-guest-login"
session_duration = "weeks=4"
}
# `authentication = "none"` lets anonymous requests run the flow.
# `designation = "authentication"` because the flow's outcome is "request is
# now authenticated as guest"; the public proxy provider's authorization_flow
# then runs implicit consent.
resource "authentik_flow" "public_auto_login" {
name = "Public Auto Login"
slug = "public-auto-login"
title = "Public Guest Login"
designation = "authentication"
authentication = "none"
}
resource "authentik_flow_stage_binding" "public_login" {
target = authentik_flow.public_auto_login.uuid
stage = authentik_stage_user_login.public_guest_login.id
order = 10
# Re-evaluate at stage runtime: at plan time, flow_plan may not yet be in
# request.context, so the expression policy's mutation would no-op. With
# evaluate_on_plan=false + re_evaluate_policies=true, the policy fires
# right before the stage runs, when flow_plan is fully populated.
evaluate_on_plan = false
re_evaluate_policies = true
}
resource "authentik_policy_binding" "set_guest_before_login" {
target = authentik_flow_stage_binding.public_login.id
policy = authentik_policy_expression.set_guest_user.id
order = 0
}
# -----------------------------------------------------------------------------
# Public proxy provider forward_domain so it claims any host on
# viktorbarzin.me. Used only on the dedicated `public` outpost (where it is
# the sole bound provider), so there's no dispatch ambiguity with the
# catchall (which lives on the embedded outpost).
# -----------------------------------------------------------------------------
resource "authentik_provider_proxy" "public" {
name = "Provider for Public"
mode = "forward_domain"
external_host = "https://public-auth.viktorbarzin.me"
cookie_domain = "viktorbarzin.me"
# When a request hits with NO Authentik session, this flow runs first and
# auto-binds the request to the `guest` user (no UI prompt).
authentication_flow = authentik_flow.public_auto_login.uuid
# Once authenticated (or already authenticated), implicit-consent auto-approves.
authorization_flow = data.authentik_flow.default_authorization_implicit_consent.id
invalidation_flow = data.authentik_flow.default_provider_invalidation.id
access_token_validity = "weeks=4"
lifecycle {
ignore_changes = [property_mappings, jwt_federation_sources, skip_path_regex, internal_host, basic_auth_enabled, basic_auth_password_attribute, basic_auth_username_attribute, intercept_header_auth]
}
}
resource "authentik_application" "public" {
name = "Public"
slug = "public"
protocol_provider = authentik_provider_proxy.public.id
# No bound policies. policy_engine_mode = "any" + zero bindings = everyone
# passes (the auto-login flow has already established `guest` as the user).
policy_engine_mode = "any"
lifecycle {
ignore_changes = [meta_description, meta_launch_url, meta_icon, group, backchannel_providers, open_in_new_tab]
}
}
# Dedicated outpost so the public provider can claim viktorbarzin.me without
# colliding with the catchall (which already claims viktorbarzin.me on the
# embedded outpost). Authentik's K8s controller deploys this as
# `ak-outpost-public` (Deployment + Service in the `authentik` namespace).
resource "authentik_outpost" "public" {
name = "public"
type = "proxy"
protocol_providers = [authentik_provider_proxy.public.id]
service_connection = "99e227a7-4562-4888-9660-4c27da678c50"
config = jsonencode({
log_level = "info"
docker_labels = null
authentik_host = "https://authentik.viktorbarzin.me/"
docker_network = null
container_image = null
docker_map_ports = true
refresh_interval = "minutes=5"
kubernetes_replicas = 1
kubernetes_namespace = "authentik"
authentik_host_browser = ""
object_naming_template = "ak-outpost-%(name)s"
authentik_host_insecure = false
kubernetes_service_type = "ClusterIP"
kubernetes_ingress_path_type = null
kubernetes_image_pull_secrets = []
kubernetes_ingress_class_name = null
kubernetes_disabled_components = []
kubernetes_ingress_annotations = {}
kubernetes_ingress_secret_name = "authentik-outpost-tls"
kubernetes_httproute_annotations = {}
kubernetes_httproute_parent_refs = []
kubernetes_json_patches = {
deployment = [
{
op = "add"
path = "/spec/template/spec/containers/0/resources"
value = { limits = { memory = "256Mi" }, requests = { cpu = "10m", memory = "64Mi" } }
},
]
}
})
}
# Ingress for `public-auth.viktorbarzin.me` exposes the public outpost's
# /outpost.goauthentik.io/* path so OAuth callbacks land on it. The
# `Provider for Public` external_host points here, so all redirect_uris in
# the OAuth flow resolve to this hostname.
module "ingress_public_outpost" {
source = "../../modules/kubernetes/ingress_factory"
# Public-tier outpost callback the OAuth flow's redirect_uris all resolve
# here; gating it with forward-auth would loop the public outpost onto itself.
# auth = "none": Public outpost callback path for OAuth flow; protecting with forward-auth creates circular dependency.
auth = "none"
namespace = "authentik"
name = "public-outpost"
host = "public-auth"
service_name = "ak-outpost-public"
port = 9000
ingress_path = ["/outpost.goauthentik.io"]
tls_secret_name = var.tls_secret_name
dns_type = "proxied"
anti_ai_scraping = false
exclude_crowdsec = true
homepage_enabled = false
depends_on = [authentik_outpost.public]
}

View file

@ -29,6 +29,7 @@ resource "kubernetes_namespace" "authentik" {
labels = {
tier = var.tier
"resource-governance/custom-quota" = "true"
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -70,8 +71,12 @@ resource "helm_release" "authentik" {
module "ingress" {
source = "../../../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
source = "../../../../modules/kubernetes/ingress_factory"
# Authentik's own UI cannot be gated by Authentik forward-auth that
# creates a chicken-and-egg loop (users can't reach the login page).
# auth = "none": Authentik UI cannot be gated by Authentik forward-auth (chicken-and-egg loop prevents login).
auth = "none"
dns_type = "proxied"
namespace = kubernetes_namespace.authentik.metadata[0].name
name = "authentik"
service_name = "goauthentik-server"
@ -91,7 +96,11 @@ module "ingress" {
}
module "ingress-outpost" {
source = "../../../../modules/kubernetes/ingress_factory"
source = "../../../../modules/kubernetes/ingress_factory"
# Authentik forward-auth outpost callback path protecting this with
# forward-auth would loop the outpost back onto itself.
# auth = "none": Authentik outpost callback path for forward-auth flow; protecting with forward-auth creates circular dependency.
auth = "none"
namespace = kubernetes_namespace.authentik.metadata[0].name
name = "authentik-outpost"
host = "authentik"

View file

@ -66,9 +66,13 @@ resource "kubernetes_deployment" "pgbouncer" {
}
}
container {
name = "pgbouncer"
image = "edoburu/pgbouncer:latest"
image_pull_policy = "IfNotPresent"
name = "pgbouncer"
image = "edoburu/pgbouncer:latest"
# `:latest` tag keep `Always` so pod restarts pick up upstream
# updates. The previous `IfNotPresent` value was declared at module
# creation but the live cluster has reconciled to `Always` (likely
# via a Helm/operator default). Match reality to drop the drift.
image_pull_policy = "Always"
port {
container_port = 6432

View file

@ -78,7 +78,10 @@ global:
addPrometheusAnnotations: true
worker:
replicas: 3
# 2 replicas: workers handle background tasks (LDAP sync, email,
# certificate renewal) — no user-facing traffic, so 2-of-3 isn't
# needed for availability. Drop saves ~100m sustained CPU.
replicas: 2
# Same unauthenticated_age cap as server — both the server (Django session
# middleware) and worker (cleanup tasks) need to see the value.
env:

View file

@ -29,6 +29,7 @@ resource "kubernetes_namespace" "beads" {
name = "beads-server"
labels = {
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -43,7 +44,7 @@ resource "kubernetes_persistent_volume_claim" "dolt_data" {
name = "dolt-data"
namespace = kubernetes_namespace.beads.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "10Gi"
}
@ -55,6 +56,13 @@ resource "kubernetes_persistent_volume_claim" "dolt_data" {
requests = { storage = "2Gi" }
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_config_map" "dolt_init" {
@ -67,6 +75,23 @@ resource "kubernetes_config_map" "dolt_init" {
CREATE USER IF NOT EXISTS 'beads'@'%' IDENTIFIED BY '';
GRANT ALL PRIVILEGES ON *.* TO 'beads'@'%' WITH GRANT OPTION;
EOT
"02-create-presence-table.sql" = <<-EOT
CREATE DATABASE IF NOT EXISTS beads;
USE beads;
CREATE TABLE IF NOT EXISTS presence_claims (
session_id VARCHAR(128) NOT NULL,
resource_label VARCHAR(255) NOT NULL,
purpose TEXT NOT NULL,
claimed_at DATETIME(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
expires_at DATETIME(3) NOT NULL,
host VARCHAR(128) NOT NULL,
user VARCHAR(64) NOT NULL,
agent_name VARCHAR(64) DEFAULT 'claude-code',
PRIMARY KEY (session_id, resource_label),
INDEX idx_resource (resource_label),
INDEX idx_expires (expires_at)
);
EOT
}
}
@ -78,6 +103,16 @@ resource "kubernetes_deployment" "dolt" {
app = "dolt"
tier = local.tiers.aux
}
annotations = {
# Keel is namespace-enrolled (keel.sh/enrolled=true on the namespace),
# but this deployment opts OUT of auto-updates: dolthub/dolt-sql-server:latest
# currently resolves to a broken 0.50.10 build. Pinned image lives in the
# container spec below. Codified here so TF state matches live, no drift.
"keel.sh/policy" = "never"
"keel.sh/match-tag" = "true"
"keel.sh/trigger" = "poll"
"keel.sh/pollSchedule" = "@every 1h"
}
}
spec {
replicas = 1
@ -98,7 +133,12 @@ resource "kubernetes_deployment" "dolt" {
spec {
container {
name = "dolt"
image = "dolthub/dolt-sql-server:latest"
# Pinned to 2.0.3 :latest currently resolves to 0.50.10 on dolthub
# (different versioning stream) whose docker-entrypoint.sh references
# an undefined docker_process_sql function and crash-loops on every
# init script in /docker-entrypoint-initdb.d. Keel can upgrade this
# tag in-cluster; the lifecycle.ignore_changes below preserves that.
image = "dolthub/dolt-sql-server:2.0.3"
port {
name = "mysql"
@ -170,7 +210,59 @@ resource "kubernetes_deployment" "dolt" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE
# Keel annotations are codified in metadata.annotations above (policy=never
# opts this deployment out of auto-updates see the comment there).
]
}
}
# One-shot Job to apply the presence_claims schema to the running Dolt server.
# The dolt_init ConfigMap only fires on fresh PVCs; since Dolt already exists
# with persistent state, this Job is the only path to update the live schema.
# The job name is hashed off the SQL content so a new Job runs whenever the
# schema changes; the SQL itself is idempotent (CREATE ... IF NOT EXISTS).
resource "kubernetes_job" "presence_schema_migrate" {
metadata {
name = "presence-schema-${substr(sha256(kubernetes_config_map.dolt_init.data["02-create-presence-table.sql"]), 0, 8)}"
namespace = kubernetes_namespace.beads.metadata[0].name
}
spec {
backoff_limit = 3
template {
metadata {}
spec {
restart_policy = "OnFailure"
container {
name = "migrate"
image = "mysql:8.4"
command = ["sh", "-c"]
args = [
"mysql -h dolt.beads-server.svc.cluster.local -P 3306 -u root < /sql/02-create-presence-table.sql"
]
volume_mount {
name = "sql"
mount_path = "/sql"
}
}
volume {
name = "sql"
config_map {
name = kubernetes_config_map.dolt_init.metadata[0].name
}
}
}
}
}
wait_for_completion = true
timeouts {
create = "5m"
}
depends_on = [kubernetes_deployment.dolt]
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
]
}
}
@ -374,7 +466,11 @@ resource "kubernetes_deployment" "workbench" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE
]
}
}
@ -416,7 +512,8 @@ module "ingress" {
namespace = kubernetes_namespace.beads.metadata[0].name
name = "dolt-workbench"
tls_secret_name = var.tls_secret_name
protected = false
# auth = "none": Dolt Workbench is client-side encrypted task database; no backend user auth required; Anubis PoW fronts ingress.
auth = "none"
exclude_crowdsec = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"
@ -566,7 +663,7 @@ resource "kubernetes_deployment" "beadboard" {
}
container {
name = "beadboard"
name = "beadboard"
# Phase 3 cutover 2026-05-07 Forgejo registry consolidation.
image = "forgejo.viktorbarzin.me/viktor/beadboard:${var.beadboard_image_tag}"
@ -646,7 +743,11 @@ resource "kubernetes_deployment" "beadboard" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE
]
}
}
@ -677,7 +778,7 @@ module "beadboard_ingress" {
namespace = kubernetes_namespace.beads.metadata[0].name
name = "beadboard"
tls_secret_name = var.tls_secret_name
protected = true
auth = "required"
exclude_crowdsec = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"

View file

@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
]
}
provider "registry.terraform.io/hashicorp/helm" {
version = "3.1.1"
hashes = [

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:SBlzGxotNUN6HH9d0S-m@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:ts7DGcKmTTY-5ujz4mhh@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "blog"
}
}

View file

@ -10,6 +10,7 @@ resource "kubernetes_namespace" "website" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -76,8 +77,12 @@ resource "kubernetes_deployment" "blog" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -116,23 +121,25 @@ resource "kubernetes_service" "blog" {
# tiny PoW (~250ms desktop), get a 30-day cookie, and pass through. Replaces
# the global ai-bot-block forwardAuth for this site.
module "anubis" {
source = "../../modules/kubernetes/anubis_instance"
name = "blog"
namespace = kubernetes_namespace.website.metadata[0].name
target_url = "http://${kubernetes_service.blog.metadata[0].name}.${kubernetes_namespace.website.metadata[0].name}.svc.cluster.local"
source = "../../modules/kubernetes/anubis_instance"
name = "blog"
namespace = kubernetes_namespace.website.metadata[0].name
target_url = "http://${kubernetes_service.blog.metadata[0].name}.${kubernetes_namespace.website.metadata[0].name}.svc.cluster.local"
shared_store_url = "redis://redis-master.redis.svc.cluster.local:6379/10"
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
auth = "none" # Anubis-fronted; PoW challenge gates bots, no Authentik
namespace = kubernetes_namespace.website.metadata[0].name
name = "blog"
service_name = module.anubis.service_name
port = module.anubis.service_port
extra_middlewares = ["traefik-x402@kubernetescrd"]
full_host = "viktorbarzin.me"
dns_type = "proxied"
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false # Anubis is the gatekeeper now drop the redundant ai-bot-block forwardAuth.
full_host = "viktorbarzin.me"
dns_type = "proxied"
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false # Anubis is the gatekeeper now drop the redundant ai-bot-block forwardAuth.
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Blog"
@ -145,12 +152,24 @@ module "ingress" {
module "ingress-www" {
source = "../../modules/kubernetes/ingress_factory"
auth = "none" # Anubis-fronted; PoW challenge gates bots, no Authentik
namespace = kubernetes_namespace.website.metadata[0].name
name = "blog-www"
service_name = module.anubis.service_name
port = module.anubis.service_port
extra_middlewares = ["traefik-x402@kubernetescrd"]
full_host = "www.viktorbarzin.me"
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false
full_host = "www.viktorbarzin.me"
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00
# CI retrigger v3 2026-05-16T14:06:39Z
# CI retrigger v4 2026-05-16T14:13:59Z
# CI retrigger v5 2026-05-16T23:10:38Z
# CI retrigger v6 2026-05-16T23:18:58Z

View file

@ -9,6 +9,10 @@ terraform {
source = "cloudflare/cloudflare"
version = "~> 4"
}
authentik = {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
}
}

View file

@ -12,6 +12,7 @@ resource "kubernetes_namespace" "broker_sync" {
labels = {
"istio-injection" = "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -61,7 +62,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
name = "broker-sync-data-encrypted"
namespace = kubernetes_namespace.broker_sync.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -73,6 +74,13 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
requests = { storage = "1Gi" }
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
locals {
@ -660,8 +668,13 @@ resource "kubernetes_cron_job_v1" "fidelity" {
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 5
# Suspended until the broker-sync image ships with Playwright + Chromium.
suspend = true
# Unsuspended 2026-05-17 after the delta gains-offset emission landed
# (broker-sync @98c4729). Manual trigger:
# kubectl -n broker-sync create job fid-now \
# --from=cronjob/broker-sync-fidelity
# NB: storage_state expires every 30-90 days see code-r9n for the
# chrome-service-driven re-seed runbook.
suspend = false
job_template {
metadata {}
spec {

View file

@ -22,6 +22,9 @@ resource "kubernetes_namespace" "calico_system" {
name = "calico-system"
labels = {
name = "calico-system"
# calico-system namespace is managed by tigera-operator auto-update is
# incompatible (operator reverts DaemonSet image from its Installation CR).
# "keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -65,3 +68,66 @@ resource "kubernetes_namespace" "tigera_operator" {
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# Wave 1 W1.6 (beads code-8ywc): observation phase via Calico GlobalNetworkPolicy
# `action: Log`. This is the supported primitive on Calico OSS v3.26 the
# Calico-Enterprise FelixConfiguration.flowLogsFileEnabled approach is NOT
# accepted by the OSS CRD (verified 2026-05-19: "strict decoding error").
#
# How it works:
# - GNP selects pods by namespaceSelector
# - egress rule action=Log writes an iptables NFLOG entry that lands in the
# kernel log / journald with prefix "calico-packet:" on each node
# - Alloy DaemonSet already ships node-journal to Loki (job=node-journal)
# - LogQL query: {job="node-journal"} |= "calico-packet" surfaces egress flows
# - After ~1 week of observation, build the empirical per-namespace egress
# allowlist; then flip the same GNP to [Allow specific dests, Deny rest]
#
# Started with `recruiter-responder` as the pilot on 2026-05-19; expanded
# 2026-05-19 to all tier 3+4 namespaces (per locked plan tier 3-edge has
# 17 ns, tier 4-aux has 65 ns, all use Calico's WorkloadEndpoint policy
# path). Tier 0/1/2 stay out of observation in wave 1 (cluster infra +
# GPU workloads, deferred per the plan).
#
# `apply_only = true` on the kubectl_manifest means renaming the TF resource
# does NOT destroy the old GNP via TF we kubectl delete the legacy pilot
# GNP after this applies to clean it up. (Tracked manually.)
resource "kubectl_manifest" "wave1_egress_observe_tier34" {
yaml_body = yamlencode({
apiVersion = "projectcalico.org/v3"
kind = "GlobalNetworkPolicy"
metadata = {
name = "wave1-egress-observe-tier34"
annotations = {
"security.viktorbarzin.me/wave" = "1"
"security.viktorbarzin.me/purpose" = "observe-then-enforce egress for tier 3-edge + 4-aux"
}
}
spec = {
order = 2000
selector = "all()"
namespaceSelector = "tier in {\"3-edge\", \"4-aux\"}"
types = ["Egress"]
egress = [
# Rule 1: log every egress packet (LOG target writes to kernel/journal,
# alloy ships to Loki with job=node-journal,transport=kernel).
# LogQL: {job="node-journal"} |~ "calico-packet"
{ action = "Log" },
# Rule 2: allow everything (observation must NOT break workloads).
{ action = "Allow" },
]
}
})
apply_only = true
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00
# CI retrigger v3 2026-05-16T14:06:39Z
# CI retrigger v4 2026-05-16T14:13:59Z
# CI retrigger v5 2026-05-16T23:10:38Z
# CI retrigger v6 2026-05-16T23:18:58Z

View file

@ -9,6 +9,7 @@ resource "kubernetes_namespace" "changedetection" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -68,7 +69,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
name = "changedetection-data-proxmox"
namespace = kubernetes_namespace.changedetection.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "8Gi"
}
@ -82,6 +83,13 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_deployment" "changedetection" {
@ -187,8 +195,13 @@ resource "kubernetes_deployment" "changedetection" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -218,7 +231,7 @@ module "ingress" {
namespace = kubernetes_namespace.changedetection.metadata[0].name
name = "changedetection"
tls_secret_name = var.tls_secret_name
protected = true
auth = "required"
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Changedetection"

View file

@ -24,6 +24,7 @@ resource "kubernetes_namespace" "chrome_service" {
"istio-injection" = "disabled"
tier = local.tiers.aux
"chrome-service.viktorbarzin.me/server" = "true"
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -74,7 +75,7 @@ resource "kubernetes_persistent_volume_claim" "profile_encrypted" {
name = "chrome-service-profile-encrypted"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "10Gi"
}
@ -88,6 +89,13 @@ resource "kubernetes_persistent_volume_claim" "profile_encrypted" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
# --- NFS backup target ---
@ -107,6 +115,12 @@ resource "kubernetes_deployment" "chrome_service" {
namespace = kubernetes_namespace.chrome_service.metadata[0].name
labels = merge(local.labels, {
tier = local.tiers.aux
# Deliberate pin: chrome-service's playwright image MUST match
# the playwright Python version in f1-stream (see local.image
# comment above). Opt out of Keel auto-update via this label
# the inject-keel-annotations ClusterPolicy excludes workloads
# selector-matching keel.sh/policy=never.
"keel.sh/policy" = "never"
})
annotations = {
"reloader.stakater.com/auto" = "true"
@ -304,8 +318,12 @@ resource "kubernetes_deployment" "chrome_service" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -354,7 +372,7 @@ module "ingress" {
namespace = kubernetes_namespace.chrome_service.metadata[0].name
name = "chrome"
tls_secret_name = var.tls_secret_name
protected = true
auth = "required"
# noVNC defaults to /vnc.html auto-redirect / there.
ingress_path = ["/"]
extra_annotations = {

View file

@ -10,6 +10,7 @@ resource "kubernetes_namespace" "city-guesser" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -67,8 +68,13 @@ resource "kubernetes_deployment" "city-guesser" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -99,7 +105,7 @@ module "ingress" {
namespace = "city-guesser"
name = "city-guesser"
tls_secret_name = var.tls_secret_name
protected = true
auth = "required"
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "City Guesser"

View file

@ -12,7 +12,7 @@ locals {
namespace = "claude-agent"
# Phase 3 cutover 2026-05-07 see infra/docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md.
image = "forgejo.viktorbarzin.me/viktor/claude-agent-service"
image_tag = "2fd7670d"
image_tag = "191ed5dd"
labels = {
app = "claude-agent-service"
}
@ -191,27 +191,25 @@ resource "kubernetes_cluster_role_binding" "claude_agent" {
}
# --- Storage ---
resource "kubernetes_persistent_volume_claim" "workspace" {
wait_until_bound = false
metadata {
name = "claude-agent-workspace-encrypted"
namespace = kubernetes_namespace.claude_agent.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "20Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm-encrypted"
resources {
requests = {
storage = "10Gi"
}
}
}
#
# The `workspace` volume in the deployment is intentionally emptyDir agent
# jobs do fresh git clones each run, so a per-pod scratch dir on node disk
# is faster and isolated. The 10Gi `claude-agent-workspace-encrypted` PVC
# that previously sat next to this comment was created but never wired
# into the deployment (sat idle from 2026-04-15 to 2026-05-11).
#
# For cases where the agent DOES need to persist state across pod restarts
# (caches, ad-hoc outputs, anything that should survive a pod reschedule),
# `module.persistent` below provides a 5Gi NFS-backed RWX volume mounted
# at /persistent. RWX so all 3 replicas can read/write the same dir;
# sequential job mutex in the service prevents concurrent writes.
module "persistent" {
source = "../../modules/kubernetes/nfs_volume"
name = "claude-agent-persistent"
namespace = kubernetes_namespace.claude_agent.metadata[0].name
nfs_server = "192.168.1.127"
nfs_path = "/srv/nfs/claude-agent-persistent"
storage = "5Gi"
}
# --- Deployment ---
@ -251,11 +249,15 @@ resource "kubernetes_deployment" "claude_agent" {
fs_group = 1000
}
# Fix workspace ownership (PVC may have root-owned files from prior run)
# Fix workspace ownership. Kubelet creates the Dockerfile WORKDIR
# (/workspace/infra) inside the emptyDir as root:gid=fsGroup with
# the setgid bit uid 1000 can't write into it without explicit
# chown + chmod. Pre-create so the path is guaranteed, then chown
# recursively and chmod the infra subdir for safety.
init_container {
name = "fix-perms"
image = "busybox:1.37"
command = ["sh", "-c", "chown -R 1000:1000 /workspace"]
command = ["sh", "-c", "mkdir -p /workspace/infra /persistent && chown -R 1000:1000 /workspace /persistent && chmod 0775 /workspace/infra /persistent"]
security_context {
run_as_user = 0
}
@ -263,6 +265,10 @@ resource "kubernetes_deployment" "claude_agent" {
name = "workspace"
mount_path = "/workspace"
}
volume_mount {
name = "persistent"
mount_path = "/persistent"
}
resources {
requests = {
memory = "32Mi"
@ -368,6 +374,7 @@ resource "kubernetes_deployment" "claude_agent" {
mkdir -p /workspace/.beads /workspace/scratch /home/agent/.claude/agents
cp /usr/share/agent-seed/beads-metadata.json /workspace/.beads/metadata.json
cp /usr/share/agent-seed/beads-task-runner.md /home/agent/.claude/agents/beads-task-runner.md
cp /usr/share/agent-seed/recruiter-triage.md /home/agent/.claude/agents/recruiter-triage.md
EOT
]
@ -431,6 +438,10 @@ resource "kubernetes_deployment" "claude_agent" {
name = "workspace"
mount_path = "/workspace"
}
volume_mount {
name = "persistent"
mount_path = "/persistent"
}
volume_mount {
name = "sops-age-key"
mount_path = "/home/agent/.config/sops/age"
@ -453,8 +464,16 @@ resource "kubernetes_deployment" "claude_agent" {
volume {
name = "workspace"
# Per-pod ephemeral scratch agent does fresh git clones each
# job, so node-disk emptyDir is faster than a network-backed PVC
# and avoids RWO contention across the 3 replicas.
empty_dir {}
}
volume {
name = "persistent"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.workspace.metadata[0].name
claim_name = module.persistent.claim_name
}
}

View file

@ -0,0 +1,18 @@
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}
dependency "vault" {
config_path = "../vault"
skip_outputs = true
}
dependency "external-secrets" {
config_path = "../external-secrets"
skip_outputs = true
}

View file

@ -6,6 +6,7 @@ variable "postgresql_host" { type = string }
variable "claude_memory_db_password" {
type = string
sensitive = true
default = "" # falls back to Vault `secret/claude-memory.db_password` below
}
data "vault_kv_secret_v2" "secrets" {
@ -18,6 +19,7 @@ resource "kubernetes_namespace" "claude-memory" {
name = "claude-memory"
labels = {
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -112,11 +114,13 @@ resource "kubernetes_job" "db_init" {
"sh", "-c",
<<-EOT
set -e
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -tc "SELECT 1 FROM pg_roles WHERE rolname='claude_memory'" | grep -q 1 || \
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -c "CREATE ROLE claude_memory WITH LOGIN PASSWORD '${var.claude_memory_db_password}'"
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -tc "SELECT 1 FROM pg_database WHERE datname='claude_memory'" | grep -q 1 || \
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -c "CREATE DATABASE claude_memory OWNER claude_memory"
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -c "GRANT ALL PRIVILEGES ON DATABASE claude_memory TO claude_memory"
# -d postgres: psql defaults database name to username; root user
# doesn't have a root-named database, so be explicit.
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_roles WHERE rolname='claude_memory'" | grep -q 1 || \
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE ROLE claude_memory WITH LOGIN PASSWORD '${coalesce(var.claude_memory_db_password, data.vault_kv_secret_v2.secrets.data["db_password"])}'"
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_database WHERE datname='claude_memory'" | grep -q 1 || \
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE DATABASE claude_memory OWNER claude_memory"
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "GRANT ALL PRIVILEGES ON DATABASE claude_memory TO claude_memory"
echo "Database init complete"
EOT
]
@ -246,6 +250,9 @@ resource "kubernetes_deployment" "claude-memory" {
ignore_changes = [
spec[0].template[0].spec[0].container[0].image,
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -274,7 +281,11 @@ resource "kubernetes_service" "claude-memory" {
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
source = "../../modules/kubernetes/ingress_factory"
# MCP server called by Claude Code (and other tools/agents) via app-layer
# bearer-token auth; forward-auth would break programmatic clients.
# auth = "none": MCP server called by Claude Code via bearer-token auth; forward-auth would break programmatic clients.
auth = "none"
dns_type = "proxied"
namespace = kubernetes_namespace.claude-memory.metadata[0].name
name = "claude-memory"

View file

@ -50,6 +50,22 @@ locals {
}
}
# Zone-level Bot Management. ai_bots_protection was "block" CF returned
# 403 to declared AI bot UAs at the edge, so the in-cluster x402 gateway
# never got a chance to issue HTTP 402 with a payment offer. Flipped to
# "disabled" so AI bots reach Traefik x402, which returns 402 with the
# wallet address. Generic Bot Fight Mode + crawler protection stay on.
# (import {} stanza for adoption lives in the root stack TF restriction.)
resource "cloudflare_bot_management" "zone" {
zone_id = var.cloudflare_zone_id
enable_js = true
fight_mode = true
ai_bots_protection = "disabled"
# crawler_protection / is_robots_txt_managed are settable only via newer
# provider versions; they retain whatever the API currently has
# (crawler_protection=enabled, is_robots_txt_managed=true).
}
resource "cloudflare_zero_trust_tunnel_cloudflared_config" "sof" {
account_id = var.cloudflare_account_id
tunnel_id = var.cloudflare_tunnel_id
@ -152,57 +168,57 @@ resource "cloudflare_record" "mail_spf" {
}
resource "cloudflare_record" "mail_domainkey_rspamd" {
content = "\"v=DKIM1; h=sha256; k=rsa; p=MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAs9XHeFBKhUAEJSikXx+P49Q3nEBbnaSpn6h/9TqIhKaZWSVa2uGUGYQieNdon7DEJZ0VFo0Tvm3/UFsy2qF7ZmF+E/+N8EmkcPrMlxgJT281dpk5DxrZ+kbzw/DosfHH71K6vCLB4rSexzxJHaAx0AUddI3bFUJGjMgCXXCMZF+p8YCx+DDGPIXz2FOTtlJlR7aeZ2xXavwE/lBfI3MLnsq7X+GhPjQEax070nndOdZI0S8HpZkVxdGWl1N2Ec6LukYm2RiUkEMMQHSYX7WF3JBc+CGqUyd706Iy/5oeC3UGwZSM2uLkrp8YBjmw/h1rAeyv/ITt6ZXraP/cIMRiVQIDAQAB\""
name = "mail._domainkey.viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
content = "\"v=DKIM1; h=sha256; k=rsa; p=MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAs9XHeFBKhUAEJSikXx+P49Q3nEBbnaSpn6h/9TqIhKaZWSVa2uGUGYQieNdon7DEJZ0VFo0Tvm3/UFsy2qF7ZmF+E/+N8EmkcPrMlxgJT281dpk5DxrZ+kbzw/DosfHH71K6vCLB4rSexzxJHaAx0AUddI3bFUJGjMgCXXCMZF+p8YCx+DDGPIXz2FOTtlJlR7aeZ2xXavwE/lBfI3MLnsq7X+GhPjQEax070nndOdZI0S8HpZkVxdGWl1N2Ec6LukYm2RiUkEMMQHSYX7WF3JBc+CGqUyd706Iy/5oeC3UGwZSM2uLkrp8YBjmw/h1rAeyv/ITt6ZXraP/cIMRiVQIDAQAB\""
name = "mail._domainkey.viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
}
resource "cloudflare_record" "brevo_domainkey1" {
content = "b1.viktorbarzin-me.dkim.brevo.com."
name = "brevo1._domainkey.viktorbarzin.me"
proxied = false
ttl = 1
type = "CNAME"
zone_id = var.cloudflare_zone_id
content = "b1.viktorbarzin-me.dkim.brevo.com."
name = "brevo1._domainkey.viktorbarzin.me"
proxied = false
ttl = 1
type = "CNAME"
zone_id = var.cloudflare_zone_id
}
resource "cloudflare_record" "brevo_domainkey2" {
content = "b2.viktorbarzin-me.dkim.brevo.com."
name = "brevo2._domainkey.viktorbarzin.me"
proxied = false
ttl = 1
type = "CNAME"
zone_id = var.cloudflare_zone_id
content = "b2.viktorbarzin-me.dkim.brevo.com."
name = "brevo2._domainkey.viktorbarzin.me"
proxied = false
ttl = 1
type = "CNAME"
zone_id = var.cloudflare_zone_id
}
resource "cloudflare_record" "brevo_code" {
content = "\"brevo-code:a6ef1dd91b248559900246eb4e7ceebd\""
name = "viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
content = "\"brevo-code:a6ef1dd91b248559900246eb4e7ceebd\""
name = "viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
}
resource "cloudflare_record" "mail_mta_sts" {
content = "\"v=STSv1; id=20260412\""
name = "_mta-sts.viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
content = "\"v=STSv1; id=20260412\""
name = "_mta-sts.viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
}
resource "cloudflare_record" "mail_tlsrpt" {
content = "\"v=TLSRPTv1; rua=mailto:postmaster@viktorbarzin.me\""
name = "_smtp._tls.viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
content = "\"v=TLSRPTv1; rua=mailto:postmaster@viktorbarzin.me\""
name = "_smtp._tls.viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
}
resource "cloudflare_record" "mail_dmarc" {

View file

@ -6,7 +6,8 @@ resource "kubernetes_namespace" "cloudflared" {
metadata {
name = "cloudflared"
labels = {
tier = var.tier
tier = var.tier
"keel.sh/enrolled" = "true"
}
}
lifecycle {

View file

@ -52,6 +52,7 @@ resource "kubernetes_namespace" "coturn" {
name = "coturn"
labels = {
tier = local.tiers.edge
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -194,8 +195,13 @@ resource "kubernetes_deployment" "coturn" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}

View file

@ -29,6 +29,7 @@ resource "kubernetes_namespace" "crowdsec" {
labels = {
tier = var.tier
"resource-governance/custom-quota" = "true"
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -282,7 +283,7 @@ module "ingress" {
dns_type = "proxied"
namespace = kubernetes_namespace.crowdsec.metadata[0].name
name = "crowdsec-web"
protected = true
auth = "required"
tls_secret_name = var.tls_secret_name
exclude_crowdsec = true
}

View file

@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
]
}
provider "registry.terraform.io/hashicorp/helm" {
version = "3.1.1"
hashes = [

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:SBlzGxotNUN6HH9d0S-m@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:ts7DGcKmTTY-5ujz4mhh@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "cyberchef"
}
}

View file

@ -9,6 +9,7 @@ resource "kubernetes_namespace" "cyberchef" {
name = "cyberchef"
labels = {
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -77,8 +78,12 @@ resource "kubernetes_deployment" "cyberchef" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -105,22 +110,24 @@ resource "kubernetes_service" "cyberchef" {
module "anubis" {
source = "../../modules/kubernetes/anubis_instance"
name = "cc"
namespace = kubernetes_namespace.cyberchef.metadata[0].name
target_url = "http://${kubernetes_service.cyberchef.metadata[0].name}.${kubernetes_namespace.cyberchef.metadata[0].name}.svc.cluster.local"
source = "../../modules/kubernetes/anubis_instance"
name = "cc"
namespace = kubernetes_namespace.cyberchef.metadata[0].name
target_url = "http://${kubernetes_service.cyberchef.metadata[0].name}.${kubernetes_namespace.cyberchef.metadata[0].name}.svc.cluster.local"
shared_store_url = "redis://redis-master.redis.svc.cluster.local:6379/5"
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
auth = "none" # Anubis-fronted; PoW challenge gates bots, no Authentik
dns_type = "proxied"
namespace = kubernetes_namespace.cyberchef.metadata[0].name
name = "cc"
service_name = module.anubis.service_name
port = module.anubis.service_port
extra_middlewares = ["traefik-x402@kubernetescrd"]
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "CyberChef"
@ -130,3 +137,14 @@ module "ingress" {
"gethomepage.dev/pod-selector" = ""
}
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00
# CI retrigger v3 2026-05-16T14:06:39Z
# CI retrigger v4 2026-05-16T14:13:59Z
# CI retrigger v5 2026-05-16T23:10:38Z
# CI retrigger v6 2026-05-16T23:18:58Z

View file

@ -9,6 +9,10 @@ terraform {
source = "cloudflare/cloudflare"
version = "~> 4"
}
authentik = {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
}
}

View file

@ -16,6 +16,7 @@ resource "kubernetes_namespace" "dashy" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -100,8 +101,13 @@ resource "kubernetes_deployment" "dashy" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -132,5 +138,5 @@ module "ingress" {
namespace = kubernetes_namespace.dashy.metadata[0].name
name = "dashy"
tls_secret_name = var.tls_secret_name
protected = true # hidden as we use homepage now
auth = "required" # hidden as we use homepage now
}

View file

@ -17,6 +17,7 @@ resource "kubernetes_namespace" "dawarich" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.edge
"keel.sh/enrolled" = "true"
}
}
}
@ -325,7 +326,13 @@ resource "kubernetes_deployment" "dawarich" {
}
}
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -432,7 +439,13 @@ resource "kubernetes_service" "dawarich" {
# }
# }
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
source = "../../modules/kubernetes/ingress_factory"
# owntracks bridge hook posts to /api/v1/owntracks/points?api_key=... from
# outside the cluster; mobile location apps also POST programmatically with
# an api_key. Forward-auth would 302 these clients into a login they can't
# complete. Dawarich enforces api_key at app layer.
# auth = "none": Location tracking API mobile apps + OwnTracks bridge POST via api_key; forward-auth 302s break programmatic clients.
auth = "none"
dns_type = "proxied"
namespace = kubernetes_namespace.dawarich.metadata[0].name
name = "dawarich"

View file

@ -131,6 +131,18 @@ resource "kubernetes_stateful_set_v1" "mysql_standalone" {
"app.kubernetes.io/instance" = "mysql-standalone"
"app.kubernetes.io/component" = "primary"
}
# Explicit Keel opt-out. The dbaas namespace is already excluded
# from the `inject-keel-annotations` Kyverno ClusterPolicy, but the
# StatefulSet historically picked up Keel annotations anyway (from
# an earlier version of that policy that didn't have the exclusion
# list). `keel.sh/policy: never` makes Keel skip this resource even
# if those legacy annotations are still present, so we cannot be
# silently bumped to a new MySQL version again.
#
# Lifting this MUST go through docs/plans/2026-05-19-mysql-8.4.9-upgrade-*.
annotations = {
"keel.sh/policy" = "never"
}
}
spec {
service_name = "mysql-standalone"
@ -167,8 +179,28 @@ resource "kubernetes_stateful_set_v1" "mysql_standalone" {
}
container {
name = "mysql"
image = "mysql:8.4"
name = "mysql"
#
# DO NOT BUMP THIS IMAGE WITHOUT FOLLOWING THE PLAN
#
# Pinned to mysql:8.4.8 EXACTLY. The in-server DD upgrade from
# 80408 80409 stalls reliably on this hardware (24s of writes
# then no progress, no CPU, never completes). The 2026-05-18
# recovery from the failed auto-bump took ~25 min of full
# MySQL downtime + Forgejo/registry/7 apps cascade.
#
# To go to 8.4.9 (or any later version), follow:
# docs/plans/2026-05-19-mysql-8.4.9-upgrade-design.md
# docs/plans/2026-05-19-mysql-8.4.9-upgrade-plan.md
# Beads: code-963q
#
# The upgrade path is wipe + re-init (NOT in-place DD upgrade).
# Requires: maintenance window, fresh dump, Vault user reset.
#
# History: code-eme8 (initial outage), code-k40p (recovery).
# See also: docs/runbooks/restore-mysql.md.
#
image = "mysql:8.4.8"
port {
container_port = 3306
@ -240,7 +272,7 @@ resource "kubernetes_stateful_set_v1" "mysql_standalone" {
metadata {
name = "data"
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "50Gi"
}
@ -346,7 +378,7 @@ resource "kubernetes_persistent_volume_claim" "pgadmin_encrypted" {
name = "dbaas-pgadmin-encrypted"
namespace = kubernetes_namespace.dbaas.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -360,6 +392,13 @@ resource "kubernetes_persistent_volume_claim" "pgadmin_encrypted" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
module "nfs_postgresql_backup_host" {
@ -791,7 +830,7 @@ module "ingress" {
namespace = kubernetes_namespace.dbaas.metadata[0].name
name = "pma"
tls_secret_name = var.tls_secret_name
protected = true
auth = "required"
extra_annotations = {}
}
@ -1043,12 +1082,12 @@ module "ingress" {
# Ensure the CNPG cluster manifest exists (idempotent kubectl apply)
resource "null_resource" "pg_cluster" {
triggers = {
instances = "2"
instances = "3"
image = "ghcr.io/cloudnative-pg/postgis:16"
storage_size = "20Gi"
storage_class = "proxmox-lvm-encrypted"
memory_limit = "2Gi"
pg_params = "v2-shared512-walcomp-workmem16"
memory_limit = "3Gi"
pg_params = "v3-shared1024-walcomp-workmem16-max200"
}
provisioner "local-exec" {
@ -1060,13 +1099,26 @@ resource "null_resource" "pg_cluster" {
name: pg-cluster
namespace: dbaas
spec:
instances: 2
# 3 instances (1 primary + 2 replicas) so a single-node drain (e.g.
# kured's weekly OS-reboot wave) still leaves a primary candidate
# immediately available for switchover. Previously 2; CNPG would
# still failover with 2 but only if the lone replica was caught up
# during a long WAL backlog the failover would stall the drain.
# Bumped 2026-05-16 ahead of Monday's first post-fix kured cycle.
instances: 3
imageName: ghcr.io/cloudnative-pg/postgis:16
postgresql:
parameters:
search_path: '"$user", public'
shared_buffers: "512MB"
effective_cache_size: "1536MB"
# Cluster grew past the 100-conn default ceiling (~90/100 idle
# steady-state in May 2026; authentik+matrix alone hold ~55).
# Bumped to 200 with shared_buffers/effective_cache_size/memory
# scaled proportionally. work_mem stays at 16MB that's per
# sort/hash op, not per connection, so 16MB * 200 isn't the
# worst case.
max_connections: "200"
shared_buffers: "1024MB"
effective_cache_size: "2560MB"
work_mem: "16MB"
wal_compression: "on"
random_page_cost: "4"
@ -1075,7 +1127,9 @@ resource "null_resource" "pg_cluster" {
enableSuperuserAccess: true
inheritedMetadata:
annotations:
resize.topolvm.io/threshold: "80%"
# threshold = free-space % below which autoresizer expands.
# 10% means "expand when 90% used" (the conventional knob).
resize.topolvm.io/threshold: "10%"
resize.topolvm.io/increase: "20%"
resize.topolvm.io/storage_limit: "100Gi"
storage:
@ -1084,9 +1138,9 @@ resource "null_resource" "pg_cluster" {
resources:
requests:
cpu: "50m"
memory: "2Gi"
memory: "3Gi"
limits:
memory: "2Gi"
memory: "3Gi"
EOF
EOT
}
@ -1149,7 +1203,8 @@ resource "null_resource" "pg_terraform_state_db" {
provisioner "local-exec" {
command = <<-EOT
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas pg-cluster-1 -c postgres -- \
PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
bash -c '
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'terraform_state'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE ROLE terraform_state WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
@ -1173,7 +1228,8 @@ resource "null_resource" "pg_payslip_ingest_db" {
provisioner "local-exec" {
command = <<-EOT
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas pg-cluster-1 -c postgres -- \
PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
bash -c '
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'payslip_ingest'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE ROLE payslip_ingest WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
@ -1197,7 +1253,8 @@ resource "null_resource" "pg_job_hunter_db" {
provisioner "local-exec" {
command = <<-EOT
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas pg-cluster-1 -c postgres -- \
PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
bash -c '
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'job_hunter'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE ROLE job_hunter WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
@ -1209,6 +1266,35 @@ resource "null_resource" "pg_job_hunter_db" {
}
}
# Postiz: 3 databases (postiz, temporal, temporal_visibility) all owned by the
# `postiz` role. Bundled bitnami PostgreSQL was retired 2026-05-09 in favour of
# this CNPG cluster covered by postgresql-backup-per-db automatically.
# Role password placeholder; Vault static role `pg-postiz` rotates 7d.
resource "null_resource" "pg_postiz_dbs" {
depends_on = [null_resource.pg_cluster]
triggers = {
role = "postiz"
dbs = "postiz,temporal,temporal_visibility"
}
provisioner "local-exec" {
command = <<-EOT
PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
bash -c '
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'postiz'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE ROLE postiz WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
for db in postiz temporal temporal_visibility; do
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_database WHERE datname = '"'"'$db'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE DATABASE $db OWNER postiz"
psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE $db TO postiz"
done
'
EOT
}
}
# Create wealthfolio_sync database for the SQLitePG ETL sidecar that mirrors
# Wealthfolio's daily_account_valuation/accounts/activities into PG so Grafana
# can chart net worth, contributions, and growth.
@ -1264,6 +1350,35 @@ resource "null_resource" "pg_fire_planner_db" {
}
}
# Create instagram_poster database for the IG-curation pipeline. Initial use:
# benchmark_score table written by `instagram_poster.benchmark` CLI (vision-LLM
# scoring per Immich asset). Future: migrate story_queue/decision/ig_posted_media
# off the pod's sqlite PVC into this DB so the pod is fully stateless.
# Role password is managed by Vault Database Secrets Engine
# (static role `pg-instagram-poster`, 7d rotation).
resource "null_resource" "pg_instagram_poster_db" {
depends_on = [null_resource.pg_cluster]
triggers = {
db_name = "instagram_poster"
username = "instagram_poster"
}
provisioner "local-exec" {
command = <<-EOT
PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
bash -c '
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'instagram_poster'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE ROLE instagram_poster WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_database WHERE datname = '"'"'instagram_poster'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE DATABASE instagram_poster OWNER instagram_poster"
psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE instagram_poster TO instagram_poster"
'
EOT
}
}
# Old PostgreSQL deployment kept commented for rollback reference
# resource "kubernetes_deployment" "postgres" {
# metadata {
@ -1400,7 +1515,7 @@ module "ingress-pgadmin" {
namespace = kubernetes_namespace.dbaas.metadata[0].name
name = "pgadmin"
tls_secret_name = var.tls_secret_name
protected = true
auth = "required"
}

View file

@ -4,7 +4,8 @@ resource "kubernetes_namespace" "descheduler" {
metadata {
name = "descheduler"
labels = {
tier = local.tiers.cluster
tier = local.tiers.cluster
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -94,3 +95,14 @@ resource "helm_release" "descheduler" { # rename me
values = [templatefile("${path.module}/values.yaml", {})]
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00
# CI retrigger v3 2026-05-16T14:06:39Z
# CI retrigger v4 2026-05-16T14:13:59Z
# CI retrigger v5 2026-05-16T23:10:38Z
# CI retrigger v6 2026-05-16T23:18:58Z

View file

@ -10,6 +10,7 @@ resource "kubernetes_namespace" "diun" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -91,7 +92,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
name = "diun-data-proxmox"
namespace = kubernetes_namespace.diun.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -105,6 +106,13 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_deployment" "diun" {
@ -230,6 +238,12 @@ resource "kubernetes_deployment" "diun" {
}
}
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}

View file

@ -17,6 +17,7 @@ resource "kubernetes_namespace" "ebook2audiobook" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.gpu
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -120,8 +121,13 @@ resource "kubernetes_deployment" "ebook2audiobook" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -254,7 +260,7 @@ module "ingress" {
namespace = kubernetes_namespace.ebook2audiobook.metadata[0].name
name = "ebook2audiobook"
tls_secret_name = var.tls_secret_name
protected = true
auth = "required"
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Ebook2Audiobook"
@ -322,8 +328,13 @@ resource "kubernetes_deployment" "audiblez" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -412,8 +423,13 @@ resource "kubernetes_deployment" "audiblez-web" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -445,7 +461,7 @@ module "audiblez-web-ingress" {
host = "audiblez"
dns_type = "non-proxied"
tls_secret_name = var.tls_secret_name
protected = true
auth = "required"
max_body_size = "500m" # Allow large EPUB uploads
extra_annotations = {
"gethomepage.dev/enabled" = "true"

View file

@ -9,6 +9,7 @@ resource "kubernetes_namespace" "ebooks" {
name = "ebooks"
labels = {
tier = local.tiers.edge
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -150,7 +151,7 @@ resource "kubernetes_persistent_volume_claim" "calibre_config_iscsi" {
name = "ebooks-calibre-config-proxmox"
namespace = kubernetes_namespace.ebooks.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "50%"
"resize.topolvm.io/storage_limit" = "10Gi"
}
@ -164,6 +165,13 @@ resource "kubernetes_persistent_volume_claim" "calibre_config_iscsi" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
module "nfs_calibre_ingest_host" {
@ -205,7 +213,7 @@ resource "kubernetes_persistent_volume_claim" "abs_config_proxmox" {
name = "ebooks-abs-config-proxmox"
namespace = kubernetes_namespace.ebooks.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -219,6 +227,13 @@ resource "kubernetes_persistent_volume_claim" "abs_config_proxmox" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
module "nfs_audiobookshelf_metadata_host" {
@ -350,7 +365,13 @@ resource "kubernetes_deployment" "calibre-web-automated" {
}
}
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -378,6 +399,7 @@ resource "kubernetes_service" "calibre" {
module "calibre_ingress" {
source = "../../modules/kubernetes/ingress_factory"
auth = "required"
dns_type = "proxied"
namespace = kubernetes_namespace.ebooks.metadata[0].name
name = "calibre"
@ -470,7 +492,13 @@ resource "kubernetes_deployment" "annas-archive-stacks" {
}
}
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -502,7 +530,7 @@ module "stacks_ingress" {
name = "stacks"
service_name = "annas-archive-stacks"
tls_secret_name = var.tls_secret_name
protected = true
auth = "required"
extra_annotations = {
"gethomepage.dev/enabled" = "false"
}
@ -619,7 +647,13 @@ resource "kubernetes_deployment" "audiobookshelf" {
}
}
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -646,7 +680,11 @@ resource "kubernetes_service" "audiobookshelf" {
}
module "audiobookshelf_ingress" {
source = "../../modules/kubernetes/ingress_factory"
source = "../../modules/kubernetes/ingress_factory"
# auth = "app": Audiobookshelf has its own user/password login + API
# tokens used by the iOS/Android Audiobookshelf app. Authentik forward-auth
# was 302-ing the mobile clients; ABS's own auth gates users.
auth = "app"
dns_type = "non-proxied"
namespace = kubernetes_namespace.ebooks.metadata[0].name
name = "audiobookshelf"
@ -890,7 +928,13 @@ resource "kubernetes_deployment" "book_search" {
}
}
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -921,7 +965,7 @@ module "book_search_ingress" {
namespace = kubernetes_namespace.ebooks.metadata[0].name
name = "book-search"
tls_secret_name = var.tls_secret_name
protected = true
auth = "required"
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Book Search"
@ -940,6 +984,7 @@ module "book_search_api_ingress" {
host = "book-search"
service_name = "book-search"
tls_secret_name = var.tls_secret_name
protected = false
# auth = "none": Book Search API endpoints API key auth handled by backend; forward-auth would block downloads.
auth = "none"
ingress_path = ["/api/download-url", "/api/download-status", "/api/send-to-kindle", "/shortcut"]
}

View file

@ -10,6 +10,7 @@ resource "kubernetes_namespace" "echo" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.edge
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -74,8 +75,13 @@ resource "kubernetes_deployment" "echo" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -101,7 +107,11 @@ resource "kubernetes_service" "echo" {
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
source = "../../modules/kubernetes/ingress_factory"
# echo is a header-reflecting diagnostic public so it's reachable for
# forward-auth smoke-testing. Anyone visiting echo.viktorbarzin.me sees
# exactly which X-authentik-* headers Traefik forwarded to backends.
auth = "public"
dns_type = "proxied"
namespace = kubernetes_namespace.echo.metadata[0].name
name = "echo"

View file

@ -11,6 +11,7 @@ resource "kubernetes_namespace" "excalidraw" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -32,7 +33,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
name = "excalidraw-data-proxmox"
namespace = kubernetes_namespace.excalidraw.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -46,6 +47,13 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_deployment" "excalidraw" {
@ -117,8 +125,13 @@ resource "kubernetes_deployment" "excalidraw" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -149,7 +162,7 @@ module "ingress" {
namespace = kubernetes_namespace.excalidraw.metadata[0].name
name = "draw"
tls_secret_name = var.tls_secret_name
protected = true
auth = "required"
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Excalidraw"

View file

@ -3,6 +3,7 @@ resource "kubernetes_namespace" "external_secrets" {
name = "external-secrets"
labels = {
tier = local.tiers.cluster
"keel.sh/enrolled" = "true"
}
}
lifecycle {

View file

@ -13,6 +13,7 @@ resource "kubernetes_namespace" "f1-stream" {
"istio-injection" : "disabled"
tier = local.tiers.aux
"chrome-service.viktorbarzin.me/client" = "true"
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -83,7 +84,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
name = "f1-stream-data-proxmox"
namespace = kubernetes_namespace.f1-stream.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -97,6 +98,13 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_deployment" "f1-stream" {
@ -195,8 +203,12 @@ resource "kubernetes_deployment" "f1-stream" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -237,11 +249,12 @@ module "tls_secret" {
# (which load before any user has a chance to solve PoW), CHALLENGE
# everything else the HTML pages.
module "anubis" {
source = "../../modules/kubernetes/anubis_instance"
name = "f1"
namespace = kubernetes_namespace.f1-stream.metadata[0].name
target_url = "http://${kubernetes_service.f1-stream.metadata[0].name}.${kubernetes_namespace.f1-stream.metadata[0].name}.svc.cluster.local"
policy_yaml = <<-EOT
source = "../../modules/kubernetes/anubis_instance"
name = "f1"
namespace = kubernetes_namespace.f1-stream.metadata[0].name
target_url = "http://${kubernetes_service.f1-stream.metadata[0].name}.${kubernetes_namespace.f1-stream.metadata[0].name}.svc.cluster.local"
shared_store_url = "redis://redis-master.redis.svc.cluster.local:6379/6"
policy_yaml = <<-EOT
bots:
- import: (data)/bots/_deny-pathological.yaml
- import: (data)/bots/aggressive-brazilian-scrapers.yaml
@ -262,6 +275,11 @@ module "anubis" {
- name: f1-data-routes
path_regex: ^/(embed|embed-asset|extract|extractors|health|proxy|relay|schedule|streams)(/|\?|$)
action: ALLOW
# Allow non-GET methods unconditionally AI scrapers GET the body,
# they don't POST. Mutating XHRs and CORS preflight need to bypass.
- name: allow-non-get-methods
action: ALLOW
expression: method != "GET"
- name: catchall-challenge
path_regex: .*
action: CHALLENGE
@ -270,6 +288,7 @@ module "anubis" {
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
auth = "none" # Anubis-fronted; PoW challenge gates bots, no Authentik
dns_type = "non-proxied"
namespace = kubernetes_namespace.f1-stream.metadata[0].name
name = "f1"
@ -288,3 +307,14 @@ module "ingress" {
"gethomepage.dev/pod-selector" = ""
}
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00
# CI retrigger v3 2026-05-16T14:06:39Z
# CI retrigger v4 2026-05-16T14:13:59Z
# CI retrigger v5 2026-05-16T23:10:38Z
# CI retrigger v6 2026-05-16T23:18:58Z

View file

@ -33,6 +33,8 @@ resource "kubernetes_namespace" "fire_planner" {
# for headless verification (NetworkPolicy in chrome-service ns admits
# any namespace carrying this label).
"chrome-service.viktorbarzin.me/client" = "true"
# Opt into Keel auto-update (inject-keel-annotations ClusterPolicy).
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -230,9 +232,10 @@ resource "kubernetes_deployment" "fire_planner" {
}
init_container {
name = "alembic-migrate"
image = local.image
command = ["python", "-m", "fire_planner", "migrate"]
name = "alembic-migrate"
image = local.image
image_pull_policy = "Always"
command = ["python", "-m", "fire_planner", "migrate"]
env_from {
secret_ref {
@ -310,7 +313,12 @@ resource "kubernetes_deployment" "fire_planner" {
}
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
depends_on = [
@ -420,6 +428,77 @@ resource "kubernetes_cron_job_v1" "fire_planner_recompute" {
]
}
# Weekly refresh of the COL cache: walks col_snapshot for rows
# expiring within 7 days, re-scrapes Numbeo + Expatistan, upserts. With
# the user-chosen 1-year TTL, a healthy cache has 0 stale rows on most
# Sundays the job is a no-op until rows age out. Schedule Sunday 04:00
# UTC so Numbeo's contributor activity (mostly weekday) doesn't race
# our reads.
resource "kubernetes_cron_job_v1" "fire_planner_col_refresh" {
metadata {
name = "fire-planner-col-refresh"
namespace = kubernetes_namespace.fire_planner.metadata[0].name
}
spec {
schedule = "0 4 * * 0"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 5
starting_deadline_seconds = 600
job_template {
metadata {
labels = local.labels
}
spec {
backoff_limit = 1
ttl_seconds_after_finished = 86400
template {
metadata {
labels = local.labels
}
spec {
restart_policy = "OnFailure"
image_pull_secrets {
name = "registry-credentials"
}
container {
name = "col-refresh"
image = local.image
command = ["python", "-m", "fire_planner", "col-refresh-stale", "--within-days", "7"]
env_from {
secret_ref {
name = "fire-planner-db-creds"
}
}
resources {
requests = {
cpu = "100m"
memory = "256Mi"
}
limits = {
memory = "512Mi"
}
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
depends_on = [
kubernetes_manifest.db_external_secret,
]
}
# Public ingress at fire-planner.viktorbarzin.me. Authentik-protected
# (forward-auth at the Traefik layer); Cloudflare-proxied for CDN +
# DDoS shielding. Backend FastAPI serves the SPA at / and the API
@ -431,7 +510,7 @@ module "ingress" {
name = "fire-planner"
port = 8080
tls_secret_name = var.tls_secret_name
protected = true
auth = "required"
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "FIRE Planner"
@ -443,11 +522,14 @@ module "ingress" {
# Second ingress at the same host for the /api/ prefix WITHOUT Authentik
# forward-auth. The SPA loads under Authentik (main ingress at /), then its
# fetch() XHRs hit /api/* directly forward-auth on /api/* would 302 the
# XHR to a cross-origin Authentik login page, which fetch().json() can't
# parse. App-layer bearer auth still gates writes (POST/PATCH/DELETE on
# scenarios, /recompute, /simulate); read endpoints are open. Acceptable
# for a personal tool whose only data is anonymous numeric projections.
# fetch() XHRs hit /api/* directly ANY forward-auth here (required OR
# public-tier auto-bind) would 302 the XHR to a cross-origin Authentik
# login page, which fetch() rejects under CORS preflight rules. Even the
# `auth = "public"` flow needs a 302+cookie dance on first visit to set
# the guest session cookie, so it doesn't help XHR APIs. App-layer bearer
# auth still gates writes (POST/PATCH/DELETE on scenarios, /recompute,
# /simulate); read endpoints are open. Acceptable for a personal tool
# whose only data is anonymous numeric projections.
module "ingress_api" {
source = "../../modules/kubernetes/ingress_factory"
dns_type = "none"
@ -458,7 +540,8 @@ module "ingress_api" {
port = 8080
ingress_path = ["/api/"]
tls_secret_name = var.tls_secret_name
protected = false
# auth = "none": XHR-based API endpoints; forward-auth 302+cookie-dance breaks CORS preflight and browser fetch().
auth = "none"
}
# Plan-time read of the ESO-created K8s Secret for Grafana datasource
@ -514,3 +597,6 @@ resource "kubernetes_config_map" "grafana_fire_planner_datasource" {
})
}
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00

View file

@ -9,6 +9,7 @@ resource "kubernetes_namespace" "foolery" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -65,7 +66,7 @@ module "ingress" {
namespace = kubernetes_namespace.foolery.metadata[0].name
name = "foolery"
tls_secret_name = var.tls_secret_name
protected = true
auth = "required"
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Foolery"

View file

@ -10,6 +10,7 @@ resource "kubernetes_namespace" "forgejo" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.edge
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -30,7 +31,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
name = "forgejo-data-encrypted"
namespace = kubernetes_namespace.forgejo.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "50%"
"resize.topolvm.io/storage_limit" = "50Gi"
}
@ -140,6 +141,16 @@ resource "kubernetes_deployment" "forgejo" {
name = "FORGEJO__packages__ENABLED"
value = "true"
}
# Disable source archive ZIP/TAR generation. Bots crawling
# /<owner>/<repo>/archive/<sha>.zip on dot_files (and similar
# vim-plugin trees) caused 9.9s 500s and chewed ~440m sustained
# CPU. Git clone / OCI registry / API are unaffected only
# /archive/* URLs return 404 now. Toggle back to "false" if a
# legitimate consumer needs source ZIPs.
env {
name = "FORGEJO__repository__DISABLE_DOWNLOAD_SOURCE_ARCHIVES"
value = "true"
}
volume_mount {
name = "data"
mount_path = "/data"
@ -169,8 +180,13 @@ resource "kubernetes_deployment" "forgejo" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -194,7 +210,12 @@ resource "kubernetes_service" "forgejo" {
}
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
source = "../../modules/kubernetes/ingress_factory"
# Git + OCI registry (/v2/) native clients (git, docker/podman) use HTTP
# basic-auth / bearer tokens, NOT browser sessions. Forward-auth would 302
# them into a redirect they can't follow.
# auth = "none": Git + OCI registry clients use HTTP Basic auth / bearer tokens; native CLI tools cannot follow forward-auth redirects.
auth = "none"
dns_type = "non-proxied"
namespace = kubernetes_namespace.forgejo.metadata[0].name
name = "forgejo"

View file

@ -225,7 +225,7 @@ module "ingress" {
name = "music-${var.name}"
tls_secret_name = var.tls_secret_name
dns_type = "non-proxied"
protected = var.protected
auth = var.protected ? "required" : "none"
extra_annotations = var.extra_annotations
}
@ -235,9 +235,9 @@ resource "kubernetes_ingress_v1" "stream-noauth" {
name = "music-${var.name}-stream"
namespace = "freedify"
annotations = {
"traefik.ingress.kubernetes.io/router.middlewares" = "traefik-retry@kubernetescrd,traefik-rate-limit@kubernetescrd"
"traefik.ingress.kubernetes.io/router.entrypoints" = "websecure"
"traefik.ingress.kubernetes.io/router.priority" = "100"
"traefik.ingress.kubernetes.io/router.middlewares" = "traefik-retry@kubernetescrd,traefik-rate-limit@kubernetescrd"
"traefik.ingress.kubernetes.io/router.entrypoints" = "websecure"
"traefik.ingress.kubernetes.io/router.priority" = "100"
}
}
spec {

View file

@ -55,6 +55,7 @@ resource "kubernetes_namespace" "freedify" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -98,14 +99,14 @@ module "viktor" {
# https://music-emo.viktorbarzin.me/
module "emo" {
source = "./factory"
name = "emo"
tag = "latest"
tls_secret_name = var.tls_secret_name
depends_on = [kubernetes_namespace.freedify]
tier = local.tiers.aux
protected = true
genius_token = lookup(local.credentials["emo"], "genius_token", null)
source = "./factory"
name = "emo"
tag = "latest"
tls_secret_name = var.tls_secret_name
depends_on = [kubernetes_namespace.freedify]
tier = local.tiers.aux
protected = true
genius_token = lookup(local.credentials["emo"], "genius_token", null)
gemini_api_key = lookup(local.credentials["emo"], "gemini_api_key", null)
navidrome_scan_url = data.kubernetes_secret.eso_secrets.data["navidrome_scan_url"]
ha_sofia_url = lookup(data.kubernetes_secret.eso_secrets.data, "ha_sofia_url", "")

View file

@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
]
}
provider "registry.terraform.io/hashicorp/helm" {
version = "3.1.1"
hashes = [

View file

@ -8,6 +8,7 @@ resource "kubernetes_namespace" "immich" {
name = "freshrss"
labels = {
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -67,7 +68,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
name = "freshrss-data-proxmox"
namespace = kubernetes_namespace.immich.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -81,6 +82,13 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_persistent_volume_claim" "extensions_proxmox" {
@ -89,7 +97,7 @@ resource "kubernetes_persistent_volume_claim" "extensions_proxmox" {
name = "freshrss-extensions-proxmox"
namespace = kubernetes_namespace.immich.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -103,6 +111,13 @@ resource "kubernetes_persistent_volume_claim" "extensions_proxmox" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
@ -189,8 +204,12 @@ resource "kubernetes_deployment" "freshrss" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -214,7 +233,11 @@ resource "kubernetes_service" "freshrss" {
}
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
source = "../../modules/kubernetes/ingress_factory"
# auth = "app": FreshRSS has built-in user login and exposes Fever +
# GReader APIs (/api/fever.php, /api/greader.php) used by mobile RSS
# readers like Reeder/FeedMe. Authentik forward-auth was 302-ing those.
auth = "app"
dns_type = "proxied"
namespace = "freshrss"
name = "rss"
@ -233,3 +256,6 @@ module "ingress" {
"gethomepage.dev/widget.password" = local.homepage_credentials["freshrss"]["password"]
}
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00

View file

@ -9,6 +9,10 @@ terraform {
source = "cloudflare/cloudflare"
version = "~> 4"
}
authentik = {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
}
}

Some files were not shown because too many files have changed in this diff Show more