fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]

6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-09 08:45:33 +00:00
parent 6d224861c4
commit fd0f4a0365
1166 changed files with 358546 additions and 0 deletions

View file

@ -0,0 +1,74 @@
# Runbook: kube-apiserver Audit Logging
**Status:** enabled 2026-06-06 on `k8s-master` (10.0.20.100, the single
control-plane node). Motivated by the novelapp incident — a workload was
deleted with no way to attribute it, because apiserver audit logging had never
been on (see post-incident note below).
## What is configured
- **Audit policy:** `infra/scripts/k8s-apiserver-audit-policy.yaml` (source of
truth), deployed to `/etc/kubernetes/audit-policy.yaml` on k8s-master.
Low-write by design: drops reads (get/list/watch), high-churn resources
(events, leases, endpointslices, token/subjectaccess reviews), and probe
URLs; logs everything else (create/update/patch/delete) at **Metadata**
level (who/verb/resource/namespace/name/time/sourceIP — no bodies).
`omitStages: [RequestReceived]` → one line per mutating request.
- **kube-apiserver static-pod manifest** (`/etc/kubernetes/manifests/kube-apiserver.yaml`):
`--audit-policy-file=/etc/kubernetes/audit-policy.yaml`,
`--audit-log-path=/var/log/kubernetes/audit/audit.log`,
`--audit-log-maxage=30 --audit-log-maxbackup=10 --audit-log-maxsize=100`
(≤1 GB on disk, 30-day rotation), plus the `audit-policy` (File, RO) and
`audit-logs` (DirectoryOrCreate) hostPath volumes/mounts.
- **Persistence across `kubeadm upgrade`:** the same flags + volumes are in the
`kubeadm-config` ConfigMap (`kube-system`), `ClusterConfiguration.apiServer.{extraArgs,extraVolumes}`
(v1beta4). Without this, a control-plane upgrade regenerates the manifest and
silently drops audit (and oidc). The OIDC flags are recorded there too (see
below).
- **Shipping to Loki:** the Alloy DaemonSet
(`infra/stacks/monitoring/modules/monitoring/alloy.yaml`) tails
`/var/log/kubernetes/audit/audit.log` (it schedules on the control-plane node
and mounts host `/var/log`). Query in Loki/Grafana with
`{job="kubernetes-audit"}`.
## How to attribute a change ("who deleted X, when")
```
# In Loki (Grafana Explore or logcli), last 24h:
{job="kubernetes-audit"} |= "delete" |= "<resource-name>"
```
Each entry is a JSON `audit.k8s.io/v1` Event: `user.username`, `verb`,
`objectRef.{resource,namespace,name}`, `requestReceivedTimestamp`,
`sourceIPs`, `userAgent`. On-node fallback (Loki down):
`sudo grep <name> /var/log/kubernetes/audit/audit.log` on k8s-master.
Note: direct `kubectl`/dashboard calls now show the real identity (user SA or
OIDC email). Pre-2026-06-06 deletions are NOT recoverable (audit was off).
## CRITICAL gotcha that blocked this (and OIDC) for weeks
`kubelet` runs **every** non-dotfile in its `staticPodPath`
(`/etc/kubernetes/manifests`) as a static pod. A stray
`kube-apiserver.yaml.bak.<epoch>` left in that directory (from an earlier manual
edit) was a **second** manifest defining pod `kube-apiserver`. kubelet ran the
older `.bak` copy and ignored edits to the real `kube-apiserver.yaml` — so newly
added flags (the OIDC flags, then these audit flags) never reached the running
process even though the file clearly had them. Symptom: the running apiserver's
`/proc/<pid>/cmdline` (or `crictl inspect` args) is SHORTER than the manifest's
`command:` list. Fix: move any `*.bak`/backup OUT of `/etc/kubernetes/manifests/`.
**Always back up control-plane manifests to a sibling dir (e.g.
`/etc/kubernetes/`), never inside `manifests/`.** This also un-blocked OIDC
(memory id=4042) as a side effect.
## Rollback
Backups live in `/etc/kubernetes/apiserver-manifest-archive/` on k8s-master
(the 27-arg pre-audit known-good, and the 36-arg desired). To disable audit:
remove the `--audit-*` flags + audit volumes from the manifest (kubelet
restarts the apiserver in ~30-40s), and remove them from `kubeadm-config`. A bad
manifest edit only needs the known-good copied back over
`/etc/kubernetes/manifests/kube-apiserver.yaml`.
Editing the apiserver manifest restarts the apiserver → ~30-40s API blip on this
single-control-plane cluster. Always edit from a backup + watch
`curl -sk https://10.0.20.100:6443/livez` before declaring success.

View file

@ -0,0 +1,188 @@
# Beads Auto-Dispatch Runbook
Users can hand work to the headless `beads-task-runner` agent by assigning a
bead to the sentinel user `agent`. Two CronJobs in the `beads-server`
namespace drive the pipeline:
- **`beads-dispatcher`** — every 2 min: picks up the highest-priority
`assignee=agent`/`status=open` bead with non-empty acceptance criteria,
claims it by flipping to `in_progress`, and POSTs it to BeadBoard's
`/api/agent-dispatch`. BeadBoard forwards to `claude-agent-service` with
the existing bearer-token flow.
- **`beads-reaper`** — every 10 min: flips any `assignee=agent` +
`status=in_progress` bead whose `updated_at` is older than 30 min to
`status=blocked` with an explanatory note. Catches pod crashes mid-run.
The manual BeadBoard Dispatch button continues to work in parallel.
## Flow diagram
```
user: bd assign <id> agent
Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐
│ │
▼ │
CronJob: beads-dispatcher │
1. GET beadboard/api/agent-status (busy?) │
2. bd query 'assignee=agent AND status=open' │
3. bd update -s in_progress (claim) │
4. POST beadboard/api/agent-dispatch │
5. bd note "dispatched: job=…" │
│ │
▼ │
claude-agent-service /execute │
beads-task-runner agent runs; notes/closes bead │
│ │
▼ │
done ──► next tick picks up the next bead ───────────────┘
CronJob: beads-reaper (every 10 min)
for bead (assignee=agent, status=in_progress, updated_at > 30 min):
bd note "reaper: no progress for Nm — blocking"
bd update -s blocked
```
## Usage
### Hand a bead to the agent
```
bd create "Title" \
-d "Full context — files, services, error messages. Any agent with no prior context must be able to execute this." \
--acceptance "Concrete, verifiable criteria" \
-p 2
bd assign <new-id> agent
```
**Acceptance criteria is required.** Beads without it are skipped by the
dispatcher and stay in `open` forever. This is intentional — the
`beads-task-runner` agent expects clear done conditions.
### Take a bead back (unassign)
```
bd assign <id> ""
```
If the bead is already `in_progress`, also reset it:
```
bd update <id> -s open
```
### Pause auto-dispatch
```
cd infra/stacks/beads-server
scripts/tg apply -var=beads_dispatcher_enabled=false
```
This sets `spec.suspend: true` on both CronJobs. Existing running jobs
continue; no new ticks fire. Re-enable by re-applying with
`beads_dispatcher_enabled=true` (the default). Manual BeadBoard Dispatch
remains available while paused.
### Read the logs
```
# Recent dispatcher runs
kubectl -n beads-server get jobs --selector=job-name --sort-by=.metadata.creationTimestamp | grep beads-dispatcher | tail
kubectl -n beads-server logs job/<dispatcher-job-name>
# Tail the underlying agent once a bead dispatches
kubectl -n claude-agent logs -l app=claude-agent-service -f
# Inspect reaper decisions
kubectl -n beads-server get jobs | grep beads-reaper | tail
kubectl -n beads-server logs job/<reaper-job-name>
```
### Inspect a specific bead's dispatch history
```
bd show <id> --json | jq '{status, assignee, notes, updated_at}'
```
Both the dispatcher and reaper write dated notes (`auto-dispatcher claimed
at…`, `dispatched: job=…`, `reaper: no progress for…`) so the audit trail
lives on the bead itself.
## Reaper semantics — when a bead becomes `blocked`
The reaper flips a bead to `blocked` if:
- `assignee = agent`, AND
- `status = in_progress`, AND
- `updated_at` is more than **30 minutes** in the past.
Every `bd note` bumps `updated_at`, so a well-behaved `beads-task-runner`
agent never trips the reaper — it notes progress as it works. A `blocked`
bead is a signal that:
- the agent pod crashed mid-run (`kubectl -n claude-agent delete pod` test),
- the job hit its 15-minute budget timeout inside `claude-agent-service`
without notes (rare — the agent usually notes failure before exiting),
- `claude-agent-service` was restarted during the run (in-memory job state
is lost; see [known risks](#known-risks)).
Recovery: read the reaper note, reopen manually if appropriate:
```
bd update <id> -s open
bd assign <id> agent # re-arm for next dispatcher tick
```
## Design choices
- **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd
client can set it (`bd assign <id> agent`).
- **One-bead-per-tick dispatch** — the dispatcher submits at most one bead
per 2-min tick, gating on `claude-agent-service`'s `/health` `busy` flag.
`busy` now means `active >= capacity` (bounded semaphore, default 10) — the
service no longer single-flight-locks via `asyncio.Lock`. So up to
~`capacity` beads can run concurrently; the 2-min poll cadence (not
single-slot execution) now bounds ramp-up.
- **Fixed agent (`beads-task-runner`)** — read-only rails, matches BeadBoard's
manual Dispatch button. Broader-privilege agents stay manual.
- **CronJob (not in-service polling, not n8n)** — matches existing infra
pattern (OpenClaw task-processor, certbot-renewal, backups), TF-managed,
easy to pause.
- **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing
the image-seeded file. The CronJob's init step copies it into `/tmp/.beads/`
because `bd` may touch the parent directory and ConfigMap mounts are
read-only.
## Known risks
- **In-memory job state in `claude-agent-service`** — if the pod restarts
mid-run, the job record is lost. The reaper catches this after 30 min.
Persistent job store is deferred.
- **Prompt injection via bead fields** — a malicious bead description could
try to steer the agent. The `beads-task-runner` rails + token budget +
timeout are the defense. Identical exposure as the manual Dispatch button.
- **Image tag drift**`claude_agent_service_image_tag` in
`stacks/beads-server/main.tf` mirrors `local.image_tag` in
`stacks/claude-agent-service/main.tf`. Bump both when the image rebuilds,
or the dispatcher/reaper will run on an older layer. (They only need
`bd`, `curl`, `jq` — stable across rebuilds — so the drift is low-risk.)
- **`bd` JSON schema changes** — the reaper's `jq` reads `.id` and
`.updated_at`. If a future `bd` upgrade renames these, the reaper breaks
silently (no reaping, no alert). `BD_VERSION` is pinned in the image
Dockerfile.
## Verification after change
```
# Both CronJobs exist with the right schedule / SUSPEND state
kubectl -n beads-server get cronjob
# End-to-end smoke test
bd create "auto-dispatch smoke test" \
-d "Read /etc/hostname inside the agent sandbox and close." \
--acceptance "bd note includes 'hostname=' and bead is closed."
bd assign <new-id> agent
# within 2 min:
bd show <new-id> --json | jq '.notes'
# → contains 'auto-dispatcher claimed' + 'dispatched: job=<uuid>'
```

View file

@ -0,0 +1,211 @@
# Runbook — chrome-service snapshot pipeline
Operational playbook for the hourly cookie-snapshot pipeline that warms
external Claude Code sessions on the dev box. Architecture in
`architecture/chrome-service.md`.
## At a glance
| Component | Where | When | What |
|---|---|---|---|
| chrome-service Deployment | `chrome-service` ns | always-on | headed chromium, CDP :9222, persistent /profile/chromium-data |
| snapshot-server sidecar | same pod | always-on | serves `/api/snapshot`, bearer-gated, port 8088 |
| snapshot-harvester CronJob | `chrome-service` ns | `23 * * * *` | dumps `storage_state()` via CDP → `/profile/snapshots/storage-state.json` |
| dev-box refresh timer | each dev box | hourly | curls `chrome.viktorbarzin.me/api/snapshot``~/.cache/playwright-shared-storage-state.json` |
| dev-box `playwright-mcp.service` | each dev box | always-on | `@playwright/mcp --isolated --storage-state=…` per-MCP-connection contexts |
## Day-to-day
### Log into a new site (warm the profile)
1. Open `https://chrome.viktorbarzin.me/` (Authentik will gate).
2. The noVNC view of the in-cluster headed chromium loads. Click on the
browser window, navigate, log in.
3. Cookies land in `/profile/chromium-data/Default/Cookies` on the PVC.
4. Within ≤60 min, the snapshot-harvester CronJob picks them up and
writes the snapshot. Within ≤60 min after that, dev boxes pull the
new file. New Claude Code sessions see the new cookies.
5. To skip the wait: trigger the harvester now (next section).
### Trigger snapshot harvester manually
```bash
kubectl -n chrome-service create job \
--from=cronjob/chrome-service-snapshot-harvester \
snapshot-harvest-$(date +%s)
# Watch logs
kubectl -n chrome-service logs -f -l job-name=$(kubectl -n chrome-service get jobs -o name | tail -1 | cut -d/ -f2)
```
Expected: `wrote snapshot (… bytes) to /profile/snapshots/storage-state.json`.
### Trigger dev-box refresh manually
```bash
# On the dev box, as the user whose Claude Code sessions need the new state:
systemctl --user start playwright-snapshot-refresh.service
# Or directly:
/usr/local/bin/playwright-snapshot-refresh
# Verify
ls -la ~/.cache/playwright-shared-storage-state.json
```
### Inspect the current snapshot
```bash
# In-cluster (from any pod with kubectl exec into the chrome-service pod):
kubectl -n chrome-service exec deploy/chrome-service -c snapshot-server -- \
cat /profile/snapshots/storage-state.json | jq '.cookies | length'
# Externally (via the bearer-gated endpoint):
TOKEN=$(vault kv get -field=api_bearer_token secret/chrome-service)
curl -fsSL -H "Authorization: Bearer $TOKEN" \
https://chrome.viktorbarzin.me/api/snapshot | jq '.cookies | length'
```
## Failure modes
### "no browser contexts found"
The harvester reports `no browser contexts found — chrome-service may
not have launched a persistent context yet` and exits non-zero.
**Cause**: chromium just started and hasn't created its default context
yet, or it crashed.
**Fix**: check chrome-service pod logs (`kubectl -n chrome-service logs
deploy/chrome-service -c chrome-service`). The next hourly run will
retry. If chromium is wedged: `kubectl -n chrome-service rollout restart
deploy/chrome-service` (strategy = Recreate, brief downtime).
### "connect_over_cdp failed"
Harvester or any in-cluster caller can't reach the CDP endpoint.
**Cause**: chrome-service pod not Ready, NetworkPolicy doesn't admit
the caller's namespace, or chromium isn't listening on :9222.
**Diagnose**:
```bash
kubectl -n chrome-service get pods
kubectl -n chrome-service describe networkpolicy chrome-service-ws-ingress
# From inside the cluster (e.g. a debug pod in chrome-service ns):
nc -zv chrome-service.chrome-service.svc.cluster.local 9222
curl -fsSL http://chrome-service.chrome-service.svc.cluster.local:9222/json/version
```
**Fix**: depends on the diagnosis. NetworkPolicy needs the caller's
namespace label or an explicit name-fallback. If chromium isn't
binding, check the container logs.
### Dev-box `playwright-snapshot-refresh` returns 401
The bearer token in `~/.config/playwright/token` doesn't match the
server's. Almost always means the Vault secret was rotated and the
local cache is stale.
**Fix**:
```bash
vault login -method=oidc # if needed
vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token
chmod 600 ~/.config/playwright/token
systemctl --user start playwright-snapshot-refresh.service
```
### Dev-box `playwright-snapshot-refresh` returns 404 with "snapshot not yet available"
The harvester hasn't run successfully yet (fresh cluster, or all
recent runs failed). Trigger it manually (see "Trigger snapshot
harvester manually").
### Claude Code sessions still see old cookies
The MCP server reads the snapshot file at process start and seeds each
new context with it. **Existing MCP sessions don't hot-reload** — they
keep the cookies they were seeded with at session start. New sessions
get the fresh snapshot.
**Fix**: restart the MCP server on the dev box to pick up the new file:
```bash
systemctl --user restart playwright-mcp.service
```
### Snapshot file is suspiciously small or empty cookies array
The persistent chromium context isn't holding any cookies. Probably
means the user hasn't logged into anything via noVNC, or chromium was
relaunched without preserving `/profile/chromium-data`.
**Diagnose**:
```bash
kubectl -n chrome-service exec deploy/chrome-service -c chrome-service -- \
ls -la /profile/chromium-data/Default/Cookies
```
A populated `Cookies` SQLite file should be several hundred KB once
real logins exist. If it's missing or empty, log in via noVNC.
## Token rotation
```bash
# Rotate Vault secret (32-byte URL-safe random).
vault kv put secret/chrome-service \
api_bearer_token=$(python3 -c 'import secrets; print(secrets.token_urlsafe(32))')
# Reloader auto-restarts chrome-service pod (snapshot-server picks up new token).
# On EVERY dev box that pulls the snapshot:
vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token
chmod 600 ~/.config/playwright/token
# Verify the next refresh succeeds:
systemctl --user start playwright-snapshot-refresh.service
journalctl --user -u playwright-snapshot-refresh.service -n 20
```
## Restore from a backup tarball
The 6-hourly backup CronJob writes `tar -czf /backup/YYYY_MM_DD_HH.tar.gz
-C /profile .` to NFS at `/srv/nfs/chrome-service-backup/`. To restore
the entire profile:
```bash
# 1. Scale chrome-service down so its lock is released.
kubectl -n chrome-service scale deploy/chrome-service --replicas=0
# 2. Mount the PVC in a helper pod and restore.
kubectl -n chrome-service apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata: {name: restore-helper, namespace: chrome-service}
spec:
containers:
- name: helper
image: alpine:3.20
command: [sleep, infinity]
volumeMounts:
- {name: profile, mountPath: /profile}
- {name: backup, mountPath: /backup, readOnly: true}
volumes:
- name: profile
persistentVolumeClaim: {claimName: chrome-service-profile-encrypted}
- name: backup
persistentVolumeClaim: {claimName: chrome-service-backup-host}
restartPolicy: Never
EOF
kubectl -n chrome-service wait --for=condition=ready pod/restore-helper
kubectl -n chrome-service exec restore-helper -- sh -c '
rm -rf /profile/chromium-data /profile/snapshots &&
tar -xzf /backup/2026_06_04_18.tar.gz -C /profile
'
# 3. Cleanup helper, scale chrome-service back up.
kubectl -n chrome-service delete pod restore-helper
kubectl -n chrome-service scale deploy/chrome-service --replicas=1
```

View file

@ -0,0 +1,122 @@
# Runbook — PVE R730 fan-control daemon
Presence-aware IPMI fan controller on the PVE host (192.168.1.127). Runs the
CPU cool when the garage is empty, quiet when someone's in the garage. Design:
`infra/docs/plans/2026-06-04-pve-fan-control-design.md`.
## What it is
- `/usr/local/bin/fan-control` — bash daemon (source: `infra/scripts/fan-control.sh`).
- `fan-control.service` — systemd unit (`Type=simple`, restarts on failure).
- `/etc/fan-control.env` — config incl. the ha-sofia token (chmod 600, not in git).
## HA control (Home Assistant)
Drive the fans from **dashboard-it → "Server" view → Fans**. The view is
deliberately minimal — it shows the current **fan speed** (% of capacity +
absolute RPM) and two controls:
- **Override %** (`input_number.r730_fan_manual_pct`) — the fan % to hold. While
**unlocked** it continuously mirrors the live commanded fan %, so it always
shows the actual *absolute* speed and updates as the fan moves (NOT a stale
value or a delta) — `automation.r730_fan_override_track_live_speed_while_unlocked`
syncs it to `sensor.r730_fan_control_target` (guarded to ignore
unavailable/unknown). While **locked** it stops tracking and becomes your
editable setpoint. A readout under the slider shows the live `% · rpm`.
- **Lock — freeze speed** (`input_boolean.r730_fan_lock`) — turn the algorithm
off and hold a fixed speed. Toggling it **ON** snapshots the *current*
commanded % into Override and switches the daemon to `manual`
(`automation.r730_fan_lock_freeze_current_speed_resume_algo`); toggling it
**OFF** switches back to `auto`, resuming the presence curve. Fine-tune the
held % with Override while locked. A 🔒 reminder appears on the view while
locked.
Under the hood the daemon still reads `input_select.r730_fan_mode`
(auto/cool/quiet/manual) + `input_number.r730_fan_manual_pct` each loop; the Lock
toggle just drives `mode` between `manual` (locked) and `auto` (unlocked).
`cool`/`quiet` remain valid modes if set directly (via the entity) but are no
longer surfaced on the simplified dashboard. `CEILING` (83 °C) still overrides
everything → Dell auto, **even when locked**. A stale non-`auto` mode left while
*unlocked* still auto-reverts to `auto` after 60 min
(`automation.r730_fan_mode_auto_revert`, now a dormant safety net). An HA change
is applied within one daemon loop (~15 s).
Monitoring sensors on the same view: `sensor.r730_fan_speed` (redfish exporter),
`sensor.r730_fan_control_target` + `sensor.r730_fan_control_mode` +
`sensor.r730_fan_power_est` (Pushgateway). Fan **% and RPM are merged into one
"Fan speed" card** (the two had identical trend shapes) — the % trend comes from
the stable Pushgateway sensor, while RPM reads `sensor.r730_fan_speed` but **falls
back to a calibrated estimate (shown with a `~` prefix) whenever the Redfish
sensor is `unavailable`** (it blips out intermittently), so the readout never goes
blank. `r730_fan_power_est` is an ESTIMATE of
total fan power (the iDRAC reports no per-fan power) — modelled from RPM via the
fan affinity law (∝ RPM³), calibrated to the power sweep (~2 W floor → ~99 W full).
The HA objects (helpers, the auto-revert automation, the REST sensors in
`rest_resources/{idrac_redfish_exporter,fan_control}.yaml`, and the dashboard
cards) live on **ha-sofia** and are auto-git-tracked there by the version-control
add-on — they are NOT in this repo.
## Quick status
```bash
ssh root@192.168.1.127 systemctl status fan-control
ssh root@192.168.1.127 'journalctl -u fan-control -n 30 --no-pager'
ssh root@192.168.1.127 'ipmitool sdr type fan | grep ^Fan1; ipmitool sdr type temperature | grep "^Temp "'
```
Log lines look like `temp=60C ha_mode=auto eff=cool fan=50% (was 70%)`
(`ha_mode` = the HA setpoint; `eff` = the effective curve applied).
## Disable / roll back to stock firmware control
```bash
ssh root@192.168.1.127 'systemctl disable --now fan-control && ipmitool raw 0x30 0x30 0x01 0x01'
```
The unit's `ExecStopPost` already restores Dell auto on stop, so the explicit
`raw ... 0x01` is belt-and-suspenders. The box is back to its stock curve.
## Tune
Edit `/etc/fan-control.env` on the host, then `systemctl restart fan-control`.
Common knobs:
- `HOLD_SECS` — how long to stay quiet after the garage door last moved (default 900 = 15 min).
- `CEILING` — temp at which we abandon manual control and let the firmware take over (default 83).
- Curve shape: **linear anchors** near the top of the script — `COOL_T_LO/COOL_P_LO/COOL_T_HI/COOL_P_HI` (default 50°C/30% → 83°C/100%) and `QUIET_*` (68°C/20% → 83°C/100%); fan% interpolates linearly between them (replaced the old discrete step-bands). `MIN_STEP` (default 3%) = smallest fan-% change worth an IPMI write (anti-jitter); `DEADBAND` (3°C) = ease-down hysteresis. Lower `COOL_P_HI` or raise `COOL_T_HI` to run the top end quieter; steepen by raising `COOL_P_LO` / lowering `COOL_T_LO`.
## Deploy / update
```bash
cd infra
scp scripts/fan-control.sh root@192.168.1.127:/usr/local/bin/fan-control
ssh root@192.168.1.127 chmod +x /usr/local/bin/fan-control
scp scripts/fan-control.service root@192.168.1.127:/etc/systemd/system/fan-control.service
# first install only — create /etc/fan-control.env from fan-control.env.example with the HA token
ssh root@192.168.1.127 'systemctl daemon-reload && systemctl restart fan-control'
```
## HA token
`/etc/fan-control.env` holds a long-lived ha-sofia token used to read
`sensor.garage_door_state_bg`. Mint via Home Assistant → Profile → Security →
Long-lived access tokens, or reuse the existing ha-sofia token. If the token is
missing/empty, the daemon still runs but **COOL-only** (no quiet mode) and logs
`ha_reachable=0`.
## Symptoms & checks
| Symptom | Check |
|---------|-------|
| Fans stuck loud | `journalctl -u fan-control` — is `mode=fallback`? (ceiling breach or IPMI fail). Check CPU temp. |
| Never goes quiet | Token valid? `curl -H "Authorization: Bearer $TOKEN" http://192.168.1.8:8123/api/states/sensor.garage_door_state_bg`. Garage door reporting? |
| Fans flapping | Increase `DEADBAND`. |
| Service won't start | `systemctl status fan-control`; check `ipmitool` works: `ipmitool sdr type temperature`. |
| Box left in manual after crash | `ipmitool raw 0x30 0x30 0x01 0x01` to force Dell auto. |
## Verify presence wiring
```bash
# one iteration, real IPMI + HA, no daemon loop:
ssh root@192.168.1.127 'set -a; . /etc/fan-control.env; set +a; RUN_ONCE=1 /usr/local/bin/fan-control'
```
With the garage closed for >15 min you should see `mode=cool`; within 15 min of
the door moving, `mode=quiet`.

View file

@ -0,0 +1,126 @@
# Runbook: Forgejo registry break-glass — recovering infra-ci
Last updated: 2026-05-07
## When to use this runbook
When **all** of the following are true:
1. Forgejo (`forgejo.viktorbarzin.me`) is unreachable.
2. `registry-private` is also gone (post-Phase 4 of the consolidation),
so you can't fall back to `registry.viktorbarzin.me:5050/infra-ci`.
3. You need to run an infra Woodpecker pipeline (apply, build-cli,
drift-detection, etc.) — but those pipelines pull `infra-ci` and
crash because the registry is down.
If only Forgejo is down but `registry-private` is still alive, the
pipelines work — `image:` references in `infra/.woodpecker/*.yml`
still hit `registry.viktorbarzin.me:5050/infra-ci` until Phase 3
flips them. Skip this runbook entirely.
## What's available
The `build-ci-image.yml` Woodpecker pipeline saves a tarball after
each successful push:
| Location | Path |
|---|---|
| Registry VM disk (10.0.20.10) | `/opt/registry/data/private/_breakglass/infra-ci-<sha>.tar.gz` |
| Registry VM disk (latest symlink) | `/opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz` |
| Synology NAS (offsite copy via daily-backup sync) | `/volume1/Backup/Viki/pve-backup/_forgejo-breakglass/` |
The registry VM keeps the last 5 tarballs. Synology mirrors them
through the existing offsite-sync-backup job (`/usr/local/bin/
offsite-sync-backup`).
## Recovery procedure
The goal is to get a working `infra-ci` image onto a k8s node so
Woodpecker pods can run it. Then run a Woodpecker pipeline that
restores Forgejo from PVC backup or rebuilds it.
### Step 1 — copy the tarball to a node
From your workstation (the registry VM is reachable but Forgejo is
not — the rest of the cluster might be in a similar partial state):
```bash
ssh wizard@10.0.20.103 # any responsive k8s node
sudo mkdir -p /var/breakglass
sudo scp root@10.0.20.10:/opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz \
/var/breakglass/
```
If the registry VM is also down, fall back to Synology:
```bash
sudo scp 192.168.1.13:/volume1/Backup/Viki/pve-backup/_forgejo-breakglass/infra-ci-latest.tar.gz \
/var/breakglass/
```
### Step 2 — load into containerd
`docker load` won't help on a k8s node — it loads into the docker
daemon, which kubelet/containerd doesn't see. Use `ctr`:
```bash
sudo ctr -n k8s.io images import /var/breakglass/infra-ci-latest.tar.gz
sudo ctr -n k8s.io images list | grep infra-ci
```
Confirm the image is tagged with the original repository name
(`registry.viktorbarzin.me:5050/infra-ci:<sha>` — the tarball was
saved with that tag, NOT the Forgejo name).
### Step 3 — pin pods to this node
Add a node selector or taint-toleration to whatever pipeline you
need to run. Simplest: cordon the other nodes briefly so Woodpecker
schedules onto this one.
```bash
for n in $(kubectl get nodes -o name | grep -v $(hostname)); do
kubectl cordon ${n#node/}
done
```
Run the pipeline. After it completes:
```bash
for n in $(kubectl get nodes -o name); do
kubectl uncordon ${n#node/}
done
```
### Step 4 — fix the underlying problem
The pipeline you just ran was meant to restore Forgejo. Common
options:
- **Forgejo PVC corrupt**`docs/runbooks/forgejo-registry-rebuild-image.md`
walks through PVC restore from LVM snapshot or PVE backup.
- **Forgejo OOM-loop** — bump memory request+limit in
`infra/stacks/forgejo/main.tf` and apply.
- **Forgejo unreachable due to network** — check Traefik, MetalLB,
pfSense.
Once Forgejo is back, run `build-ci-image.yml` manually so the
tarball regenerates with the latest commit.
## Why this exists
The 2026-04-19 post-mortem on the registry-orphan-index incident
showed that a single registry going corrupt could block ALL infra
pipelines (because every pipeline pulls `infra-ci` from that
registry). The dual-push to Forgejo + registry-private removes that
single-point-of-failure during the bake. After Phase 4
decommissions registry-private, the tarball is the last line of
defense.
## Why on the registry VM and not in-cluster
The Forgejo pod and registry-private pod both depend on cluster
networking + storage. The registry VM is an independent
non-clustered VM with local storage. If the cluster is in a bad
state, the VM's disk is still readable from any other host on the
LAN.

View file

@ -0,0 +1,128 @@
# Runbook: Rebuild an Image on the Forgejo OCI Registry
Last updated: 2026-05-07
## When to use this
Pipelines pulling from `forgejo.viktorbarzin.me/viktor/<image>` fail with:
- `failed to resolve reference … : not found`
- `manifest unknown`
- HEAD on a manifest/blob digest returns 404
- `forgejo-integrity-probe` CronJob in `monitoring` reports
`registry_manifest_integrity_failures > 0` for
`instance="forgejo.viktorbarzin.me"`
This is the Forgejo equivalent of the registry-private orphan-index
failure mode (`docs/post-mortems/2026-04-19-registry-orphan-index.md`).
Cause is usually package-version delete races with an in-flight pull,
or PVC corruption. Fix is to rebuild the image from source and
re-push, so Forgejo receives a complete, fresh upload.
If the symptom is different (Forgejo unreachable, PVC OOM,
authentication failure), use:
- `docs/runbooks/forgejo-registry-setup.md` for auth + token issues
- `docs/runbooks/forgejo-registry-breakglass.md` if Forgejo + the
cluster are both unreachable
- `docs/runbooks/restore-pvc-from-backup.md` for PVC corruption
## Phase 1 — Confirm the diagnosis
From any host:
```sh
REG=forgejo.viktorbarzin.me
USER=cluster-puller
PASS="$(vault kv get -field=forgejo_pull_token secret/viktor)"
IMAGE=viktor/payslip-ingest
TAG=latest
# 1. Confirm the manifest exists at all.
curl -sk -u "$USER:$PASS" \
-H 'Accept: application/vnd.oci.image.index.v1+json,application/vnd.oci.image.manifest.v1+json' \
"https://$REG/v2/$IMAGE/manifests/$TAG" | jq '.mediaType, .manifests[].digest // .config.digest'
# 2. HEAD each child / config / layer digest. Any non-200 = confirmed.
for d in $(curl -sk -u "$USER:$PASS" -H 'Accept: application/vnd.oci.image.index.v1+json' \
"https://$REG/v2/$IMAGE/manifests/$TAG" | jq -r '.manifests[].digest // empty'); do
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
-I "https://$REG/v2/$IMAGE/manifests/$d")
echo "$d → $code"
done
```
The probe's last log run is also a fast way to see what's affected:
```sh
kubectl -n monitoring logs \
$(kubectl -n monitoring get pods -l job-name -o name \
| grep forgejo-integrity-probe | head -1)
```
## Phase 2 — Rebuild and re-push
Forgejo lets you delete a specific package version through the API.
Doing this **before** the rebuild ensures the new push doesn't
collide with the half-broken existing entry.
```sh
# Delete the broken version (replace TAG with the actual tag).
curl -X DELETE -H "Authorization: token $(vault kv get -field=forgejo_cleanup_token secret/viktor)" \
"https://$REG/api/v1/packages/viktor/container/$(basename $IMAGE)/$TAG"
```
Rebuild via Woodpecker (manual run if the pipeline isn't triggered
by a code change):
1. Open `https://ci.viktorbarzin.me/repos/<repo>/manual` for the
project.
2. Click **Run pipeline** with `branch=master`.
3. Wait for the build-and-push step to complete.
4. Confirm the new version is visible in Forgejo Web UI under
`viktor/<image>` → Packages → Container.
## Phase 3 — Restart consumers
Pods that already cached the broken digest may continue using it.
Force a fresh pull:
```sh
kubectl rollout restart deploy/<service> -n <ns>
```
If the pod still fails, the new manifest digest may not have
propagated through containerd's cache. Drain + restart containerd on
the affected node:
```sh
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
ssh wizard@<node> sudo systemctl restart containerd
kubectl uncordon <node>
```
## Phase 4 — Verify integrity recovery
The next probe run (every 15 min) will report:
```
registry_manifest_integrity_failures{instance="forgejo.viktorbarzin.me"} 0
```
The `RegistryManifestIntegrityFailure` alert resolves automatically
30 minutes after the metric goes back to 0.
## Why this happens
Forgejo's OCI registry stores blobs in its own DB+filesystem. Unlike
`registry:2` + `distribution`, it doesn't have the
[`distribution#3324`](https://github.com/distribution/distribution/issues/3324)
GC-vs-tag-delete race. But it can still reach a broken state if:
- The retention CronJob deletes a version while a pull is in flight
on the same digest.
- The PVC fills up mid-push (`docs/runbooks/restore-pvc-from-backup.md`).
- A Forgejo upgrade migrates the package schema and a row is dropped.
In all cases the recovery procedure is identical: delete the broken
version through the API, rebuild from source, force consumers to
re-pull.

View file

@ -0,0 +1,163 @@
# Runbook: Forgejo OCI registry — initial setup
Last updated: 2026-05-07
This runbook covers the **one-time** bootstrap of Forgejo's container
registry, executed during Phase 0 of the registry consolidation plan
(`docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md`).
After this runbook is complete, the Forgejo OCI registry at
`forgejo.viktorbarzin.me` accepts pushes from CI and pulls from the
cluster, with retention and integrity monitoring in place.
## Order of operations
The Terraform stacks reference Vault keys that don't exist on a fresh
cluster. Create the keys **before** running `scripts/tg apply`.
1. Apply the resource bumps (memory, PVC, ingress body size,
packages env vars) — these don't depend on the new Vault keys.
2. Create the service-account users + PATs in Forgejo.
3. Push the PATs to Vault.
4. Apply the rest of Phase 0 (registry-credentials extension,
monitoring probe, retention CronJob).
### Step 1 — apply Forgejo deployment bumps
```bash
cd infra/stacks/forgejo
scripts/tg apply
```
Wait for the new pod to come up at the bumped 1Gi memory request and
the resized 15Gi PVC. Verify packages are enabled:
```bash
kubectl exec -n forgejo deploy/forgejo -- forgejo manager flush-queues
kubectl exec -n forgejo deploy/forgejo -- env | grep PACKAGES
```
### Step 2 — create service-account users
`forgejo admin user create` is idempotent only with
`--must-change-password=false`. Re-running it on an existing user
errors out — that's fine; skip on rerun.
```bash
# cluster-puller — read:package PAT for in-cluster pulls.
kubectl exec -n forgejo deploy/forgejo -- \
forgejo admin user create \
--username cluster-puller \
--email cluster-puller@viktorbarzin.me \
--password "$(openssl rand -base64 24)" \
--must-change-password=false
# ci-pusher — write:package PAT for CI dual-push, also reused as the
# cleanup CronJob credential (write:package includes delete).
kubectl exec -n forgejo deploy/forgejo -- \
forgejo admin user create \
--username ci-pusher \
--email ci-pusher@viktorbarzin.me \
--password "$(openssl rand -base64 24)" \
--must-change-password=false
```
The user passwords are throwaway — we only ever auth via PAT. Forgejo
admin can reset them at any time from the Web UI.
### Step 3 — generate the PATs
PATs **must** be generated through the Web UI logged in as the
respective user (the CLI doesn't expose token creation). To log in
without OAuth (registration is disabled for everyone except `viktor`,
the admin), use the per-user temporary password from step 2.
For each of `cluster-puller` and `ci-pusher`:
1. Sign out of `viktor`.
2. Go to `https://forgejo.viktorbarzin.me/user/login` and sign in
with the throwaway password.
3. Settings → Applications → Generate new token.
4. Name: `cluster-pull` / `ci-push`. **Expiration: never.**
5. Scopes:
- `cluster-puller`: `read:package`
- `ci-pusher`: `write:package` (covers read+write+delete)
6. Save the token shown on the next page — it is **not** displayed again.
For the cleanup CronJob, generate a third PAT on `ci-pusher`:
7. Repeat steps 4-6 with name `cleanup`, scope `write:package`.
### Step 4 — push PATs to Vault
```bash
vault login -method=oidc
# Read-only, used by the cluster-wide registry-credentials Secret and
# by the Forgejo integrity probe.
vault kv patch secret/viktor \
forgejo_pull_token=<paste cluster-puller PAT>
# Write+delete, used by the retention CronJob inside Forgejo's
# namespace.
vault kv patch secret/viktor \
forgejo_cleanup_token=<paste ci-pusher cleanup PAT>
# Write, propagated by vault-woodpecker-sync to all Woodpecker repos.
vault kv patch secret/ci/global \
forgejo_user=ci-pusher \
forgejo_push_token=<paste ci-pusher push PAT>
```
### Step 5 — apply the rest of Phase 0
```bash
# Registry credential Secret (now reads forgejo_pull_token).
cd infra/stacks/kyverno && scripts/tg apply
# Monitoring probe + retention CronJob.
cd infra/stacks/monitoring && scripts/tg apply
cd infra/stacks/forgejo && scripts/tg apply
# Containerd hosts.toml on each existing k8s node — VM cloud-init
# only fires on first boot.
infra/scripts/setup-forgejo-containerd-mirror.sh
```
## Verification
```bash
# Login from a workstation with docker.
echo "<ci-pusher PAT>" | docker login forgejo.viktorbarzin.me -u ci-pusher --password-stdin
# Push a smoketest image.
docker pull alpine:3.20
docker tag alpine:3.20 forgejo.viktorbarzin.me/viktor/smoketest:1
docker push forgejo.viktorbarzin.me/viktor/smoketest:1
# Pull from a k8s node.
ssh wizard@<node> sudo crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1
# Confirm the cluster-wide Secret was synced into a fresh namespace.
kubectl create namespace forgejo-smoketest
kubectl get secret -n forgejo-smoketest registry-credentials \
-o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | jq '.auths | keys'
# Expect: ["10.0.20.10:5050", "forgejo.viktorbarzin.me",
# "registry.viktorbarzin.me", "registry.viktorbarzin.me:5050"]
kubectl delete namespace forgejo-smoketest
# Delete the smoketest package via API.
curl -X DELETE -H "Authorization: token <ci-pusher cleanup PAT>" \
https://forgejo.viktorbarzin.me/api/v1/packages/viktor/container/smoketest/1
```
## When to revisit
- **PAT rotation**: PATs created here have no expiry by design. If a
PAT leaks, regenerate via the Web UI and `vault kv patch` the new
value into the same key — the next `terragrunt apply` will sync it
to all consumers within minutes (Kyverno ClusterPolicy clones the
Secret, vault-woodpecker-sync runs every 6h).
- **New service account**: if a future workload needs different
scopes, add a parallel user/PAT here rather than expanding existing
PAT scope. Principle of least privilege.

View file

@ -0,0 +1,47 @@
# Runbook: Grow `/srv/nfs` LV (`pve/nfs-data`)
Use when `/srv/nfs` on the PVE host is filling up and the workloads writing to it cannot be slimmed down. The LV sits on the LVM-thin pool `pve/data` (10.54 TB total). Thin-pool free space is the real gate — confirm before extending.
## When to use
- `df -h /srv/nfs` shows usage > ~85 % and projected growth exceeds free space within a backup retention window.
- An upcoming bulk write (media import, restore) needs headroom that the current free space won't absorb.
## Steps
1. **Check thin-pool headroom on PVE host:**
```bash
ssh root@192.168.1.127 'lvs pve/data; lvs pve/nfs-data; df -h /srv/nfs'
```
The `pve/data` thin pool's `Data%` should leave room for the extension (target `Data%` after extend < 90 %).
2. **Extend the LV and online-resize ext4:**
```bash
ssh root@192.168.1.127 '
lvextend -L +1T pve/nfs-data &&
resize2fs /dev/pve/nfs-data
'
```
Both commands are safe online: `lvextend` only grows allocation, `resize2fs` extends ext4 while mounted.
3. **Verify:**
```bash
ssh root@192.168.1.127 'lvs pve/nfs-data; df -h /srv/nfs'
```
`df` should show the new size; `Use%` should drop proportionally.
## Notes
- **Not Terraform-managed.** PVE host LVs live outside the IaC tree (no `infra/stacks/pve-host/`). Record the new size in `docs/architecture/storage.md` (the "HDD NFS" line and the diagram label) in the same commit.
- **Thin-pool overcommit warning** from `lvextend` is informational — it reports the sum of all thin volume virtual sizes (currently ~12 TiB) vs. the physical pool (10.7 TiB). Real fill is `pve/data` `Data%`; ignore the overcommit warning unless `Data%` itself is climbing toward 100 %.
- **`/srv/nfs-ssd`** lives on a separate LV (`ssd/nfs-ssd-data`) backed by SSDs — the same `lvextend`/`resize2fs` pattern applies, but the source pool is `ssd/data`.
## Backout
Online shrinks are unsafe with active workloads. Don't try to shrink `pve/nfs-data` in place — restore from snapshot or migrate data out and rebuild the LV instead.

View file

@ -0,0 +1,83 @@
# Runbook: Immich 4K video stutters on playback/download
## Symptom
High-resolution (4K) videos stutter when streamed in the Immich mobile app or
downloaded — for **both** local-LAN and remote-internet clients.
## Root cause (diagnosed 2026-06-01)
Immich's transcoding was set to `ffmpeg.targetResolution=original` with
`maxBitrate=0` (no cap) and `preset=ultrafast`. The GPU (NVENC) faithfully
re-encoded 4K sources to **4K H.264**, and `ultrafast` is so inefficient it
produced **77264 Mbps** "optimized" files — often larger than the originals.
The mobile app streams that `encoded-video` copy. A 100 Mbps stream needs
~12.5 MB/s sustained. All Immich video lives on `/srv/nfs/immich/{library,encoded-video}`
`pve-nfs-data` LV → the **shared 7200rpm `sdc` thin pool** (same pool as every
VM disk + etcd), reached over inter-VLAN NFS. Measured: a single cold read got
4254 MB/s, but under 3 concurrent reads it collapsed to 1724 MB/s each — and
real seeky multi-user playback drops below the needed bitrate → buffer underrun.
Remotely, 100 Mbps simply exceeds typical home **upload** bandwidth.
So the "transcode" was making streaming *worse*, not better.
## Fix
Transcode config is **DB-managed** (`system_metadata` key `system-config`, JSONB —
NOT Terraform). Apply via the system-config API (broadcasts a live reload — no pod
restart). Keep 4K, cap the bitrate, use an efficient preset:
```
ffmpeg.maxBitrate : "0" -> "20000k" # ~20 Mbps cap (2.5 MB/s)
ffmpeg.preset : "ultrafast"-> "medium" # ~2-3x more efficient
ffmpeg.transcode : "required" -> "bitrate" # transcode anything >maxBitrate or non-h264
ffmpeg.targetResolution : "original" # unchanged — 4K preserved
ffmpeg.accel=nvenc, accelDecode=true # unchanged
```
GET the full config, change only these keys, PUT it back (preserves SMTP/OAuth
secrets). Admin API key works; `me@viktorbarzin.me`'s homepage-widget token in
`immich-secrets.homepage_credentials.immich.token` has admin write.
**Originals are never touched** — only the `encoded-video/` streaming copy changes.
## Apply the new policy to EXISTING videos
Config changes only affect new/missing transcodes. `videoConversion force=false`
("Missing") only fills assets lacking a transcode row; it does NOT re-touch existing
oversized ones. `force=true` ("All") re-does all ~11k (wasteful). To regenerate only
the **non-conforming** subset:
1. Identify offenders: existing `encoded_video` files whose bitrate > 20 Mbps.
Bitrate = filesize×8 ÷ `asset.duration` (codec/bitrate are NOT in the DB; size is
on disk, filename = `<assetId>.mp4`). ~3296 offenders / 268 GB on 2026-06-01.
2. Delete their derived rows (regenerable; never originals):
`DELETE FROM asset_file WHERE type='encoded_video' AND "assetId" = ANY(:offenders);`
This makes them "missing." The deterministic `<assetId>.mp4` path is overwritten on
regen (reclaims space).
3. Trigger `PUT /api/jobs/videoConversion {"command":"start","force":false}`.
**Gotcha (seen 2026-06-02):** the enqueue is an async background scan. If a prior
scan is still in-flight when you delete the rows, the freshly-missing assets get
MISSED and the queue drains early (only 11/3296 offenders were picked up on the
first pass). After the queue first reaches `waiting:0`, **re-trigger `force=false`
once while the queue is idle** and confirm the still-missing/offender count actually
dropped — a fresh scan enqueues anything missed.
4. Per-asset API (`POST /api/assets/jobs`) is owner-scoped (admin can't drive other
users' assets) — hence the delete-then-missing approach via the admin global job.
## Verify
- New output bitrate: `ffprobe -show_entries format=bit_rate` on a freshly-written
`encoded-video/*.mp4` → should be ≤ ~20 Mbps (was 77264).
- Progress: `SELECT count(*) FROM asset_file WHERE type='encoded_video';` rises as
regeneration proceeds.
## Monitor while it runs (concurrency 1, can take 13 days)
- `videoConversion` runs at concurrency **1** (Immich default; gentle — do NOT raise,
protects sdc). Thumbnail/metadata/library are capped to 2 for the same reason.
- Watch sdc (`iostat -x` on 192.168.1.127) and apiserver latency
(`kubectl get --raw=/healthz`). The risk is sdc saturation → etcd starvation →
apiserver down (precedent: `post-mortems/2026-05-25-immich-anca-elements-io-storm.md`).
Healthy baseline during this job: sdc ~70% util, apiserver <100 ms.
- Pause if it suffers: `PUT /api/jobs/videoConversion {"command":"pause"}`; resume with
`{"command":"resume"}`.
## Real fix for the root contention
This is mitigation. The durable fix is moving Immich video storage (or the VM disks)
off the shared `sdc` 7200rpm pool — tracked in beads `code-oflt` (IO isolation).

317
docs/runbooks/job-hunter.md Normal file
View file

@ -0,0 +1,317 @@
# Runbook: job-hunter — passive job + comp scraper
Last updated: 2026-06-02
`job-hunter` is a passive job-market + compensation scraper in the `job-hunter`
namespace. It pulls open roles from ATS boards (Greenhouse / Lever / Ashby),
HN "Who is hiring", and levels.fyi comp medians into a CNPG Postgres DB, and
serves agent-friendly CLI queries (used by the `job-hunter` Claude skill). As
of 2026-06-02 it also accumulates **dated snapshots** so comp and hiring-volume
trends can be tracked over time.
## Where things live
| Thing | Location |
|---|---|
| Source code | Forgejo `https://forgejo.viktorbarzin.me/viktor/job-hunter` (NOT in the monorepo) |
| Image | `forgejo.viktorbarzin.me/viktor/job-hunter:latest` (CI builds on push; Keel rolls the Deployment) |
| Terraform stack | `infra/stacks/job-hunter/` (`main.tf` = Deployment/Service/ESO; `cronjob.tf` = weekly refresh) |
| Database | `pg-cluster-rw.dbaas.svc.cluster.local:5432/job_hunter`, role `job_hunter` (Vault `static-creds/pg-job-hunter`, 7d rotation) |
| App secrets | Vault `secret/job-hunter``webhook_bearer_token`, `cdio_api_key`, `smtp_username/password`, `digest_to/from_address` |
| Grafana | `https://grafana.viktorbarzin.me` → datasource **Job Hunter** (PG, read-only) |
| Claude skill | `~/.claude/skills/job-hunter/SKILL.md` |
| Weekly scrape | CronJob `job-hunter-refresh`, **Sundays 04:00 UTC** |
## Architecture
- **Sources** (`job_hunter/sources/`): `ats` (Greenhouse/Lever/Ashby JSON APIs, ~35 companies in `config/companies.yaml`), `hn` (Algolia), `levels_fyi` (comp medians), `linkedin_guest` (opt-in), `changedetection` (`/webhook/cdio` for non-ATS careers pages in `config/cdio_watches.yaml`).
- **Tables**: `companies`, `roles`, `comp_points`, `levels`, `fx_rates` (upsert-in-place, "current state"); `comp_snapshots`, `roles_snapshots` (append-only, one row per source-row per `snapshot_date` — the dated series). Snapshots are written as a side-effect of every upsert during a refresh.
- **The ATS fetch is resilient**: a board returning a permanent 4xx (404/410/403) is skipped with a warning; 5xx/network errors retry once then skip. One dead board cannot abort the whole run (regression fixed 2026-06-02 — Elastic's 404 had been taking down every refresh). Boards are fetched concurrently (bounded semaphore, default 8 in-flight).
---
## OPS
### Is it healthy?
```bash
# CronJob exists + last schedule/success
kubectl -n job-hunter get cronjob job-hunter-refresh
# Most recent run's pods + logs
kubectl -n job-hunter get jobs -l app=job-hunter --sort-by=.metadata.creationTimestamp
kubectl -n job-hunter logs -l job-name=$(kubectl -n job-hunter get jobs -o jsonpath='{.items[-1:].metadata.name}')
# Deployment (serves the CLI / webhook) is up
kubectl -n job-hunter get deploy job-hunter
# Data freshness — newest snapshot date should advance weekly
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter report --days 7 | jq '.source_mix'
```
Row-count sanity (via the read-only Grafana datasource or a direct exec):
```bash
kubectl -n job-hunter exec deploy/job-hunter -- python -c "import job_hunter" # smoke
```
### Manual refresh (off-schedule)
```bash
kubectl -n job-hunter exec deploy/job-hunter -- \
python -m job_hunter refresh --source ats --source hn --source levels_fyi
```
Or trigger the CronJob immediately:
```bash
kubectl -n job-hunter create job --from=cronjob/job-hunter-refresh jh-manual-$(date +%s)
```
### Seed / re-snapshot the dated series
Snapshots are written automatically on every refresh. To seed a baseline from
the current tables (idempotent — one row per source-row per day):
```bash
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter snapshot
# back-date a snapshot if needed:
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter snapshot --date 2026-06-01
```
### Add an ATS company
ATS companies are scraped from `config/companies.yaml` in the **Forgejo repo**
(not the monorepo). To add one:
1. Live-probe the slug returns HTTP 200 with London roles before adding it:
```bash
curl -s "https://boards-api.greenhouse.io/v1/boards/<slug>/jobs?content=true" -o /dev/null -w '%{http_code}\n'
# Lever: https://api.lever.co/v0/postings/<slug>?mode=json
# Ashby: https://api.ashbyhq.com/posting-api/job-board/<slug>?includeCompensation=true
```
2. Add a `{slug, display_name, ats_type, ats_id, careers_url}` block to `config/companies.yaml`, commit, push.
3. CI builds the image; Keel rolls the Deployment. The next refresh picks it up. (No Terraform change — config ships in the image.)
A board that later starts 404ing is skipped automatically; remove its entry
when the 404 is permanent (keeps logs clean).
### Add a changedetection.io watch (non-ATS firms)
Firms without a public ATS JSON API (Citadel, Two Sigma, G-Research, HRT, xAI,
Wise, Revolut, …) are diff-monitored via CDIO. Add to `config/cdio_watches.yaml`
in the Forgejo repo, then reconcile:
```bash
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-seed --dry-run # preview
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-seed # create
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-reconcile # list
```
Changes hit `/webhook/cdio`; comp/role extraction from the diff is manual or
LLM-side (CDIO only captures the changed text).
### Deploying (build triggers the rollout)
Deploys are **automatic on push to master** — we build the image, so CI also
drives the rollout (`.woodpecker.yml`: `build-and-push` tags `latest` +
`${CI_COMMIT_SHA:0:8}`, then a `deploy` step runs
`kubectl set image deployment/job-hunter ...:${SHA}` + `rollout status`). The
woodpecker-agent SA is cluster-admin, so no kubeconfig/RBAC is wired into the
step. Keel stays enrolled in parallel as a redundant net (finds the SHA already
running → no-op). So to ship code:
```bash
# in the job-hunter source repo (forgejo viktor/job-hunter)
git push origin master # → lint+test → build (latest + :<sha>) → set image → rollout
```
The **Deployment** rolls to the just-built `:<sha>`. The **CronJob** runs
`:latest` with `imagePullPolicy: Always`, so its next scheduled pod pulls the
newest image (no rollout needed for a CronJob). `image_tag = "latest"` in
`terragrunt.hcl` is just the TF baseline; the running Deployment digest is
whatever CI last set (`kubectl -n job-hunter get deploy job-hunter -o jsonpath='{..image}'`).
**Versioning** is still semver — bump `pyproject.toml` and cut a `git tag
vX.Y.Z` to mark a release; that's the human version record, independent of the
`:<sha>` deploy tag (map a running SHA back to a version with `git describe`).
**Rollback**: `kubectl -n job-hunter rollout undo deployment/job-hunter` (last
ReplicaSet), or push a revert commit (CI redeploys the reverted SHA).
### Applying the Terraform stack
```bash
cd infra/stacks/job-hunter
scripts/tg plan # vault login -method=oidc first
scripts/tg apply
```
The DB password rotates every 7 days (Vault static role `pg-job-hunter`);
Reloader restarts the Deployment when the ESO-synced secret changes. The
Grafana datasource password is mirrored via a second ExternalSecret in the
`monitoring` namespace.
### Common failures
| Symptom | Cause | Fix |
|---|---|---|
| Refresh job `Error`, log shows `ats: skipping company=X — HTTP 404` | A board slug was renamed/removed | Expected — the run continues. Remove the dead slug from `companies.yaml` if permanent. |
| Refresh aborts with a traceback before any company | Pre-2026-06-02 image (no skip-on-404) | Confirm Keel rolled the new image: `kubectl -n job-hunter get deploy job-hunter -o jsonpath='{..image}'`. |
| `snapshot` / refresh fails: `relation "job_hunter.comp_snapshots" does not exist` | Migration 0004 not applied | The CronJob + Deployment run `migrate` on start. Run `kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter migrate`. |
| `/webhook/cdio` returns 401 | `webhook_bearer_token` mismatch between Vault and the CDIO notification URL | Re-run `cdio-seed` after rotating the token; it rebuilds the `jsons://...?+Authorization=` URL. |
| Non-GBP comp looks wrong / NULL | `fx_rates` gap for the role's `posted_at` date | `kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter backfill-fx --days 30` |
| Job OOMKilled | levels.fyi HTML parse spike across many companies | Bump the CronJob container memory limit in `cronjob.tf` (currently 1Gi). |
---
## ANALYST
### Weekly above-target Slack alert
The `job-hunter-alert` CronJob (Sundays 05:00 UTC, an hour after the refresh)
posts to Slack the companies whose London p50 total comp **≥ £500k**, flagging
any that **newly crossed** since last week's snapshot. Threshold is the
`--threshold` arg in `cronjob.tf` (default 500000 — well above the ~£267k move
floor, so only clearly-exceptional comp pings). Slack webhook comes from Vault
`secret/job-hunter``slack_webhook_url` (seeded from the shared workspace
webhook → currently posts to the same channel as Keel; repoint to a dedicated
channel by `vault kv patch secret/job-hunter slack_webhook_url=<url>`).
```bash
# Preview the message without posting
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter alert --stdout
# Different bar / location
kubectl -n job-hunter exec deploy/job-hunter -- \
python -m job_hunter alert --threshold 350000 --location london --stdout
# Fire it now (posts to Slack)
kubectl -n job-hunter create job --from=cronjob/job-hunter-alert jh-alert-manual
```
`newly_crossed` needs ≥2 snapshot dates — it's empty until the second weekly
run accumulates. To change the standing threshold, edit `--threshold` in
`infra/stacks/job-hunter/cronjob.tf` and apply.
### The periodic "market leaders in comp" report
This is the headline command — current leaders by p50 total comp, week-over-week
movers, new entrants, open-role counts, and sample-size caveats:
```bash
# London senior leaders, human-readable
kubectl -n job-hunter exec deploy/job-hunter -- \
python -m job_hunter analyze --level senior --top-n 10
# All levels, JSON for downstream tools
kubectl -n job-hunter exec deploy/job-hunter -- \
python -m job_hunter analyze --format json
```
`--trend-weeks N` sets the movers comparison window (default 12). Movers report
`available: false` until at least two snapshot dates spanning the window exist —
the series starts accumulating from the first refresh after 2026-06-02, so
12-week movers become meaningful around late August 2026.
### Query recipes
```bash
# Salary band for a slice
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter bands --title 'staff'
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-band --level senior
# Per-(company, level) comp table
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-table --location london
# Open roles, highest-confidence comp first
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter query --title sre --with-salary --limit 20
# Compare two firms
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-band --company janestreet
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-band --company optiver
```
### Trend queries (Grafana or psql against the snapshot tables)
The dated series lives in `comp_snapshots` / `roles_snapshots`. Examples (run in
Grafana's "Job Hunter" datasource, or `psql` as the `job_hunter` role):
```sql
-- Comp trend: median total comp per company over time (London)
SELECT s.snapshot_date, c.display_name,
percentile_cont(0.5) WITHIN GROUP (ORDER BY COALESCE(s.total_gbp, s.base_gbp)) AS p50_gbp
FROM job_hunter.comp_snapshots s
JOIN job_hunter.companies c ON c.id = s.company_id
WHERE s.location_bucket = 'london'
GROUP BY s.snapshot_date, c.display_name
ORDER BY s.snapshot_date, p50_gbp DESC;
-- Hiring-volume trend: open London roles per company per snapshot
SELECT s.snapshot_date, c.display_name, COUNT(*) AS open_roles
FROM job_hunter.roles_snapshots s
JOIN job_hunter.companies c ON c.id = s.company_id
WHERE s.primary_location = 'london'
GROUP BY s.snapshot_date, c.display_name
ORDER BY s.snapshot_date, open_roles DESC;
-- Two-snapshot diff: p50 change for one company between two dates
SELECT c.display_name, s.snapshot_date,
percentile_cont(0.5) WITHIN GROUP (ORDER BY COALESCE(s.total_gbp, s.base_gbp)) AS p50
FROM job_hunter.comp_snapshots s
JOIN job_hunter.companies c ON c.id = s.company_id
WHERE c.slug = 'janestreet' AND s.snapshot_date IN ('2026-06-02', '2026-08-30')
GROUP BY c.display_name, s.snapshot_date;
```
### "Your comp vs the market" dashboard panel + your baselines
The Job Hunter Grafana dashboard (`grafana.viktorbarzin.me` → Job Hunter) has a
bar chart **"Your comp vs the market — London p50 total comp"** ranking every
company's London median TC with your comp shown in line. Your figures are
deliberately **not hardcoded in the committed dashboard JSON** — they live in
the DB as labeled comp_points with `source='self'` (the panel tags any
`source='self'` row as "You" and renders one bar each). There are **two**, by
design:
- `self-realized`**"Me - realized gross" ≈ £409k**: your actual P60 gross
for the current tax year. **Source = `SUM(payslip_ingest.payslip.taxable_pay)`**
for the tax year (this equals the P60 "pay for tax"; do NOT use
`salary+bonus+rsu_vest`, where `rsu_vest` is net/partial and understates RSU
income by ~half). Inflated by concurrent stacked RSU vests + META price.
- `self-current`**"Me - package (grant TC)" ≈ £267k**: base + bonus +
current-year RSU refresher *grant face* (£117,927). This is the basis
**levels.fyi uses for the company bars**, so it's the apples-to-apples figure
for comparing a job *offer*.
Both sit below the £500k alert bar (never ping Slack). Re-seed when comp changes
(realized: re-pull `taxable_pay`; grant-value: from the YE letter). The
grant-value seed (run the realized one the same way with `company_slug='self-realized'`,
`company_display_name='Me - realized gross'`, `total_value=<taxable_pay sum>`):
```bash
kubectl -n job-hunter exec deploy/job-hunter -- python -c "
import asyncio; from decimal import Decimal; from datetime import date
from job_hunter.db import create_engine_from_env, make_session_factory
from job_hunter.sources.comp.base import CompPoint
from job_hunter.storage_comp import upsert_comp_point
async def m():
e=create_engine_from_env(); sf=make_session_factory(e)
async with sf() as s:
# total_value is what the comparison/bar uses — it MUST be full TC
# (base + bonus + RSU). Store the components too for transparency.
await upsert_comp_point(s, CompPoint(source='self', external_id='self-current',
company_slug='self-current', company_display_name='Me (Meta IC5)',
level_slug='senior', location_bucket='london',
base_value=Decimal('123682'), bonus_value=Decimal('25734'),
rsu_grant_value=Decimal('117927'), rsu_vesting_years=1,
total_value=Decimal('267343'), currency='GBP', effective_date=date.today()))
await s.commit()
await e.dispose()
asyncio.run(m())"
```
### Interpreting the numbers — caveats
- **Sample size**: `analyze` flags companies with `n < 3` as `low_confidence`. A single self-reported datapoint is anecdote, not a band — chase the p50 only where n is healthy.
- **levels.fyi bias**: comp_points are self-reported medians; they skew toward people who report (often higher earners) and lag the market by a quarter or two.
- **HFT/quant**: base comp is the disclosed figure; bonus (often the larger half) is variable and usually absent from postings. Treat HFT base as a floor, not total.
- **Currency**: all figures are GBP-normalised via ECB rates looked up by `posted_at` (7-day fallback). A FX gap shows as NULL comp, not a wrong number.
- **Movers need history**: a delta is only as good as the two snapshot dates behind it; early deltas (< full `trend_weeks` of data) compare against the earliest available snapshot and are noted as such.
## Related
- Skill: `~/.claude/skills/job-hunter/SKILL.md` (agent invocation patterns)
- Beads epic: `code-snp`
- Storage / backup context: this DB is on the shared CNPG cluster (`dbaas`), backed up by the per-db `postgresql-backup-per-db` CronJob.

View file

@ -0,0 +1,207 @@
# K8s Node Auto-Upgrades
## Overview
OS-level package upgrades on the 5 K8s VMs (master + 4 workers) are driven by `unattended-upgrades` and rebooted by `kured`, with multiple safety gates layered on top to prevent the failure mode that caused the March 2026 26h cluster outage.
## Architecture
```
apt-daily.timer (random within window)
│ apt-get update
apt-daily-upgrade.timer (random within window)
│ unattended-upgrades runs
│ - Allowed-Origins: -security, -updates, ESM
│ - Package-Blacklist: containerd*, runc, calico-*, cni-plugins-*, docker-ce
│ - apt-mark hold on kubelet, kubeadm, kubectl, containerd*, runc
│ - Automatic-Reboot=false (kured handles reboots)
▼ if kernel/glibc/systemd updated
/var/run/reboot-required appears on the host
▼ (sentinel-gate DaemonSet polls every 5min)
kured-sentinel-gate checks:
├── 1. Host has /var/run/reboot-required
├── 2. ALL nodes Ready
├── 3. ALL calico-node pods Running
└── 4. NO node Ready-transition in last 24h (soak window)
▼ all pass
touch /var/run/gated-reboot-required
▼ (kured polls every 1h within 02:00-06:00 London, any day of the week)
kured checks Prometheus before draining:
│ http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts
│ ANY firing alert (except ignore-list) blocks the drain
│ Ignore-list: ^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$
▼ no blockers
kured drains the node (priority-ordered, 310s budget)
kured runs /bin/systemctl reboot
▼ node returns
kured uncordons + posts Slack notification (configuration.notifyUrl)
▼ 24h cool-down begins (sentinel-gate Check 4)
```
## Components
### unattended-upgrades (in-guest)
- **Config**: `/etc/apt/apt.conf.d/52unattended-upgrades-k8s` + `/etc/apt/apt.conf.d/20auto-upgrades`
- **Source of truth**: `infra/modules/create-template-vm/cloud_init.yaml` (lines for `is_k8s_template`)
- **Day-2 push**: SSH-based — see "Restore / re-apply config" below
### kured (Helm release)
- **Stack**: `infra/stacks/kured/main.tf`
- **Helm chart**: `kured-5.11.0` (image `ghcr.io/kubereboot/kured:1.21.0`)
- **Window**: 02:00-06:00 Europe/London, every day of the week (was Mon-Fri until 2026-05-16), period=1h, concurrency=1
- **Sentinel**: `/sentinel/gated-reboot-required` (created by sentinel-gate DaemonSet)
- **Slack hook**: Vault `secret/kured``slack_kured_webhook`
### kured-sentinel-gate (DaemonSet)
- **Source**: `kubernetes_daemon_set_v1.kured_sentinel_gate` in `infra/stacks/kured/main.tf` (lines ~120-260)
- **Image**: `bitnami/kubectl:latest`
- **Loop period**: every 300s
- **Gate logic**: 4 checks — see Architecture diagram
### Upgrade Gates Prometheus alerts
- **Source**: `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` group `Upgrade Gates`
- **10 alerts**: KubeAPIServerDown, KubeStateMetricsDown, PrometheusRuleEvaluationFailing, PVCStuckPending, RecentNodeReboot, MysqlStandaloneDown, ClusterPodReadyRatioDropped, NodeMemoryPressure, NodeDiskPressure, KubeQuotaAlmostFull
- **Effect**: kured `--prometheus-url` polls Prometheus before each drain — any non-ignored firing alert halts the rollout
## Common Operations
### Verify the system is healthy
```bash
# kured pods + sentinel-gate Running on all 5 nodes
kubectl -n kured get pods
# kured can reach Prometheus
kubectl -n kured exec ds/kured -- /usr/bin/kured --help | grep prometheus
# Upgrade Gates rules loaded + state
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -q -O- 'http://localhost:9090/api/v1/rules' | \
jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | " \(.name): \(.state)"'
# Per-node unattended-upgrades status
for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
echo "=== $n ==="
ssh $n "systemctl is-active unattended-upgrades; apt list --upgradable 2>/dev/null | wc -l"
done
```
### Halt rollout in an emergency
```bash
# Option 1: scale kured to 0 (most decisive)
kubectl -n kured scale ds kured --replicas=0
# When ready: kubectl -n kured scale ds kured --replicas=5
# Option 2: silence the gate via Alertmanager (allows kured to retry once silence expires)
# Use Alertmanager UI at https://prometheus.viktorbarzin.me/alertmanager/
```
### Force halt by adding a custom blocker alert
- Add a PrometheusRule expression that's always-1 (e.g. `vector(1)`) to the `Upgrade Gates` group temporarily.
- Apply, wait for sync (~120s), kured will block on the next poll.
- Remove when ready.
### Pause apt upgrades on a single node
```bash
ssh <node> sudo systemctl stop unattended-upgrades
ssh <node> sudo systemctl disable unattended-upgrades
# Re-enable when ready:
ssh <node> sudo systemctl enable --now unattended-upgrades
```
### Restore / re-apply unattended-upgrades config to existing nodes
Cloud-init only runs on first boot. To bring existing nodes into compliance with the IaC:
```bash
# Per node — installs uu, drops apt config, holds k8s/runtime packages, enables service
for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
ssh $n sudo bash -s <<'EOF'
set -e
systemctl unmask unattended-upgrades 2>/dev/null || true
DEBIAN_FRONTEND=noninteractive apt-get install -y unattended-upgrades update-notifier-common
cat > /etc/apt/apt.conf.d/52unattended-upgrades-k8s <<'CONF'
Unattended-Upgrade::Allowed-Origins {
"${distro_id}:${distro_codename}";
"${distro_id}:${distro_codename}-security";
"${distro_id}:${distro_codename}-updates";
"${distro_id}ESMApps:${distro_codename}-apps-security";
"${distro_id}ESM:${distro_codename}-infra-security";
};
Unattended-Upgrade::Package-Blacklist {
"^containerd(\.io)?$";
"^runc$";
"^cri-tools$";
"^kubernetes-cni$";
"^calico-.*";
"^cni-plugins-.*";
"^docker-ce$";
};
Unattended-Upgrade::DevRelease "false";
Unattended-Upgrade::Automatic-Reboot "false";
CONF
cat > /etc/apt/apt.conf.d/20auto-upgrades <<'CONF'
APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Unattended-Upgrade "1";
CONF
apt-mark hold kubelet kubeadm kubectl
apt-mark hold containerd containerd.io runc 2>/dev/null || true
systemctl enable --now unattended-upgrades
EOF
done
```
### Roll back a bad apt upgrade
1. Identify the package(s) that broke things from `/var/log/apt/history.log` on the affected node.
2. Hold them: `sudo apt-mark hold <pkg>`.
3. Downgrade: `sudo apt-get install -y --allow-downgrades <pkg>=<previous-version>` (find versions via `apt-cache madison <pkg>`).
4. Reboot the node manually if the package needs it.
5. Add the package to the `Unattended-Upgrade::Package-Blacklist` in `cloud_init.yaml` AND drop the holds via the SSH push above so future apt runs skip it.
### kured halted — investigate which alert is blocking
```bash
# Show kured logs — it logs "blocking alerts" when halting
kubectl -n kured logs ds/kured --tail=100 | grep -i alert
# List currently firing alerts (any of these blocks kured):
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -q -O- 'http://localhost:9090/api/v1/alerts' | \
jq -r '.data.alerts[] | select(.state == "firing") | " \(.labels.alertname) (\(.labels.severity // "info"))"' | sort -u
```
The alert is either:
- One of the 10 `Upgrade Gates` (genuine cluster-health issue — fix it),
- A pre-existing alert (any of the ~211 in the library — investigate),
- Or `RecentNodeReboot` — expected for 24h after each node reboot. This is the soak window.
### Verify the 24h soak is enforcing
```bash
# Sentinel-gate logs Check 4 outcome
kubectl -n kured logs ds/kured-sentinel-gate --tail=20 | grep -E "soak|cool-down|24"
# kured won't drain another node until the most recent Ready-transition is >24h ago.
# If you need to override (e.g. emergency security patch), shorten the cool-down by
# editing infra/stacks/kured/main.tf (sentinel script: 86400 → smaller) and applying.
```
## Past Incidents
- **2026-03-16 SEV-1**: Kured + Containerd Cascade Outage (26h). See `docs/post-mortems/2026-03-16-kured-containerd-cascade-outage.html`. Root cause: unattended-upgrades pushed a kernel update → kured rebooted nodes → containerd's overlayfs snapshotter corrupted → image pulls failed → calico broke → cascading outage. Remediations now baked into this system: 24h soak, Prometheus halt-on-alert, Package-Blacklist for runtime components, sentinel-gate health checks.
## File Pointers
| What | Where |
|------|-------|
| kured Helm + sentinel-gate | `infra/stacks/kured/main.tf` |
| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
| Cloud-init for new nodes | `infra/modules/create-template-vm/cloud_init.yaml` |
| Slack webhook | Vault `secret/kured``slack_kured_webhook` |
| Post-mortem | `infra/docs/post-mortems/2026-03-16-kured-containerd-cascade-outage.html` |
| Architecture doc | `infra/docs/architecture/automated-upgrades.md` (OS section) |

View file

@ -0,0 +1,345 @@
# K8s Version Upgrade Pipeline
## Overview
Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 5 K8s
VMs are upgraded automatically by a weekly detection CronJob that seeds a
chain of small phase Jobs. Each Job is **pinned to a node that is NOT its
drain target** — so no pod in the chain can preempt itself.
The chain (Sun 12:00 UTC weekly):
```
detection CronJob → preflight Job → master Job → worker × 4 Jobs → postflight Job
```
This is **independent** of the OS-side `unattended-upgrades + kured`
pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts.
Schedules can overlap (kured runs daily 02:00-06:00 London; detection
here runs Sun 12:00 UTC) — when a kured reboot lands within 24h of the
Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates
group blocks the version-upgrade preflight, so the chain self-defers
to the next Sunday rather than rolling on top of a half-fresh node.
## Architecture
```
k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns, SA: k8s-upgrade-job)
│ kubectl get nodes → running version
│ ssh master 'apt-cache madison kubeadm' → latest patch (within current minor)
│ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor available?
│ push k8s_upgrade_available{kind,running,target} → Pushgateway
▼ if a target is detected
envsubst on /template/job-template.yaml | kubectl apply -f -
│ creates k8s-upgrade-preflight-<target_version>
Job 0 — preflight (pinned: k8s-node1)
├── All nodes Ready + no Mem/Disk pressure
├── halt-on-alert (kured-style ignore-list)
├── 24h-quiet baseline (no Ready transitions <24h ago)
├── kubeadm upgrade plan matches target
├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
├── Trigger backup-etcd Job, wait, verify snapshot byte count
├── SSH master: containerd skew fix (if master < workers)
├── SSH all 5 nodes: apt repo URL rewrite (only kind=minor)
└── spawn_next → k8s-upgrade-master-<target_version>
Job 1 — master upgrade (pinned: k8s-node1)
├── halt-on-alert recheck (no firing alerts)
├── drain k8s-master (predrain_unstick deletes PDB-blocked pods)
├── ssh wizard@k8s-master 'bash -s' < /scripts/update_k8s.sh -- --role master --release X.Y.Z
├── kubectl uncordon k8s-master; wait Ready + version match
├── verify control-plane pods Running
├── halt-on-alert recheck (allows RecentNodeReboot)
└── spawn_next → k8s-upgrade-worker-<v>-k8s-node4
Job 2 — worker k8s-node4 (pinned: k8s-node1)
Job 3 — worker k8s-node3 (pinned: k8s-node1)
Job 4 — worker k8s-node2 (pinned: k8s-node1)
(identical pattern: halt-on-alert wait 30m → drain → ssh script → uncordon → 10-min soak → spawn_next)
Job 5 — worker k8s-node1 (pinned: k8s-master + control-plane toleration)
└── spawn_next → k8s-upgrade-postflight-<target_version>
Job 6 — postflight (no pinning)
├── Verify all 5 nodes at target version
├── Verify no firing Upgrade Gates alerts
├── Compute pod-ready ratio (should be ≥ 0.9)
├── Clear k8s-upgrade-* annotations on namespace
├── Push k8s_upgrade_in_flight=0, k8s_upgrade_snapshot_taken=0, k8s_upgrade_started_timestamp=0
└── Slack: ✅ K8s upgrade complete
```
**Pin choices summarised:**
- k8s-node1 hosts every Job that drains master or another worker. k8s-node1
itself is upgraded **last**.
- k8s-master hosts Job 5 (which drains k8s-node1). Job 5's spec includes a
toleration for `node-role.kubernetes.io/control-plane:NoSchedule`.
- If anyone reorders the worker sequence, the pin for Job 5 needs to track
whatever worker is upgraded last. The mapping is in `scripts/upgrade-step.sh`
→ the `case "${PHASE}:${TARGET_NODE:-}"` block.
## Components
### Shared resources (one-time, Terraform-managed)
| Resource | Purpose |
|---|---|
| **ConfigMap `k8s-upgrade-scripts`** | Mounts `/scripts/upgrade-step.sh` (universal phase body, dispatches on `$PHASE`) and `/scripts/update_k8s.sh` (per-node kubeadm/kubelet/kubectl upgrade body — same script the old manual loop used) in every Job pod. |
| **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. |
| **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. |
| **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. |
| **CronJob `k8s-version-check`** | Sun 12:00 UTC. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. |
### Pushgateway metrics
Pushed by upgrade-step.sh during phase execution; observed by the
`Upgrade Gates` alert group in `stacks/monitoring/.../prometheus_chart_values.tpl`:
| Metric | Pushed by | Cleared by |
|---|---|---|
| `k8s_upgrade_in_flight` (1/0) | preflight Job (set to 1) | postflight Job (set to 0) |
| `k8s_upgrade_started_timestamp` (epoch s) | preflight Job | postflight Job (set to 0) |
| `k8s_upgrade_snapshot_taken` (1/0) | preflight Job (set to 1 after Job=`pre-upgrade-etcd-*` completes with `Backup done:` log of ≥1 KiB) | postflight Job (0) |
| `k8s_upgrade_available{kind,running,target}` | detection CronJob | next detection run (overwrite) |
| `k8s_version_check_last_run_timestamp` | detection CronJob | (cumulative) |
### Upgrade Gates alerts (`Upgrade Gates` group in prometheus_chart_values.tpl)
- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout.
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
- All three alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
### Vault secrets
- `secret/k8s-upgrade/ssh_key` — ed25519 PRIVATE key, used by Jobs to SSH `wizard@<node>`
- `secret/k8s-upgrade/ssh_key_pub` — matching PUBLIC key, deployed to nodes' `~/.ssh/authorized_keys`
- `secret/k8s-upgrade/slack_webhook` — Slack incoming-webhook URL
Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` namespace. The previous `api_bearer_token` entry is GONE — the chain does not POST to `claude-agent-service`.
## Common Operations
### Post-upgrade: restore apiserver OIDC (REQUIRED after any control-plane bump)
`kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
and drops the `--authentication-config` flag**, silently disabling apiserver
OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get
401). This is not auto-detected (the `rbac` stack's `null_resource` trigger is a
content hash that doesn't change). After any control-plane upgrade, re-apply:
```bash
cd stacks/rbac
TF_VAR_ssh_private_key="$(cat ~/.ssh/id_ed25519)" \
VAULT_ADDR=https://vault.viktorbarzin.me ../../scripts/tg apply \
--non-interactive -target=module.rbac.null_resource.apiserver_oidc_config
```
(`ssh_private_key` must be a key authorized for `wizard@<master>`; it is not yet
wired from Vault.) The provisioner re-writes `/etc/kubernetes/pki/auth-config.yaml`
(both `kubernetes` + `k8s-dashboard` issuers), re-adds the flag, and
health-gates `/livez` with auto-rollback. Verify: `curl -sk
https://localhost:6443/livez` on the master = `ok`, and the apiserver manifest
contains `--authentication-config`. See `docs/plans/2026-06-04-k8s-dashboard-sso-design.md`.
### Verify the pipeline is healthy
```bash
# CronJob present + not suspended
kubectl -n k8s-upgrade get cronjob k8s-version-check
# Latest detection run output
kubectl -n k8s-upgrade get jobs -l app=k8s-version-upgrade
kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200
# Chain Jobs from the last run (retained 7 days via ttlSecondsAfterFinished)
kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain
# Pushgateway — running detection metric
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -q -O- 'http://prometheus-prometheus-pushgateway.monitoring:9091/metrics' | \
grep -E '^(k8s_upgrade_(available|in_flight|started_timestamp|snapshot_taken)|k8s_version_check_last_run_timestamp)'
# Upgrade Gates rules loaded
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -q -O- 'http://localhost:9090/api/v1/rules' | \
jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | " \(.name): \(.state)"'
```
### Manually trigger detection (no upgrade)
Use `detection_dry_run=true` to short-circuit before spawning Job 0:
```bash
# Toggle var in TF, apply, and trigger
# (in stacks/k8s-version-upgrade/main.tf)
# variable "detection_dry_run" { default = true }
# scripts/tg apply
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check version-check-test
kubectl -n k8s-upgrade logs -l job-name=version-check-test -f
# When done, flip back to false.
```
### Manually trigger the chain (skip detection)
Useful for testing or to force a specific target. Render Job 0 directly:
```bash
TARGET=1.34.7
KIND=patch
IMAGE=$(kubectl -n k8s-upgrade get cronjob k8s-version-check \
-o jsonpath='{.spec.jobTemplate.spec.template.spec.containers[0].image}')
cat <<EOF | envsubst | kubectl apply -f -
$(kubectl -n k8s-upgrade get cm k8s-upgrade-job-template -o jsonpath='{.data.job-template\.yaml}')
EOF
# Note: export JOB_NAME, PHASE_NEXT, etc. first — see the CronJob's command for
# the full env block. Easier: just trigger detection with the right inputs.
```
### Kill a stuck Job (chain halted mid-flight)
The chain stalls if any Job dies without spawning its successor. `K8sUpgradeStalled`
fires after 90 min. Recovery:
```bash
# 1. Identify the failed Job
kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain
kubectl -n k8s-upgrade describe job/<failed-job-name> | tail -50
kubectl -n k8s-upgrade logs job/<failed-job-name>
# 2. Diagnose. Common causes:
# - drain stuck on PDB-violating pod (predrain_unstick should handle this;
# but a brand-new PDB pattern could escape it — manually delete the pod)
# - SSH from Job pod failing (node restarted? known_hosts mismatch?)
# - kubeadm upgrade failed on a node (check journalctl + apt history on that node)
# 3. Fix the root cause first.
# 4. Delete the failed Job + re-spawn it. Naming is deterministic so
# `kubectl apply` of the same name reconciles to a single Job.
kubectl -n k8s-upgrade delete job/<failed-job-name>
# 5. Manually render + apply the same Job. Pull the template + spec from the
# next-Job-creation block in upgrade-step.sh — easiest is to copy from a
# sibling Job's YAML:
kubectl -n k8s-upgrade get job/<sibling-job-name> -o yaml \
| yq 'del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp, .metadata.managedFields, .status)' \
| yq '.metadata.name = "<failed-job-name>"' \
| yq '.spec.template.spec.containers[0].env[] | select(.name=="PHASE") .value = "<right-phase>"' \
| kubectl apply -f -
# The chain will continue from there. The next-Job-creation step in upgrade-step.sh
# is idempotent (deterministic name) so re-running won't duplicate downstream.
```
### Skip a phase (advanced; use sparingly)
If you've already done the work for a phase manually and want the chain to
jump past it, manually create the NEXT phase's Job with the deterministic
name. The previous phase's spawn-next will see the Job already exists and
short-circuit. Example: master already on target; jump straight to worker:
```bash
TARGET=1.34.7
TGT_LBL=${TARGET//./-}
# (compose Job from upgrade-step.sh spawn_next code, name=k8s-upgrade-worker-$TGT_LBL-k8s-node4, run on k8s-node1)
```
### Halt the pipeline in an emergency
```bash
# Option 1: suspend the detection CronJob (won't stop an in-flight chain)
kubectl -n k8s-upgrade patch cronjob k8s-version-check \
-p '{"spec":{"suspend":true}}' --type=merge
# Re-enable: -p '{"spec":{"suspend":false}}'
# Option 2: delete all in-flight chain Jobs
kubectl -n k8s-upgrade delete jobs -l app=k8s-upgrade-chain
# This leaves the in-flight annotation + Pushgateway gauge intact —
# K8sUpgradeStalled will fire to surface the halt.
# Option 3: force a blocker alert (same regex kured uses)
# — see k8s-node-auto-upgrades.md "Force halt by adding a custom blocker alert"
```
### Clear orphaned in-flight state
After deciding NOT to retry a halted chain:
```bash
kubectl annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-in-flight- \
viktorbarzin.me/k8s-upgrade-target- \
viktorbarzin.me/k8s-upgrade-snapshot-path-
# Reset Pushgateway gauges so K8sUpgradeStalled / EtcdPreUpgradeSnapshotMissing clear:
kubectl -n monitoring port-forward svc/prometheus-prometheus-pushgateway 9091:9091 &
printf '# TYPE k8s_upgrade_in_flight gauge\nk8s_upgrade_in_flight 0\n# TYPE k8s_upgrade_snapshot_taken gauge\nk8s_upgrade_snapshot_taken 0\n# TYPE k8s_upgrade_started_timestamp gauge\nk8s_upgrade_started_timestamp 0\n' \
| curl --data-binary @- http://localhost:9091/metrics/job/k8s-version-upgrade
kill %1
```
### Rollback paths
`kubeadm` does **not** support in-place downgrade. If a run fails:
#### Master broke during/after kubeadm upgrade
1. Identify the etcd snapshot: `kubectl get ns k8s-upgrade -o jsonpath='{.metadata.annotations.viktorbarzin\.me/k8s-upgrade-snapshot-path}'`
2. Restore etcd per `infra/docs/runbooks/restore-etcd.md`.
3. Manually downgrade master `kubeadm`/`kubelet`/`kubectl` to the pre-upgrade version. Find versions in `/var/log/apt/history.log` on the node:
```bash
ssh wizard@k8s-master 'sudo cat /var/log/apt/history.log | tail -40'
# Pre-upgrade versions are in the most recent "Commandline: apt-get install"
sudo apt-mark unhold kubeadm kubelet kubectl
sudo apt-get install --allow-downgrades -y \
kubeadm=<OLD>-1.1 kubelet=<OLD>-1.1 kubectl=<OLD>-1.1
sudo apt-mark hold kubeadm kubelet kubectl
sudo systemctl daemon-reload && sudo systemctl restart kubelet
```
#### Worker broke
1. `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
2. Downgrade apt packages on that node only (see above)
3. `kubectl uncordon <node>`
4. The cluster continues running on the master + remaining workers throughout
### One-shot SSH key rotation
1. Generate new keypair: `ssh-keygen -t ed25519 -f /tmp/k8s-upgrade -N ""`
2. Update Vault:
```bash
vault kv patch secret/k8s-upgrade \
ssh_key=@/tmp/k8s-upgrade \
ssh_key_pub=@/tmp/k8s-upgrade.pub
```
3. Push the new pubkey to every node:
```bash
for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
ssh wizard@$n 'sed -i "/k8s-upgrade-key$/d" ~/.ssh/authorized_keys'
ssh wizard@$n 'echo "$(cat /tmp/k8s-upgrade.pub) k8s-upgrade-key" >> ~/.ssh/authorized_keys'
done
```
4. ESO refreshes within 15 min — or force: `kubectl -n k8s-upgrade annotate externalsecret k8s-upgrade-creds force-sync=$(date +%s) --overwrite`
## Past Incidents
### 2026-05-11 — Self-preemption (agent → Job-chain rewrite)
- The v1 agent ran inside the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was scheduled to k8s-node4.
- During Stage 6 (first worker drain) the agent ran `kubectl drain k8s-node4` — evicting itself.
- The bash process died after the drain but before the SSH-pipe to install kubeadm on node4.
- Node4 was left cordoned; cluster stuck at master v1.34.7, workers v1.34.2 until manual recovery.
- **Mitigation**: rewrote the pipeline as a chain of Jobs, each `nodeSelector`-pinned to a non-target node. New `predrain_unstick` step deletes PDB-blocked single-replica pods (Anubis pattern) before drain so they don't loop forever. Added `K8sUpgradeStalled` alert (in-flight + started_timestamp > 90 min).
## File Pointers
| What | Where |
|------|-------|
| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` |
| Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
| Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` |
| Per-node upgrade script | `infra/scripts/update_k8s.sh` |
| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
| Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` |
| Architecture doc | `infra/docs/architecture/automated-upgrades.md` (K8s Version Upgrades section) |
| Related (OS reboots) | `infra/docs/runbooks/k8s-node-auto-upgrades.md` |
| Deprecated agent prompt (reference) | `infra/.claude/agents/k8s-version-upgrade.deprecated.md` |

View file

@ -0,0 +1,249 @@
# Runbook: KMS public exposure (vlmcs.viktorbarzin.me:1688)
`vlmcs.viktorbarzin.me:1688/TCP` is intentionally open to the internet so any
visitor can activate Volume License Microsoft products. The webpage at
`https://kms.viktorbarzin.me/` documents how to use it.
**Two hostnames, on purpose** (do not merge them):
- `kms.viktorbarzin.me` — the **website** (Traefik). Serves the docs and the
`/scripts/*.ps1` activators. Internally resolves to the Traefik LB
(`10.0.20.203`), which has **no** `:1688` listener.
- `vlmcs.viktorbarzin.me` — the **KMS endpoint** (vlmcsd). A-only (no AAAA —
the IPv6 tunnel doesn't forward 1688). Resolves to `10.0.20.202` on the LAN
(Technitium split-horizon, set via API — `cloudflare_record.vlmcs` in
`stacks/kms` owns the public A) and to `176.12.22.76` on the internet
(Cloudflare → pfSense WAN NAT :1688). Every `slmgr` / `ospp` command on the
page points here.
Pointing a client at `kms.viktorbarzin.me:1688` fails from the LAN with "KMS
server cannot be reached" — that name is the website, not the KMS server.
This runbook covers operations on the public exposure: where to find logs,
how to tune the rate limit, how to revoke if abused.
## Architecture
- **K8s service**: `windows-kms` in namespace `kms`, MetalLB **dedicated**
LB IP `10.0.20.202:1688`. ETP=Local, so vlmcsd sees real WAN client IPs
in its log (pfSense WAN forwards do DNAT-only, no SNAT; ETP=Local skips
the kube-proxy SNAT too). Same pattern mailserver used pre-2026-04-19.
Sharing `10.0.20.200` isn't an option — all 10 services there are
ETP=Cluster and MetalLB requires a single ETP per shared IP.
- **Native DNS auto-discovery for LAN clients**: any Windows client with
DNS suffix `viktorbarzin.lan` activates with zero config — Windows
queries `_vlmcs._tcp.viktorbarzin.lan` SRV by default, the SRV target
resolves to `vlmcs.viktorbarzin.lan``10.0.20.202`, and `slmgr /ato`
succeeds. Records:
- `_vlmcs._tcp.viktorbarzin.lan` SRV 0 0 1688 vlmcs.viktorbarzin.lan
- `vlmcs.viktorbarzin.lan` A `10.0.20.202`
- `kms.viktorbarzin.lan` A `10.0.20.200` (Traefik — for the user-facing
website at `https://kms.viktorbarzin.lan/`; **not** the KMS server)
Manual override (e.g., for clients without the suffix or for clients
on the public internet): `slmgr /skms vlmcs.viktorbarzin.me:1688` (works
LAN + WAN) or `slmgr /skms 10.0.20.202:1688` (LAN, direct). Do **not** use
`kms.viktorbarzin.me:1688` — that name is the website (Traefik), not the
KMS server. To revert a manually-overridden client back to auto-discovery:
`slmgr /ckms`.
- **Pod fluidity**: deployment has `replicas=1` (notifier dedup state is
per-pod) with no node affinity. TCP readiness/liveness probes on 1688
gate Pod Ready on the listener actually being up, so MetalLB only
advertises `10.0.20.202` from a node where vlmcsd is serving.
- **pfSense WAN forward**: `WAN TCP/1688 → k8s_kms_lb:1688`
(alias = `10.0.20.202`, dedicated to KMS). Description: `KMS public —
kms.viktorbarzin.me`. Other forwards using `k8s_shared_lb` (WireGuard,
HTTPS, shadowsocks, smtps, etc.) are unaffected.
- **Filter rule** on the WAN interface, TCP/1688 destination
`<k8s_kms_lb>`, with state-table per-source caps:
- `max-src-conn 50` — concurrent connections per source IP
- `max-src-conn-rate 10/60` — 10 new connections per 60 seconds per
source
- `overload <virusprot>` flush — sources that exceed either cap get added
to pfSense's stock `virusprot` pf table and have their existing states
flushed. (`virusprot` is the only table pfSense's filter generator
targets for `overload`; see `/etc/inc/filter.inc`. Don't try to point
it at a custom table — the schema doesn't expose that knob.)
- **Probe filter in slack-notifier**: a bare TCP open/close (no
Application/Activation block from vlmcsd) is treated as a probe — Uptime
Kuma's port-type monitor on `windows-kms.kms.svc:1688` and the kubelet
readiness/liveness probes both hit this path. Probes increment
`kms_connection_probes_total{source}` (`source``internal_pod`,
`cluster_node`, `external`) and log to stdout, but never post to Slack.
Real activations still post.
- **Website `/scripts` + `/keys.json` carve-out**: the website is Anubis-fronted
(PoW challenge). `/scripts/*` and `/keys.json` are carved out to the bare
nginx backend (`module.ingress_scripts` in `stacks/kms`, `ingress_path`)
because PowerShell `iwr | iex` / `ConvertFrom-Json` are non-JS clients that
can't solve the PoW — without the carve-out they'd download the Anubis
challenge HTML and choke. Everything else stays behind Anubis. Verify:
`curl -A curl https://kms.viktorbarzin.me/scripts/setup-kms.ps1` and
`.../keys.json` both return real content (not "Making sure you're not a bot!").
- **Auto-key selection**: the scripts no longer require the user to pick a GVLK.
`/keys.json` is `data/products.yaml` rendered to JSON (Hugo KEYS output format).
When no Volume License key is installed, `setup-kms.ps1` / `kms-bootstrap.ps1`
detect the edition — Windows via registry `EditionID` (+ `CurrentBuildNumber`
for LTSC/Server, which share an EditionID across releases), Office via the
Click-to-Run `ProductReleaseIds` — fetch `/keys.json`, and `slmgr /ipk` /
`ospp /inpkey` the matching key before activating. Only fires when not already
licensed (never clobbers a working retail key). Azure-Edition server SKUs are
intentionally unmapped (they collide with Datacenter and KMS may fail there).
- **Edition switch (kms-bootstrap.ps1, consent-gated)**: when the installed
product *can't* KMS-activate (Windows Home/retail; no VL Office), the bootstrap
shows the consequences and asks before changing anything (default No). Windows
`changepk.exe /ProductKey <target GVLK>` (default Pro; `$env:KMS_EDITION`
overrides) — in-place edition UPGRADE, **needs a reboot then re-run**, one-way
(no in-place downgrade). Office → slim ODT `setup.exe /configure` to a VL
product (default ProPlus2024Volume; `$env:KMS_OFFICE_PRODUCT` overrides) — ~3 GB
download, closes Office. If an INCOMPATIBLE Click-to-Run Office is installed
(retail/M365 — `ProductReleaseIds` not ending in `Volume`), it's named in the
prompt and **uninstalled first** via ODT `<Remove>` of just those products (VL
products of other families are kept), then the VL product installs. The ODT run
is one shared `Invoke-Odt` for both `<Add>` and `<Remove>`. **Removing the bundled
consumer Office leaves a pending reboot**, so a VL install in the same run — or a
re-run before rebooting — fails with `setup.exe` exit **1603**. Two guards: a
hard-reboot (CBS/WU) gate before the ~3 GB download, and a reboot-aware 1603
message telling the user to reboot + re-run (idempotent — the incompatible Office
is already gone). `Invoke-Odt` checks the setup.exe exit code and on failure
captures the C2R log from `%TEMP%` into telemetry; `Wait-OfficeInstalled` polls
on-disk state (ospp.vbs + ProductReleaseIds) because `setup.exe` can return before
the C2R install finishes. Non-interactive runs only proceed with an explicit env
override. setup-kms.ps1 stays minimal and points non-VL editions at the bootstrap.
NOTE: real-hardware status (2026-06-01) — the incompatible-uninstall path DID run
on a real M365/Office-Home box (`O365HomePremRetail` removed cleanly); the VL
install then needs a reboot first (hit 1603, now guided). changepk edition-switch
remains untested (no Home test box; the Pro test VM can't be switched reversibly).
- **SXSMSI/1603 deep-repair + escalation (2026-06-02):** when the VL install fails
`[Failing PreReq=SXSMSI]`/1603 with NO pending reboot (the C2R bootstrap MSI fails),
the script offers a consent-gated deep repair (`Repair-OfficePrereq`: `msiexec
/unregister`+`/regserver` and reset `SoftwareDistribution`+`catroot2` — the level
past DISM/SFC; uninstalls nothing; `$env:KMS_DEEP_REPAIR=1` auto-consents). It
persists `HKLM\SOFTWARE\kms-bootstrap\DeepRepairDone`; if 1603 recurs AFTER a deep
repair it stops looping and shows the in-place-Windows-repair guidance
(`Show-InPlaceRepairHint`, telemetry `sxsmsi-unrecoverable`). **Pilot on PVE VM 300
(2026-06-02) proved SXSMSI is client-machine-specific, not the script:** the
identical script + the exact user journey both reach `office/ok` on a healthy
Win10 — CF1 = clean (Remove-All+reboot) → VL install; CF2 = retail
`O365HomePremRetail` → script targeted-remove → reboot → VL install. So a
persistent SXSMSI is the client's corrupted Windows servicing/Installer subsystem
(below DISM/SFC), fixed only by an in-place Windows repair-install. Also learned:
the targeted retail uninstall is itself flaky under low disk (exit -1) and the
qemu guest-agent drops during heavy C2R installs (poll telemetry/state, not
guest-exec, for results).
- **Self-hosted ODT bootstrapper**: the Office reinstall path fetches the Office
Deployment Tool from `https://kms.viktorbarzin.me/scripts/odt-setup.exe` (a
committed copy in `kms-website/static/scripts/`), NOT from Microsoft —
`download.microsoft.com`'s ODT URL is build-numbered and rotates every release
(the old hardcoded one 404'd). `$env:KMS_ODT_URL` overrides. The bootstrapper
self-updates the Office payload, so refresh the committed copy only occasionally.
- **Client telemetry → Loki**: the scripts POST a small ANONYMOUS diagnostics
event per run to `https://kms.viktorbarzin.me/diag` (action, outcome, error +
exit codes, EditionID/build/locale, detected Office products, script version;
NO hostname/user/keys). Fire-and-forget (3s, swallowed) — never affects
activation. `$env:KMS_NO_TELEMETRY=1` opts out; `$env:KMS_DIAG_URL` overrides.
Collector: standalone `kms-diag` Deployment (`stacks/kms`, python stdlib HTTP
on :9102) reachable via the `/diag` ingress carve-out (bypasses Anubis like
`/scripts`); it prints `KMSDIAG <json>` to stdout → Loki. Query in Grafana:
`{namespace="kms",pod=~"kms-diag.*"} |= "KMSDIAG"`. Disclosed in the site FAQ.
## Where the logs are
### vlmcsd (kms namespace, k8s)
```bash
# Live tail
kubectl logs -n kms -l app=kms-service -c windows-kms --tail=50 -f
# All activations in the running pod
kubectl logs -n kms -l app=kms-service -c windows-kms | grep "Incoming KMS request"
```
Source IPs from the WAN are real client IPs (pfSense DNAT-only + ETP=Local
preserve them through the chain). LAN clients hitting the LB IP directly
appear as their own IP. Pod-source probes (Uptime Kuma) appear as a Calico
pod IP in `10.10.0.0/16`. Kubelet readiness/liveness probes appear as the
hosting node IP in `10.0.20.0/24`.
### Slack notifier (kms namespace, k8s)
```bash
kubectl logs -n kms -l app=kms-service -c slack-notifier --tail=50 -f
```
Posts to `#alerts`, dedup window 1h per (source-IP, product). Activations
also increment the Prometheus counter `kms_activations_total{product,status}`
exposed on the same pod at `:9101/metrics` (scraped by the cluster-wide
`kubernetes-pods` job; query via Prometheus or Grafana directly).
Probe-only TCP connections (open+close, no KMS RPC) are silently filtered
out of Slack and counted in `kms_connection_probes_total{source}`. Useful
queries:
```promql
# Probe rate by source
rate(kms_connection_probes_total[5m])
# Probes from the public WAN (a non-zero rate here means real port-scans
# are reaching us, not just internal monitoring)
rate(kms_connection_probes_total{source="external"}[5m])
```
### pfSense — virusprot table and filter hits
```bash
# SSH to 10.0.20.1 as root
pfctl -t virusprot -T show # who's currently in the virusprot table
pfctl -t virusprot -T expire 86400 # boot anyone added more than 24h ago
pfctl -t virusprot -T flush # nuke the entire table
# Filter rule hit counts (find the KMS public rule, look at Evaluations / States)
pfctl -sr -v | grep -A 4 1688
# State table — current TCP/1688 connections, per source
pfctl -ss | grep ':1688 '
```
## Tightening or loosening the rate limit
The filter rule is configured via the pfSense web UI
(`Firewall → Rules → WAN`, look for the `KMS public — kms.viktorbarzin.me`
rule) under **Advanced Options → "Maximum new connections per source per
seconds"** and **"Maximum state entries per source"**.
- **Default**: `max-src-conn 50`, `max-src-conn-rate 10/60`
- To **tighten** (suspected abuse): drop to `max-src-conn 10`,
`max-src-conn-rate 3/60`. Flush state and existing virusprot afterwards
(`pfctl -k 0.0.0.0/0 -K 0.0.0.0/0` is overkill — just save+apply the
rule, pfSense reloads pf and existing virusprot stay blocked).
- To **loosen** (legitimate users blocked): bump to
`max-src-conn-rate 30/60`. The `virusprot` table flush still applies on
overload; reduce its lifetime via
`Firewall → Advanced → State Timeouts` if entries linger.
The `overload` table entry survives pf reloads. Running
`pfctl -t virusprot -T flush` after a tuning change clears the slate.
## Revoking the public exposure
If the activation surface needs to come down (abuse, legal, audit):
1. **pfSense web UI**`Firewall → NAT → Port Forward` → find
`WAN TCP/1688 → k8s_kms_lb`**delete** (or disable). Apply.
2. **pfSense web UI**`Firewall → Rules → WAN` → find
`KMS public — kms.viktorbarzin.me`**delete** (or disable). Apply.
3. Verify externally: from a phone tether, `nc -zw3 kms.viktorbarzin.me 1688`
should now fail.
The k8s service stays reachable on the LAN
(`10.0.20.202:1688` directly, and the website at `kms.viktorbarzin.lan`
via Traefik on `10.0.20.203:443`) — only the WAN port-forward is removed.
To put it back, recreate the NAT rule (target alias `k8s_kms_lb`,
port `1688`) and the filter rule with the same per-source caps. The alias
itself is independent of any forward and persists across delete/restore.
## Related
- Stack: `stacks/kms/` (Terraform; deployment, MetalLB Service, ingress,
ExternalSecret for the Slack webhook)
- Webpage source: `kms-website/` repo (Hugo + nginx; Woodpecker builds +
pushes to forgejo, then `kubectl set image deployment/kms-web-page`)
- Networking architecture footnote:
`docs/architecture/networking.md` § "MetalLB & Load Balancing"

View file

@ -0,0 +1,222 @@
# pfSense HAProxy for Mailserver — Runbook
Last updated: 2026-04-19 (Phase 6 complete)
## What & why
External mail traffic (SMTP/IMAP) requires **real client IP visibility** for
CrowdSec + Postfix rate-limiting. MetalLB cannot inject PROXY-protocol
headers (see [`mailserver-proxy-protocol.md`](./mailserver-proxy-protocol.md)),
so pfSense runs a small HAProxy that:
1. Listens on the pfSense VLAN20 IP (`10.0.20.1`) on all 4 mail ports,
2. Forwards each connection to a k8s node's NodePort with `send-proxy-v2`,
3. Injects PROXY v2 framing so Postfix/Dovecot see the original client IP,
4. TCP-checks every k8s worker via dedicated **non-PROXY healthcheck NodePorts**
(30145/30146/30147 → pod stock 25/465/587 listeners, no PROXY required).
This split path avoids the `smtpd_peer_hostaddr_to_sockaddr` fatal that
used to fire on every PROXY-aware health probe and throttled real client
connections.
Corresponding k8s-side setup (`stacks/mailserver/modules/mailserver/`):
- ConfigMap `mailserver-user-patches``user-patches.sh` appends 3 alt
`master.cf` services to Postfix:
- `:2525` postscreen (alt :25) with `postscreen_upstream_proxy_protocol=haproxy`
- `:4465` smtpd (alt :465 SMTPS) with `smtpd_upstream_proxy_protocol=haproxy`
- `:5587` smtpd (alt :587 submission) with `smtpd_upstream_proxy_protocol=haproxy`
- ConfigMap `mailserver.config` adds Dovecot `inet_listener imaps_proxy` on
port 10993 with `haproxy = yes` and `haproxy_trusted_networks = 10.0.20.0/24`.
- Service `mailserver-proxy` (NodePort, ETP:Cluster) — 4 PROXY data ports +
3 non-PROXY healthcheck ports:
- Data (PROXY v2):
- `port 25 → targetPort 2525 → nodePort 30125`
- `port 465 → targetPort 4465 → nodePort 30126`
- `port 587 → targetPort 5587 → nodePort 30127`
- `port 993 → targetPort 10993 → nodePort 30128`
- Healthcheck (no PROXY, stock SMTP/SMTPS/Submission listeners):
- `port 2500 → targetPort 25 → nodePort 30145` (smtp-check)
- `port 4650 → targetPort 465 → nodePort 30146` (smtps-check)
- `port 5870 → targetPort 587 → nodePort 30147` (sub-check)
- Service `mailserver` (ClusterIP) — unchanged stock ports 25/465/587/993
for intra-cluster clients (Roundcube pod, `email-roundtrip-monitor`
CronJob, book-search). These listeners are PROXY-free.
bd: `code-yiu`.
## Steady-state architecture
```
External mail (WAN) path — PROXY v2
┌─────────────────────────────────────────────────────────────────────┐
│ Client (real IP) │
│ │ SMTP/SMTPS/Sub/IMAPS │
│ ▼ │
│ pfSense WAN:{25,465,587,993} │
│ │ NAT rdr → 10.0.20.1:{same} │
│ ▼ │
│ pfSense HAProxy (mode tcp, 4 frontends, 4 backend pools) │
│ │ data: send-proxy-v2 → :{30125..30128} (PROXY-aware pod) │
│ │ health: TCP-check → :{30145..30147} (no-PROXY pod) │
│ │ inter 5000 │
│ ▼ │
│ k8s-node<1-4>:{30125..30128} ← any node (ETP:Cluster) │
│ │ kube-proxy SNAT (source IP lost on the wire) │
│ ▼ │
│ mailserver pod :{2525,4465,5587,10993} │
│ │ postscreen / smtpd / Dovecot parse PROXY v2 header │
│ │ → real client IP recovered despite kube-proxy SNAT │
│ ▼ │
│ CrowdSec + Postfix / Dovecot see the true source IP ✓ │
└─────────────────────────────────────────────────────────────────────┘
Intra-cluster path — no PROXY
┌─────────────────────────────────────────────────────────────────────┐
│ Roundcube pod / email-roundtrip-monitor CronJob │
│ │ SMTP/IMAP │
│ ▼ │
│ mailserver.mailserver.svc.cluster.local:{25,465,587,993} │
│ │ ClusterIP — bypasses LoadBalancer/NodePort layer entirely │
│ ▼ │
│ mailserver pod stock :{25,465,587,993} (PROXY-free) │
└─────────────────────────────────────────────────────────────────────┘
```
## Validation
```sh
# All HAProxy frontends listening
ssh admin@10.0.20.1 'sockstat -l | grep haproxy'
# Expect: *:25, *:465, *:587, *:993, *:2525 (test port)
# All backend pools healthy
ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio" \
| awk 'NR>1 {print $3, $4, $6}'
# srv_op_state 2 = UP, 0 = DOWN
# Container listens on all 8 ports
kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
ss -ltn | grep -E ':(25|2525|465|4465|587|5587|993|10993)\b'
# pf rdr points at pfSense (10.0.20.1), not <mailserver> alias
ssh admin@10.0.20.1 'pfctl -sn' | grep -E 'port = (25|submission|imaps|smtps)'
# E2E probe — Brevo → external MX :25 → IMAP fetch
kubectl create job --from=cronjob/email-roundtrip-monitor probe-test -n mailserver
kubectl wait --for=condition=complete --timeout=90s job/probe-test -n mailserver
kubectl logs job/probe-test -n mailserver | grep SUCCESS
kubectl delete job probe-test -n mailserver
# Real client IP in maillog post-delivery
kubectl logs -c docker-mailserver deployment/mailserver -n mailserver \
| grep 'smtpd-proxy25.*CONNECT from' | tail -5
# Expect external source IPs (e.g., Brevo 77.32.148.x), NOT 10.0.20.x
```
## Bootstrap / restore from scratch
pfSense HAProxy config lives in `/cf/conf/config.xml` under
`<installedpackages><haproxy>`. That file is scp'd nightly to
`/mnt/backup/pfsense/config-YYYYMMDD.xml` by `scripts/daily-backup.sh`, then
synced to Synology. To rebuild from source of truth (git):
```sh
scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/
ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'
```
The script is idempotent — re-runs reset the mailserver frontends + backends
to the declared state.
Expected output:
```
haproxy_check_and_run rc=OK
```
## Operations
### Change backend k8s node IPs / NodePorts
Edit `infra/scripts/pfsense-haproxy-bootstrap.php``$NODES` array + the
`build_pool()` port arguments. Re-run the bootstrap command above. Don't
hand-edit `/var/etc/haproxy/haproxy.cfg` — it is regenerated from XML on
every apply.
### Check health of backends
```sh
ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio"
```
`srv_op_state=2` means UP, `0` means DOWN.
### View live HAProxy stats (WebUI)
`https://pfsense.viktorbarzin.me` → Services → HAProxy → Stats.
### Reload after config.xml edit
```sh
ssh admin@10.0.20.1 'pfSsh.php playback svc restart haproxy'
```
### Rollback (flip NAT back to MetalLB, post-Phase-6 only partial)
There is no Phase-6 rollback one-liner. Phase 6 removed the MetalLB
LoadBalancer 10.0.20.202 entirely, so un-flipping NAT now would send
traffic to a dead alias. To regress:
1. Re-add `metallb.io/loadBalancerIPs = "10.0.20.202"` + `type = "LoadBalancer"`
+ `external_traffic_policy = "Local"` to `kubernetes_service.mailserver`,
apply.
2. Re-add the `mailserver` host alias in pfSense pointing at 10.0.20.202
(Firewall → Aliases → Hosts).
3. Run `infra/scripts/pfsense-nat-mailserver-haproxy-unflip.php` on pfSense.
For rollback of just the NAT (Phase 4) without touching the Service, only
the third step is needed — but only meaningful BEFORE Phase 6.
### Restore from backup
pfSense config backup is a plain XML file:
```
/mnt/backup/pfsense/config-YYYYMMDD.xml # sda host copy (1.1TB RAID1)
/volume1/Backup/Viki/pve-backup/pfsense/... # Synology offsite
```
Full restore: pfSense WebUI → Diagnostics → Backup & Restore → Upload that
`config.xml`. The `<installedpackages><haproxy>` section is included.
## Phase history (bd code-yiu)
| Phase | Status | Description |
|---|---|---|
| 1a | ✅ commit `ef75c02f` | k8s alt :2525 listener + NodePort Service |
| 2 | ✅ 2026-04-19 | pfSense HAProxy pkg installed (`pfSense-pkg-haproxy-devel-0.63_2`, HAProxy 2.9-dev6) |
| 3 | ✅ commit `ba697b02` | HAProxy config persisted in pfSense XML (bootstrap script + this runbook) |
| 4+5| ✅ commit `9806d515` | 4-port alt listeners + HAProxy frontends for 25/465/587/993 + NAT flip |
| 6 | ✅ this commit | Mailserver Service downgraded LoadBalancer → ClusterIP; `10.0.20.202` released back to MetalLB pool; orphan `mailserver` pfSense alias removed; monitors retargeted |
## Known warts
- ~~HAProxy TCP health-check with `send-proxy-v2` generates `getpeername:
Transport endpoint not connected` warnings on postscreen every check cycle.~~
**Resolved 2026-05-05**: dedicated non-PROXY healthcheck NodePorts
(30145/30146/30147 → stock pod 25/465/587) added; HAProxy now checks
those, eliminating both the `getpeername` postscreen warnings and the
`smtpd_peer_hostaddr_to_sockaddr: ... Servname not supported` fatals
that were throttling smtpd respawns and causing ~50% client timeouts on
the public 587 path. `inter` dropped 120000 → 5000 (fast failover, no
log-spam concern). `option smtpchk` was tried but flapped against
postscreen (multi-line greet + DNSBL silence + anti-pre-greet detection
trip HAProxy's parser → L7RSP). Plain TCP check on the no-PROXY ports
is sufficient.
- Frontend binds on all pfSense interfaces (`bind :25` instead of
`10.0.20.1:25`). `<extaddr>` is set in XML but pfSense templates it
port-only. Low concern in practice because WAN firewall rules plus the
NAT rdr gate external access; internal VLAN clients SHOULD be able to
reach HAProxy on any pfSense-local IP.
- k8s-node5 doesn't exist — cluster has master + 4 workers. Backend pool
capped at 4 servers.
- Postscreen still logs `improper command pipelining` for legitimate
clients that send `EHLO\r\nQUIT\r\n` as a single TCP write. This is
unchanged pre/post-migration — postscreen's anti-bot heuristic.

View file

@ -0,0 +1,181 @@
# Mailserver PROXY protocol — research & decision
Last updated: 2026-04-18 (original research). **Outcome implemented 2026-04-19 — see [UPDATE](#update-2026-04-19) below.**
> ## UPDATE (2026-04-19)
>
> This doc describes the research that led to the Phase-6 rollout. **Option C
> (pfSense HAProxy + PROXY v2)** was chosen and is now live. Operational
> state, cutover history, bootstrap, and rollback procedures live in
> [`mailserver-pfsense-haproxy.md`](mailserver-pfsense-haproxy.md).
>
> This file is retained as a decision record — it explains *why* Option A
> (pod-pinning via nodeSelector) was rejected mid-session in favour of
> Option C, and documents the MetalLB upstream limitation (PROXY injection
> is explicitly won't-implement). Future debates of "why don't we just pin
> the pod?" should land here first.
## TL;DR
**MetalLB does not and will not inject PROXY protocol headers.** The original plan
(`/home/wizard/.claude/plans/let-s-work-on-linking-temporal-valiant.md`, task
`code-rtb`) assumed MetalLB could be configured to emit PROXY v1/v2 on behalf of
the `mailserver` LoadBalancer Service. That assumption is wrong at the product
level. MetalLB is a control-plane-only announcer (ARP/NDP for L2 mode, BGP for
L3 mode); it never touches the L4 payload.
As a result, there is no single Terraform change that can flip
`externalTrafficPolicy: Local``Cluster` on the `mailserver` Service while
preserving the real client IP for Postfix/postscreen and Dovecot. Three
alternative paths exist (see below); none is trivial.
## Environment (verified 2026-04-18)
- **MetalLB version**: `quay.io/metallb/controller:v0.15.3` /
`quay.io/metallb/speaker:v0.15.3` (5 speakers).
- **Advertisement type**: L2Advertisement `default` bound to IPAddressPool
`default` (10.0.20.20010.0.20.220). No BGPAdvertisements.
- **Service**: `mailserver/mailserver` — type `LoadBalancer`, `loadBalancerIPs:
10.0.20.202`, `externalTrafficPolicy: Local`,
`healthCheckNodePort: 30234`, 5 ports (25, 465, 587, 993, 9166/dovecot-metrics).
- **Pod**: single replica today, RWO PVCs prevent horizontal scale without
further work (`mailserver-data-encrypted`, `mailserver-letsencrypt-encrypted`).
## Why the original plan fails
### MetalLB never touches packets
> *"MetalLB is controlplane only, making it part of the dataplane means we
> would be responsible for the performance of the system, so more bugs to
> fight, I personally don't see that happening."*
> — MetalLB maintainer `champtar`, 2021-01-06
> (issue [#797 — Feature Request: Supporting Proxy Protocol v2](https://github.com/metallb/metallb/issues/797))
Issue #797 is closed as "won't implement". Repeat asks in 20222023 got the
same answer. The v0.15.3 API surface confirms this: no
`proxyProtocol`/`haproxy`/`protocol: proxy` field exists on `IPAddressPool`,
`L2Advertisement`, `BGPAdvertisement`, or as a Service annotation.
Only managed-cloud LBs (AWS NLB, Azure LB, OCI, DO, OVH, Scaleway, etc.) offer
PROXY protocol as a tick-box. MetalLB's equivalents are:
| MetalLB feature | Does it preserve client IP? | Comment |
|---|---|---|
| `externalTrafficPolicy: Local` (current) | Yes, via iptables DNAT on the speaker node | Forces pod↔speaker colocation on L2 mode. This is the pain we wanted to avoid. |
| `externalTrafficPolicy: Cluster` | No — kube-proxy SNATs to the node IP | The problem we would re-introduce if we flipped without PROXY injection. |
| PROXY protocol injection | N/A — not implemented | Dead end. |
### The `Local` trap is real, but narrower than it seems
Today's `Local` policy means the ARP announcer node must also host the mailserver
pod. MetalLB always picks a single speaker to advertise the VIP (leader
election per IP), so in practice exactly one node matters at any moment. A pod
rescheduled to a different node silently drops inbound SMTP/IMAP until a GARP
flip or node cordon.
The only pods on our cluster that see this same class of risk are Traefik
(3 replicas + PDB `minAvailable=2`, so 2 of 3 nodes always have a pod) and
mailserver (1 replica). Traefik survives because the pods outnumber the nodes
that could be the speaker at once; the mailserver cannot.
## Alternative paths (ranked by effort)
### Option A — Pin the mailserver pod to a specific node (SIMPLEST)
Add `nodeSelector` on the mailserver Deployment pointing at a label that's also
stamped on the MetalLB speaker we want to advertise the VIP from, and use
MetalLB's [node selector](https://metallb.io/configuration/_advanced_l2_configuration/#specify-network-interfaces-that-lb-ip-can-be-announced-from)
on `L2Advertisement.spec.nodeSelectors` to pin the announcer to the same node.
Trade-offs:
- Zero changes to Postfix/Dovecot configs.
- Keeps `externalTrafficPolicy: Local` — real client IP keeps arriving.
- Loses HA (the whole point of the MetalLB layer) but reflects reality — one
replica, one PVC, no HA today anyway.
- Drain of that node requires a planned cutover, but that's no worse than
today's silent failure mode.
Implementation (~10 lines of Terraform):
```hcl
# In stacks/mailserver/modules/mailserver/main.tf, on the Deployment:
node_selector = { "viktorbarzin.me/mailserver-anchor" = "true" }
# In stacks/platform (or wherever the MetalLB CRs live):
resource "kubernetes_manifest" "mailserver_l2ad" {
manifest = {
apiVersion = "metallb.io/v1beta1"
kind = "L2Advertisement"
metadata = { name = "mailserver", namespace = "metallb-system" }
spec = {
ipAddressPools = ["default"]
nodeSelectors = [{ matchLabels = { "viktorbarzin.me/mailserver-anchor" = "true" } }]
}
}
}
```
Plus a node label via `kubectl label node k8s-node3 viktorbarzin.me/mailserver-anchor=true`.
**Recommendation: this is the shortest path to eliminating the silent-drop
failure mode** without taking on a new proxy tier.
### Option B — Put a HAProxy sidecar in front of Postfix/Dovecot
Stand up an in-cluster HAProxy with PROXY v2 enabled on the frontend and
`send-proxy-v2` on the backend to `mailserver:25/465/587/993`. Expose HAProxy
via a new MetalLB Service with `externalTrafficPolicy: Cluster` + kube-proxy
DSR workaround (still loses client IP at that layer), or run HAProxy on the
host-network of the same node (back to Option A's colocation).
Trade-offs:
- Introduces one more network hop and TLS-termination decision for every
SMTP connect.
- HAProxy needs its own cert rotation (or `tls-passthrough`) — adds moving
parts to an already crowded mailserver module.
- Doesn't actually solve the colocation problem on its own — HAProxy itself
needs to receive the client IP, so we are back to externalTrafficPolicy
constraints for HAProxy.
**Recommendation: avoid unless we also get HA for mailserver itself, which
needs RWX storage + DB split-brain work — out of scope.**
### Option C — Replace MetalLB with a different LB for this Service
Candidates: [kube-vip](https://kube-vip.io/) (supports eBPF-based DSR but not
PROXY injection either), [Cilium LB](https://docs.cilium.io/en/stable/network/lb-ipam/)
(preserves client IP via DSR in hybrid mode), or a dedicated HAProxy running on
pfSense and NAT-forwarding 25/465/587/993 with PROXY headers to a
ClusterIP-exposed mailserver. Cilium requires a CNI migration (we run Calico
today); pfSense HAProxy is genuinely feasible but belongs in a different bd
task.
**Recommendation: track as P3 follow-up under a new bd task if Option A proves
insufficient.**
## Decision
Do nothing in this session beyond this runbook + the bd note. The `code-rtb`
task as written is not executable — MetalLB cannot inject PROXY headers, and
the Postfix/Dovecot config changes the plan proposed would not receive the
header they expect, they would hang waiting for it and then timeout (5s per
connection).
Follow-up work filed as bd child tasks (if user wants to pursue):
- **Option A — pin mailserver + L2Advertisement nodeSelectors** (new bd task)
- **Option C — HAProxy on pfSense with PROXY v2 to a ClusterIP** (new bd task)
## References
- [MetalLB issue #797 — Feature Request: Supporting Proxy Protocol v2](https://github.com/metallb/metallb/issues/797) (closed, won't implement)
- [MetalLB PR #796 — Source IP Preservation discussion](https://github.com/metallb/metallb/issues/796)
- Postfix [postscreen_upstream_proxy_protocol](https://www.postfix.org/postconf.5.html#postscreen_upstream_proxy_protocol) — expects the PROXY header *on every incoming connection*; if absent, postscreen drops after `postscreen_upstream_proxy_timeout`.
- Dovecot [haproxy_trusted_networks](https://doc.dovecot.org/settings/core/#core_setting-haproxy_trusted_networks) — treats the header as mandatory for listed source networks.
- Cluster state verified against: `kubectl -n metallb-system get pods`,
`kubectl get ipaddresspools.metallb.io -A`,
`kubectl get l2advertisements.metallb.io -A`,
`kubectl get bgpadvertisements.metallb.io -A`,
`kubectl -n mailserver get svc mailserver -o yaml`.

View file

@ -0,0 +1,57 @@
# Runbook: Add a new archive to Nextcloud / PVE NFS
Use this runbook when you need to surface a new directory under `/srv/nfs/` or `/srv/nfs-ssd/` to specific Nextcloud users as a dedicated External mount. Each archive gets its own NC mount; only the listed `applicableUsers` can see and access it.
## Steps
1. **Create the directory on PVE.**
```bash
ssh root@192.168.1.127
mkdir -p /srv/nfs/<archive-name>
# Use /srv/nfs-ssd/<archive-name> for the SSD pool instead.
```
2. **Populate the directory.**
Rsync from a remote source, copy from another NFS path, or let the granted user upload via the NC web UI after step 5. Example rsync:
```bash
rsync -avP --info=progress2 user@source:/path/ /srv/nfs/<archive-name>/
```
3. **Add a manifest entry.**
Edit `infra/stacks/nextcloud/external_storage.tf`. In the `kubernetes_config_map_v1.nextcloud_external_storage_manifest` resource, append a new entry to `archiveMounts`:
```json
{ "mountPoint": "/<archive-name>", "dataDir": "/mnt/pve-nfs/<archive-name>", "applicableUsers": ["<owner1>", "admin"], "applicableGroups": [], "enableSharing": false }
```
Use `/mnt/pve-nfs-ssd/<archive-name>` for the SSD pool. NC usernames are `admin`, `anca`, `emo` — not display names (`admin` is Viktor). `admin` is included so the owner of the homelab can always assist with the archive. Set `enableSharing: true` only if you want recipients to re-share subfolders.
4. **Plan and apply.**
```bash
cd infra/stacks/nextcloud
scripts/tg plan
scripts/tg apply
```
The bootstrap Job re-runs and applies the new mount plus `applicable_users` idempotently via `occ files_external:*` and `occ files_external:applicable`. No manual `occ` invocation needed.
5. **Verify.**
Log in as a granted user — `/<archive-name>` must appear in their NC sidebar; read, upload, and delete must all work. Log in as a non-granted user and confirm the mount is not visible at all.
## Backout
Remove the entry from `archiveMounts` in the manifest ConfigMap, then `scripts/tg apply`. The bootstrap Job re-runs and removes the mount. The root mounts (`PVE NFS Pool`, `PVE NFS-SSD Pool`, visible to group `admin` only) are unaffected throughout.
After the mount is gone there is no NC trash to clean. The directory on PVE (`/srv/nfs/<archive-name>`) can be `rmdir`'d once you have confirmed the data is safe elsewhere.
## Related
- Architecture: `docs/architecture/storage.md` — "Nextcloud as PVE-NFS browser" section
- Original design/plan: `infra/docs/plans/2026-05-23-anca-elements-{design,plan}.md` <!-- TODO: confirm path once orchestrator files the plan docs -->
- Manifest source: `infra/stacks/nextcloud/external_storage.tf` (`kubernetes_config_map_v1.nextcloud_external_storage_manifest`)

View file

@ -0,0 +1,66 @@
# NFS Prerequisites for `modules/kubernetes/nfs_volume`
The `nfs_volume` Terraform module creates a `PersistentVolume` pointing at a
path on the Proxmox NFS server (`192.168.1.127`). It does **not** create the
underlying directory on the server.
If the path does not exist, the first pod that tries to mount the resulting
PVC gets stuck in `ContainerCreating` with the kubelet event:
```
MountVolume.SetUp failed for volume "<name>" : mount failed: exit status 32
mount.nfs: mounting 192.168.1.127:/srv/nfs/<path> failed, reason given by
server: No such file or directory
```
## Bootstrap before first apply
Before adding a new `nfs_volume` consumer (backup CronJob, data PV, etc.),
create the export root on the PVE host:
```sh
# Replace <app> with the backup stack name, e.g. mailserver-backup,
# roundcube-backup, immich-backup, etc.
ssh root@192.168.1.127 'mkdir -p /srv/nfs/<app> && chmod 755 /srv/nfs/<app>'
# Confirm exports are live (no change to /etc/exports needed — `/srv/nfs`
# is already exported via the root entry in pve-nfs-exports).
ssh root@192.168.1.127 exportfs -v | grep '/srv/nfs\b'
```
`/srv/nfs` is exported with the root entry. Subdirectories inherit the
export automatically; they just have to exist on disk.
## Known consumers
| Consumer | NFS path | Owning stack |
|--------------------------------|---------------------------------|--------------------------|
| `mailserver-backup` | `/srv/nfs/mailserver-backup` | `stacks/mailserver/` |
| `roundcube-backup` | `/srv/nfs/roundcube-backup` | `stacks/mailserver/` |
| `mysql-backup` | `/srv/nfs/mysql-backup` | `stacks/dbaas/` |
| `postgresql-backup` | `/srv/nfs/postgresql-backup` | `stacks/dbaas/` |
| `vaultwarden-backup` | `/srv/nfs/vaultwarden-backup` | `stacks/vaultwarden/` |
Use `grep -rn 'nfs_volume' infra/stacks/` to find all active consumers.
## Why not auto-create?
Two options were considered for automating this:
1. `null_resource` + `local-exec` SSH `mkdir` in the `nfs_volume` module —
works but adds an SSH dependency to every Terraform run, makes the
module non-hermetic, and fails if the operator does not have SSH to
the PVE host.
2. `nfs-subdir-external-provisioner` — handles subdirs automatically but
changes the PV/PVC shape and would require migrating all existing
consumers.
Neither is worth the churn for a one-time operation per new backup stack.
Document + checklist is the current call; re-evaluate if we start adding
one NFS consumer per week.
## Related tasks
- `code-yo4` — this runbook
- `code-z26` — mailserver backup CronJob (first-time setup hit this)
- `code-1f6` — Roundcube backup CronJob (also hit this)

View file

@ -0,0 +1,72 @@
# Runbook: Offboard a User
Removing a user can span two surfaces — the **in-cluster** namespace-owner model
(Vault `k8s_users` / RBAC / namespace) and the **devvm Workstation** (roster /
OS account / t3 instance). Both are **staged**: a *reversible cut* (revoke access,
delete nothing) first, then an explicit, gated *destructive removal*. Do the
reversible cut immediately; only do the destructive step once you're sure.
> Architecture: `../architecture/multi-tenancy.md`. Workstation design:
> `../plans/2026-06-07-multi-user-workstation-design.md`.
---
## Part A — DevVM Workstation offboarding
Driven by removing the user's entry from `infra/scripts/workstation/roster.yaml`.
`roster_engine.py offboard_plan` computes the staged actions (reversible cut vs the
gated `userdel_archive`, which is **never** auto-applied).
### A1. Reversible cut (revoke access; delete nothing)
1. **Delete the user's entry** from `roster.yaml`; commit + push.
2. **Reconcile** (`sudo /usr/local/bin/t3-provision-users`, or wait for the hourly
timer). This **regenerates** `/etc/ttyd-user-map` + `dispatch.json` *without* the
user → `t3-dispatch` now returns **403** for them. *(Automated.)*
3. **Disable their instance + lock login** *(manual today; Phase 7 will fold this into
the reconcile):*
```bash
sudo systemctl disable --now t3-serve@<os_user>.service
sudo passwd -l <os_user>
```
4. **Verify:** they can no longer reach `t3.viktorbarzin.me` (302 → Authentik, then
denied once removed from the `T3 Users` group — Part C) and cannot log in. Nothing
is deleted; re-adding the roster entry + reconcile fully restores them.
### A2. Destructive removal (explicit, gated — NEVER automatic)
Only after the reversible cut and a deliberate decision:
```bash
sudo tar czf /mnt/backup/offboard/<os_user>-$(date +%Y%m%d).tar.gz /home/<os_user>
sudo userdel -r <os_user> # removes home + mail spool — IRREVERSIBLE
```
Rollback before this step: re-add the roster entry + reconcile. After it: restore
from the archive.
---
## Part B — In-cluster (namespace-owner) offboarding
1. **Reversible cut:** remove the user's Authentik group membership (edge/RBAC blocked)
and their entry from the Vault `k8s_users` map (`secret/platform`).
2. **Apply:** `scripts/tg apply` the `vault``platform``woodpecker` stacks (drops the
RBAC binding, Vault identity/policy, and per-user CI). Their OIDC kubeconfig stops
authorizing immediately.
3. **Destructive (gated):** deleting their namespace(s) removes all their workloads +
data — back up first (PVCs, DBs), then delete only on explicit decision.
---
## Part C — Authentik (both surfaces)
Remove the user from the relevant Authentik group(s) — `kubernetes-namespace-owners`
(cluster) and/or `T3 Users` (workstation edge gate). This is the edge revocation; do
it as part of the reversible cut so they're locked out at the front door.
---
## Order of operations
Reversible cut on **all** relevant surfaces first (Authentik group → roster removal +
reconcile → `k8s_users` removal + apply) → verify access is gone → only then the gated
destructive steps (`userdel -r`, namespace deletion), each after its own archive.

View file

@ -0,0 +1,281 @@
# pfSense Unbound DNS Resolver
Last updated: 2026-04-19
## Overview
pfSense runs **Unbound** (DNS Resolver) as its sole DNS service, replacing
dnsmasq (DNS Forwarder) as of 2026-04-19 (DNS hardening Workstream D,
bd `code-k0d`).
Unbound AXFR-slaves the `viktorbarzin.lan` zone from the Technitium primary
via the `10.0.20.201` LoadBalancer, so LAN-side `.lan` resolution survives
a full Kubernetes outage. Public queries go to Cloudflare via DNS-over-TLS
(`1.1.1.1` + `1.0.0.1` on port 853, SNI `cloudflare-dns.com`).
## Listeners
Unbound binds on:
| Interface | IP | Purpose |
|-----------|-----|---------|
| WAN | `192.168.1.2:53` | LAN (192.168.1.0/24) clients querying via pfSense WAN |
| LAN | `10.0.10.1:53` | Management VLAN clients |
| OPT1 | `10.0.20.1:53` | K8s VLAN clients (CoreDNS upstream) |
| lo0 | `127.0.0.1:53` | pfSense itself |
The prior WAN NAT `rdr` rule (`192.168.1.2:53 → 10.0.20.201`) was removed in
the same change — Unbound now answers directly on WAN.
## Config Summary
Relevant `<unbound>` keys in `/cf/conf/config.xml`:
| Key | Value | Meaning |
|-----|-------|---------|
| `enable` | flag | Enable Unbound |
| `dnssec` | flag | DNSSEC validation on |
| `forwarding` | flag | Forwarding mode (send recursive queries to upstream) |
| `forward_tls_upstream` | flag | Use DoT for upstream forwarders |
| `prefetch` | flag | Prefetch records near expiry |
| `prefetchkey` | flag | Prefetch DNSKEY records |
| `dnsrecordcache` | flag | `serve-expired: yes` |
| `active_interface` | `lan,opt1,wan,lo0` | Listen interfaces |
| `msgcachesize` | `256` (MB) | Message cache (rrset-cache auto-doubles to 512MB) |
| `cache_max_ttl` | `604800` | 7 days |
| `cache_min_ttl` | `60` | 60 seconds |
| `custom_options` | base64 | Contains `serve-expired-ttl: 259200` + `auth-zone:` block |
Upstream DoT forwarders live in `<system>`:
- `dnsserver[0] = 1.1.1.1`
- `dnsserver[1] = 1.0.0.1`
- `dns1host = cloudflare-dns.com`
- `dns2host = cloudflare-dns.com`
## Auth-Zone for viktorbarzin.lan
The custom_options block declares:
```
server:
serve-expired-ttl: 259200
auth-zone:
name: "viktorbarzin.lan"
master: 10.0.20.201
fallback-enabled: yes
for-downstream: yes
for-upstream: yes
zonefile: "viktorbarzin.lan.zone"
allow-notify: 10.0.20.201
```
- `master: 10.0.20.201` — AXFR source (Technitium LoadBalancer)
- `fallback-enabled: yes` — if the zone can't refresh from master, fall back to normal recursion for this name (prevents hard-fail if AXFR breaks)
- `for-downstream: yes` — answer queries for this zone with AA flag
- `for-upstream: yes` — Unbound's internal iterator also uses this zone
- `zonefile` is relative to the chroot (`/var/unbound/viktorbarzin.lan.zone`)
- `allow-notify: 10.0.20.201` — accept NOTIFY from Technitium
## Technitium-side ACL
Zone `viktorbarzin.lan` on Technitium has `zoneTransfer = UseSpecifiedNetworkACL`
with ACL entries:
- `10.0.20.1` (pfSense OPT1)
- `10.0.10.1` (pfSense LAN)
- `192.168.1.2` (pfSense WAN)
Verify via the Technitium API:
```
curl -sk "http://127.0.0.1:5380/api/zones/options/get?token=$TOK&zone=viktorbarzin.lan" | jq .response.zoneTransfer
```
## Operational Checks
```bash
# Is Unbound listening?
ssh admin@10.0.20.1 "sockstat -l -4 -p 53"
# Auth-zone loaded?
ssh admin@10.0.20.1 "unbound-control -c /var/unbound/unbound.conf list_auth_zones"
# Expected: viktorbarzin.lan. serial NNNNN
# LAN record via auth-zone? (aa flag = authoritative / from auth-zone)
dig @192.168.1.2 idrac.viktorbarzin.lan +norec
# Public record via DoT? (ad flag = DNSSEC validated, via 1.1.1.1/1.0.0.1)
dig @192.168.1.2 example.com +dnssec
# Zonefile has all records?
ssh admin@10.0.20.1 "wc -l /var/unbound/viktorbarzin.lan.zone"
```
## K8s Outage Drill
Tests that `.lan` resolution survives a full Technitium outage:
```bash
# Scale Technitium primary to 0
kubectl -n technitium scale deploy/technitium --replicas=0
# Wait ~5 seconds, then test from a LAN client
ssh devvm.viktorbarzin.lan "dig @192.168.1.2 idrac.viktorbarzin.lan +short"
# Expected: 192.168.1.4 (served from Unbound's cached auth-zone)
# Restore immediately
kubectl -n technitium scale deploy/technitium --replicas=1
```
Completed successfully on 2026-04-19 initial deployment.
Note: secondary/tertiary Technitium pods remain up and continue to serve
queries via the `10.0.20.201` LoadBalancer even when the primary is down —
so the strongest proof that Unbound's auth-zone serves locally is to also
scale those down (optional, not part of the routine drill).
## Backup & Rollback
### Backups
- **On-box**: `/cf/conf/config.xml.2026-04-19-pre-unbound` (created before this
workstream ran — keep for 30 days, then delete)
- **Daily**: PVE `daily-backup` script copies `/cf/conf/config.xml` and a full
pfSense config tar to `/mnt/backup/pfsense/` on the Proxmox host at 05:00
- **Offsite**: Synology `pve-backup/pfsense/` (synced daily by
`offsite-sync-backup`)
### Rollback to dnsmasq
If Unbound misbehaves, revert to dnsmasq + NAT rdr:
```bash
# On pfSense
cp /cf/conf/config.xml.2026-04-19-pre-unbound /cf/conf/config.xml
# Tell pfSense to re-read config and reload services
php -r 'require_once("config.inc"); require_once("config.lib.inc"); disable_path_cache();'
/etc/rc.restart_webgui # reloads PHP config caches
# Restart services
php -r 'require_once("config.inc"); require_once("services.inc"); services_dnsmasq_configure(); services_unbound_configure(); filter_configure();'
/etc/rc.filter_configure # re-applies NAT rules (brings back rdr)
```
Verify:
```bash
sockstat -l -4 -p 53 | grep dnsmasq # expect dnsmasq on 10.0.10.1 and 10.0.20.1
pfctl -sn | grep '53' # expect rdr on wan UDP 53 → 10.0.20.201
```
### Rollback without wiping new changes
If you only want to stop Unbound without restoring the whole config, edit
config.xml and remove `<enable/>` from `<unbound>` + add it back to `<dnsmasq>`,
then re-run `services_unbound_configure()` + `services_dnsmasq_configure()`.
You also need to re-add the WAN NAT rdr in `<nat><rule>` (see the backup XML
for the exact shape — tracker `1775670025`).
## Known Gotchas
1. **pfSense regenerates `/var/unbound/unbound.conf`** on every service reload
from `<unbound>` in `config.xml`. Edits to unbound.conf are NOT durable.
2. **`unbound-control` default config path is wrong**. Always use
`unbound-control -c /var/unbound/unbound.conf <cmd>`.
3. **`custom_options` is base64-encoded** in config.xml. Use `base64 -d` to
decode in a shell, or `base64_decode()` in PHP.
4. **`interface-automatic: yes` is NOT used** when `active_interface` is
explicitly set to a list — pfSense emits explicit `interface: <ip>` lines.
5. **`auth-zone`'s `zonefile` path is relative to the Unbound chroot**
(`/var/unbound`), NOT absolute. Using an absolute path silently fails.
6. **DoT forwarders need `forward_tls_upstream`** flag AND `dns1host` /
`dns2host` set in `<system>` for SNI — without the hostname, pfSense emits
`forward-addr: 1.1.1.1@853` (no `#`) which Cloudflare rejects with
certificate hostname mismatch.
## Kea DHCP-DDNS TSIG (WS E, 2026-04-19)
Kea DHCP-DDNS on pfSense signs its RFC 2136 dynamic updates with an
HMAC-SHA256 TSIG key (`kea-ddns`). Technitium's `viktorbarzin.lan` zone
and reverse zones (`10.0.10.in-addr.arpa`, `20.0.10.in-addr.arpa`,
`1.168.192.in-addr.arpa`) require both a pfSense-source IP (10.0.20.1 /
10.0.10.1 / 192.168.1.2) AND a valid TSIG signature.
### Config locations
| Side | File | Notes |
|------|------|-------|
| pfSense | `/usr/local/etc/kea/kea-dhcp-ddns.conf` | Hand-managed. Pre-WS-E backup: `.2026-04-19-pre-tsig`. Daemon: `kea-dhcp-ddns` (`pkill -x kea-dhcp-ddns && /usr/local/sbin/kea-dhcp-ddns -c /usr/local/etc/kea/kea-dhcp-ddns.conf -d &`) |
| Technitium | Zone options API: `POST /api/zones/options/set?zone=<z>&updateSecurityPolicies=kea-ddns\|*.<z>\|ANY&updateNetworkACL=10.0.20.1,10.0.10.1,192.168.1.2&update=UseSpecifiedNetworkACL` | Set on primary; replicates to secondary/tertiary via AXFR |
| Technitium settings | TSIG keys array: `POST /api/settings/set` with `tsigKeys: [{keyName: "kea-ddns", sharedSecret: <b64>, algorithmName: "hmac-sha256"}]` | Must be set on all 3 Technitium instances (primary, secondary, tertiary) |
| Vault | `secret/viktor/kea_ddns_tsig_secret` | Authoritative copy of the base64 secret |
### Rotating the TSIG key
1. Generate a new base64 32-byte secret: `openssl rand -base64 32` (any base64-encoded blob of reasonable length works; HMAC-SHA256 truncates/pads internally).
2. Write it to Vault: `vault kv patch secret/viktor kea_ddns_tsig_secret=<new-secret>`.
3. Add the new key under a **new name** (e.g., `kea-ddns-v2`) via the Technitium settings API on all 3 instances. Do NOT overwrite `kea-ddns` while Kea still uses it — you'd orphan in-flight updates.
4. Update `/usr/local/etc/kea/kea-dhcp-ddns.conf` on pfSense to reference both keys in `tsig-keys`, set `key-name: kea-ddns-v2` on each `forward-ddns` / `reverse-ddns` domain, restart `kea-dhcp-ddns`.
5. Update each affected zone's `updateSecurityPolicies` to use the new key name.
6. After a lease-renewal cycle (default Kea lease = 7200s / 2h), verify with `kubectl -n technitium exec <primary-pod> -- grep "TSIG KeyName: kea-ddns-v2" /etc/dns/logs/<today>.log`.
7. Remove the old `kea-ddns` key from Technitium settings + Kea config.
### Emergency TSIG bypass (if rotation breaks DDNS)
If DDNS updates are failing and you cannot quickly fix the key, temporarily
downgrade the zone policy to IP-ACL only (pfSense source IPs) without
TSIG:
```bash
kubectl -n technitium port-forward pod/<primary-pod> 5380:5380 &
TOKEN=$(curl -s -X POST http://127.0.0.1:5380/api/user/login \
-d "user=admin&pass=$(vault kv get -field=technitium_password secret/platform)&includeInfo=false" | jq -r .token)
for Z in viktorbarzin.lan 10.0.10.in-addr.arpa 20.0.10.in-addr.arpa 1.168.192.in-addr.arpa; do
curl -s -X POST "http://127.0.0.1:5380/api/zones/options/set?token=$TOKEN&zone=$Z&update=UseSpecifiedNetworkACL&updateNetworkACL=10.0.20.1,10.0.10.1,192.168.1.2&updateSecurityPolicies="
done
```
This clears `updateSecurityPolicies` while keeping the IP ACL. Updates
now flow unsigned from pfSense IPs — **weaker** than TSIG but restores
service. Re-enable TSIG as soon as the key issue is resolved.
### Verify TSIG is enforced
```bash
# Unsigned update should fail
nsupdate <<EOF
server 10.0.20.201 53
zone viktorbarzin.lan
update delete tsig-test.viktorbarzin.lan.
update add tsig-test.viktorbarzin.lan. 300 A 10.99.99.99
send
EOF
# Expected: "update failed: REFUSED"
# Signed update should succeed
cat > /tmp/kea-ddns.key <<EOF
key "kea-ddns" {
algorithm hmac-sha256;
secret "$(vault kv get -field=kea_ddns_tsig_secret secret/viktor)";
};
EOF
nsupdate -k /tmp/kea-ddns.key <<EOF
server 10.0.20.201 53
zone viktorbarzin.lan
update delete tsig-test.viktorbarzin.lan.
update add tsig-test.viktorbarzin.lan. 300 A 10.99.99.99
send
EOF
dig @10.0.20.201 +short tsig-test.viktorbarzin.lan
# Expected: 10.99.99.99
rm -f /tmp/kea-ddns.key
```
## Related Docs
- `docs/architecture/dns.md` — overall DNS architecture (K8s side, Technitium, CoreDNS)
- `docs/architecture/networking.md` — VLAN layout, pfSense interface mapping
- `.claude/skills/pfsense/skill.md` — SSH / CLI patterns for pfSense management

View file

@ -0,0 +1,103 @@
# Runbook: Proxmox host (pve, 192.168.1.127)
Last updated: 2026-04-19
The Proxmox host is a baremetal hypervisor on the storage LAN
(192.168.1.0/24) with a single IP `192.168.1.127`. It hosts every
Kubernetes node VM and the NFS exports that back PVCs. It does **not**
receive DHCP — its network config is static in
`/etc/network/interfaces` (ifupdown). Because of that, DNS must be
configured manually and stays out of the scope of Kea/DHCP-DDNS.
## DNS configuration
The host uses a plain `/etc/resolv.conf` with two nameservers. No
`systemd-resolved`, no `resolvconf`, no NetworkManager — nothing
manages `/etc/resolv.conf`; it is a regular file owned by root.
### Why plain `/etc/resolv.conf` and not systemd-resolved
1. Installing `systemd-resolved` on an active Proxmox node during
business hours is the kind of change that risks breaking the NFS
server or VM networking. PVE's Debian base does not ship
`systemd-resolved` by default.
2. The ifupdown `/etc/network/interfaces` file does not manage
`/etc/resolv.conf` here — ifupdown's resolvconf integration is
only active if the `resolvconf` package is installed, which it is
not (`dpkg -l resolvconf` returns `un`).
3. A plain file is the simplest mental model and avoids a second
layer of "which tool is running now" confusion during an incident.
If you ever want to migrate to `systemd-resolved`, install the
package, enable the service, symlink `/etc/resolv.conf` to
`/run/systemd/resolve/stub-resolv.conf`, and drop the config in
`/etc/systemd/resolved.conf.d/10-internal-dns.conf` — but do this
during a maintenance window, not reactively.
### Current state
```
# /etc/resolv.conf
search viktorbarzin.lan
nameserver 192.168.1.2
nameserver 94.140.14.14
options timeout:2 attempts:2
```
| Field | Value | Purpose |
|---|---|---|
| Primary | `192.168.1.2` | pfSense LAN interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` |
| Fallback | `94.140.14.14` | AdGuard public DNS — recursive only, used if pfSense LAN IP unreachable |
| `search` | `viktorbarzin.lan` | Unqualified names (`technitium`, `idrac`, etc.) resolve against the internal zone |
| `timeout:2 attempts:2` | — | Cap glibc resolver at 2s per server, 2 tries — reasonable fallback latency |
### Verification commands
```sh
ssh root@192.168.1.127 '
cat /etc/resolv.conf # should show the two nameservers
dig +short idrac.viktorbarzin.lan # expect an A record (192.168.1.4)
dig +short github.com # expect an A record
'
```
Simulated failover — force the primary unreachable and verify the
fallback answers:
```sh
ssh root@192.168.1.127 '
ip route add blackhole 192.168.1.2
dig +short +time=3 github.com # glibc times out on primary, tries 94.140.14.14 → A record returned
ip route del blackhole 192.168.1.2 # cleanup
'
```
Expected behaviour: the first `dig` prints a warning about the UDP
setup failing for 192.168.1.2 and then prints the GitHub A record
(answered by 94.140.14.14).
## Rollback
A pre-change backup of `/etc/resolv.conf`, `/etc/network/interfaces`,
and `/etc/network/interfaces.d/` lives at
`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
host. To roll back:
```sh
ssh root@192.168.1.127 '
# pick the backup you want (there may be multiple if this runbook has been applied more than once)
BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
tar -xzf "$BACKUP" -C /
cat /etc/resolv.conf
'
```
No service restart is needed — glibc re-reads `/etc/resolv.conf` per
lookup.
## Related docs
- `docs/architecture/dns.md` — where each resolver IP lives and which
subnet it serves.
- `docs/runbooks/nfs-prerequisites.md` — other operations on this
host; read before adding new NFS exports.

View file

@ -0,0 +1,188 @@
# RAM Upgrade — Dell R730 Proxmox Host (Completed 2026-04-01)
**Host**: Dell R730 @ 192.168.1.127 (Proxmox)
**CPU**: Single Xeon E5-2699 v4 (CPU2 unpopulated — B-side slots unavailable)
**Before**: 144 GB (4x32G Samsung BB1 + 2x8G SK Hynix) @ 2400 MHz
**After**: 272 GB (4x32G Samsung BB1 + 4x32G Samsung CB1 + 2x8G SK Hynix) @ 2400 MHz
## Lessons Learned
1. **3 DPC downclock**: Adding DIMMs to the 3rd slot per channel (A11/A12) caused automatic downclocking to 1866 MHz. Dell R730 BIOS allows manual override back to 2400 MHz via **System BIOS > Memory Settings > Memory Frequency > Max Performance**.
2. **MySQL InnoDB Cluster CR recreation**: Deleting and recreating the InnoDBCluster CR generates new admin secrets that don't match the existing data on PVCs. Fix: manually create the new admin user in MySQL and configure GR recovery channel credentials.
3. **CNPG primary label**: After restarting the CNPG operator, it may not immediately label the primary pod with `role=primary`. Deleting the pod forces the operator to recreate it with the correct labels.
4. **LimitRange blocks MySQL**: The `dbaas` namespace LimitRange (4Gi max) blocks MySQL pods that need 5Gi. Kyverno policy resets LimitRange patches. Fix: reduce MySQL memory limit in CR to 4Gi.
## Physical DIMM Slot Map (looking down at motherboard, front of server at bottom)
```
╔══════════════════════════════════════════════════════════════════════════════╗
║ CPU1 DIMM SLOTS ║
║ ║
║ ┌─── WHITE (1st per channel) ───┐ ║
║ │ │ ║
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
║ │ │ A1 │ │ A2 │ │ A3 │ │ A4 │ ║
║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ ◄── KEEP (existing Samsung 32G) ║
║ │ │██████│ │██████│ │██████│ │██████│ ║
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
║ └────────────────────────────────┘ ║
║ ║
║ ┌─── BLACK (2nd per channel) ───┐ ║
║ │ │ ║
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
║ │ │ A5 │ │ A6 │ │ A7 │ │ A8 │ ║
║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ ◄── INSTALL NEW 32G Samsung ║
║ │ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ (remove old 8G from A5/A6) ║
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
║ └────────────────────────────────┘ ║
║ ║
║ ┌─── GREEN (3rd per channel) ───┐ ║
║ │ │ ║
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
║ │ │ A9 │ │ A10 │ │ A11 │ │ A12 │ ║
║ │ │ │ │ │ │ 8G │ │ 8G │ ◄── MOVE old 8G Hynix here ║
║ │ │ empty│ │ empty│ │░░░░░░│ │░░░░░░│ (from A5 → A11, A6 → A12) ║
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
║ └────────────────────────────────┘ ║
║ ║
║ Legend: ██ = existing 32G (keep in place) ║
║ ▓▓ = NEW 32G Samsung M393A4K40BB1-CRC (install) ║
║ ░░ = relocated 8G SK Hynix HMA81GR7AFR8N-UH (moved from A5/A6) ║
║ ║
╚══════════════════════════════════════════════════════════════════════════════╝
```
## Channel Summary After Install
```
Channel 0: A1 [32G] ──── A5 [32G] ──── A9 [ ] = 64 GB ✓ matched
Channel 1: A2 [32G] ──── A6 [32G] ──── A10[ ] = 64 GB ✓ matched
Channel 2: A3 [32G] ──── A7 [32G] ──── A11[ 8G ] = 72 GB ~ +8G bonus
Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB ~ +8G bonus
───────── ───────── ──────────
WHITE BLACK GREEN TOTAL: 272 GB
(keep) (new 32G) (moved 8G)
```
**Performance**: ~1-2% bandwidth penalty on Ch2/Ch3 due to mixed DIMM sizes. Ch0/Ch1 fully matched.
## Shutdown Sequence
### Phase 0: Gracefully Stop Stateful Services
Scale down databases, caches, and secrets engines before draining nodes to ensure clean shutdown with no data loss.
```bash
export KUBECONFIG=/path/to/config
# 1. Vault — seal all instances (flushes WAL, closes connections)
kubectl -n vault exec vault-0 -- vault operator step-down 2>/dev/null
kubectl -n vault exec vault-0 -- vault operator seal
kubectl -n vault exec vault-1 -- vault operator seal
kubectl -n vault exec vault-2 -- vault operator seal
# 2. MySQL InnoDB Cluster — set super_read_only, scale router to 0
kubectl -n dbaas scale deploy mysql-cluster-router --replicas=0
kubectl -n dbaas exec mysql-cluster-0 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;"
kubectl -n dbaas exec mysql-cluster-1 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;"
kubectl -n dbaas exec mysql-cluster-2 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;"
# innodb_fast_shutdown=0 forces full purge + change buffer merge on stop
# 3. PostgreSQL CNPG — trigger checkpoint on primaries
kubectl -n dbaas exec pg-cluster-2 -- psql -U postgres -c "CHECKPOINT;"
kubectl -n dbaas exec pg-cluster-4 -- psql -U postgres -c "CHECKPOINT;"
kubectl -n immich exec deploy/immich-postgresql -- psql -U postgres -c "CHECKPOINT;"
# 4. Redis — trigger BGSAVE then scale down
kubectl -n redis exec redis-node-0 -- redis-cli BGSAVE
kubectl -n redis exec redis-node-1 -- redis-cli BGSAVE
sleep 5 # wait for RDB flush
kubectl -n redis scale deploy redis-haproxy --replicas=0
# 5. ClickHouse — flush
kubectl -n rybbit exec deploy/clickhouse -- clickhouse-client --query "SYSTEM FLUSH LOGS"
# 6. Scale down stateful workloads
kubectl -n dbaas scale sts mysql-cluster --replicas=0
kubectl -n redis scale sts redis-node --replicas=0
kubectl -n vault scale sts vault --replicas=0
# 7. Verify all stateful pods terminated
kubectl get pods -A | grep -iE 'mysql-cluster-[0-9]|pg-cluster|redis-node|vault-[0-9]|clickhouse'
```
### Phase 1: Drain K8s Nodes
```bash
# Drain workers (reverse order)
kubectl drain k8s-node4 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
kubectl drain k8s-node3 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
kubectl drain k8s-node2 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
kubectl drain k8s-node1 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
# Cordon master
kubectl cordon k8s-master
```
### Phase 2: Shutdown VMs (via Proxmox)
```bash
ssh root@192.168.1.127
# K8s workers
for VMID in 201 202 203 204; do qm shutdown $VMID && echo "Shutdown VMID $VMID"; done
sleep 30
# K8s master
qm shutdown 200; sleep 15
# Docker registry
qm shutdown 220; sleep 10
# Secondary VMs
for VMID in 102 300 103; do qm shutdown $VMID; done
sleep 20
# TrueNAS (decommissioned 2026-04-13 — VM 9000 should already be stopped; skip if absent)
qm shutdown 9000 2>/dev/null || true
# pfSense (last — network gateway)
qm shutdown 101; sleep 15
# Verify all VMs stopped
qm list
```
### Phase 3: Shutdown Proxmox Host
```bash
shutdown -h now
```
## Physical RAM Installation
| Step | Action | Slot(s) | DIMM |
|------|--------|---------|------|
| 1 | Power off host | — | Completed via Phase 3 above |
| 2 | **Remove** | A5 (black clip) | Take out 8G Hynix, set aside |
| 3 | **Remove** | A6 (black clip) | Take out 8G Hynix, set aside |
| 4 | **Install NEW** | A5 (black clip) | Insert 32G Samsung |
| 5 | **Install NEW** | A6 (black clip) | Insert 32G Samsung |
| 6 | **Install NEW** | A7 (black clip) | Insert 32G Samsung |
| 7 | **Install NEW** | A8 (black clip) | Insert 32G Samsung |
| 8 | **Install MOVED** | A11 (green clip) | Insert 8G Hynix (was in A5) |
| 9 | **Install MOVED** | A12 (green clip) | Insert 8G Hynix (was in A6) |
| 10 | Power on | — | — |
## Post-Boot Verification
```bash
# Verify all 10 DIMMs detected
ssh root@192.168.1.127 'dmidecode -t memory | grep -E "Locator:|Size:" | grep -v Bank'
# Verify total RAM (~268 GiB usable)
ssh root@192.168.1.127 'free -h'
```

View file

@ -0,0 +1,170 @@
# Runbook: Rebuild an Image After a Registry Orphan-Index Incident
Last updated: 2026-04-19
## When to use this
Pipelines that pull from `registry.viktorbarzin.me:5050` are failing with
messages like:
- `failed to resolve reference … : not found`
- `manifest unknown`
- `image can't be pulled` (Woodpecker exit 126)
- `error pulling image`: HEAD on a child manifest digest returns 404
…and `skopeo inspect --tls-verify --creds "$USER:$PASS" docker://registry.viktorbarzin.me:5050/<image>:<tag>`
returns an OCI image index whose `manifests[].digest` references are 404
on the registry.
This is the **orphan OCI-index** failure mode documented in
`docs/post-mortems/2026-04-19-registry-orphan-index.md`. The fix is to
rebuild the affected image from source so the registry receives a fresh,
complete push.
If the symptom is different (e.g., registry container down, TLS expiry,
auth failure), use `docs/runbooks/registry-vm.md` instead.
## Phase 1 — Confirm the diagnosis
From any host with `skopeo`:
```sh
REG=registry.viktorbarzin.me:5050
IMAGE=infra-ci
TAG=latest
# 1. Confirm the index exists.
skopeo inspect --tls-verify --creds "$USER:$PASS" \
--raw "docker://$REG/$IMAGE:$TAG" | jq '.mediaType, .manifests[].digest'
# 2. HEAD each child. Any non-200 = confirmed orphan.
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
"docker://$REG/$IMAGE:$TAG" | jq -r '.manifests[].digest'); do
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
-I "https://$REG/v2/$IMAGE/manifests/$d")
echo "$d → $code"
done
```
If every child is 200, the problem is elsewhere — stop here and check
the registry VM, TLS, or auth.
The `registry-integrity-probe` CronJob in the `monitoring` namespace
runs this same check every 15 minutes across every tag in the catalog;
its last run is also a fast way to see which image(s) are affected:
```sh
kubectl -n monitoring logs \
$(kubectl -n monitoring get pods -l job-name -o name \
| grep registry-integrity-probe | head -1)
```
## Phase 2 — Rebuild
### Option A (preferred): rebuild via CI
Find the `build-*.yml` pipeline that produces the image:
| Image | Pipeline | Repo ID |
|---|---|---|
| `infra-ci` | `.woodpecker/build-ci-image.yml` | 1 (infra) |
| `infra` (cli) | `.woodpecker/build-cli.yml` | 1 (infra) |
| `k8s-portal` | `.woodpecker/k8s-portal.yml` | 1 (infra) |
Trigger a manual build. The Woodpecker API expects a numeric repo ID
(paths with `owner/name` return HTML):
```sh
WOODPECKER_TOKEN=$(vault kv get -field=woodpecker_admin_token secret/viktor)
# Kick off a manual build against master.
curl -s -X POST \
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
-H "Content-Type: application/json" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines" \
-d '{"branch":"master"}' | jq .number
# Follow the pipeline at https://ci.viktorbarzin.me/repos/1/pipeline/<number>
```
The pipeline's `verify-integrity` step walks every blob the push
references. If it passes, the registry now has a clean index; pull
consumers will recover on next attempt.
### Option B (fallback): build on the registry VM
Only use this if Woodpecker itself is broken (its own pipeline runs
from the same `infra-ci` image, so a corrupted `infra-ci:latest` can
prevent Option A from recovering).
```sh
ssh root@10.0.20.10 '
cd /tmp
git clone --depth 1 https://github.com/ViktorBarzin/infra
cd infra/ci
docker build -t registry.viktorbarzin.me:5050/infra-ci:manual -t registry.viktorbarzin.me:5050/infra-ci:latest .
docker login -u "$USER" -p "$PASS" registry.viktorbarzin.me:5050
docker push registry.viktorbarzin.me:5050/infra-ci:manual
docker push registry.viktorbarzin.me:5050/infra-ci:latest
'
```
Then re-run any pipelines that failed — Woodpecker UI → Restart, or:
```sh
curl -s -X POST \
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines/<failed-pipeline-number>"
```
## Phase 3 — Verify
```sh
# 1. Pull the image fresh (bypassing containerd cache) and check its index.
REG=registry.viktorbarzin.me:5050
skopeo inspect --tls-verify --creds "$USER:$PASS" \
--raw "docker://$REG/infra-ci:latest" \
| jq '.manifests[] | {digest, platform}'
# 2. HEAD every child digest — all should be 200.
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
"docker://$REG/infra-ci:latest" | jq -r '.manifests[].digest'); do
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
-I "https://$REG/v2/infra-ci/manifests/$d")
[ "$code" = "200" ] || echo "STILL BROKEN: $d → $code"
done
echo "verified"
# 3. Kick off the next scheduled probe for good measure.
kubectl -n monitoring create job --from=cronjob/registry-integrity-probe \
registry-integrity-probe-verify-$(date +%s)
kubectl -n monitoring logs -f -l job-name=registry-integrity-probe-verify-$(date +%s)
```
The `RegistryManifestIntegrityFailure` alert clears automatically when
the probe's next run returns zero failures.
## Phase 4 — Investigate orphans
Once the immediate fix is in, check whether any OTHER images on the
registry have orphan children:
```sh
ssh root@10.0.20.10 'python3 /opt/registry/fix-broken-blobs.sh --dry-run 2>&1 | grep "ORPHAN INDEX"'
```
Each hit is a separate image that will eventually fail to pull. Rebuild
them in the same way (Option A preferred). If the list is long, open a
beads task — do NOT batch-delete the indexes; that's a destructive
registry operation outside this runbook's scope.
## Related
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — why this
happens.
- `docs/runbooks/registry-vm.md` — VM-level operations (DNS,
`docker compose` restarts).
- `modules/docker-registry/fix-broken-blobs.sh` — the scanner cron
itself, runs nightly and after each GC.
- `stacks/monitoring/modules/monitoring/main.tf`
`registry_integrity_probe` CronJob definition.

View file

@ -0,0 +1,227 @@
# Runbook: Registry VM (docker-registry, 10.0.20.10)
Last updated: 2026-05-07
The registry VM is an Ubuntu 24.04 VM on the cluster LAN subnet
`10.0.20.0/24`, with a static netplan config (no DHCP). Because it
sits on a subnet that only has pfSense as its gateway, its DNS must
be statically configured.
**As of Phase 4 of forgejo-registry-consolidation 2026-05-07** the VM
no longer hosts the private R/W registry. It hosts pull-through
caches only:
| Port | Upstream |
|---|---|
| 5000 | docker.io (Docker Hub) — auth via dockerhub_registry_password |
| 5010 | ghcr.io |
| 5020 | quay.io |
| 5030 | registry.k8s.io |
| 5040 | reg.kyverno.io |
The decommissioned private registry (port 5050) is now hosted on
Forgejo at `forgejo.viktorbarzin.me/viktor/<image>`. See
`docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md` for the
migration. Break-glass tarballs of `infra-ci` are still produced on
each build to `/opt/registry/data/private/_breakglass/` — see
`docs/runbooks/forgejo-registry-breakglass.md`.
## DNS configuration
Ubuntu ships `systemd-resolved` and uses netplan to declare per-link
`nameservers`. Netplan writes systemd-networkd or NetworkManager
configs that resolved reads at runtime. There is **no automatic
merging** of netplan DNS with the `[Resolve]` section of
`/etc/systemd/resolved.conf` — per-link settings override the global
ones. So both layers must be in sync:
| Layer | File | Role |
|---|---|---|
| Netplan | `/etc/netplan/50-cloud-init.yaml` | Per-link DNS servers that resolved reports on `Link 2 (eth0)` |
| Resolved global | `/etc/systemd/resolved.conf.d/10-internal-dns.conf` | `Global` scope `DNS=` / `FallbackDNS=` — also shown in `resolvectl status` |
### Current state
`/etc/systemd/resolved.conf.d/10-internal-dns.conf`:
```ini
[Resolve]
DNS=10.0.20.1
FallbackDNS=94.140.14.14
Domains=viktorbarzin.lan
```
`/etc/netplan/50-cloud-init.yaml` (eth0 block, simplified):
```yaml
nameservers:
addresses:
- 10.0.20.1
- 94.140.14.14
search:
- viktorbarzin.lan
```
`resolvectl status` output after the change:
```
Global
resolv.conf mode: stub
Current DNS Server: 10.0.20.1
DNS Servers: 10.0.20.1
Fallback DNS Servers: 94.140.14.14
DNS Domain: viktorbarzin.lan
Link 2 (eth0)
Current Scopes: DNS
Current DNS Server: 10.0.20.1
DNS Servers: 10.0.20.1 94.140.14.14
DNS Domain: viktorbarzin.lan
```
| Field | Value | Purpose |
|---|---|---|
| Primary | `10.0.20.1` | pfSense OPT1 interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` |
| Fallback | `94.140.14.14` | AdGuard public DNS — used if pfSense unreachable (e.g., OPT1 flap) |
| Search | `viktorbarzin.lan` | Unqualified names resolve against the internal zone |
### Why this matters for the registry
Container builds on this VM reference `.lan` hostnames (Technitium,
NFS, etc.) and external hostnames (Docker Hub, GHCR). Before the
hardening the netplan had `1.1.1.1` / `8.8.8.8` only, which meant:
1. Internal hostname lookups silently failed (slow timeout) — the
VM could not resolve `idrac.viktorbarzin.lan` or any internal
helper.
2. If Cloudflare's 1.1.1.1 had an outage, the VM would lose DNS
entirely.
With the new config the VM can resolve both zones and keeps working
if the primary DNS server is unreachable.
## Apply / re-apply
```sh
ssh root@10.0.20.10 '
netplan generate
netplan apply
systemctl restart systemd-resolved
resolvectl status | head -20
'
```
`netplan apply` is not disruptive when only `nameservers` change — it
does not bounce the link.
## Verification
```sh
ssh root@10.0.20.10 '
dig +short idrac.viktorbarzin.lan # 192.168.1.4
dig +short github.com # GitHub A record
dig +short registry.viktorbarzin.me # 10.0.20.10 + external A
'
```
Fallback test — blackhole the primary and confirm external lookups
still succeed through 94.140.14.14:
```sh
ssh root@10.0.20.10 '
ip route add blackhole 10.0.20.1
dig +short +time=5 +tries=2 github.com # should still answer
ip route del blackhole 10.0.20.1
'
```
Internal lookups do fail during the blackhole (the fallback is a
public resolver and does not know about the internal zone), which is
expected — the fallback buys availability for external pulls, not
internal hostnames.
## Rollback
A pre-change backup of `/etc/resolv.conf`, `/etc/systemd/resolved.conf`,
and `/etc/netplan/` lives at
`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
VM. To roll back:
```sh
ssh root@10.0.20.10 '
BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
tar -xzf "$BACKUP" -C /
rm -f /etc/systemd/resolved.conf.d/10-internal-dns.conf
netplan apply
systemctl restart systemd-resolved
resolvectl status | head -10
'
```
## Auto-sync pipeline
Changes to `modules/docker-registry/{docker-compose.yml, fix-broken-blobs.sh,
cleanup-tags.sh, nginx_registry.conf, config-private.yml}` deploy
automatically via `.woodpecker/registry-config-sync.yml`:
- Fires on `push` to master touching any of those paths, or via `manual`
event (Woodpecker UI / API).
- SCPs every managed file to `/opt/registry/` on `10.0.20.10`.
- Bounces containers + nginx when a compose-visible file changed; leaves
them alone when only scripts changed (cron picks up automatically).
- Runs a dry-run `fix-broken-blobs.sh` at the end to verify the registry
is still coherent.
SSH credentials: Woodpecker repo-secret `registry_ssh_key` (ed25519,
provisioned 2026-04-19). Public key at `/root/.ssh/authorized_keys` on
`10.0.20.10`. Private key mirrored at `secret/woodpecker/registry_ssh_key`
in Vault (subkeys `private_key` / `public_key` / `known_hosts_entry`).
Manual override if you need to sync right now:
```sh
curl -sf -X POST \
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines" \
-d '{"branch":"master"}' | jq .number
```
## Bouncing registry containers — the nginx DNS trap
`docker compose up -d` on `/opt/registry/docker-compose.yml` recreates
`registry-*` containers when their image tag changes, which assigns them
new IPs on the `registry` bridge network. **`registry-nginx` resolves its
upstream DNS names (`registry-private`, `registry-dockerhub`, …) ONCE at
startup and caches the results** — it does not re-resolve after a
recreate.
Symptom if you forget: `/v2/_catalog` on `:5050` returns
`{"repositories": []}`, `/v2/` returns 200 without auth, pulls return
the wrong image. nginx is forwarding to a stale IP that now belongs to a
different registry-* backend (commonly the pull-through ghcr or
dockerhub cache, which have empty catalogs from the htpasswd-auth user's
perspective).
**Always follow a registry-* bounce with `docker restart registry-nginx`.**
Or prevent the problem by setting a `resolver` directive in
`nginx_registry.conf` so upstream names are re-resolved per request.
```sh
ssh root@10.0.20.10 '
cd /opt/registry && docker compose up -d
docker restart registry-nginx
sleep 3
docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}" \
| grep -E "registry-"
'
```
## Related docs
- `docs/architecture/dns.md` — resolver IP assignments per subnet.
- `.claude/CLAUDE.md` (at repo root) — notes on the private registry
and `containerd` `hosts.toml` redirects.
- `docs/runbooks/registry-rebuild-image.md` — rebuild an image after an
orphan OCI-index incident (different class of problem than DNS).
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — root cause
+ detection gaps behind the recurring missing-blob incidents.

View file

@ -0,0 +1,96 @@
# Restore etcd
## Prerequisites
- SSH access to `k8s-master` node
- etcd snapshot available on NFS at `/mnt/main/etcd-backup/`
- etcd PKI certs at `/etc/kubernetes/pki/etcd/` on master node
## Backup Location
- NFS: `/mnt/main/etcd-backup/etcd-snapshot-YYYYMMDD-HHMMSS.db`
- Replicated to Synology NAS (192.168.1.13) via Proxmox host offsite-sync-backup (inotify-driven rsync)
- Retention: 30 days
- Schedule: Daily at 00:00
## CRITICAL: etcd is the foundation of the cluster
Restoring etcd will reset the entire Kubernetes state to the snapshot time. All objects created after the snapshot will be lost. This is a last-resort operation.
**Only restore etcd if the control plane is completely broken.**
## Restore Procedure
### 1. SSH to the master node
```bash
ssh k8s-master
```
### 2. Identify the snapshot to restore
```bash
ls -lt /mnt/main/etcd-backup/etcd-snapshot-*.db | head -10
```
### 3. Stop the API server and etcd
```bash
# Move static pod manifests to stop them
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/
sudo mv /etc/kubernetes/manifests/etcd.yaml /etc/kubernetes/
# Wait for pods to stop
sudo crictl ps | grep -E "etcd|apiserver"
```
### 4. Back up current etcd data
```bash
sudo mv /var/lib/etcd /var/lib/etcd.bak.$(date +%Y%m%d-%H%M%S)
```
### 5. Restore the snapshot
```bash
sudo ETCDCTL_API=3 etcdctl snapshot restore /mnt/main/etcd-backup/etcd-snapshot-YYYYMMDD-HHMMSS.db \
--data-dir=/var/lib/etcd \
--name=k8s-master \
--initial-cluster=k8s-master=https://127.0.0.1:2380 \
--initial-advertise-peer-urls=https://127.0.0.1:2380
```
### 6. Fix permissions
```bash
sudo chown -R root:root /var/lib/etcd
```
### 7. Restart etcd and API server
```bash
sudo mv /etc/kubernetes/etcd.yaml /etc/kubernetes/manifests/
# Wait for etcd to be ready
sleep 30
sudo mv /etc/kubernetes/kube-apiserver.yaml /etc/kubernetes/manifests/
```
### 8. Verify restoration
```bash
# Check etcd health
sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
endpoint health
# Check cluster status
kubectl get nodes
kubectl get pods -A | head -20
```
### 9. Reconcile state
After etcd restore, some objects may be stale:
```bash
# Re-apply critical infrastructure
cd /path/to/infra
scripts/tg apply stacks/platform
# Check for orphaned resources
kubectl get pods -A | grep -E "Terminating|Error|Unknown"
```
## Estimated Time
- Snapshot restore: ~10-15 minutes
- Full reconciliation: ~30-60 minutes (depends on drift)

View file

@ -0,0 +1,173 @@
# Full Cluster Rebuild
Last updated: 2026-04-06
## When to Use
- Complete cluster failure (all VMs lost)
- etcd corruption requiring full rebuild
- Proxmox host failure requiring fresh VM provisioning
## Prerequisites
- Proxmox host (192.168.1.127) accessible, with NFS exports on `/srv/nfs` and `/srv/nfs-ssd`
- Synology NAS (192.168.1.13) accessible for offsite backup restore if the PVE host backup disk is also lost
- sda backup disk mounted at `/mnt/backup` on PVE host (or restore from Synology first)
- Git repo with infra code
- SOPS age keys for state decryption (`~/.config/sops/age/keys.txt`)
- Vault unseal keys (emergency kit)
## Rebuild Order
The rebuild must follow dependency order. Each layer depends on the one before it.
### Phase 1: Infrastructure (Proxmox VMs)
```bash
# 1. Provision VMs via Terraform
cd infra
scripts/tg apply stacks/infra
# 2. Wait for VMs to boot and be reachable
# k8s-master, k8s-node3, k8s-node4, k8s-node5
# (node1 has GPU workloads, node2 excluded from MySQL anti-affinity only — both are active cluster members)
```
### Phase 2: Kubernetes Control Plane
```bash
# 3. Initialize kubeadm on master (if starting fresh)
sudo kubeadm init --config /etc/kubernetes/kubeadm-config.yaml
# 4. Join worker nodes
# Get join command from master, run on each node
# 5. OR restore etcd from snapshot (see restore-etcd.md)
# This restores all K8s objects from the snapshot time
```
### Phase 3: Storage Layer
```bash
# 6. Deploy CSI drivers (NFS + Proxmox)
scripts/tg apply stacks/nfs-csi
scripts/tg apply stacks/proxmox-csi
# 7. Verify PVs are accessible
kubectl get pv
kubectl get pvc -A | grep -v Bound
```
### Phase 3.5: Restore PVC Data from sda Backup
After storage layer is deployed, restore PVC data from the sda backup disk:
```bash
# 8a. List available backup weeks
ssh root@192.168.1.127
ls -l /mnt/backup/pvc-data/
# 8b. For each critical PVC, restore files:
# Example: vaultwarden-data-proxmox
WEEK="2026-14" # Use most recent week
NAMESPACE="vaultwarden"
PVC_NAME="vaultwarden-data-proxmox"
# Find the PV LV name
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' | grep $PVC_NAME
# Assuming volumeHandle is "local-lvm:vm-999-pvc-abc123"
LV_NAME="vm-999-pvc-abc123"
# Mount the LV
lvchange -ay pve/$LV_NAME
mkdir -p /mnt/restore-temp
mount /dev/pve/$LV_NAME /mnt/restore-temp
# Restore from backup
rsync -avP --delete /mnt/backup/pvc-data/$WEEK/$NAMESPACE/$PVC_NAME/ /mnt/restore-temp/
# Unmount
umount /mnt/restore-temp
lvchange -an pve/$LV_NAME
# 8c. Repeat for all critical PVCs (prioritize: vaultwarden, vault, redis, nextcloud)
```
**Note on pfSense restore**: If pfSense needs restoration, restore `config.xml` from `/mnt/backup/pfsense/<week>/config.xml` via web UI, or full filesystem tar for custom scripts.
**Note on PVE config restore**: If custom scripts/timers are lost, restore from `/mnt/backup/pve-config/` (daily-backup, offsite-sync-backup, lvm-pvc-snapshot scripts + timers).
### Phase 4: Vault (secrets foundation)
```bash
# 8. Deploy Vault (see restore-vault.md for full procedure)
scripts/tg apply stacks/vault
# 9. Initialize/unseal/restore raft snapshot
# 10. Verify ESO can connect
scripts/tg apply stacks/external-secrets
kubectl get externalsecrets -A
```
### Phase 5: Platform Services
```bash
# 11. Deploy platform stack (Traefik, monitoring, Kyverno, etc.)
scripts/tg apply stacks/platform
# 12. Verify ingress is working
curl -s -o /dev/null -w "%{http_code}" https://viktorbarzin.me/
```
### Phase 6: Databases
```bash
# 13. Deploy database stack
scripts/tg apply stacks/dbaas
# 14. Wait for CNPG and InnoDB clusters to initialize
kubectl wait --for=condition=Ready cluster/pg-cluster -n dbaas --timeout=600s
# 15. Restore PostgreSQL from dump (see restore-postgresql.md)
# 16. Restore MySQL from dump (see restore-mysql.md)
```
### Phase 7: Application Services
```bash
# 17. Deploy remaining stacks in any order
for stack in vaultwarden immich nextcloud linkwarden health; do
scripts/tg apply stacks/$stack
done
# 18. Restore Vaultwarden (see restore-vaultwarden.md)
```
### Phase 8: Verification
```bash
# 19. Check all pods are running
kubectl get pods -A | grep -v Running | grep -v Completed
# 20. Check all ingresses respond
kubectl get ingress -A -o jsonpath='{range .items[*]}{.spec.rules[0].host}{"\n"}{end}' | while read host; do
code=$(curl -s -o /dev/null -w "%{http_code}" "https://$host/" 2>/dev/null)
echo "$host: $code"
done
# 21. Check monitoring
# Verify Prometheus targets: https://prometheus.viktorbarzin.me/targets
# Verify Alertmanager: https://alertmanager.viktorbarzin.me/
# 22. Run backup CronJobs manually to establish baseline
kubectl create job --from=cronjob/backup-etcd manual-etcd-backup -n default
kubectl create job --from=cronjob/postgresql-backup manual-pg-backup -n dbaas
kubectl create job --from=cronjob/mysql-backup manual-mysql-backup -n dbaas
kubectl create job --from=cronjob/vault-raft-backup manual-vault-backup -n vault
kubectl create job --from=cronjob/vaultwarden-backup manual-vw-backup -n vaultwarden
```
## Dependency Graph
```
etcd → K8s API → CSI Drivers → Restore PVC data from sda → Vault → ESO → Platform → Databases → Apps
Restore DB dumps from
/mnt/backup/nfs-mirror
or Synology/pve-backup
```
## Estimated Time
- Full cluster rebuild from scratch: ~2-4 hours
- With etcd restore (objects preserved): ~1-2 hours
- Individual service restore: ~10-30 minutes each

View file

@ -0,0 +1,159 @@
# Runbook: Restore PVC from LVM Thin Snapshot
Last updated: 2026-04-06
## When to Use
- Rolling back a PVC to a previous state after a bad migration, data corruption, or accidental deletion
- Pre-upgrade safety: snapshot before upgrade, restore if upgrade fails
- Fast recovery for data changed within the last 7 days
## Prerequisites
- SSH access to PVE host (192.168.1.127)
- The `lvm-pvc-snapshot` script at `/usr/local/bin/lvm-pvc-snapshot`
- kubectl configured on PVE host (`/root/.kube/config`)
## Snapshot Retention
- **Daily snapshots**: Created at 03:00 via systemd timer
- **Retention**: 7 days (older snapshots automatically pruned)
- **Coverage**: All proxmox-lvm PVCs except `dbaas` and `monitoring` namespaces
**If you need data older than 7 days**, see "Alternative: Restore from sda Backup" below.
## Procedure
### 1. List Available Snapshots
```bash
ssh root@192.168.1.127 lvm-pvc-snapshot list
```
Output shows all snapshots with their original LV, age, and data divergence percentage.
### 2. Identify the PVC LV Name
Find the LV name for your PVC:
```bash
# From your workstation (with kubectl):
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle'
# The HANDLE column shows "local-lvm:<lv-name>"
```
### 3. Run the Restore
```bash
ssh root@192.168.1.127
lvm-pvc-snapshot restore <pvc-lv-name> <snapshot-lv-name>
```
The script will:
1. Look up the K8s PV/PVC/workload for the LV
2. Show a dry-run of all actions
3. Ask for confirmation (type `yes`)
4. Scale down the workload (Deployment or StatefulSet)
5. Rename the current LV to `<name>_pre_restore_<timestamp>`
6. Rename the snapshot LV to the original name
7. Scale the workload back up
8. Wait for pod to become Ready
### 4. Verify
```bash
# Check pod is running
kubectl get pods -n <namespace> -l app=<workload>
# Check the application is working correctly
# (service-specific verification)
```
### 5. Clean Up
Once you've verified the restore is correct, remove the pre-restore backup:
```bash
ssh root@192.168.1.127 lvremove -f pve/<original-lv>_pre_restore_<timestamp>
```
## Manual Restore (if script fails)
If the automated restore fails, perform these steps manually:
```bash
# 1. Scale down the workload
kubectl scale deployment/<name> -n <ns> --replicas=0
# or for StatefulSets:
kubectl scale statefulset/<name> -n <ns> --replicas=0
# 2. Wait for pods to terminate
kubectl wait --for=delete pod -l app=<name> -n <ns> --timeout=120s
# 3. SSH to PVE host
ssh root@192.168.1.127
# 4. Verify LV is inactive
lvs -o lv_name,lv_active pve | grep <lv-name>
# 5. Rename LVs
lvrename pve <original-lv> <original-lv>_pre_restore_$(date +%Y%m%d_%H%M)
lvrename pve <snapshot-lv> <original-lv>
# 6. Scale back up
kubectl scale deployment/<name> -n <ns> --replicas=1
```
## Database-Specific Notes
- **MySQL InnoDB**: After restore, InnoDB will replay redo logs automatically on startup. Check `SHOW ENGINE INNODB STATUS` for recovery progress.
- **PostgreSQL**: WAL replay happens automatically. Check `pg_is_in_recovery()` and PostgreSQL logs.
- **Redis**: Redis loads the RDB file on startup. Check `INFO persistence` for load status.
For databases, prefer the app-level backup restore (see `restore-mysql.md`, `restore-postgresql.md`) unless you need a very recent point-in-time that predates the last dump.
## Alternative: Restore from sda Backup
If LVM snapshots are too old or missing (data lost >7 days ago), use the weekly file-level backup on sda:
**Location**: `/mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/` on PVE host
**Retention**: 4 weekly versions (weeks 0-3)
### Procedure
```bash
# 1. List available backup weeks
ssh root@192.168.1.127
ls -l /mnt/backup/pvc-data/
# 2. Identify the PVC backup directory
ls -l /mnt/backup/pvc-data/2026-14/<namespace>/
# 3. Scale down the workload
kubectl scale deployment/<name> -n <ns> --replicas=0
# 4. Mount the live PVC LV on PVE host
lvchange -ay pve/<pvc-lv-name>
mkdir -p /mnt/restore-temp
mount /dev/pve/<pvc-lv-name> /mnt/restore-temp
# 5. Restore from backup
rsync -avP --delete /mnt/backup/pvc-data/2026-14/<namespace>/<pvc-name>/ /mnt/restore-temp/
# 6. Unmount and scale up
umount /mnt/restore-temp
lvchange -an pve/<pvc-lv-name>
kubectl scale deployment/<name> -n <ns> --replicas=1
```
See `restore-pvc-from-backup.md` for detailed walkthrough.
## Troubleshooting
| Problem | Cause | Fix |
|---------|-------|-----|
| "Another instance is running" | Concurrent snapshot/restore | Wait for timer to finish: `systemctl status lvm-pvc-snapshot.service` |
| LV still active after scale-down | Proxmox CSI hasn't detached | Wait 30s, or `lvchange -an pve/<lv>` |
| Pod stuck in ContainerCreating | Volume not attached to node | `kubectl describe pod` — check events for attach errors |
| No PV found for volume handle | LV name doesn't match any PV | Check `kubectl get pv -o yaml` for the correct volumeHandle format |

View file

@ -0,0 +1,256 @@
# Restore MySQL (Standalone)
Last updated: 2026-05-18 (after the 8.4.9 DD-upgrade disaster recovery)
Applies to the `mysql-standalone` StatefulSet in the `dbaas` namespace
(raw `kubernetes_stateful_set_v1`, migrated from InnoDB Cluster on
2026-04-16). The historic InnoDB-Cluster recovery flow is gone.
## Prerequisites
- `kubectl` against the cluster
- Root password: `kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d`
- A backup dump on NFS at `/srv/nfs/mysql-backup/` (exported via
`dbaas-mysql-backup-host` PVC inside the cluster)
## Backup Locations
| Location | Purpose | Retention |
|---|---|---|
| `/srv/nfs/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz` | Full daily dump (CronJob `mysql-backup`, daily 00:30 UTC) | 14 days |
| `/srv/nfs/mysql-backup/per-db/<dbname>/dump_*.sql.gz` | Per-DB dumps (CronJob `mysql-backup-per-db`, daily 00:45 UTC) | 14 days |
| Synology `Backup/Viki/nfs/mysql-backup/` | Offsite mirror via inotify-tracked rsync | unlimited |
Latest full dump is ~230MB compressed (~3GB uncompressed). Restore
of a full dump into a fresh MySQL pod takes ~3 minutes.
## Scenario A — Single database restored alongside the others
When one DB is corrupted but MySQL is otherwise fine.
```bash
ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
# List per-db dumps for the affected database
kubectl -n dbaas exec mysql-standalone-0 -- ls -lt /backup/per-db/<dbname>/
# Pipe a chosen dump into MySQL (REPLACE existing data in <dbname>):
kubectl -n dbaas exec -i mysql-standalone-0 -- \
sh -c "zcat /backup/per-db/<dbname>/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -uroot -p\"$ROOT_PWD\" <dbname>"
# Restart consumers
kubectl -n <ns> rollout restart deployment
```
## Scenario B — Full disaster: data dictionary corrupt or PVC unsalvageable
This is the path executed on 2026-05-18 when a Keel-driven bump to
`mysql:8.4.9` left the data dictionary half-upgraded and 8.4.8 refused
to start (`Server upgrade of version 80408 is still pending`
MY-013379). Wipes the PVC and rehydrates from the daily dump.
**Estimated downtime: 25 minutes.** Plan accordingly — Forgejo +
registry + every MySQL app go offline during this.
### B.1 Stop the failing MySQL pod
```bash
kubectl -n dbaas scale statefulset mysql-standalone --replicas=0
```
### B.2 Verify the dump you intend to restore is healthy
```bash
ssh root@192.168.1.127 'ls -la /srv/nfs/mysql-backup/dump_*.sql.gz | tail -5'
# Sanity-check the header
ssh root@192.168.1.127 'zcat /srv/nfs/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz | head -20'
# Should show "MySQL dump 10.13 ... Server version 8.4.X"
```
### B.3 Pin MySQL image in Terraform (if it auto-bumped)
If the upgrade was triggered by a Keel bump on a floating tag
(`mysql:8.4`), edit `stacks/dbaas/modules/dbaas/main.tf` to pin to a
known-good exact version (`mysql:8.4.8`). Commit but don't apply yet.
### B.4 Wipe the corrupted PVC
The PV reclaim policy defaults to **Retain** on
`proxmox-lvm-encrypted``kubectl delete pvc` alone leaves the PV
attached to the (corrupted) disk. Flip to `Delete` first so the CSI
driver actually cleans up the underlying LV.
```bash
PV=$(kubectl -n dbaas get pvc data-mysql-standalone-0 -o jsonpath='{.spec.volumeName}')
kubectl patch pv "$PV" -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
kubectl -n dbaas delete pvc data-mysql-standalone-0
```
The PV transitions to `Released` then gets cleaned up by the CSI
controller; confirm with `kubectl get pv | grep <PV>` (eventually
disappears).
### B.5 Scale MySQL back up via Terraform
```bash
cd stacks/dbaas && /home/wizard/code/infra/scripts/tg apply
```
This recreates the PVC fresh (5Gi initial; pvc-autoresizer grows it
on demand) and starts a brand-new MySQL pod. The pod initializes an
empty datadir using `MYSQL_ROOT_PASSWORD` from the `cluster-secret`
K8s Secret — ~30s to ready.
### B.6 Restore the full dump via a one-shot Job
```bash
cat <<'YAML' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: mysql-restore-$(date +%Y-%m-%d)
namespace: dbaas
spec:
ttlSecondsAfterFinished: 3600
template:
spec:
restartPolicy: Never
containers:
- name: restore
image: mysql:8.4.8
command: ["bash","-c"]
args:
- |
set -euo pipefail
gunzip -c /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | \
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"$MYSQL_ROOT_PASSWORD"
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"$MYSQL_ROOT_PASSWORD" -e 'SHOW DATABASES;'
env:
- name: MYSQL_ROOT_PASSWORD
valueFrom:
secretKeyRef: { name: cluster-secret, key: ROOT_PASSWORD }
volumeMounts:
- { name: backup, mountPath: /backup, readOnly: true }
volumes:
- name: backup
persistentVolumeClaim: { claimName: dbaas-mysql-backup-host, readOnly: true }
YAML
```
Watch progress: `kubectl -n dbaas logs -f job/<name>`. Takes ~3 min
for a 230MB compressed dump.
### B.7 Reset static MySQL users with passwords from Vault
**This step is mandatory.** `mysqldump` restores rows in `mysql.user`
verbatim, including password hashes. But `null_resource.mysql_static_user`
in Terraform writes the **current Vault password** to `forgejo` and
`roundcubemail` — and that current password rarely matches the dump's
hash. The apps will fail auth (forgejo logs `Error 1045 (28000): Access
denied for user 'forgejo'@'...'`) until you reset them.
```bash
FORGEJO_PW=$(vault kv get -field=mysql_forgejo_password secret/viktor)
RC_PW=$(vault kv get -field=mysql_roundcubemail_password secret/viktor)
kubectl -n dbaas exec -i mysql-standalone-0 -- bash -c 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD"' <<SQL
DROP USER IF EXISTS 'forgejo'@'%';
DROP USER IF EXISTS 'roundcubemail'@'%';
CREATE USER 'forgejo'@'%' IDENTIFIED WITH caching_sha2_password BY '$FORGEJO_PW';
CREATE USER 'roundcubemail'@'%' IDENTIFIED WITH caching_sha2_password BY '$RC_PW';
GRANT ALL PRIVILEGES ON \`forgejo\`.* TO 'forgejo'@'%';
GRANT ALL PRIVILEGES ON \`roundcubemail\`.* TO 'roundcubemail'@'%';
FLUSH PRIVILEGES;
SQL
```
`ALTER USER` sometimes hits `ERROR 1396 Operation ALTER USER failed`
on freshly-restored DBs (stale grant-table cache); `DROP USER` +
`CREATE USER` is the reliable form.
Vault-rotated app users (nextcloud, codimd, grafana, paperless,
phpipam, etc.) are managed by Vault DB engine and their dump password
already matches the live K8s secret, so they need no manual fixup.
### B.8 Restart MySQL-dependent apps
The dump restore brings MySQL up, but app pods still hold stale
connections (and forgejo has been crash-looping). Roll the
deployments to force fresh connections:
```bash
for ns_app in \
"forgejo:deploy/forgejo" \
"nextcloud:deploy/nextcloud" \
"hackmd:deploy/hackmd" \
"monitoring:deploy/grafana" \
"paperless-ngx:deploy/paperless-ngx" \
"uptime-kuma:deploy/uptime-kuma" \
"url:deploy/shlink" \
"realestate-crawler:deploy/realestate-crawler-api" \
"realestate-crawler:deploy/realestate-crawler-celery" \
"realestate-crawler:deploy/realestate-crawler-celery-beat" \
"realestate-crawler:deploy/realestate-crawler-ui"; do
ns=${ns_app%%:*}; app=${ns_app##*:}
kubectl -n "$ns" rollout restart "$app" &
done
wait
```
If any deployments stay stuck in `ImagePullBackOff` (e.g.
`chrome-service`, `fire-planner`, `freedify`), those rely on the
Forgejo registry — once forgejo is back, just delete their pods to
force a fresh pull:
```bash
kubectl -n chrome-service delete pod --all
kubectl -n fire-planner delete pod --all
kubectl -n freedify delete pod --all
```
### B.9 Verify recovery
```bash
# All workloads ready
kubectl get deploy,sts -A -o json | jq -r '.items[] | select(.spec.replicas != .status.readyReplicas and .spec.replicas > 0) | "\(.metadata.namespace)/\(.metadata.name)"'
# (empty output = healthy)
# Database integrity — table counts per schema
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
-e "SELECT table_schema, COUNT(*) FROM information_schema.tables \
WHERE table_schema NOT IN ('information_schema','performance_schema','sys') \
GROUP BY table_schema;"
# Forgejo's registry catalog (catches the cascade alert)
kubectl -n monitoring create job --from=cronjob/forgejo-integrity-probe manual-postrestore-$(date +%s)
kubectl -n monitoring logs job/manual-postrestore-<timestamp> --tail=10
# Expect "Probe complete: 0 failures across N repos / M tags / K indexes"
# Cluster-health re-run
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --quiet
```
### B.10 Clean up failed CronJob pods from the outage window
```bash
kubectl delete pods -A --field-selector=status.phase=Failed
```
## Why the 8.4.9 upgrade got us — and the version pin
The MySQL 8.4.9 data-dictionary upgrade from 80408 → 80409 stalls
reliably on this hardware. ~24s of writes to `mysql.ibd` and the redo
log, then no further progress, no CPU, no completion. We bumped the
liveness probe to 600s (`initial_delay_seconds`) and still no
progress. Hypothesised root cause: `innodb_io_capacity=100` combined
with `innodb_page_cleaners=1` — the upgrade's spatial-reference-system
flush phase is IO-starved. **Don't retry 8.4.9 without first bumping
IO capacity and pinning a proper maintenance window.**
Until then, the StatefulSet pins to `mysql:8.4.8` exactly, not the
floating `mysql:8.4` tag. Keel will not silently bump it.
## See also
- `docs/runbooks/forgejo-registry-breakglass.md` — companion runbook
for when the cascade has reached the registry layer.
- Beads `code-eme8` / `code-k40p` — incident tracker entries (closed
in commit ea475c3d).

View file

@ -0,0 +1,160 @@
# Restore PostgreSQL (CNPG)
Last updated: 2026-04-06
## Prerequisites
- `kubectl` access to the cluster
- CNPG operator running in the cluster
- Backup dump available on NFS at `/mnt/main/postgresql-backup/`
- PostgreSQL superuser password (from `pg-cluster-superuser` secret in `dbaas` namespace)
## Backup Location
- NFS: `/mnt/main/postgresql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz`
- Mirrored to sda: `/mnt/backup/nfs-mirror/postgresql-backup/` (PVE host 192.168.1.127)
- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/postgresql-backup/`
- Retention: 14 days (on NFS), latest only (on sda), unlimited (on Synology)
## Restore from pg_dumpall
### 1. Identify the backup to restore
```bash
# List available backups (from any node with NFS access)
ls -lt /mnt/main/postgresql-backup/dump_*.sql | head -20
# Or via a pod:
kubectl run pg-restore --rm -it --image=postgres:16.4-bullseye \
--overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"dbaas-postgresql-backup"}}],"containers":[{"name":"pg-restore","image":"postgres:16.4-bullseye","volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["ls","-lt","/backup/"]}]}}' \
-n dbaas
```
### 2. Get the superuser password
```bash
kubectl get secret pg-cluster-superuser -n dbaas -o jsonpath='{.data.password}' | base64 -d
```
### 3. Option A: Restore into existing CNPG cluster
```bash
# Port-forward to the CNPG primary
kubectl port-forward svc/pg-cluster-rw -n dbaas 5433:5432 &
# Restore (decompress and pipe to psql — this will overwrite existing data)
PGPASSWORD=$(kubectl get secret pg-cluster-superuser -n dbaas -o jsonpath='{.data.password}' | base64 -d) \
zcat /path/to/dump_YYYY_MM_DD_HH_MM.sql.gz | psql -h 127.0.0.1 -p 5433 -U postgres
```
### 3. Option B: Rebuild CNPG cluster from scratch
```bash
# 1. Delete the existing cluster
kubectl delete cluster pg-cluster -n dbaas
# 2. Wait for PVCs to be cleaned up
kubectl get pvc -n dbaas -l cnpg.io/cluster=pg-cluster
# 3. Re-apply the cluster manifest (via terragrunt)
cd infra && scripts/tg apply -target=null_resource.pg_cluster stacks/dbaas
# 4. Wait for cluster to be ready
kubectl wait --for=condition=Ready cluster/pg-cluster -n dbaas --timeout=300s
# 5. Restore the dump
PGPASSWORD=$(kubectl get secret pg-cluster-superuser -n dbaas -o jsonpath='{.data.password}' | base64 -d) \
kubectl run pg-restore --rm -it --image=postgres:16.4-bullseye \
--overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"dbaas-postgresql-backup"}}],"containers":[{"name":"pg-restore","image":"postgres:16.4-bullseye","env":[{"name":"PGPASSWORD","value":"'$PGPASSWORD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | psql -h pg-cluster-rw.dbaas -U postgres"]}]}}' \
-n dbaas
```
### 4. Verify restoration
```bash
# Check databases exist
PGPASSWORD=$PGPASSWORD psql -h 127.0.0.1 -p 5433 -U postgres -c "\l"
# Check table counts for critical databases
for db in health linkwarden affine woodpecker claude_memory; do
echo "=== $db ==="
PGPASSWORD=$PGPASSWORD psql -h 127.0.0.1 -p 5433 -U postgres -d $db -c \
"SELECT schemaname, tablename, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 5;"
done
```
### 5. Restart dependent services
After restore, restart services that connect to PostgreSQL to pick up fresh connections:
```bash
kubectl rollout restart deployment -n health
kubectl rollout restart deployment -n linkwarden
# ... repeat for all PG-dependent services (excluding trading — disabled)
```
## Restore Single Database (from per-db backup)
Per-database backups use `pg_dump -Fc` (custom format) and are stored at `/mnt/main/postgresql-backup/per-db/<dbname>/`.
### 1. List available per-db backups
```bash
ls -lt /mnt/main/postgresql-backup/per-db/<dbname>/
# Or via a pod:
kubectl exec -n dbaas pg-cluster-1 -c postgres -- ls -lt /backup/per-db/<dbname>/ 2>/dev/null || \
echo "Mount a backup pod — see Option A below"
```
### 2. Restore a single database
```bash
# Port-forward to the CNPG primary
kubectl port-forward svc/pg-cluster-rw -n dbaas 5433:5432 &
# Restore single database (drops and recreates objects in that DB only)
PGPASSWORD=$(kubectl get secret pg-cluster-superuser -n dbaas -o jsonpath='{.data.password}' | base64 -d) \
pg_restore -h 127.0.0.1 -p 5433 -U postgres -d <dbname> --clean --if-exists \
/path/to/per-db/<dbname>/dump_YYYY_MM_DD_HH_MM.dump
```
### 3. Verify
```bash
PGPASSWORD=$PGPASSWORD psql -h 127.0.0.1 -p 5433 -U postgres -d <dbname> -c \
"SELECT schemaname, tablename, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 10;"
```
### 4. Restart the affected service only
```bash
kubectl rollout restart deployment -n <namespace>
```
**Advantages over full restore**: Only the target database is affected. All other databases continue running with their current data.
## Alternative: Restore from sda Backup
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
```bash
# 1. SSH to PVE host
ssh root@192.168.1.127
# 2. Find the latest backup
ls -lt /mnt/backup/nfs-mirror/postgresql-backup/
# 3. Mount sda backup on a pod
PGPASSWORD=$(kubectl get secret pg-cluster-superuser -n dbaas -o jsonpath='{.data.password}' | base64 -d)
kubectl run pg-restore --rm -it --image=postgres:16.4-bullseye \
--overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/postgresql-backup"}}],"containers":[{"name":"pg-restore","image":"postgres:16.4-bullseye","env":[{"name":"PGPASSWORD","value":"'$PGPASSWORD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | psql -h pg-cluster-rw.dbaas -U postgres"]}],"nodeName":"k8s-master"}}' \
-n dbaas
```
## Alternative: Restore from Synology (if PVE host is down)
If the PVE host itself is unavailable:
```bash
# 1. SSH to Synology NAS
ssh Administrator@192.168.1.13
# 2. Navigate to backup directory
cd /volume1/Backup/Viki/nfs/postgresql-backup/
# 3. Copy dump to a temporary location accessible from cluster
# (e.g., via rsync to a surviving node, or restore PVE host first)
```
## Estimated Time
- Restore into existing cluster: ~10 minutes (depends on dump size)
- Full rebuild: ~20-30 minutes

View file

@ -0,0 +1,231 @@
# Runbook: Restore PVC from sda File Backup
Last updated: 2026-04-06
## When to Use
- LVM snapshots are too old (>7 days) or missing
- Need to restore data from a specific week (up to 4 weeks back)
- LVM snapshot restore failed or snapshot is corrupt
- Granular file-level restore (not full PVC)
## Prerequisites
- SSH access to PVE host (192.168.1.127)
- kubectl configured (either on PVE host or your workstation)
- sda backup disk mounted at `/mnt/backup` on PVE host
## Backup Location
**Path**: `/mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/` on PVE host
**Retention**: 4 weekly versions (weeks 0-3)
**Deduplication**: `--link-dest` hardlink dedup (unchanged files share inodes across weeks)
## Procedure
### 1. List Available Backup Weeks
```bash
ssh root@192.168.1.127
ls -l /mnt/backup/pvc-data/
# Output shows week directories like:
# 2026-13
# 2026-14
# 2026-15
# 2026-16
```
### 2. Identify the PVC Backup Directory
```bash
# List namespaces in a specific week
ls -l /mnt/backup/pvc-data/2026-14/
# List PVCs in a namespace
ls -l /mnt/backup/pvc-data/2026-14/vaultwarden/
# Example: vaultwarden-data-proxmox/
```
### 3. Find the Live PVC LV Name
From your workstation (or PVE host with kubectl):
```bash
# Get the PV volumeHandle (contains LV name)
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' | grep <pvc-name>
# Example output:
# pvc-abc123 vaultwarden-data-proxmox vaultwarden local-lvm:vm-999-pvc-abc123
# ↑ this is the LV name
```
### 4. Scale Down the Workload
```bash
# Find the workload using the PVC
kubectl get deployment,statefulset -n <namespace> -o json | jq '.items[] | select(.spec.template.spec.volumes[]?.persistentVolumeClaim.claimName == "<pvc-name>") | .metadata.name'
# Scale down (Deployment example)
kubectl scale deployment/<workload-name> -n <namespace> --replicas=0
# Or StatefulSet:
kubectl scale statefulset/<workload-name> -n <namespace> --replicas=0
# Wait for pod to terminate
kubectl wait --for=delete pod -l app=<workload-name> -n <namespace> --timeout=120s
```
### 5. Mount the Live PVC LV
```bash
ssh root@192.168.1.127
# Activate the LV (should already be inactive after pod termination)
lvchange -ay pve/<lv-name>
# Create mount point
mkdir -p /mnt/restore-temp
# Mount the LV
mount /dev/pve/<lv-name> /mnt/restore-temp
```
### 6. Restore from Backup
**Option A: Full PVC restore (replace all data)**
```bash
# This will delete existing files in the PVC and replace with backup
rsync -avP --delete /mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/ /mnt/restore-temp/
# Example:
rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/
```
**Option B: Selective file restore (merge)**
```bash
# Restore specific files or directories without deleting existing data
rsync -avP /mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/path/to/file /mnt/restore-temp/path/to/
# Example: Restore only db.sqlite3
rsync -avP /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/db.sqlite3 /mnt/restore-temp/
```
### 7. Unmount and Deactivate LV
```bash
# Unmount
umount /mnt/restore-temp
# Deactivate LV (optional, kubelet will activate it when pod starts)
lvchange -an pve/<lv-name>
```
### 8. Scale Up the Workload
```bash
# From your workstation:
kubectl scale deployment/<workload-name> -n <namespace> --replicas=1
# Or StatefulSet:
kubectl scale statefulset/<workload-name> -n <namespace> --replicas=1
# Wait for pod to be ready
kubectl wait --for=condition=Ready pod -l app=<workload-name> -n <namespace> --timeout=120s
```
### 9. Verify
```bash
# Check pod logs for startup errors
kubectl logs -n <namespace> -l app=<workload-name> --tail=20
# Test application functionality (service-specific)
curl -s -o /dev/null -w "%{http_code}" https://<service>.viktorbarzin.me/
```
## Example: Full Vaultwarden Restore
```bash
# 1. List backups
ssh root@192.168.1.127
ls -l /mnt/backup/pvc-data/
# 2. Scale down
kubectl scale deployment vaultwarden -n vaultwarden --replicas=0
kubectl wait --for=delete pod -l app=vaultwarden -n vaultwarden --timeout=120s
# 3. Find LV name
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,HANDLE:.spec.csi.volumeHandle' | grep vaultwarden-data-proxmox
# Output: pvc-xyz vaultwarden-data-proxmox local-lvm:vm-105-pvc-xyz456
# 4. Mount and restore
ssh root@192.168.1.127
lvchange -ay pve/vm-105-pvc-xyz456
mkdir -p /mnt/restore-temp
mount /dev/pve/vm-105-pvc-xyz456 /mnt/restore-temp
rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/
umount /mnt/restore-temp
lvchange -an pve/vm-105-pvc-xyz456
# 5. Scale up
kubectl scale deployment vaultwarden -n vaultwarden --replicas=1
kubectl wait --for=condition=Ready pod -l app=vaultwarden -n vaultwarden --timeout=120s
# 6. Test
curl -s -o /dev/null -w "%{http_code}" https://vaultwarden.viktorbarzin.me/
```
## Database-Specific Notes
For databases (MySQL, PostgreSQL), prefer the app-level backup restore (see `restore-mysql.md`, `restore-postgresql.md`) unless:
- You need a very recent point-in-time that predates the last dump
- The database dump is corrupt or missing
- You're restoring a non-SQL database (e.g., Redis RDB)
## Troubleshooting
| Problem | Cause | Fix |
|---------|-------|-----|
| "LV is active" during mount | Workload pod still running or stuck | `kubectl get pods -A | grep <pvc-name>`, delete pod if stuck |
| "No such file or directory" in backup | PVC not backed up (in excluded namespace) | Check `daily-backup` script EXCLUDE_NAMESPACES |
| rsync shows 0 files transferred | Wrong backup week or PVC name | Double-check paths: `ls /mnt/backup/pvc-data/<week>/<ns>/<pvc>/` |
| Pod stuck in ContainerCreating after restore | LV still active on PVE host | `lvchange -an pve/<lv-name>`, wait 30s, check pod again |
| Backup week missing | Daily backup hasn't run for that week | Check `systemctl status daily-backup.service`, verify retention |
## Restore from Synology (if PVE host sda is unavailable)
If the PVE host sda backup disk is unavailable or corrupt:
```bash
# 1. SSH to Synology NAS
ssh Administrator@192.168.1.13
# 2. Navigate to backup directory
cd /volume1/Backup/Viki/pve-backup/pvc-data/
# 3. Find the PVC backup
ls -l 2026-14/<namespace>/<pvc-name>/
# 4. Copy to a temporary location accessible from cluster
# Option A: Restore sda on PVE host first
# Option B: rsync to a surviving node's local disk
# Option C: Mount Synology NFS share on a pod (if network accessible)
```
## Estimated Time
- Small PVC (<1GB): ~5 minutes
- Medium PVC (1-10GB): ~10-15 minutes
- Large PVC (>10GB): ~30+ minutes (depends on size and network)
## Related
- **`restore-lvm-snapshot.md`** — Fast restore for recent changes (<7 days)
- **`restore-full-cluster.md`** — Disaster recovery procedure (uses this runbook in Phase 3.5)
- **`docs/architecture/backup-dr.md`** — Backup architecture overview

View file

@ -0,0 +1,146 @@
# Restore Vault (Raft)
Last updated: 2026-04-06
## Prerequisites
- `kubectl` access to the cluster
- Vault root token (from `vault-root-token` secret in `vault` namespace — manually created, independent of automation)
- Raft snapshot available on NFS at `/mnt/main/vault-backup/`
- Unseal keys (stored securely — check `secret/viktor` in Vault or emergency kit)
## Backup Location
- NFS: `/mnt/main/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db`
- Mirrored to sda: `/mnt/backup/nfs-mirror/vault-backup/` (PVE host 192.168.1.127)
- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/vault-backup/`
- Retention: 30 days (on NFS), latest only (on sda), unlimited (on Synology)
- Schedule: Weekly on Sundays at 02:00 (`0 2 * * 0`)
## CRITICAL: Vault is a dependency for many services
Vault provides secrets to the entire cluster via ESO (External Secrets Operator). A Vault outage affects:
- All ExternalSecrets (43 secrets + 9 DB-creds secrets)
- Vault DB engine password rotation
- K8s credentials engine
- CI/CD secret sync
**Priority: Restore Vault before any other service (except etcd).**
## Restore Procedure
### 1. Identify the snapshot to restore
```bash
# List available snapshots
ls -lt /mnt/main/vault-backup/vault-raft-*.db | head -10
```
### 2. Restore Raft snapshot
```bash
# Get root token
VAULT_TOKEN=$(kubectl get secret vault-root-token -n vault -o jsonpath='{.data.vault-root-token}' | base64 -d)
# Port-forward to Vault
kubectl port-forward svc/vault-active -n vault 8200:8200 &
# Restore the snapshot (this will overwrite current state)
export VAULT_ADDR=http://127.0.0.1:8200
export VAULT_TOKEN
vault operator raft snapshot restore -force /path/to/vault-raft-YYYYMMDD-HHMMSS.db
```
### 3. Unseal Vault (if sealed after restore)
> **Note:** Vault now has an auto-unseal sidecar that automatically unseals pods
> using the `vault-unseal-key` K8s Secret. The manual procedure below is a
> fallback if auto-unseal fails.
```bash
# Check seal status
vault status
# If sealed, unseal with keys (need threshold number of keys)
vault operator unseal <key1>
vault operator unseal <key2>
vault operator unseal <key3>
```
### 4. Verify restoration
```bash
# Check Vault health
vault status
# Check raft peers
vault operator raft list-peers
# Verify key secrets exist
vault kv get secret/viktor
vault kv list secret/
# Check DB engine
vault list database/roles
# Check K8s engine
vault list kubernetes/roles
```
### 5. Trigger ESO refresh
After Vault restore, ExternalSecrets may need a refresh:
```bash
# Restart ESO to force re-sync
kubectl rollout restart deployment -n external-secrets
# Check ExternalSecret status
kubectl get externalsecrets -A | grep -v "SecretSynced"
```
## Alternative: Restore from sda Backup
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
```bash
# 1. SSH to PVE host
ssh root@192.168.1.127
# 2. Find the latest snapshot
ls -lt /mnt/backup/nfs-mirror/vault-backup/
# 3. Copy snapshot to a location accessible from cluster
# Port-forward to Vault and restore
kubectl port-forward svc/vault-active -n vault 8200:8200 &
export VAULT_ADDR=http://127.0.0.1:8200
export VAULT_TOKEN=$(kubectl get secret vault-root-token -n vault -o jsonpath='{.data.vault-root-token}' | base64 -d)
# Copy snapshot from PVE host to local workstation, then restore
scp root@192.168.1.127:/mnt/backup/nfs-mirror/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./
vault operator raft snapshot restore -force ./vault-raft-YYYYMMDD-HHMMSS.db
```
## Alternative: Restore from Synology (if PVE host is down)
If the PVE host itself is unavailable:
```bash
# 1. SSH to Synology NAS
ssh Administrator@192.168.1.13
# 2. Navigate to backup directory
cd /volume1/Backup/Viki/nfs/vault-backup/
# 3. Copy snapshot to local workstation
scp Administrator@192.168.1.13:/volume1/Backup/Viki/nfs/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./
# 4. Restore via port-forward (same as above)
```
## Full Vault Rebuild (from zero)
If Vault needs to be rebuilt from scratch:
1. Comment out data sources + OIDC config in `stacks/vault/main.tf`
2. Apply Helm release: `scripts/tg apply -target=helm_release.vault stacks/vault`
3. Initialize: `vault operator init`
4. Unseal with generated keys
5. Restore raft snapshot (step 2 above)
6. Populate `secret/vault` with OIDC credentials
7. Uncomment data sources + OIDC
8. Re-apply: `scripts/tg apply stacks/vault`
## Estimated Time
- Snapshot restore + unseal: ~10 minutes
- Full rebuild: ~30-45 minutes

View file

@ -0,0 +1,128 @@
# Restore Vaultwarden
Last updated: 2026-04-06
## Prerequisites
- `kubectl` access to the cluster
- Backup available on NFS at `/mnt/main/vaultwarden-backup/`
## Backup Location
- NFS: `/mnt/main/vaultwarden-backup/YYYY_MM_DD_HH_MM/` (directory per backup)
- Each backup contains: `db.sqlite3`, `rsa_key.pem`, `rsa_key.pub.pem`, `attachments/`, `sends/`, `config.json`
- Mirrored to sda: `/mnt/backup/nfs-mirror/vaultwarden-backup/` (PVE host 192.168.1.127)
- PVC file backup (alternative): `/mnt/backup/pvc-data/<YYYY-WW>/vaultwarden/vaultwarden-data-proxmox/`
- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/vaultwarden-backup/`
- Retention: 30 days (on NFS), latest only (on sda nfs-mirror), 4 weeks (on sda pvc-data), unlimited (on Synology)
- Schedule: Every 6 hours (00:00, 06:00, 12:00, 18:00)
- Integrity check: Both source and backup are verified before/after each backup
## Backup Contents
| File | Purpose | Critical? |
|------|---------|-----------|
| `db.sqlite3` | All passwords, TOTP seeds, org data | Yes |
| `rsa_key.pem` / `rsa_key.pub.pem` | JWT signing keys | Yes — without these, all sessions invalidate |
| `attachments/` | File attachments on vault items | Yes |
| `sends/` | Bitwarden Send files | No |
| `config.json` | Server configuration | No — can be recreated |
## Restore Procedure
### 1. Identify the backup to restore
```bash
# List available backups (directories sorted by date)
kubectl run vw-ls --rm -it --image=alpine \
--overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"vaultwarden-backup"}}],"containers":[{"name":"vw-ls","image":"alpine","volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["ls","-lt","/backup/"]}]}}' \
-n vaultwarden
```
### 2. Scale down Vaultwarden
```bash
kubectl scale deployment vaultwarden -n vaultwarden --replicas=0
```
### 3. Restore the backup
```bash
BACKUP_DIR="YYYY_MM_DD_HH_MM" # Set to desired backup
kubectl run vw-restore --rm -it --image=alpine \
--overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"vaultwarden-backup"}},{"name":"data","persistentVolumeClaim":{"claimName":"vaultwarden-data-proxmox"}}],"containers":[{"name":"vw-restore","image":"alpine","volumeMounts":[{"name":"backup","mountPath":"/backup"},{"name":"data","mountPath":"/data"}],"command":["/bin/sh","-c","cp /backup/'$BACKUP_DIR'/db.sqlite3 /data/db.sqlite3 && cp /backup/'$BACKUP_DIR'/rsa_key.pem /data/ && cp /backup/'$BACKUP_DIR'/rsa_key.pub.pem /data/ && cp -a /backup/'$BACKUP_DIR'/attachments /data/ 2>/dev/null; echo Restore complete"]}]}}' \
-n vaultwarden
```
### 4. Scale up Vaultwarden
```bash
kubectl scale deployment vaultwarden -n vaultwarden --replicas=1
# Wait for pod to be ready
kubectl wait --for=condition=Ready pod -l app=vaultwarden -n vaultwarden --timeout=120s
```
### 5. Verify restoration
```bash
# Check pod logs for startup errors
kubectl logs -n vaultwarden -l app=vaultwarden --tail=20
# Test web UI access
curl -s -o /dev/null -w "%{http_code}" https://vaultwarden.viktorbarzin.me/
```
### 6. Test login
Log in to the Vaultwarden web UI and verify:
- [ ] Can log in with your account
- [ ] Vault items are present and readable
- [ ] Attachments are accessible
- [ ] TOTP codes are generating correctly
## Alternative: Restore from PVC File Backup
If the NFS backup is unavailable or corrupt, restore from the weekly PVC file backup on sda:
```bash
# 1. List available backup weeks
ssh root@192.168.1.127
ls -l /mnt/backup/pvc-data/
# 2. Scale down Vaultwarden
kubectl scale deployment vaultwarden -n vaultwarden --replicas=0
# 3. Mount the live PVC LV on PVE host
# Find the LV name first:
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,HANDLE:.spec.csi.volumeHandle' | grep vaultwarden-data-proxmox
# Assuming volumeHandle is "local-lvm:vm-999-pvc-abc123"
LV_NAME="vm-999-pvc-abc123"
lvchange -ay pve/$LV_NAME
mkdir -p /mnt/restore-temp
mount /dev/pve/$LV_NAME /mnt/restore-temp
# 4. Restore from backup (pick a week)
rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/
# 5. Unmount and scale up
umount /mnt/restore-temp
lvchange -an pve/$LV_NAME
kubectl scale deployment vaultwarden -n vaultwarden --replicas=1
```
## Alternative: Restore from sda Backup Mirror
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
```bash
# 1. SSH to PVE host
ssh root@192.168.1.127
# 2. Find the latest backup
ls -lt /mnt/backup/nfs-mirror/vaultwarden-backup/
# 3. Mount sda backup on a pod
BACKUP_DIR="YYYY_MM_DD_HH_MM" # Set to desired backup
kubectl run vw-restore --rm -it --image=alpine \
--overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/vaultwarden-backup"}},{"name":"data","persistentVolumeClaim":{"claimName":"vaultwarden-data-proxmox"}}],"containers":[{"name":"vw-restore","image":"alpine","volumeMounts":[{"name":"backup","mountPath":"/backup"},{"name":"data","mountPath":"/data"}],"command":["/bin/sh","-c","cp /backup/'$BACKUP_DIR'/db.sqlite3 /data/db.sqlite3 && cp /backup/'$BACKUP_DIR'/rsa_key.pem /data/ && cp /backup/'$BACKUP_DIR'/rsa_key.pub.pem /data/ && cp -a /backup/'$BACKUP_DIR'/attachments /data/ 2>/dev/null; echo Restore complete"]}],"nodeName":"k8s-master"}}' \
-n vaultwarden
```
## Estimated Time
- Restore: ~5 minutes
- Verification: ~5 minutes

View file

@ -0,0 +1,196 @@
# Runbook: Scale K8s worker count (PVC capacity headroom)
Use when block-PVC pressure, memory pressure, or planned workload growth requires adding or removing K8s worker VMs. The cluster currently runs **6 workers (k8s-node1..6) + 1 control plane (k8s-master)**, sized to absorb the 2026-05-26 proxmox-csi LUN-cap incident with sustained headroom.
## Current shape
| Node | VMID | Memory | Disk | Special |
|------|------|--------|------|---------|
| k8s-master | 200 | 32 GiB | 64G | Control plane, no worker workloads |
| k8s-node1 | 201 | 48 GiB | 256G | GPU host (NVIDIA Tesla T4 passthrough), DNS primary |
| k8s-node2 | 202 | 32 GiB | 256G | |
| k8s-node3 | 203 | 32 GiB | 256G | |
| k8s-node4 | 204 | 32 GiB | 256G | |
| k8s-node5 | 205 | 32 GiB | 256G | Added 2026-05-26 (LUN-cap incident) |
| k8s-node6 | 206 | 32 GiB | 256G | Added 2026-05-26 (LUN-cap incident) |
Capacity envelope (6 workers): **174 block-PVC slots**, ~192 GiB memory, ~96 vCPU, GPU on node1 only. Pod cap is kubelet-default 110/node.
## Binding constraints — read these first
The cluster has 6 capacity dimensions. The one that bites first depends on workload shape; check each before adding/removing nodes.
1. **Per-VM block-PVC ceiling = 29** — hardcoded by `sergelogvinov/proxmox-csi-plugin` at `pkg/csi/utils.go:394` (`for lun = 1; lun < 30; lun++`). Symptom: pods stuck `ContainerCreating` with `FailedAttachVolume … no free lun found`. `CSINode.allocatable.count` advertises `28`/node. Switching `scsihw` to `virtio-scsi-single` does NOT raise this — it's a plugin constraint, not a Proxmox/QEMU one. See `docs/architecture/storage.md` § "Per-VM SCSI-LUN cap".
2. **Memory commitment** — node1 has historically run hot (was 117% of limits before the 2026-06 memory bump to 48 GiB). Treat memory as the next-binding constraint after PVC slots, especially since limits-vs-requests divergence isn't enforced by the scheduler.
3. **sdc IO contention** — every K8s VM disk + TrueNAS NFS LV live on the same Proxmox thin pool on sdc (10.7 TB RAID1 HDD). Three IO storms in 17 days (2026-05-09, 2026-05-16/17, 2026-05-25) — see `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. Adding workers redistributes block PVCs but does NOT relieve underlying disk contention; that's beads `code-oflt`.
4. **GPU concentration** — Tesla T4 is passthrough-only on node1. Frigate ML / Immich ML / Whisper / Piper / llama-cpp all schedule there via `nvidia.com/gpu.present` label. Cannot be spread without provisioning a second GPU node.
5. **PVE host memory** — total PVE RAM 320 GiB. K8s VMs claim 240 GiB; TrueNAS / pfsense / Windows VMs claim ~80 GiB more. Adding a 32-GiB worker requires verifying PVE has the headroom (`free -h`).
6. **Per-stack Terraform state** — adding/removing nodes does NOT live in any single Terragrunt stack today. VMs are created via `scripts/provision-k8s-worker` (which calls `qm clone`). They are *not* managed declaratively in TF. Consequence: removal is a manual `kubectl delete node` + `qm stop` + `qm destroy`, not `tg destroy`.
## When to scale UP (add a worker)
Add a worker when **any** of these is true for ≥7 days:
| Trigger | Threshold | How to observe |
|---------|-----------|----------------|
| PVC slots per node | `max(per-node VA count) ≥ 25` (~86% of 29 cap) | `kubectl get volumeattachment -o json \| jq -r '.items[].spec.nodeName' \| sort \| uniq -c` |
| Cluster memory requests | `> 90%` | `kubectl describe nodes \| grep -A4 "Allocated resources"` or Goldilocks dashboard |
| Planned PVC additions | ≥3 net-new block PVCs in next sprint AND current max VA ≥ 22 | Project-tracker / beads |
| LUN-cap incident | Even one `no free lun found` event | Prometheus alert `ProxmoxCSILunsExhausted` (added 2026-05-31, commit `aded77d5`) |
| Sustained pod-eviction churn | Eviction count > 20/day for ≥3 days | `kubectl get events -A --field-selector reason=Evicted` |
### Playbook — add a worker
```bash
# 1. Choose VMID + IP (next free in 10.0.20.0/22 worker range, 10.0.20.105+ used)
NEXT_VMID=207
NEXT_IP=10.0.20.107
NAME=k8s-node7
# 2. Verify PVE memory headroom (need ≥34 GiB free for a 32-GiB VM with overhead)
ssh root@192.168.1.127 'free -h; pvesh get /nodes/pve/status --output-format=json | jq .memory'
# 3. Verify thin pool has space (need ≥256 GiB raw thin allocation, but thin so only growth matters)
ssh root@192.168.1.127 'lvs pve/data'
# 4. Clone + cloud-init + auto-join (idempotent — aborts if VMID or IP exists)
scp scripts/provision-k8s-worker root@192.168.1.127:/tmp/
ssh root@192.168.1.127 'bash /tmp/provision-k8s-worker '"$NAME $NEXT_VMID $NEXT_IP"
# 5. Wait for node to appear Ready (3-5 min for cloud-init + kubeadm join)
kubectl get nodes -w
# 6. Verify CSI registration (proxmox-csi + nfs-csi node pods)
kubectl get pods -A -o wide --field-selector spec.nodeName=$NAME | grep -E "csi|calico"
# 7. Confirm Goldilocks / Kyverno / Prometheus targets it (DaemonSets populate within ~2 min)
kubectl get ds -A -o wide | awk '{print $7,$8}' | head -20
# 8. Update this runbook's "Current shape" table
```
**Post-add validation:**
- `kubectl top node $NAME` reports stats (kubelet metrics OK)
- A test pod with a `proxmox-lvm` PVC schedules there and binds
- No new alerts firing in monitoring
## When to scale DOWN (drain a worker)
Scale down when **all** of these hold for ≥30 days:
| Condition | Threshold |
|-----------|-----------|
| Max-node PVC count | `≤ 20` (≈70% of cap) |
| Cluster memory requests | `< 70%` |
| Cluster memory limits | `< 95%` (no over-committed node) |
| No upcoming workload additions | Confirmed via beads / project tracker |
Scaling down is also reasonable as a deliberate trade-off (cost, IO reduction, consolidation) even if thresholds aren't met — but accept that the next scale-up cycle will incur the LUN-cap risk again.
### Playbook — drain + remove a worker
**Pick the lightest node first.** Survey before draining:
```bash
NODE=k8s-node5
# 1. Inventory what's there
kubectl get pods -A -o wide --field-selector spec.nodeName=$NODE \
| awk 'NR>1 {print $1}' | sort | uniq -c # pods per namespace
# 2. List drain blockers (local-path PVCs in use, GPU pods, single-replica services)
kubectl get pvc -A -o json | jq -r --arg n "$NODE" '.items[]
| select(.spec.storageClassName == "local-path")
| select(.status.phase == "Bound")
| "\(.metadata.namespace)/\(.metadata.name)"'
# 3. Check presence board — is anyone mutating workloads on this node right now?
~/code/scripts/presence list
# If a `service:*` claim covers any pod on $NODE, DEFER until released.
# 4. Cordon (mark unschedulable, existing pods stay)
kubectl cordon $NODE
# 5. Watch memory pressure forecast on remaining nodes BEFORE evicting
kubectl top nodes # baseline
# Expected addition: ~ (sum of pod memory requests on $NODE) / (N - 1) per other node
# 6. Drain (respects PDBs; --delete-emptydir-data needed for tmp volumes)
kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data --timeout=15m
# Expected blips during drain (~30s-2min each for PVC reattach):
# any singleton on $NODE (Deployment replicas=1 or StatefulSet with no peers)
# Multi-replica services with PDB just roll without downtime.
# 7. Verify everything rescheduled cleanly
kubectl get pods -A -o wide --field-selector spec.nodeName=$NODE
# Should show only DaemonSet pods + Completed jobs
# 8. Remove from cluster
kubectl delete node $NODE
# 9. Shut down + (optional) destroy the VM
VMID=205
ssh root@192.168.1.127 "qm shutdown $VMID --timeout 300; qm status $VMID"
# To fully destroy (frees thin-pool space):
# ssh root@192.168.1.127 "qm destroy $VMID --purge"
# 10. Verify post-drain shape
kubectl get volumeattachment -o json \
| jq -r '.items[] | select(.spec.attacher == "csi.proxmox.sinextra.dev") | .spec.nodeName' \
| sort | uniq -c
# 11. Update this runbook's "Current shape" table
```
**Cold-spare option:** instead of `qm destroy`, keep the VM stopped. The 256 GiB disk stays allocated on thin pool but the VM consumes no CPU/RAM. Re-add via `qm start <VMID>` + `kubeadm join` (the snippet still lives at `/var/lib/vz/snippets/k8s_cloud_init.yaml`).
## Special cases
### Critical singletons that blip during drain
These services are single-replica and incur ~30s-2min outages while their PVC reattaches to the new node:
- **Stateful databases**: `mysql-standalone-0`, `pg-cluster-*` members (CNPG handles failover gracefully)
- **Mail**: `mailserver`, `roundcubemail` (Dovecot maildir locking — defer if mid-incident)
- **Browser-trust services**: `nextcloud` (sessions reset), `vaultwarden` (active sessions blip)
- **Observability**: `prometheus-server` (scrape data gap), `claude-memory`
- **Self-hosted apps with SQLite**: hackmd, n8n, paperless-ngx, freshrss, navidrome, audiobookshelf
Coordinate the drain timing with users if any of these is on the node being drained. Single-pod Postgres/MySQL DBs are the most painful — schedule during low-traffic windows.
### GPU pods
GPU pods scheduled via `nvidia.com/gpu.present=true` node selector. They **cannot** drain off node1; if node1 itself needs maintenance, scale GPU workloads to 0 first or defer drain. See `docs/runbooks/k8s-node-auto-upgrades.md` for the kured-driven reboot path.
### Active sessions
Check `~/code/scripts/presence list` before any drain. If another session holds a claim on a service hosted on the target node, defer or coordinate.
### Force-clean stuck VolumeAttachments
If a drained node has lingering VolumeAttachment entries after `kubectl delete node`:
```bash
kubectl get volumeattachment -o json \
| jq -r --arg n "$NODE" '.items[] | select(.spec.nodeName == $n) | .metadata.name' \
| xargs -r kubectl patch volumeattachment -p '{"metadata":{"finalizers":null}}' --type=merge
kubectl get volumeattachment -o json \
| jq -r --arg n "$NODE" '.items[] | select(.spec.nodeName == $n) | .metadata.name' \
| xargs -r kubectl delete volumeattachment
```
## Related
- `docs/architecture/storage.md` § "Per-VM SCSI-LUN cap" — root-cause explanation of the PVC ceiling
- `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md` — IO contention on sdc
- `docs/runbooks/k8s-node-auto-upgrades.md` — kured-driven rolling reboots (separate from scale)
- `docs/runbooks/restore-full-cluster.md` — disaster scenarios
- `scripts/provision-k8s-worker` — the actual cloning/join script
- Beads `code-oflt` — IO isolation (long-term fix for sdc contention)
- Remote memory id=2788 — `proxmox-csi-plugin hardcodes a per-VM SCSI-LUN ceiling`

View file

@ -0,0 +1,191 @@
# Security Incident Response
What to do when a wave-1 security alert fires. Each alert links to a Loki query for investigation and concrete remediation steps.
**Status: planned, not yet implemented.** Beads epic: `code-8ywc`. This runbook is the response playbook for when wave 1 ships.
## General workflow
1. **Acknowledge in Alertmanager.** Silence only after triage starts.
2. **Pull context from Loki** (queries below). Get the actor, source IP, timestamp.
3. **Decide: real or false-positive?** Use the "false-positive cases" notes below.
4. **If real:** revoke credentials (Vault token revoke, K8s SA token rotate, SSH key remove, OIDC session invalidate), then post-mortem.
5. **If false-positive:** tune the alert (extend allowlist, refine LogQL query).
## Allowlist CIDRs
All source-IP-based alerts (K2, K9, V7, S1) reference this list. Update in one place: Terraform variable `security_source_ip_allowlist` in `stacks/monitoring`.
- `10.0.20.0/22` — VLAN 20 (cluster + main LAN)
- `192.168.1.0/24` — Proxmox + Sofia LAN
- K8s pod CIDR (verify at implementation time)
- K8s service CIDR
- Headscale tailnet
**Anything outside = alert.** No public-IP exceptions.
## Viktor's identity
`me@viktorbarzin.me` is the ONLY allowlisted human identity. NOT `viktor@viktorbarzin.me`. NOT `emo@viktorbarzin.me`. emo's identity scheme is separate and must be added explicitly if/when needed.
---
## K-alerts (K8s API audit)
### K2 — ServiceAccount token used from outside cluster
**Meaning:** A K8s ServiceAccount token authenticated a request whose `sourceIPs[0]` is not in the pod CIDR or trusted LAN. Stolen SA token used externally.
```logql
{job="kube-audit"} | json | user_username =~ "system:serviceaccount:.*" | sourceIPs_0 !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*"
```
**Action:** Identify the SA. Rotate its token (`kubectl delete secret <sa-token-name>` if old-style, or recreate the SA if projected token). Audit the SA's permissions and tighten.
**False positives:** Pod-to-apiserver traffic that egresses and re-enters via NodePort/LB (rare). Investigate the originating workload.
### K3 — Secret read in sensitive namespace by unexpected actor
**Meaning:** A Secret in `vault`, `sealed-secrets`, or `external-secrets` namespace was read by an SA NOT in the allowlist (ESO controller, sealed-secrets controller, Vault SA, `me@viktorbarzin.me`).
```logql
{job="kube-audit"} | json | verb =~ "get|list" | objectRef_resource = "secrets" | objectRef_namespace =~ "vault|sealed-secrets|external-secrets" | user_username !~ "(me@viktorbarzin.me|system:serviceaccount:external-secrets:.*|system:serviceaccount:sealed-secrets:.*|system:serviceaccount:vault:.*)"
```
**Action:** Identify the actor. If a service account, audit its bindings — it shouldn't have RBAC to read those secrets. Revoke the binding. Rotate any secrets that were read.
### K4 — Exec into sensitive pod
**Meaning:** Someone `kubectl exec`'d into a pod in `vault`, `kube-system`, `dbaas`, or `cnpg-system`.
```logql
{job="kube-audit"} | json | verb = "create" | objectRef_resource = "pods" | objectRef_subresource = "exec" | objectRef_namespace =~ "vault|kube-system|dbaas|cnpg-system" | user_username != "me@viktorbarzin.me"
```
**Action:** Determine if Viktor authorized the exec. If unrecognized actor, revoke their access and rotate any credentials they could have read inside the pod.
**False positives:** Break-glass SAs used during incident response — extend the allowlist to include them by SA name.
### K5 — Mass delete
**Meaning:** Single actor deleted >5 Pods, Secrets, or ConfigMaps in 60 seconds. Either a script gone wrong or destructive intrusion.
```logql
sum by (user_username) (count_over_time({job="kube-audit"} | json | verb = "delete" | objectRef_resource =~ "pods|secrets|configmaps" [1m])) > 5
```
**Action:** Identify actor. If a Terraform apply or known cleanup job, false positive. If unrecognized, suspend the actor's credentials immediately and audit what was deleted.
### K6 — Audit policy modified
**Meaning:** Someone changed the kube-apiserver audit policy. Should only happen via Terraform.
**Action:** Verify the change came from a planned Terraform apply (check recent commits to `stacks/infra`). If not, treat as critical compromise — attacker disabling visibility.
### K7 — New ClusterRole with full wildcards
**Meaning:** A new ClusterRole was created with `verbs: ["*"]` and `resources: ["*"]`. Privilege escalation primitive.
```logql
{job="kube-audit"} | json | verb = "create" | objectRef_resource = "clusterroles" | requestObject_rules_0_verbs_0 = "*" | requestObject_rules_0_resources_0 = "*"
```
**Action:** Verify the change is intentional (some operators install such roles — calico, kyverno). If unrecognized, delete the ClusterRole and audit the creator.
### K8 — Anonymous binding
**Meaning:** A RoleBinding or ClusterRoleBinding was created referencing `system:anonymous` or `system:unauthenticated`. Catastrophic — allows unauthenticated cluster access.
**Action:** Delete the binding immediately. Audit who created it. Treat as full cluster compromise — rotate all secrets, force kubeconfig re-issue.
### K9 — Viktor's identity from unexpected source IP
**Meaning:** A request authenticated as `me@viktorbarzin.me` arrived from a source IP outside the allowlist. Stolen OIDC token / kubeconfig.
```logql
{job="kube-audit"} | json | user_username = "me@viktorbarzin.me" | sourceIPs_0 !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*|<pod-cidr>|<headscale-cidr>"
```
**Action:** Revoke Viktor's OIDC session in Authentik. Rotate Vault OIDC tokens. Audit recent activity from that IP. Verify Viktor's devices for compromise.
**False positives:** Viktor's machine on a new network without VPN — should not happen per the "no public IP access" policy. If it does, the policy needs revisiting, not the alert.
---
## V-alerts (Vault audit)
### V1 — Root token created
```logql
{job="vault-audit"} | json | request_path = "auth/token/create" | response_auth_policies = "root"
```
**Action:** Verify against Terraform / planned operation. Root tokens should ONLY be created during initial Vault setup or break-glass.
### V2 — Audit device disabled/modified
**Action:** Attacker silencing visibility. Re-enable immediately. Treat as critical compromise.
### V3 — Seal status changed
**Action:** Verify whether this is a planned operation (unseal during upgrade). If unplanned, treat as critical.
### V4 — Policy modified
**Action:** Confirm change came from a Terraform apply. Allowlist Terraform's source IP / token role. Otherwise: review the policy diff, revert if malicious.
### V5 — Auth failure spike
**Action:** Identify the auth method and source. If CI token rotation, false positive. If unknown source brute-forcing, block the source IP at pfSense.
### V6 — Token with policies different from parent
**Action:** Privilege escalation attempt. Revoke the new token. Audit the parent token's policies.
### V7 — Viktor's Vault identity from unexpected source IP
**Meaning:** A Vault operation authenticated as Viktor's entity_id arrived from an IP not in the allowlist. Requires `x_forwarded_for_authorized_addrs` to be configured (Vault sits behind Traefik so `remote_addr` is Traefik's pod IP without XFF trust).
**Action:** Revoke Viktor's Vault OIDC tokens. Force OIDC re-auth. Audit Vault access from that IP.
---
## S-alerts (Host)
### S1 — PVE sshd auth success from unexpected IP
```logql
{job="sshd-pve"} |= "Accepted" | regexp "Accepted (?P<method>\\S+) for (?P<user>\\S+) from (?P<ip>\\S+)" | ip !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*|<headscale-cidr>"
```
**Action:** Remove the user's SSH key from `/root/.ssh/authorized_keys` if it's still there. Audit recent sudo/login history (`last`, `sudo -i; journalctl _COMM=sudo`). Consider PVE as compromised — rotate root password, audit `/root/.luks-backup-key`, audit `/usr/local/bin/lvm-pvc-snapshot` and backup scripts for tampering.
---
## False-positive triage decision tree
```
Did the alert fire from a known operational event?
├─ Terraform apply at the same time? → likely V4 (policy modified)
├─ Keel auto-roll? → not a security path
├─ CI/CD pipeline running? → check V5 / K5
└─ Viktor doing recovery work? → K4, K9, S1 candidates
Extend allowlist if persistent
```
## Escalation
For SEV1 (multiple alerts, cluster-admin grants, anonymous bindings, mass deletes):
1. Cordon all nodes (`kubectl cordon`) to prevent further pod scheduling — but be aware this also stops legitimate recovery work
2. Revoke all OIDC sessions in Authentik
3. Rotate Vault root keys + reseal
4. Restore from a pre-incident backup if data integrity is questionable
5. Post-mortem per `incident-response.md`
## Related
- [Security architecture](../architecture/security.md)
- [Monitoring architecture](../architecture/monitoring.md)
- [Incident response (general)](../architecture/incident-response.md)
- Beads epic: `code-8ywc`

View file

@ -0,0 +1,127 @@
# Runbook: Synology NAS storage — navigate, assess, clean
**Target:** Synology DS218 (`NAS_Barzini`), `192.168.1.13`, `/volume1`
(5.3 TiB btrfs). This is the **offsite backup target** (Copy 3 of the
3-2-1 strategy) **and a shared family volume** — homelab data is only
under `Backup/Viki/`; `Anca/`, `Emo/`, `Common/`, `music`, `video`,
`photo` etc. are family data.
Related: [storage architecture](../architecture/storage.md) ·
[backup & DR](../architecture/backup-dr.md)
## Access
- SSH: `ssh Administrator@192.168.1.13` (capital `A`; key-auth works
from devvm and the PVE host). `Administrator` can `sudo`.
- sudo password: Vault `secret/viktor``synology_admin_password`
(`VAULT_ADDR=https://vault.viktorbarzin.me`). DSM Web API has 2FA, so
**SSH+sudo is the only unattended path** (`read -r PW; printf '%s\n'
"$PW" | sudo -S -p '' <cmd>` to keep the secret out of `argv`).
## ⚠️ NEVER run `du` / `find` / `ncdu` on this NAS
Recursive walks over the multi-TB `Backup` share take 10+ min (often
never finish) and burn disk/IO on the NAS. Use Synology's own
pre-indexed data instead:
| Need | Instant, non-walking source |
|---|---|
| Volume fill | `df -h /volume1` |
| btrfs real usage | `btrfs filesystem df /volume1` |
| Per-subvolume | `sudo btrfs qgroup show -prce --raw /volume1` |
| **Per-share / per-owner / per-type / largest / oldest / dupes** | **Storage Analyzer weekly report** (below) |
### Storage Analyzer weekly report
Storage Analyzer is installed and writes a report every **Monday
~00:00** to:
```
/volume1/Backup/Viki/synoreport/weekly storage report/<YYYY-MM-DD_..>/
```
Data is up to ~7 days stale. The useful files are zipped CSVs in
`csv/`**content is UTF-16, and there is no `unzip` on the box**, so
read them with Python:
```python
import zipfile, os
R=".../<date>/csv"
def readcsv(n):
z=zipfile.ZipFile(os.path.join(R,n)); raw=z.read(z.namelist()[0])
for enc in ("utf-16","utf-8-sig","utf-8"):
try: return raw.decode(enc)
except Exception: pass
```
Key CSVs: `volume_usage`, `share_list` (per-share, incl/excl recycle),
`quota_usage.share` (**per-owner within a share**), `file_group`
(per-file-type), `large_file`, `least_modify` (oldest), `duplicate_file`.
The `*.db` files (`folder.db` etc.) are a **custom Synology format —
NOT sqlite**; `report.html` does not embed clean folder totals.
## btrfs space-reclaim is ASYNCHRONOUS — and snapshot-pinned
- Deleting files/snapshots returns instantly but `df` lags minutes
while the btrfs cleaner reclaims extents (~30 GB/min on the DS218).
- Data deleted from the live share **stays on disk until the share
snapshots that still reference it also rotate out.** There are 4
daily `Backup` share snapshots (`GMT-*-21.00.02`), so **expect up to
~4 days of lag** before a delete fully frees space.
- Snapshot CLI (sudo, full path): `/usr/syno/sbin/synosharesnapshot
{list|delete} Backup <snap>...`. Retention:
`/usr/syno/etc/sharesnap/sharesnap.conf`.
## Capacity alert
The Synology mount surfaces to Prometheus as the PVE host NFS mount
`/mnt/synology-backup` (`job="proxmox-host"`, `fstype=nfs4`), caught by
the **global `NodeFilesystemFull`** rule in
`stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`.
- **2026-06-05:** threshold changed **90% → 95%** (`* 100 < 5`) at
user request — a backup target legitimately runs hot, so 90% was
noisy. NOTE: this rule is **global**, so the looser 95% now applies to
all node/system disks too. `BackupDiskFull` (the sda `/mnt/backup`
disk, separate alert) stays at 85%.
## Current assessment — 2026-06-05
`/volume1` at **94% (5.0 TiB used / 5.3 TiB, 324 GiB free)**, down from
98% on 2026-05-24. The **`Backup` share is 4.42 TiB (86%)**:
Administrator/homelab **3.92 TiB**, Emo/family **504 GiB**. By type:
Other 1.76 TiB, Videos 1.33 TiB, Pictures 631 GiB, Zipped 495 GiB,
DiskImage 77 GiB. The ~1.9 TiB of media is mostly the **Immich offsite
backup** (`Viki/nfs/immich` + `nfs-ssd/immich`), which **grows daily —
the structural capacity driver now that one-off cleanups are spent.**
### Already reclaimed (verified gone)
`Anca/Elements` (770 GiB — dir now empty), `prometheus-backup` (63 GiB),
`ollama`/`llamacpp`/`audiblez`/`ebook2audiobook` — removed in the
2026-06-01 cleanup; nfs-mirror now excludes the regenerable services.
### Cleanup candidates — homelab (`Backup/Viki/`, Administrator-owned)
| Target | Size | Notes |
|---|---|---|
| `Photos/gphotos-1/` | **208 GiB** zips (+ extracted) | 2023 Google Takeout, **already imported to Immich** (`immich-go.exe` beside them; dupes confirmed). Redundant. |
| `laptop/` | ~167 GiB | old VM images (Kali/windows vdis, metasploitable, soton-rpi.img) |
| `All-in-one/` | ~95 GiB | 20152018 archives |
| `#recycle/` (Backup) | ~16 GiB | recycle bin (HA backup rotation) |
| loose `*.asc`/`*.mov` in `Viki/` root | ~8 GiB | old encrypted archives, phone videos |
| `sgs7/` | ~3.5 GiB | 2021 Galaxy S7 backup |
**~500 GiB** reclaimable without touching live backups or family data.
### Cleanup candidates — family (flag to Emo, do not delete)
- `Emo/D/` Windows 7 vmdks — **3 identical 39.5 GiB copies** (one live +
two under `_SYNCAPP/Versioning/`) → 79 GiB dedup.
- Emo-shared recycle bin: 12.6 GiB.
### Do NOT touch
`Viki/pve-backup/` (live structured backup), `Viki/nfs/immich` +
`nfs-ssd/immich` (irreplaceable), `HomeAssistant/` + `ha_backup_vermont/`
(~7 GiB, healthy 3-copy retention).

View file

@ -0,0 +1,51 @@
# Runbook: Applying the Technitium Terraform stack
Last updated: 2026-04-19
The `stacks/technitium/` apply has a **post-apply readiness gate** that asserts all three DNS instances are healthy before the apply is allowed to finish. This runbook explains what it checks, how to interpret failures, and how to override it for emergency maintenance.
## What the gate checks
`stacks/technitium/modules/technitium/readiness.tf` defines `null_resource.technitium_readiness_gate`. It runs after the three Technitium deployments, the DNS LoadBalancer service, and the PDB are applied, and performs:
1. **Rollout status**`kubectl rollout status deploy/<name> --timeout=180s` for `technitium`, `technitium-secondary`, `technitium-tertiary`. Fails if any deployment has not reached its desired pod count within 180s.
2. **Per-pod API health** — for every pod with label `dns-server=true`, executes `wget http://127.0.0.1:5380/api/stats/get` inside the pod and asserts the response contains `"status":"ok"`. Catches Technitium process hangs that TCP probes miss.
3. **Zone-count parity** — queries `technitium-web`, `technitium-secondary-web`, `technitium-tertiary-web` and counts the zones returned. Fails if the three counts differ, which would mean `technitium-zone-sync` has drifted or a replica has lost state.
The gate is re-run whenever any of the deployment container spec, the CoreDNS Corefile, or the apply timestamp changes (see `triggers` in `readiness.tf`).
## Emergency override
Set `skip_readiness=true` via terragrunt inputs or pass it directly to the Terraform apply:
```bash
cd infra/stacks/technitium
scripts/tg apply -var skip_readiness=true
```
Only use this when you need to land a Terraform change while one Technitium instance is intentionally offline (e.g., you are replacing its PVC, migrating storage, or recovering a corrupted config DB). Re-apply without the flag once the instance is back.
You can also target around the gate during emergency work:
```bash
scripts/tg apply -target=kubernetes_config_map.coredns
```
`-target` bypasses the `depends_on` chain feeding the gate, so a single-resource push does not need the gate to pass.
## Failure modes and responses
| Symptom | Likely cause | Fix |
|---------|--------------|-----|
| `rollout status` times out on one deployment | Pod stuck `Pending` (node pressure / anti-affinity with other dns-server pods) or `ImagePullBackOff` | `kubectl describe pod` for events. If anti-affinity is blocking, confirm 3 nodes are Ready. |
| API check fails on a pod but readiness probe passes | Technitium process hung but port 53 still accepting TCP (liveness probe is `tcp_socket` on :53) | `kubectl delete pod <name>` — deployment will recreate it. |
| Zone count differs between instances | `technitium-zone-sync` CronJob is failing or AXFR is blocked | `kubectl logs -n technitium -l job-name=<latest-zone-sync-job>`. Check `TechnitiumZoneSyncFailed` alert. |
| Gate passes but external clients still cannot resolve | Gate only checks in-pod API and intra-cluster zone parity — external path (LoadBalancer → Technitium pod) is not tested | Run the LAN-client drill in `docs/architecture/dns.md` troubleshooting section. |
## What the gate does NOT check
- External reachability through the LoadBalancer IP `10.0.20.201` (that would require a LAN-side probe).
- CoreDNS health (CoreDNS is patched by `coredns.tf`, not this module's deployments — alerts `CoreDNSErrors` / `CoreDNSForwardFailureRate` catch regressions post-apply).
- Upstream resolver health (covered by `CoreDNSForwardFailureRate`).
For broader end-to-end verification, see `docs/architecture/dns.md` → "Verification" section, or run the Uptime Kuma external DNS probe.

View file

@ -0,0 +1,217 @@
# Runbook: Vault Raft Leader Deadlock + Safe Pod Restart
Captures the 2026-04-22 incident pattern. When a Vault raft leader enters a
stuck goroutine state (port 8201 accepts TCP but RPCs never return), the
recovery is *not* `kubectl delete --force`. Force-deleting a Vault pod that
holds a stuck NFS mount leaves kernel NFS client state corrupted, which
blocks all subsequent NFS mounts from the node and usually requires a VM
hard-reset to clear.
**Related**: [post-mortems/2026-04-22-vault-raft-leader-deadlock.md](../post-mortems/2026-04-22-vault-raft-leader-deadlock.md).
## Symptoms
- `https://vault.viktorbarzin.me/v1/sys/health` returns HTTP 503.
- Standbys log `msgpack decode error [pos 0]: i/o timeout` every 2s.
- `kubectl exec` into a standby shows raft thinks the leader is alive
(peers list all `Voter`, leader address populated) but `vault operator
raft autopilot state` stalls or errors.
- The "leader" pod's logs go silent — no heartbeats, no audit writes,
nothing. TCP on 8201 still accepts connections.
- ESO-backed secrets stop refreshing (ExternalSecret `SecretSyncedError`).
- Woodpecker CI pipelines that read from Vault at plan time hang.
## 0. Confirm the diagnosis (before touching anything)
Don't jump to force-delete. Verify the leader is actually stuck, not just
slow:
```sh
# 1. Who does raft think the leader is?
kubectl exec -n vault vault-0 -c vault -- vault status 2>&1 | \
grep -E 'HA Mode|Active Node|Leader|Raft'
# 2. Is the leader's port open but unresponsive?
LEADER_POD=vault-2 # or whichever vault status reports
kubectl exec -n vault $LEADER_POD -c vault -- sh -c \
'timeout 3 nc -zv 127.0.0.1 8200 2>&1; echo; timeout 3 vault status'
# 3. Is the active vault service pointing at a real pod?
kubectl get endpoints -n vault vault-active -o yaml | \
grep -E 'addresses|notReadyAddresses' -A2
# 4. What do standby logs say?
kubectl logs -n vault vault-0 -c vault --tail=40 | grep -iE 'msgpack|decode|rpc'
```
If (2) hangs and (4) shows repeated msgpack errors → stuck leader.
## 1. Identify the stuck pod precisely
```sh
# Find the pod whose vault_core_active would be 1 if it were scraping
# (currently no telemetry — use logs as proxy until telemetry is enabled).
for p in vault-0 vault-1 vault-2; do
echo "=== $p ==="
kubectl logs -n vault $p -c vault --tail=5 2>&1 | head -5
done | grep -B1 'no recent output'
```
The pod whose logs have been silent for minutes while the others are
actively erroring is the stuck leader.
## 2. The safe restart sequence (avoids zombie containers)
**DO NOT** `kubectl delete pod --force --grace-period=0` as the first
step. On NFS-backed Vault that's the exact move that leaves the kernel
NFS client corrupted on the node where the stuck pod ran.
Instead:
### 2a. Graceful delete first (30s grace)
```sh
kubectl delete pod -n vault vault-2
```
Wait 30 seconds. Most of the time the TERM → SIGKILL path works and the
new pod schedules cleanly. The remaining leaders re-elect and the external
endpoint recovers.
### 2b. If the pod is Terminating after 60s, find the stuck process
```sh
NODE=$(kubectl get pod -n vault vault-2-<suffix> -o jsonpath='{.spec.nodeName}')
POD_UID=$(kubectl get pod -n vault vault-2-<suffix> -o jsonpath='{.metadata.uid}')
ssh $NODE "sudo ps auxf | grep -A2 $POD_UID | head -20"
# Look for: mount.nfs (D-state), vault (Z-state), or the sh wrapper in do_wait
```
### 2c. Unmount stale NFS before force-deleting
If the old pod's NFS mount is still present, lazy-unmount it FIRST so
the kernel can release NFS session state cleanly:
```sh
ssh $NODE "sudo mount | grep $POD_UID | awk '{print \$3}' | xargs -I{} sudo umount -l {}"
```
Verify no mount.nfs processes are in D-state on the node:
```sh
ssh $NODE "ps -eo state,pid,comm | grep '^D' | head -5"
```
### 2d. Only NOW force-delete if needed
```sh
kubectl delete pod -n vault vault-2-<suffix> --force --grace-period=0
```
## 3. Recovery when the node is already stuck
If you force-deleted before reading this runbook and NFS is now broken
on the node:
**Diagnostic — confirm NFS client state is corrupted:**
```sh
NODE=k8s-node2 # node where the force-delete happened
ssh $NODE "sudo mkdir -p /tmp/nfstest && sudo timeout 30 \
mount -t nfs 192.168.1.127:/srv/nfs /tmp/nfstest && echo MOUNT_OK"
```
If the mount times out at 30-110s, kernel NFS client state is stuck.
No userspace recovery exists — only a VM reboot clears it.
**Workaround before rebooting**: mounting with `nfsvers=4.1` succeeds
on broken nodes (the corruption is NFSv4.2 session-state specific).
This is useful for diagnostic mounts, but does NOT fix CSI pods —
their mount options come from the `nfs-proxmox` StorageClass and can't
be overridden per-pod.
**Reboot the affected node VM:**
```sh
# Find PVE VM ID — nodes numbered 201-204 for k8s-node1..4
ssh root@192.168.1.127 "qm reset 20<N>"
# If qm reset leaves the VM PID unchanged (it didn't actually reboot),
# use qm stop/start:
ssh root@192.168.1.127 "qm stop 20<N> && qm start 20<N>"
```
Wait for the node to become Ready (`kubectl get node k8s-node<N> -w`)
and CSI driver to register (`kubectl get pods -n nfs-csi -o wide`).
**Gotcha — `qm reset` can be a no-op.** On the 2026-04-22 incident,
`qm reset 201` returned exit 0 but did NOT restart the VM (same QEMU PID
before and after). `qm status` reported "running" throughout. Always
verify by checking the QEMU PID or VM uptime post-reset. If uptime is
unchanged, escalate to `qm stop && qm start`.
**Gotcha — check boot order before stop/start.** Long-running VMs
(630+ day uptime) may have stale `bootdisk:` config that's been hidden
by never rebooting. On 2026-04-22, k8s-node1's config had `bootdisk:
scsi0` but the actual OS disk was on `scsi1`, so the first boot after
stop attempted iPXE and failed. Before stopping, verify:
```sh
ssh root@192.168.1.127 "grep -E 'boot|scsi[0-9]+:' /etc/pve/qemu-server/20<N>.conf"
```
If `bootdisk` references a disk ID that doesn't exist, fix it first
with `qm set 20<N> --boot "order=scsi<ID>"` (use the ID of the main
OS disk).
## 4. Prevent re-infection — the chown loop
After the node comes back, the vault pod's PV chown walk can still
peg kubelet. The durable fix is in `stacks/vault/main.tf`:
```hcl
statefulSet = {
securityContext = {
pod = {
fsGroupChangePolicy = "OnRootMismatch"
}
}
}
```
This was applied in commit `2f1f9107` (2026-04-22). If you find
yourself editing this in a kubectl patch for live recovery, follow
up with a Terraform apply the same session — leaving the cluster
ahead of Terraform state is technical debt that re-triggers on the
next apply.
## 5. Verify end-to-end
```sh
# External endpoint — the user-facing health check
curl -sk -o /dev/null -w "%{http_code}\n" https://vault.viktorbarzin.me/v1/sys/health
# expect: 200
# Raft peers (needs VAULT_TOKEN with operator capability)
kubectl exec -n vault vault-0 -c vault -- vault operator raft list-peers
# All pods 2/2
kubectl get pods -n vault -l app.kubernetes.io/name=vault -o wide
# No alerts fired (once VaultRaftLeaderStuck + VaultHAStatusUnavailable are live)
curl -s https://alertmanager.viktorbarzin.me/api/v2/alerts | \
jq '.[] | select(.labels.alertname | test("Vault"))'
```
## Known limitations
- **No alert for stuck leaders yet.** `VaultRaftLeaderStuck` and
`VaultHAStatusUnavailable` require Vault telemetry enabled
(`telemetry { unauthenticated_metrics_access = true }`) and a
scrape job. Alerts are defined in `prometheus_chart_values.tpl`
but stay silent until telemetry lands — tracked as a beads task.
- **Vault on NFS violates the documented rule.** `infra/.claude/CLAUDE.md`
says critical services must use `proxmox-lvm-encrypted`. The
`dataStorage`/`auditStorage` still use `nfs-proxmox`. Migration
tracked as an epic-level beads task.

View file

@ -0,0 +1,114 @@
# Runbook: devvm Vault token auto-renewal
**Host:** `devvm` (10.0.10.10), user `wizard`
**Source of truth:** `infra/scripts/vault-token-renew.{sh,service,timer}`
**Live paths:** `~/.local/bin/vault-token-renew`, `~/.config/systemd/user/vault-token-renew.{service,timer}`
## What this is
`wizard@devvm` authenticates to Vault with a **periodic, orphan** token stored
in `~/.vault-token`, instead of a 7-day OIDC login that needed weekly
re-auth. A systemd **user** timer renews it daily so it never expires.
| Property | Value |
|---|---|
| `display_name` | `token-devvm-wizard` |
| `period` | `768h` (32 days) |
| `explicit_max_ttl` | `0` (no hard cap) |
| `policies` | `default`, `sops-admin`, `vault-admin` |
| `orphan` | `true` (not revoked when any parent expires) |
Periodic tokens have no max-TTL; they only need renewing once per `period`.
Daily renewal leaves a 32× margin. **If devvm is decommissioned and the timer
stops, the token self-expires within ~32 days** — deliberately, unlike a root
token which would live forever (this is the security trade-off Viktor chose:
periodic + renewer over a never-expiring root token).
## Deploy on a fresh devvm
The renewer is a host-side script + user systemd units, deployed manually (same
model as the other `infra/scripts/` host scripts). From a checkout of the repo
**as user `wizard` on devvm**:
```bash
cd ~/code/infra/scripts
install -m 0755 vault-token-renew.sh ~/.local/bin/vault-token-renew # strip .sh
install -m 0644 vault-token-renew.service vault-token-renew.timer ~/.config/systemd/user/
# user manager must survive logout, so the daily timer fires headless
loginctl enable-linger "$USER"
systemctl --user daemon-reload
systemctl --user enable --now vault-token-renew.timer
```
Then mint the token (one-time, interactive — see below). The script and units
carry no secret; only the token itself is sensitive and stays out of git.
## Mint / re-mint the token
Requires an interactive OIDC login (browser), so it can't run unattended:
```bash
export VAULT_ADDR=https://vault.viktorbarzin.me
vault login -method=oidc
vault token create -orphan -period=768h \
-policy=vault-admin -policy=sops-admin -display-name=devvm-wizard \
-field=token > ~/.vault-token
chmod 600 ~/.vault-token
```
Vault prefixes the display name, so it becomes `token-devvm-wizard` (which is
what the drift guard checks for). `-orphan` is essential: a child of the 7-day
OIDC token would be revoked when that parent expired.
## Health check
```bash
export VAULT_ADDR=https://vault.viktorbarzin.me
vault token lookup | grep -E 'display_name|period|explicit_max_ttl|policies'
# expect: display_name token-devvm-wizard, period 768h, explicit_max_ttl 0s,
# policies [default sops-admin vault-admin]
# authoritative write-capability check (do NOT trust the policies field alone —
# an OIDC token shows policies=[default] but carries vault-admin via identity):
vault token capabilities secret/data/viktor # expect create/update/.../sudo
# renewer health
systemctl --user list-timers | grep vault-token-renew # next/last run
tail -5 ~/.local/state/vault-token-renew.log # recent results
```
A healthy log line looks like:
`<ts> OK renewed (dn=token-devvm-wizard ttl=2764800s)` (ttl 2764800s = 768h).
## Drift guard & recovery
`~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login`
overwrites it. Two confirmed clobber vectors:
1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer
can't push past the OIDC role's 7-day `token_max_ttl`).
2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) →
writes a read-only `kubernetes-woodpecker-default` token (can read Vault but
**cannot** write `secret/*`). This happened 2026-06-05 and went unnoticed for
two days — reads worked, writes silently 403'd.
To stop the renewer from silently keeping a foreign token alive, it runs a
**drift guard** first: it refuses to renew unless the token is
`token-devvm-wizard` **and** carries `vault-admin`. On drift it logs loudly and
exits non-zero (the systemd unit goes `failed`) rather than renewing someone
else's token. Symptom in the log:
`<ts> DRIFT: ~/.vault-token is dn=... policies=... Refusing to renew a foreign token. Re-mint: ...`
**Recovery: re-mint** (the DRIFT log line contains the exact command) — run the
[mint/re-mint](#mint--re-mint-the-token) block. The drift guard detects but does
**not** auto-recover (a deliberate scope choice — version-only, no self-heal);
recovery is the manual re-mint above.
## Tests
`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision
and the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber
case). Run: `bash infra/scripts/test-vault-token-renew.sh`.

View file

@ -0,0 +1,86 @@
# Runbook: Onboarding a new Forgejo repo to Woodpecker
Last updated: 2026-05-07
## Programmatic (preferred)
```bash
infra/scripts/woodpecker-register-forgejo-repo.sh viktor/<repo-name>
```
The script:
1. Pulls the `viktor` (Forgejo-OAuth'd) user's `hash` from the
Woodpecker PG `users` table.
2. Mints a session JWT (HS256, signed with that hash) — Woodpecker
per-user session JWTs have payload
`{"type":"user","user-id":"<id>"}` and the signing key is the
user's `hash` column. (Confirmed against a known-good admin
token: same payload shape, signature reproducible from the user's
stored hash via `openssl dgst -sha256 -hmac "$HASH"`.)
3. Looks up the Forgejo repo id and POSTs to
`https://ci.viktorbarzin.me/api/repos?forge_remote_id=<id>` as
that user. Woodpecker server creates the per-repo webhook +
per-repo signing key on the Forgejo side automatically (uses
the user's stored Forgejo OAuth `access_token` to do so — that's
why this only works with viktor's user, not the GitHub admin's).
Pre-requisites:
- `vault login -method=oidc` with read access to
`database/static-creds/pg-woodpecker`.
- `kubectl` cluster access (the script spawns a 5-min psql pod in
the `woodpecker` namespace to query the DB).
- A Forgejo PAT in `secret/viktor/forgejo_admin_token` (or pass
`FORGEJO_TOKEN=…` env), used to look up the repo's numeric ID.
- The `viktor` Woodpecker user must already exist (i.e., they've
logged in via Forgejo OAuth at least once on the Web UI).
If user_id=2 / forge_id=2 doesn't exist in `users`, the OAuth
bootstrap is unavoidable — but it only needs to happen once for
the lifetime of the Woodpecker DB.
## Why the GitHub admin token can't do this
The earlier 500 from `POST /api/repos?forge_remote_id=N` was
because my admin session token authenticates as `ViktorBarzin`
(GitHub user, forge_id=1). Woodpecker tries to call Forgejo as
that user (using their stored Forgejo OAuth token) — which doesn't
exist for the GitHub user, hence the lookup error. There's no way
around this without acting as the Forgejo user.
## Why the previous "JWT for the webhook" approach didn't work
I tried generating a webhook JWT signed with `WOODPECKER_AGENT_SECRET`
(the global agent secret) and registering it directly on Forgejo.
That fails because the webhook JWT verification path runs through a
DB-backed `keyfunc` — Woodpecker stores a per-repo signing key when
the repo is activated, and rejects any JWT signed with a different
key. POST /api/repos is what creates that per-repo key.
## After registration
Pipelines fire automatically on push. The `WOODPECKER_FORGE_TIMEOUT`
default of 3s was too tight for our cluster (Forgejo response time
spikes to 1-2s under load) — bumped to 30s in
`infra/stacks/woodpecker/values.yaml` 2026-05-07. Without that bump,
config-loader hits the deadline and every pipeline errors with
`could not load config from forge: context deadline exceeded`.
## When the v3.13 → v3.14 server upgrade matters
`v3.14.0` doesn't fix this on its own — the timeout default is the
same. Set `WOODPECKER_FORGE_TIMEOUT` regardless of version. The
v3.14 upgrade was useful for unrelated forge-API changes (smarter
config-loader, fewer redundant calls per trigger).
## Troubleshooting
- Pipeline status `error` with `could not load config from forge`:
bump `WOODPECKER_FORGE_TIMEOUT`. 30s is plenty.
- Pipeline status `error` with `secret "registry-password" not found`:
the repo's `.woodpecker.yml` still references registry-private
credentials. Drop the `registry.viktorbarzin.me` block — Forgejo
is the only registry now.
- Pipeline status `failure` with `"/vault": not found` (or any
other COPY of a binary): the gitignored binary wasn't pushed to
Forgejo. Switch the Dockerfile to `curl … && unzip` from the
HashiCorp/upstream release URL. See `claude-agent-service/Dockerfile`
commit bab6dd2 for the pattern.