fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
6d224861c4
commit
fd0f4a0365
1166 changed files with 358546 additions and 0 deletions
74
docs/runbooks/apiserver-audit-logging.md
Normal file
74
docs/runbooks/apiserver-audit-logging.md
Normal file
|
|
@ -0,0 +1,74 @@
|
|||
# Runbook: kube-apiserver Audit Logging
|
||||
|
||||
**Status:** enabled 2026-06-06 on `k8s-master` (10.0.20.100, the single
|
||||
control-plane node). Motivated by the novelapp incident — a workload was
|
||||
deleted with no way to attribute it, because apiserver audit logging had never
|
||||
been on (see post-incident note below).
|
||||
|
||||
## What is configured
|
||||
|
||||
- **Audit policy:** `infra/scripts/k8s-apiserver-audit-policy.yaml` (source of
|
||||
truth), deployed to `/etc/kubernetes/audit-policy.yaml` on k8s-master.
|
||||
Low-write by design: drops reads (get/list/watch), high-churn resources
|
||||
(events, leases, endpointslices, token/subjectaccess reviews), and probe
|
||||
URLs; logs everything else (create/update/patch/delete) at **Metadata**
|
||||
level (who/verb/resource/namespace/name/time/sourceIP — no bodies).
|
||||
`omitStages: [RequestReceived]` → one line per mutating request.
|
||||
- **kube-apiserver static-pod manifest** (`/etc/kubernetes/manifests/kube-apiserver.yaml`):
|
||||
`--audit-policy-file=/etc/kubernetes/audit-policy.yaml`,
|
||||
`--audit-log-path=/var/log/kubernetes/audit/audit.log`,
|
||||
`--audit-log-maxage=30 --audit-log-maxbackup=10 --audit-log-maxsize=100`
|
||||
(≤1 GB on disk, 30-day rotation), plus the `audit-policy` (File, RO) and
|
||||
`audit-logs` (DirectoryOrCreate) hostPath volumes/mounts.
|
||||
- **Persistence across `kubeadm upgrade`:** the same flags + volumes are in the
|
||||
`kubeadm-config` ConfigMap (`kube-system`), `ClusterConfiguration.apiServer.{extraArgs,extraVolumes}`
|
||||
(v1beta4). Without this, a control-plane upgrade regenerates the manifest and
|
||||
silently drops audit (and oidc). The OIDC flags are recorded there too (see
|
||||
below).
|
||||
- **Shipping to Loki:** the Alloy DaemonSet
|
||||
(`infra/stacks/monitoring/modules/monitoring/alloy.yaml`) tails
|
||||
`/var/log/kubernetes/audit/audit.log` (it schedules on the control-plane node
|
||||
and mounts host `/var/log`). Query in Loki/Grafana with
|
||||
`{job="kubernetes-audit"}`.
|
||||
|
||||
## How to attribute a change ("who deleted X, when")
|
||||
|
||||
```
|
||||
# In Loki (Grafana Explore or logcli), last 24h:
|
||||
{job="kubernetes-audit"} |= "delete" |= "<resource-name>"
|
||||
```
|
||||
Each entry is a JSON `audit.k8s.io/v1` Event: `user.username`, `verb`,
|
||||
`objectRef.{resource,namespace,name}`, `requestReceivedTimestamp`,
|
||||
`sourceIPs`, `userAgent`. On-node fallback (Loki down):
|
||||
`sudo grep <name> /var/log/kubernetes/audit/audit.log` on k8s-master.
|
||||
|
||||
Note: direct `kubectl`/dashboard calls now show the real identity (user SA or
|
||||
OIDC email). Pre-2026-06-06 deletions are NOT recoverable (audit was off).
|
||||
|
||||
## CRITICAL gotcha that blocked this (and OIDC) for weeks
|
||||
|
||||
`kubelet` runs **every** non-dotfile in its `staticPodPath`
|
||||
(`/etc/kubernetes/manifests`) as a static pod. A stray
|
||||
`kube-apiserver.yaml.bak.<epoch>` left in that directory (from an earlier manual
|
||||
edit) was a **second** manifest defining pod `kube-apiserver`. kubelet ran the
|
||||
older `.bak` copy and ignored edits to the real `kube-apiserver.yaml` — so newly
|
||||
added flags (the OIDC flags, then these audit flags) never reached the running
|
||||
process even though the file clearly had them. Symptom: the running apiserver's
|
||||
`/proc/<pid>/cmdline` (or `crictl inspect` args) is SHORTER than the manifest's
|
||||
`command:` list. Fix: move any `*.bak`/backup OUT of `/etc/kubernetes/manifests/`.
|
||||
**Always back up control-plane manifests to a sibling dir (e.g.
|
||||
`/etc/kubernetes/`), never inside `manifests/`.** This also un-blocked OIDC
|
||||
(memory id=4042) as a side effect.
|
||||
|
||||
## Rollback
|
||||
|
||||
Backups live in `/etc/kubernetes/apiserver-manifest-archive/` on k8s-master
|
||||
(the 27-arg pre-audit known-good, and the 36-arg desired). To disable audit:
|
||||
remove the `--audit-*` flags + audit volumes from the manifest (kubelet
|
||||
restarts the apiserver in ~30-40s), and remove them from `kubeadm-config`. A bad
|
||||
manifest edit only needs the known-good copied back over
|
||||
`/etc/kubernetes/manifests/kube-apiserver.yaml`.
|
||||
|
||||
Editing the apiserver manifest restarts the apiserver → ~30-40s API blip on this
|
||||
single-control-plane cluster. Always edit from a backup + watch
|
||||
`curl -sk https://10.0.20.100:6443/livez` before declaring success.
|
||||
188
docs/runbooks/beads-auto-dispatch.md
Normal file
188
docs/runbooks/beads-auto-dispatch.md
Normal file
|
|
@ -0,0 +1,188 @@
|
|||
# Beads Auto-Dispatch Runbook
|
||||
|
||||
Users can hand work to the headless `beads-task-runner` agent by assigning a
|
||||
bead to the sentinel user `agent`. Two CronJobs in the `beads-server`
|
||||
namespace drive the pipeline:
|
||||
|
||||
- **`beads-dispatcher`** — every 2 min: picks up the highest-priority
|
||||
`assignee=agent`/`status=open` bead with non-empty acceptance criteria,
|
||||
claims it by flipping to `in_progress`, and POSTs it to BeadBoard's
|
||||
`/api/agent-dispatch`. BeadBoard forwards to `claude-agent-service` with
|
||||
the existing bearer-token flow.
|
||||
- **`beads-reaper`** — every 10 min: flips any `assignee=agent` +
|
||||
`status=in_progress` bead whose `updated_at` is older than 30 min to
|
||||
`status=blocked` with an explanatory note. Catches pod crashes mid-run.
|
||||
|
||||
The manual BeadBoard Dispatch button continues to work in parallel.
|
||||
|
||||
## Flow diagram
|
||||
|
||||
```
|
||||
user: bd assign <id> agent
|
||||
│
|
||||
▼
|
||||
Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐
|
||||
│ │
|
||||
▼ │
|
||||
CronJob: beads-dispatcher │
|
||||
1. GET beadboard/api/agent-status (busy?) │
|
||||
2. bd query 'assignee=agent AND status=open' │
|
||||
3. bd update -s in_progress (claim) │
|
||||
4. POST beadboard/api/agent-dispatch │
|
||||
5. bd note "dispatched: job=…" │
|
||||
│ │
|
||||
▼ │
|
||||
claude-agent-service /execute │
|
||||
beads-task-runner agent runs; notes/closes bead │
|
||||
│ │
|
||||
▼ │
|
||||
done ──► next tick picks up the next bead ───────────────┘
|
||||
|
||||
|
||||
CronJob: beads-reaper (every 10 min)
|
||||
for bead (assignee=agent, status=in_progress, updated_at > 30 min):
|
||||
bd note "reaper: no progress for Nm — blocking"
|
||||
bd update -s blocked
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Hand a bead to the agent
|
||||
|
||||
```
|
||||
bd create "Title" \
|
||||
-d "Full context — files, services, error messages. Any agent with no prior context must be able to execute this." \
|
||||
--acceptance "Concrete, verifiable criteria" \
|
||||
-p 2
|
||||
bd assign <new-id> agent
|
||||
```
|
||||
|
||||
**Acceptance criteria is required.** Beads without it are skipped by the
|
||||
dispatcher and stay in `open` forever. This is intentional — the
|
||||
`beads-task-runner` agent expects clear done conditions.
|
||||
|
||||
### Take a bead back (unassign)
|
||||
|
||||
```
|
||||
bd assign <id> ""
|
||||
```
|
||||
|
||||
If the bead is already `in_progress`, also reset it:
|
||||
|
||||
```
|
||||
bd update <id> -s open
|
||||
```
|
||||
|
||||
### Pause auto-dispatch
|
||||
|
||||
```
|
||||
cd infra/stacks/beads-server
|
||||
scripts/tg apply -var=beads_dispatcher_enabled=false
|
||||
```
|
||||
|
||||
This sets `spec.suspend: true` on both CronJobs. Existing running jobs
|
||||
continue; no new ticks fire. Re-enable by re-applying with
|
||||
`beads_dispatcher_enabled=true` (the default). Manual BeadBoard Dispatch
|
||||
remains available while paused.
|
||||
|
||||
### Read the logs
|
||||
|
||||
```
|
||||
# Recent dispatcher runs
|
||||
kubectl -n beads-server get jobs --selector=job-name --sort-by=.metadata.creationTimestamp | grep beads-dispatcher | tail
|
||||
kubectl -n beads-server logs job/<dispatcher-job-name>
|
||||
|
||||
# Tail the underlying agent once a bead dispatches
|
||||
kubectl -n claude-agent logs -l app=claude-agent-service -f
|
||||
|
||||
# Inspect reaper decisions
|
||||
kubectl -n beads-server get jobs | grep beads-reaper | tail
|
||||
kubectl -n beads-server logs job/<reaper-job-name>
|
||||
```
|
||||
|
||||
### Inspect a specific bead's dispatch history
|
||||
|
||||
```
|
||||
bd show <id> --json | jq '{status, assignee, notes, updated_at}'
|
||||
```
|
||||
|
||||
Both the dispatcher and reaper write dated notes (`auto-dispatcher claimed
|
||||
at…`, `dispatched: job=…`, `reaper: no progress for…`) so the audit trail
|
||||
lives on the bead itself.
|
||||
|
||||
## Reaper semantics — when a bead becomes `blocked`
|
||||
|
||||
The reaper flips a bead to `blocked` if:
|
||||
- `assignee = agent`, AND
|
||||
- `status = in_progress`, AND
|
||||
- `updated_at` is more than **30 minutes** in the past.
|
||||
|
||||
Every `bd note` bumps `updated_at`, so a well-behaved `beads-task-runner`
|
||||
agent never trips the reaper — it notes progress as it works. A `blocked`
|
||||
bead is a signal that:
|
||||
- the agent pod crashed mid-run (`kubectl -n claude-agent delete pod` test),
|
||||
- the job hit its 15-minute budget timeout inside `claude-agent-service`
|
||||
without notes (rare — the agent usually notes failure before exiting),
|
||||
- `claude-agent-service` was restarted during the run (in-memory job state
|
||||
is lost; see [known risks](#known-risks)).
|
||||
|
||||
Recovery: read the reaper note, reopen manually if appropriate:
|
||||
|
||||
```
|
||||
bd update <id> -s open
|
||||
bd assign <id> agent # re-arm for next dispatcher tick
|
||||
```
|
||||
|
||||
## Design choices
|
||||
|
||||
- **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd
|
||||
client can set it (`bd assign <id> agent`).
|
||||
- **One-bead-per-tick dispatch** — the dispatcher submits at most one bead
|
||||
per 2-min tick, gating on `claude-agent-service`'s `/health` `busy` flag.
|
||||
`busy` now means `active >= capacity` (bounded semaphore, default 10) — the
|
||||
service no longer single-flight-locks via `asyncio.Lock`. So up to
|
||||
~`capacity` beads can run concurrently; the 2-min poll cadence (not
|
||||
single-slot execution) now bounds ramp-up.
|
||||
- **Fixed agent (`beads-task-runner`)** — read-only rails, matches BeadBoard's
|
||||
manual Dispatch button. Broader-privilege agents stay manual.
|
||||
- **CronJob (not in-service polling, not n8n)** — matches existing infra
|
||||
pattern (OpenClaw task-processor, certbot-renewal, backups), TF-managed,
|
||||
easy to pause.
|
||||
- **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing
|
||||
the image-seeded file. The CronJob's init step copies it into `/tmp/.beads/`
|
||||
because `bd` may touch the parent directory and ConfigMap mounts are
|
||||
read-only.
|
||||
|
||||
## Known risks
|
||||
|
||||
- **In-memory job state in `claude-agent-service`** — if the pod restarts
|
||||
mid-run, the job record is lost. The reaper catches this after 30 min.
|
||||
Persistent job store is deferred.
|
||||
- **Prompt injection via bead fields** — a malicious bead description could
|
||||
try to steer the agent. The `beads-task-runner` rails + token budget +
|
||||
timeout are the defense. Identical exposure as the manual Dispatch button.
|
||||
- **Image tag drift** — `claude_agent_service_image_tag` in
|
||||
`stacks/beads-server/main.tf` mirrors `local.image_tag` in
|
||||
`stacks/claude-agent-service/main.tf`. Bump both when the image rebuilds,
|
||||
or the dispatcher/reaper will run on an older layer. (They only need
|
||||
`bd`, `curl`, `jq` — stable across rebuilds — so the drift is low-risk.)
|
||||
- **`bd` JSON schema changes** — the reaper's `jq` reads `.id` and
|
||||
`.updated_at`. If a future `bd` upgrade renames these, the reaper breaks
|
||||
silently (no reaping, no alert). `BD_VERSION` is pinned in the image
|
||||
Dockerfile.
|
||||
|
||||
## Verification after change
|
||||
|
||||
```
|
||||
# Both CronJobs exist with the right schedule / SUSPEND state
|
||||
kubectl -n beads-server get cronjob
|
||||
|
||||
# End-to-end smoke test
|
||||
bd create "auto-dispatch smoke test" \
|
||||
-d "Read /etc/hostname inside the agent sandbox and close." \
|
||||
--acceptance "bd note includes 'hostname=' and bead is closed."
|
||||
bd assign <new-id> agent
|
||||
# within 2 min:
|
||||
bd show <new-id> --json | jq '.notes'
|
||||
# → contains 'auto-dispatcher claimed' + 'dispatched: job=<uuid>'
|
||||
```
|
||||
211
docs/runbooks/chrome-service-snapshot.md
Normal file
211
docs/runbooks/chrome-service-snapshot.md
Normal file
|
|
@ -0,0 +1,211 @@
|
|||
# Runbook — chrome-service snapshot pipeline
|
||||
|
||||
Operational playbook for the hourly cookie-snapshot pipeline that warms
|
||||
external Claude Code sessions on the dev box. Architecture in
|
||||
`architecture/chrome-service.md`.
|
||||
|
||||
## At a glance
|
||||
|
||||
| Component | Where | When | What |
|
||||
|---|---|---|---|
|
||||
| chrome-service Deployment | `chrome-service` ns | always-on | headed chromium, CDP :9222, persistent /profile/chromium-data |
|
||||
| snapshot-server sidecar | same pod | always-on | serves `/api/snapshot`, bearer-gated, port 8088 |
|
||||
| snapshot-harvester CronJob | `chrome-service` ns | `23 * * * *` | dumps `storage_state()` via CDP → `/profile/snapshots/storage-state.json` |
|
||||
| dev-box refresh timer | each dev box | hourly | curls `chrome.viktorbarzin.me/api/snapshot` → `~/.cache/playwright-shared-storage-state.json` |
|
||||
| dev-box `playwright-mcp.service` | each dev box | always-on | `@playwright/mcp --isolated --storage-state=…` per-MCP-connection contexts |
|
||||
|
||||
## Day-to-day
|
||||
|
||||
### Log into a new site (warm the profile)
|
||||
|
||||
1. Open `https://chrome.viktorbarzin.me/` (Authentik will gate).
|
||||
2. The noVNC view of the in-cluster headed chromium loads. Click on the
|
||||
browser window, navigate, log in.
|
||||
3. Cookies land in `/profile/chromium-data/Default/Cookies` on the PVC.
|
||||
4. Within ≤60 min, the snapshot-harvester CronJob picks them up and
|
||||
writes the snapshot. Within ≤60 min after that, dev boxes pull the
|
||||
new file. New Claude Code sessions see the new cookies.
|
||||
5. To skip the wait: trigger the harvester now (next section).
|
||||
|
||||
### Trigger snapshot harvester manually
|
||||
|
||||
```bash
|
||||
kubectl -n chrome-service create job \
|
||||
--from=cronjob/chrome-service-snapshot-harvester \
|
||||
snapshot-harvest-$(date +%s)
|
||||
|
||||
# Watch logs
|
||||
kubectl -n chrome-service logs -f -l job-name=$(kubectl -n chrome-service get jobs -o name | tail -1 | cut -d/ -f2)
|
||||
```
|
||||
|
||||
Expected: `wrote snapshot (… bytes) to /profile/snapshots/storage-state.json`.
|
||||
|
||||
### Trigger dev-box refresh manually
|
||||
|
||||
```bash
|
||||
# On the dev box, as the user whose Claude Code sessions need the new state:
|
||||
systemctl --user start playwright-snapshot-refresh.service
|
||||
|
||||
# Or directly:
|
||||
/usr/local/bin/playwright-snapshot-refresh
|
||||
|
||||
# Verify
|
||||
ls -la ~/.cache/playwright-shared-storage-state.json
|
||||
```
|
||||
|
||||
### Inspect the current snapshot
|
||||
|
||||
```bash
|
||||
# In-cluster (from any pod with kubectl exec into the chrome-service pod):
|
||||
kubectl -n chrome-service exec deploy/chrome-service -c snapshot-server -- \
|
||||
cat /profile/snapshots/storage-state.json | jq '.cookies | length'
|
||||
|
||||
# Externally (via the bearer-gated endpoint):
|
||||
TOKEN=$(vault kv get -field=api_bearer_token secret/chrome-service)
|
||||
curl -fsSL -H "Authorization: Bearer $TOKEN" \
|
||||
https://chrome.viktorbarzin.me/api/snapshot | jq '.cookies | length'
|
||||
```
|
||||
|
||||
## Failure modes
|
||||
|
||||
### "no browser contexts found"
|
||||
|
||||
The harvester reports `no browser contexts found — chrome-service may
|
||||
not have launched a persistent context yet` and exits non-zero.
|
||||
|
||||
**Cause**: chromium just started and hasn't created its default context
|
||||
yet, or it crashed.
|
||||
|
||||
**Fix**: check chrome-service pod logs (`kubectl -n chrome-service logs
|
||||
deploy/chrome-service -c chrome-service`). The next hourly run will
|
||||
retry. If chromium is wedged: `kubectl -n chrome-service rollout restart
|
||||
deploy/chrome-service` (strategy = Recreate, brief downtime).
|
||||
|
||||
### "connect_over_cdp failed"
|
||||
|
||||
Harvester or any in-cluster caller can't reach the CDP endpoint.
|
||||
|
||||
**Cause**: chrome-service pod not Ready, NetworkPolicy doesn't admit
|
||||
the caller's namespace, or chromium isn't listening on :9222.
|
||||
|
||||
**Diagnose**:
|
||||
```bash
|
||||
kubectl -n chrome-service get pods
|
||||
kubectl -n chrome-service describe networkpolicy chrome-service-ws-ingress
|
||||
|
||||
# From inside the cluster (e.g. a debug pod in chrome-service ns):
|
||||
nc -zv chrome-service.chrome-service.svc.cluster.local 9222
|
||||
curl -fsSL http://chrome-service.chrome-service.svc.cluster.local:9222/json/version
|
||||
```
|
||||
|
||||
**Fix**: depends on the diagnosis. NetworkPolicy needs the caller's
|
||||
namespace label or an explicit name-fallback. If chromium isn't
|
||||
binding, check the container logs.
|
||||
|
||||
### Dev-box `playwright-snapshot-refresh` returns 401
|
||||
|
||||
The bearer token in `~/.config/playwright/token` doesn't match the
|
||||
server's. Almost always means the Vault secret was rotated and the
|
||||
local cache is stale.
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
vault login -method=oidc # if needed
|
||||
vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token
|
||||
chmod 600 ~/.config/playwright/token
|
||||
systemctl --user start playwright-snapshot-refresh.service
|
||||
```
|
||||
|
||||
### Dev-box `playwright-snapshot-refresh` returns 404 with "snapshot not yet available"
|
||||
|
||||
The harvester hasn't run successfully yet (fresh cluster, or all
|
||||
recent runs failed). Trigger it manually (see "Trigger snapshot
|
||||
harvester manually").
|
||||
|
||||
### Claude Code sessions still see old cookies
|
||||
|
||||
The MCP server reads the snapshot file at process start and seeds each
|
||||
new context with it. **Existing MCP sessions don't hot-reload** — they
|
||||
keep the cookies they were seeded with at session start. New sessions
|
||||
get the fresh snapshot.
|
||||
|
||||
**Fix**: restart the MCP server on the dev box to pick up the new file:
|
||||
```bash
|
||||
systemctl --user restart playwright-mcp.service
|
||||
```
|
||||
|
||||
### Snapshot file is suspiciously small or empty cookies array
|
||||
|
||||
The persistent chromium context isn't holding any cookies. Probably
|
||||
means the user hasn't logged into anything via noVNC, or chromium was
|
||||
relaunched without preserving `/profile/chromium-data`.
|
||||
|
||||
**Diagnose**:
|
||||
```bash
|
||||
kubectl -n chrome-service exec deploy/chrome-service -c chrome-service -- \
|
||||
ls -la /profile/chromium-data/Default/Cookies
|
||||
```
|
||||
|
||||
A populated `Cookies` SQLite file should be several hundred KB once
|
||||
real logins exist. If it's missing or empty, log in via noVNC.
|
||||
|
||||
## Token rotation
|
||||
|
||||
```bash
|
||||
# Rotate Vault secret (32-byte URL-safe random).
|
||||
vault kv put secret/chrome-service \
|
||||
api_bearer_token=$(python3 -c 'import secrets; print(secrets.token_urlsafe(32))')
|
||||
|
||||
# Reloader auto-restarts chrome-service pod (snapshot-server picks up new token).
|
||||
|
||||
# On EVERY dev box that pulls the snapshot:
|
||||
vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token
|
||||
chmod 600 ~/.config/playwright/token
|
||||
|
||||
# Verify the next refresh succeeds:
|
||||
systemctl --user start playwright-snapshot-refresh.service
|
||||
journalctl --user -u playwright-snapshot-refresh.service -n 20
|
||||
```
|
||||
|
||||
## Restore from a backup tarball
|
||||
|
||||
The 6-hourly backup CronJob writes `tar -czf /backup/YYYY_MM_DD_HH.tar.gz
|
||||
-C /profile .` to NFS at `/srv/nfs/chrome-service-backup/`. To restore
|
||||
the entire profile:
|
||||
|
||||
```bash
|
||||
# 1. Scale chrome-service down so its lock is released.
|
||||
kubectl -n chrome-service scale deploy/chrome-service --replicas=0
|
||||
|
||||
# 2. Mount the PVC in a helper pod and restore.
|
||||
kubectl -n chrome-service apply -f - <<EOF
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata: {name: restore-helper, namespace: chrome-service}
|
||||
spec:
|
||||
containers:
|
||||
- name: helper
|
||||
image: alpine:3.20
|
||||
command: [sleep, infinity]
|
||||
volumeMounts:
|
||||
- {name: profile, mountPath: /profile}
|
||||
- {name: backup, mountPath: /backup, readOnly: true}
|
||||
volumes:
|
||||
- name: profile
|
||||
persistentVolumeClaim: {claimName: chrome-service-profile-encrypted}
|
||||
- name: backup
|
||||
persistentVolumeClaim: {claimName: chrome-service-backup-host}
|
||||
restartPolicy: Never
|
||||
EOF
|
||||
|
||||
kubectl -n chrome-service wait --for=condition=ready pod/restore-helper
|
||||
|
||||
kubectl -n chrome-service exec restore-helper -- sh -c '
|
||||
rm -rf /profile/chromium-data /profile/snapshots &&
|
||||
tar -xzf /backup/2026_06_04_18.tar.gz -C /profile
|
||||
'
|
||||
|
||||
# 3. Cleanup helper, scale chrome-service back up.
|
||||
kubectl -n chrome-service delete pod restore-helper
|
||||
kubectl -n chrome-service scale deploy/chrome-service --replicas=1
|
||||
```
|
||||
122
docs/runbooks/fan-control.md
Normal file
122
docs/runbooks/fan-control.md
Normal file
|
|
@ -0,0 +1,122 @@
|
|||
# Runbook — PVE R730 fan-control daemon
|
||||
|
||||
Presence-aware IPMI fan controller on the PVE host (192.168.1.127). Runs the
|
||||
CPU cool when the garage is empty, quiet when someone's in the garage. Design:
|
||||
`infra/docs/plans/2026-06-04-pve-fan-control-design.md`.
|
||||
|
||||
## What it is
|
||||
|
||||
- `/usr/local/bin/fan-control` — bash daemon (source: `infra/scripts/fan-control.sh`).
|
||||
- `fan-control.service` — systemd unit (`Type=simple`, restarts on failure).
|
||||
- `/etc/fan-control.env` — config incl. the ha-sofia token (chmod 600, not in git).
|
||||
|
||||
## HA control (Home Assistant)
|
||||
|
||||
Drive the fans from **dashboard-it → "Server" view → Fans**. The view is
|
||||
deliberately minimal — it shows the current **fan speed** (% of capacity +
|
||||
absolute RPM) and two controls:
|
||||
|
||||
- **Override %** (`input_number.r730_fan_manual_pct`) — the fan % to hold. While
|
||||
**unlocked** it continuously mirrors the live commanded fan %, so it always
|
||||
shows the actual *absolute* speed and updates as the fan moves (NOT a stale
|
||||
value or a delta) — `automation.r730_fan_override_track_live_speed_while_unlocked`
|
||||
syncs it to `sensor.r730_fan_control_target` (guarded to ignore
|
||||
unavailable/unknown). While **locked** it stops tracking and becomes your
|
||||
editable setpoint. A readout under the slider shows the live `% · rpm`.
|
||||
- **Lock — freeze speed** (`input_boolean.r730_fan_lock`) — turn the algorithm
|
||||
off and hold a fixed speed. Toggling it **ON** snapshots the *current*
|
||||
commanded % into Override and switches the daemon to `manual`
|
||||
(`automation.r730_fan_lock_freeze_current_speed_resume_algo`); toggling it
|
||||
**OFF** switches back to `auto`, resuming the presence curve. Fine-tune the
|
||||
held % with Override while locked. A 🔒 reminder appears on the view while
|
||||
locked.
|
||||
|
||||
Under the hood the daemon still reads `input_select.r730_fan_mode`
|
||||
(auto/cool/quiet/manual) + `input_number.r730_fan_manual_pct` each loop; the Lock
|
||||
toggle just drives `mode` between `manual` (locked) and `auto` (unlocked).
|
||||
`cool`/`quiet` remain valid modes if set directly (via the entity) but are no
|
||||
longer surfaced on the simplified dashboard. `CEILING` (83 °C) still overrides
|
||||
everything → Dell auto, **even when locked**. A stale non-`auto` mode left while
|
||||
*unlocked* still auto-reverts to `auto` after 60 min
|
||||
(`automation.r730_fan_mode_auto_revert`, now a dormant safety net). An HA change
|
||||
is applied within one daemon loop (~15 s).
|
||||
|
||||
Monitoring sensors on the same view: `sensor.r730_fan_speed` (redfish exporter),
|
||||
`sensor.r730_fan_control_target` + `sensor.r730_fan_control_mode` +
|
||||
`sensor.r730_fan_power_est` (Pushgateway). Fan **% and RPM are merged into one
|
||||
"Fan speed" card** (the two had identical trend shapes) — the % trend comes from
|
||||
the stable Pushgateway sensor, while RPM reads `sensor.r730_fan_speed` but **falls
|
||||
back to a calibrated estimate (shown with a `~` prefix) whenever the Redfish
|
||||
sensor is `unavailable`** (it blips out intermittently), so the readout never goes
|
||||
blank. `r730_fan_power_est` is an ESTIMATE of
|
||||
total fan power (the iDRAC reports no per-fan power) — modelled from RPM via the
|
||||
fan affinity law (∝ RPM³), calibrated to the power sweep (~2 W floor → ~99 W full).
|
||||
|
||||
The HA objects (helpers, the auto-revert automation, the REST sensors in
|
||||
`rest_resources/{idrac_redfish_exporter,fan_control}.yaml`, and the dashboard
|
||||
cards) live on **ha-sofia** and are auto-git-tracked there by the version-control
|
||||
add-on — they are NOT in this repo.
|
||||
|
||||
## Quick status
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 systemctl status fan-control
|
||||
ssh root@192.168.1.127 'journalctl -u fan-control -n 30 --no-pager'
|
||||
ssh root@192.168.1.127 'ipmitool sdr type fan | grep ^Fan1; ipmitool sdr type temperature | grep "^Temp "'
|
||||
```
|
||||
Log lines look like `temp=60C ha_mode=auto eff=cool fan=50% (was 70%)`
|
||||
(`ha_mode` = the HA setpoint; `eff` = the effective curve applied).
|
||||
|
||||
## Disable / roll back to stock firmware control
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'systemctl disable --now fan-control && ipmitool raw 0x30 0x30 0x01 0x01'
|
||||
```
|
||||
The unit's `ExecStopPost` already restores Dell auto on stop, so the explicit
|
||||
`raw ... 0x01` is belt-and-suspenders. The box is back to its stock curve.
|
||||
|
||||
## Tune
|
||||
|
||||
Edit `/etc/fan-control.env` on the host, then `systemctl restart fan-control`.
|
||||
Common knobs:
|
||||
- `HOLD_SECS` — how long to stay quiet after the garage door last moved (default 900 = 15 min).
|
||||
- `CEILING` — temp at which we abandon manual control and let the firmware take over (default 83).
|
||||
- Curve shape: **linear anchors** near the top of the script — `COOL_T_LO/COOL_P_LO/COOL_T_HI/COOL_P_HI` (default 50°C/30% → 83°C/100%) and `QUIET_*` (68°C/20% → 83°C/100%); fan% interpolates linearly between them (replaced the old discrete step-bands). `MIN_STEP` (default 3%) = smallest fan-% change worth an IPMI write (anti-jitter); `DEADBAND` (3°C) = ease-down hysteresis. Lower `COOL_P_HI` or raise `COOL_T_HI` to run the top end quieter; steepen by raising `COOL_P_LO` / lowering `COOL_T_LO`.
|
||||
|
||||
## Deploy / update
|
||||
|
||||
```bash
|
||||
cd infra
|
||||
scp scripts/fan-control.sh root@192.168.1.127:/usr/local/bin/fan-control
|
||||
ssh root@192.168.1.127 chmod +x /usr/local/bin/fan-control
|
||||
scp scripts/fan-control.service root@192.168.1.127:/etc/systemd/system/fan-control.service
|
||||
# first install only — create /etc/fan-control.env from fan-control.env.example with the HA token
|
||||
ssh root@192.168.1.127 'systemctl daemon-reload && systemctl restart fan-control'
|
||||
```
|
||||
|
||||
## HA token
|
||||
|
||||
`/etc/fan-control.env` holds a long-lived ha-sofia token used to read
|
||||
`sensor.garage_door_state_bg`. Mint via Home Assistant → Profile → Security →
|
||||
Long-lived access tokens, or reuse the existing ha-sofia token. If the token is
|
||||
missing/empty, the daemon still runs but **COOL-only** (no quiet mode) and logs
|
||||
`ha_reachable=0`.
|
||||
|
||||
## Symptoms & checks
|
||||
|
||||
| Symptom | Check |
|
||||
|---------|-------|
|
||||
| Fans stuck loud | `journalctl -u fan-control` — is `mode=fallback`? (ceiling breach or IPMI fail). Check CPU temp. |
|
||||
| Never goes quiet | Token valid? `curl -H "Authorization: Bearer $TOKEN" http://192.168.1.8:8123/api/states/sensor.garage_door_state_bg`. Garage door reporting? |
|
||||
| Fans flapping | Increase `DEADBAND`. |
|
||||
| Service won't start | `systemctl status fan-control`; check `ipmitool` works: `ipmitool sdr type temperature`. |
|
||||
| Box left in manual after crash | `ipmitool raw 0x30 0x30 0x01 0x01` to force Dell auto. |
|
||||
|
||||
## Verify presence wiring
|
||||
|
||||
```bash
|
||||
# one iteration, real IPMI + HA, no daemon loop:
|
||||
ssh root@192.168.1.127 'set -a; . /etc/fan-control.env; set +a; RUN_ONCE=1 /usr/local/bin/fan-control'
|
||||
```
|
||||
With the garage closed for >15 min you should see `mode=cool`; within 15 min of
|
||||
the door moving, `mode=quiet`.
|
||||
126
docs/runbooks/forgejo-registry-breakglass.md
Normal file
126
docs/runbooks/forgejo-registry-breakglass.md
Normal file
|
|
@ -0,0 +1,126 @@
|
|||
# Runbook: Forgejo registry break-glass — recovering infra-ci
|
||||
|
||||
Last updated: 2026-05-07
|
||||
|
||||
## When to use this runbook
|
||||
|
||||
When **all** of the following are true:
|
||||
|
||||
1. Forgejo (`forgejo.viktorbarzin.me`) is unreachable.
|
||||
2. `registry-private` is also gone (post-Phase 4 of the consolidation),
|
||||
so you can't fall back to `registry.viktorbarzin.me:5050/infra-ci`.
|
||||
3. You need to run an infra Woodpecker pipeline (apply, build-cli,
|
||||
drift-detection, etc.) — but those pipelines pull `infra-ci` and
|
||||
crash because the registry is down.
|
||||
|
||||
If only Forgejo is down but `registry-private` is still alive, the
|
||||
pipelines work — `image:` references in `infra/.woodpecker/*.yml`
|
||||
still hit `registry.viktorbarzin.me:5050/infra-ci` until Phase 3
|
||||
flips them. Skip this runbook entirely.
|
||||
|
||||
## What's available
|
||||
|
||||
The `build-ci-image.yml` Woodpecker pipeline saves a tarball after
|
||||
each successful push:
|
||||
|
||||
| Location | Path |
|
||||
|---|---|
|
||||
| Registry VM disk (10.0.20.10) | `/opt/registry/data/private/_breakglass/infra-ci-<sha>.tar.gz` |
|
||||
| Registry VM disk (latest symlink) | `/opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz` |
|
||||
| Synology NAS (offsite copy via daily-backup sync) | `/volume1/Backup/Viki/pve-backup/_forgejo-breakglass/` |
|
||||
|
||||
The registry VM keeps the last 5 tarballs. Synology mirrors them
|
||||
through the existing offsite-sync-backup job (`/usr/local/bin/
|
||||
offsite-sync-backup`).
|
||||
|
||||
## Recovery procedure
|
||||
|
||||
The goal is to get a working `infra-ci` image onto a k8s node so
|
||||
Woodpecker pods can run it. Then run a Woodpecker pipeline that
|
||||
restores Forgejo from PVC backup or rebuilds it.
|
||||
|
||||
### Step 1 — copy the tarball to a node
|
||||
|
||||
From your workstation (the registry VM is reachable but Forgejo is
|
||||
not — the rest of the cluster might be in a similar partial state):
|
||||
|
||||
```bash
|
||||
ssh wizard@10.0.20.103 # any responsive k8s node
|
||||
sudo mkdir -p /var/breakglass
|
||||
sudo scp root@10.0.20.10:/opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz \
|
||||
/var/breakglass/
|
||||
```
|
||||
|
||||
If the registry VM is also down, fall back to Synology:
|
||||
|
||||
```bash
|
||||
sudo scp 192.168.1.13:/volume1/Backup/Viki/pve-backup/_forgejo-breakglass/infra-ci-latest.tar.gz \
|
||||
/var/breakglass/
|
||||
```
|
||||
|
||||
### Step 2 — load into containerd
|
||||
|
||||
`docker load` won't help on a k8s node — it loads into the docker
|
||||
daemon, which kubelet/containerd doesn't see. Use `ctr`:
|
||||
|
||||
```bash
|
||||
sudo ctr -n k8s.io images import /var/breakglass/infra-ci-latest.tar.gz
|
||||
sudo ctr -n k8s.io images list | grep infra-ci
|
||||
```
|
||||
|
||||
Confirm the image is tagged with the original repository name
|
||||
(`registry.viktorbarzin.me:5050/infra-ci:<sha>` — the tarball was
|
||||
saved with that tag, NOT the Forgejo name).
|
||||
|
||||
### Step 3 — pin pods to this node
|
||||
|
||||
Add a node selector or taint-toleration to whatever pipeline you
|
||||
need to run. Simplest: cordon the other nodes briefly so Woodpecker
|
||||
schedules onto this one.
|
||||
|
||||
```bash
|
||||
for n in $(kubectl get nodes -o name | grep -v $(hostname)); do
|
||||
kubectl cordon ${n#node/}
|
||||
done
|
||||
```
|
||||
|
||||
Run the pipeline. After it completes:
|
||||
|
||||
```bash
|
||||
for n in $(kubectl get nodes -o name); do
|
||||
kubectl uncordon ${n#node/}
|
||||
done
|
||||
```
|
||||
|
||||
### Step 4 — fix the underlying problem
|
||||
|
||||
The pipeline you just ran was meant to restore Forgejo. Common
|
||||
options:
|
||||
|
||||
- **Forgejo PVC corrupt** — `docs/runbooks/forgejo-registry-rebuild-image.md`
|
||||
walks through PVC restore from LVM snapshot or PVE backup.
|
||||
- **Forgejo OOM-loop** — bump memory request+limit in
|
||||
`infra/stacks/forgejo/main.tf` and apply.
|
||||
- **Forgejo unreachable due to network** — check Traefik, MetalLB,
|
||||
pfSense.
|
||||
|
||||
Once Forgejo is back, run `build-ci-image.yml` manually so the
|
||||
tarball regenerates with the latest commit.
|
||||
|
||||
## Why this exists
|
||||
|
||||
The 2026-04-19 post-mortem on the registry-orphan-index incident
|
||||
showed that a single registry going corrupt could block ALL infra
|
||||
pipelines (because every pipeline pulls `infra-ci` from that
|
||||
registry). The dual-push to Forgejo + registry-private removes that
|
||||
single-point-of-failure during the bake. After Phase 4
|
||||
decommissions registry-private, the tarball is the last line of
|
||||
defense.
|
||||
|
||||
## Why on the registry VM and not in-cluster
|
||||
|
||||
The Forgejo pod and registry-private pod both depend on cluster
|
||||
networking + storage. The registry VM is an independent
|
||||
non-clustered VM with local storage. If the cluster is in a bad
|
||||
state, the VM's disk is still readable from any other host on the
|
||||
LAN.
|
||||
128
docs/runbooks/forgejo-registry-rebuild-image.md
Normal file
128
docs/runbooks/forgejo-registry-rebuild-image.md
Normal file
|
|
@ -0,0 +1,128 @@
|
|||
# Runbook: Rebuild an Image on the Forgejo OCI Registry
|
||||
|
||||
Last updated: 2026-05-07
|
||||
|
||||
## When to use this
|
||||
|
||||
Pipelines pulling from `forgejo.viktorbarzin.me/viktor/<image>` fail with:
|
||||
|
||||
- `failed to resolve reference … : not found`
|
||||
- `manifest unknown`
|
||||
- HEAD on a manifest/blob digest returns 404
|
||||
- `forgejo-integrity-probe` CronJob in `monitoring` reports
|
||||
`registry_manifest_integrity_failures > 0` for
|
||||
`instance="forgejo.viktorbarzin.me"`
|
||||
|
||||
This is the Forgejo equivalent of the registry-private orphan-index
|
||||
failure mode (`docs/post-mortems/2026-04-19-registry-orphan-index.md`).
|
||||
Cause is usually package-version delete races with an in-flight pull,
|
||||
or PVC corruption. Fix is to rebuild the image from source and
|
||||
re-push, so Forgejo receives a complete, fresh upload.
|
||||
|
||||
If the symptom is different (Forgejo unreachable, PVC OOM,
|
||||
authentication failure), use:
|
||||
- `docs/runbooks/forgejo-registry-setup.md` for auth + token issues
|
||||
- `docs/runbooks/forgejo-registry-breakglass.md` if Forgejo + the
|
||||
cluster are both unreachable
|
||||
- `docs/runbooks/restore-pvc-from-backup.md` for PVC corruption
|
||||
|
||||
## Phase 1 — Confirm the diagnosis
|
||||
|
||||
From any host:
|
||||
|
||||
```sh
|
||||
REG=forgejo.viktorbarzin.me
|
||||
USER=cluster-puller
|
||||
PASS="$(vault kv get -field=forgejo_pull_token secret/viktor)"
|
||||
IMAGE=viktor/payslip-ingest
|
||||
TAG=latest
|
||||
|
||||
# 1. Confirm the manifest exists at all.
|
||||
curl -sk -u "$USER:$PASS" \
|
||||
-H 'Accept: application/vnd.oci.image.index.v1+json,application/vnd.oci.image.manifest.v1+json' \
|
||||
"https://$REG/v2/$IMAGE/manifests/$TAG" | jq '.mediaType, .manifests[].digest // .config.digest'
|
||||
|
||||
# 2. HEAD each child / config / layer digest. Any non-200 = confirmed.
|
||||
for d in $(curl -sk -u "$USER:$PASS" -H 'Accept: application/vnd.oci.image.index.v1+json' \
|
||||
"https://$REG/v2/$IMAGE/manifests/$TAG" | jq -r '.manifests[].digest // empty'); do
|
||||
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
|
||||
-I "https://$REG/v2/$IMAGE/manifests/$d")
|
||||
echo "$d → $code"
|
||||
done
|
||||
```
|
||||
|
||||
The probe's last log run is also a fast way to see what's affected:
|
||||
|
||||
```sh
|
||||
kubectl -n monitoring logs \
|
||||
$(kubectl -n monitoring get pods -l job-name -o name \
|
||||
| grep forgejo-integrity-probe | head -1)
|
||||
```
|
||||
|
||||
## Phase 2 — Rebuild and re-push
|
||||
|
||||
Forgejo lets you delete a specific package version through the API.
|
||||
Doing this **before** the rebuild ensures the new push doesn't
|
||||
collide with the half-broken existing entry.
|
||||
|
||||
```sh
|
||||
# Delete the broken version (replace TAG with the actual tag).
|
||||
curl -X DELETE -H "Authorization: token $(vault kv get -field=forgejo_cleanup_token secret/viktor)" \
|
||||
"https://$REG/api/v1/packages/viktor/container/$(basename $IMAGE)/$TAG"
|
||||
```
|
||||
|
||||
Rebuild via Woodpecker (manual run if the pipeline isn't triggered
|
||||
by a code change):
|
||||
|
||||
1. Open `https://ci.viktorbarzin.me/repos/<repo>/manual` for the
|
||||
project.
|
||||
2. Click **Run pipeline** with `branch=master`.
|
||||
3. Wait for the build-and-push step to complete.
|
||||
4. Confirm the new version is visible in Forgejo Web UI under
|
||||
`viktor/<image>` → Packages → Container.
|
||||
|
||||
## Phase 3 — Restart consumers
|
||||
|
||||
Pods that already cached the broken digest may continue using it.
|
||||
Force a fresh pull:
|
||||
|
||||
```sh
|
||||
kubectl rollout restart deploy/<service> -n <ns>
|
||||
```
|
||||
|
||||
If the pod still fails, the new manifest digest may not have
|
||||
propagated through containerd's cache. Drain + restart containerd on
|
||||
the affected node:
|
||||
|
||||
```sh
|
||||
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
|
||||
ssh wizard@<node> sudo systemctl restart containerd
|
||||
kubectl uncordon <node>
|
||||
```
|
||||
|
||||
## Phase 4 — Verify integrity recovery
|
||||
|
||||
The next probe run (every 15 min) will report:
|
||||
|
||||
```
|
||||
registry_manifest_integrity_failures{instance="forgejo.viktorbarzin.me"} 0
|
||||
```
|
||||
|
||||
The `RegistryManifestIntegrityFailure` alert resolves automatically
|
||||
30 minutes after the metric goes back to 0.
|
||||
|
||||
## Why this happens
|
||||
|
||||
Forgejo's OCI registry stores blobs in its own DB+filesystem. Unlike
|
||||
`registry:2` + `distribution`, it doesn't have the
|
||||
[`distribution#3324`](https://github.com/distribution/distribution/issues/3324)
|
||||
GC-vs-tag-delete race. But it can still reach a broken state if:
|
||||
|
||||
- The retention CronJob deletes a version while a pull is in flight
|
||||
on the same digest.
|
||||
- The PVC fills up mid-push (`docs/runbooks/restore-pvc-from-backup.md`).
|
||||
- A Forgejo upgrade migrates the package schema and a row is dropped.
|
||||
|
||||
In all cases the recovery procedure is identical: delete the broken
|
||||
version through the API, rebuild from source, force consumers to
|
||||
re-pull.
|
||||
163
docs/runbooks/forgejo-registry-setup.md
Normal file
163
docs/runbooks/forgejo-registry-setup.md
Normal file
|
|
@ -0,0 +1,163 @@
|
|||
# Runbook: Forgejo OCI registry — initial setup
|
||||
|
||||
Last updated: 2026-05-07
|
||||
|
||||
This runbook covers the **one-time** bootstrap of Forgejo's container
|
||||
registry, executed during Phase 0 of the registry consolidation plan
|
||||
(`docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md`).
|
||||
|
||||
After this runbook is complete, the Forgejo OCI registry at
|
||||
`forgejo.viktorbarzin.me` accepts pushes from CI and pulls from the
|
||||
cluster, with retention and integrity monitoring in place.
|
||||
|
||||
## Order of operations
|
||||
|
||||
The Terraform stacks reference Vault keys that don't exist on a fresh
|
||||
cluster. Create the keys **before** running `scripts/tg apply`.
|
||||
|
||||
1. Apply the resource bumps (memory, PVC, ingress body size,
|
||||
packages env vars) — these don't depend on the new Vault keys.
|
||||
2. Create the service-account users + PATs in Forgejo.
|
||||
3. Push the PATs to Vault.
|
||||
4. Apply the rest of Phase 0 (registry-credentials extension,
|
||||
monitoring probe, retention CronJob).
|
||||
|
||||
### Step 1 — apply Forgejo deployment bumps
|
||||
|
||||
```bash
|
||||
cd infra/stacks/forgejo
|
||||
scripts/tg apply
|
||||
```
|
||||
|
||||
Wait for the new pod to come up at the bumped 1Gi memory request and
|
||||
the resized 15Gi PVC. Verify packages are enabled:
|
||||
|
||||
```bash
|
||||
kubectl exec -n forgejo deploy/forgejo -- forgejo manager flush-queues
|
||||
kubectl exec -n forgejo deploy/forgejo -- env | grep PACKAGES
|
||||
```
|
||||
|
||||
### Step 2 — create service-account users
|
||||
|
||||
`forgejo admin user create` is idempotent only with
|
||||
`--must-change-password=false`. Re-running it on an existing user
|
||||
errors out — that's fine; skip on rerun.
|
||||
|
||||
```bash
|
||||
# cluster-puller — read:package PAT for in-cluster pulls.
|
||||
kubectl exec -n forgejo deploy/forgejo -- \
|
||||
forgejo admin user create \
|
||||
--username cluster-puller \
|
||||
--email cluster-puller@viktorbarzin.me \
|
||||
--password "$(openssl rand -base64 24)" \
|
||||
--must-change-password=false
|
||||
|
||||
# ci-pusher — write:package PAT for CI dual-push, also reused as the
|
||||
# cleanup CronJob credential (write:package includes delete).
|
||||
kubectl exec -n forgejo deploy/forgejo -- \
|
||||
forgejo admin user create \
|
||||
--username ci-pusher \
|
||||
--email ci-pusher@viktorbarzin.me \
|
||||
--password "$(openssl rand -base64 24)" \
|
||||
--must-change-password=false
|
||||
```
|
||||
|
||||
The user passwords are throwaway — we only ever auth via PAT. Forgejo
|
||||
admin can reset them at any time from the Web UI.
|
||||
|
||||
### Step 3 — generate the PATs
|
||||
|
||||
PATs **must** be generated through the Web UI logged in as the
|
||||
respective user (the CLI doesn't expose token creation). To log in
|
||||
without OAuth (registration is disabled for everyone except `viktor`,
|
||||
the admin), use the per-user temporary password from step 2.
|
||||
|
||||
For each of `cluster-puller` and `ci-pusher`:
|
||||
|
||||
1. Sign out of `viktor`.
|
||||
2. Go to `https://forgejo.viktorbarzin.me/user/login` and sign in
|
||||
with the throwaway password.
|
||||
3. Settings → Applications → Generate new token.
|
||||
4. Name: `cluster-pull` / `ci-push`. **Expiration: never.**
|
||||
5. Scopes:
|
||||
- `cluster-puller`: `read:package`
|
||||
- `ci-pusher`: `write:package` (covers read+write+delete)
|
||||
6. Save the token shown on the next page — it is **not** displayed again.
|
||||
|
||||
For the cleanup CronJob, generate a third PAT on `ci-pusher`:
|
||||
|
||||
7. Repeat steps 4-6 with name `cleanup`, scope `write:package`.
|
||||
|
||||
### Step 4 — push PATs to Vault
|
||||
|
||||
```bash
|
||||
vault login -method=oidc
|
||||
|
||||
# Read-only, used by the cluster-wide registry-credentials Secret and
|
||||
# by the Forgejo integrity probe.
|
||||
vault kv patch secret/viktor \
|
||||
forgejo_pull_token=<paste cluster-puller PAT>
|
||||
|
||||
# Write+delete, used by the retention CronJob inside Forgejo's
|
||||
# namespace.
|
||||
vault kv patch secret/viktor \
|
||||
forgejo_cleanup_token=<paste ci-pusher cleanup PAT>
|
||||
|
||||
# Write, propagated by vault-woodpecker-sync to all Woodpecker repos.
|
||||
vault kv patch secret/ci/global \
|
||||
forgejo_user=ci-pusher \
|
||||
forgejo_push_token=<paste ci-pusher push PAT>
|
||||
```
|
||||
|
||||
### Step 5 — apply the rest of Phase 0
|
||||
|
||||
```bash
|
||||
# Registry credential Secret (now reads forgejo_pull_token).
|
||||
cd infra/stacks/kyverno && scripts/tg apply
|
||||
|
||||
# Monitoring probe + retention CronJob.
|
||||
cd infra/stacks/monitoring && scripts/tg apply
|
||||
cd infra/stacks/forgejo && scripts/tg apply
|
||||
|
||||
# Containerd hosts.toml on each existing k8s node — VM cloud-init
|
||||
# only fires on first boot.
|
||||
infra/scripts/setup-forgejo-containerd-mirror.sh
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# Login from a workstation with docker.
|
||||
echo "<ci-pusher PAT>" | docker login forgejo.viktorbarzin.me -u ci-pusher --password-stdin
|
||||
|
||||
# Push a smoketest image.
|
||||
docker pull alpine:3.20
|
||||
docker tag alpine:3.20 forgejo.viktorbarzin.me/viktor/smoketest:1
|
||||
docker push forgejo.viktorbarzin.me/viktor/smoketest:1
|
||||
|
||||
# Pull from a k8s node.
|
||||
ssh wizard@<node> sudo crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1
|
||||
|
||||
# Confirm the cluster-wide Secret was synced into a fresh namespace.
|
||||
kubectl create namespace forgejo-smoketest
|
||||
kubectl get secret -n forgejo-smoketest registry-credentials \
|
||||
-o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | jq '.auths | keys'
|
||||
# Expect: ["10.0.20.10:5050", "forgejo.viktorbarzin.me",
|
||||
# "registry.viktorbarzin.me", "registry.viktorbarzin.me:5050"]
|
||||
kubectl delete namespace forgejo-smoketest
|
||||
|
||||
# Delete the smoketest package via API.
|
||||
curl -X DELETE -H "Authorization: token <ci-pusher cleanup PAT>" \
|
||||
https://forgejo.viktorbarzin.me/api/v1/packages/viktor/container/smoketest/1
|
||||
```
|
||||
|
||||
## When to revisit
|
||||
|
||||
- **PAT rotation**: PATs created here have no expiry by design. If a
|
||||
PAT leaks, regenerate via the Web UI and `vault kv patch` the new
|
||||
value into the same key — the next `terragrunt apply` will sync it
|
||||
to all consumers within minutes (Kyverno ClusterPolicy clones the
|
||||
Secret, vault-woodpecker-sync runs every 6h).
|
||||
- **New service account**: if a future workload needs different
|
||||
scopes, add a parallel user/PAT here rather than expanding existing
|
||||
PAT scope. Principle of least privilege.
|
||||
47
docs/runbooks/grow-pve-nfs-lv.md
Normal file
47
docs/runbooks/grow-pve-nfs-lv.md
Normal file
|
|
@ -0,0 +1,47 @@
|
|||
# Runbook: Grow `/srv/nfs` LV (`pve/nfs-data`)
|
||||
|
||||
Use when `/srv/nfs` on the PVE host is filling up and the workloads writing to it cannot be slimmed down. The LV sits on the LVM-thin pool `pve/data` (10.54 TB total). Thin-pool free space is the real gate — confirm before extending.
|
||||
|
||||
## When to use
|
||||
|
||||
- `df -h /srv/nfs` shows usage > ~85 % and projected growth exceeds free space within a backup retention window.
|
||||
- An upcoming bulk write (media import, restore) needs headroom that the current free space won't absorb.
|
||||
|
||||
## Steps
|
||||
|
||||
1. **Check thin-pool headroom on PVE host:**
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'lvs pve/data; lvs pve/nfs-data; df -h /srv/nfs'
|
||||
```
|
||||
|
||||
The `pve/data` thin pool's `Data%` should leave room for the extension (target `Data%` after extend < 90 %).
|
||||
|
||||
2. **Extend the LV and online-resize ext4:**
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 '
|
||||
lvextend -L +1T pve/nfs-data &&
|
||||
resize2fs /dev/pve/nfs-data
|
||||
'
|
||||
```
|
||||
|
||||
Both commands are safe online: `lvextend` only grows allocation, `resize2fs` extends ext4 while mounted.
|
||||
|
||||
3. **Verify:**
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'lvs pve/nfs-data; df -h /srv/nfs'
|
||||
```
|
||||
|
||||
`df` should show the new size; `Use%` should drop proportionally.
|
||||
|
||||
## Notes
|
||||
|
||||
- **Not Terraform-managed.** PVE host LVs live outside the IaC tree (no `infra/stacks/pve-host/`). Record the new size in `docs/architecture/storage.md` (the "HDD NFS" line and the diagram label) in the same commit.
|
||||
- **Thin-pool overcommit warning** from `lvextend` is informational — it reports the sum of all thin volume virtual sizes (currently ~12 TiB) vs. the physical pool (10.7 TiB). Real fill is `pve/data` `Data%`; ignore the overcommit warning unless `Data%` itself is climbing toward 100 %.
|
||||
- **`/srv/nfs-ssd`** lives on a separate LV (`ssd/nfs-ssd-data`) backed by SSDs — the same `lvextend`/`resize2fs` pattern applies, but the source pool is `ssd/data`.
|
||||
|
||||
## Backout
|
||||
|
||||
Online shrinks are unsafe with active workloads. Don't try to shrink `pve/nfs-data` in place — restore from snapshot or migrate data out and rebuild the LV instead.
|
||||
83
docs/runbooks/immich-transcode-bitrate.md
Normal file
83
docs/runbooks/immich-transcode-bitrate.md
Normal file
|
|
@ -0,0 +1,83 @@
|
|||
# Runbook: Immich 4K video stutters on playback/download
|
||||
|
||||
## Symptom
|
||||
High-resolution (4K) videos stutter when streamed in the Immich mobile app or
|
||||
downloaded — for **both** local-LAN and remote-internet clients.
|
||||
|
||||
## Root cause (diagnosed 2026-06-01)
|
||||
Immich's transcoding was set to `ffmpeg.targetResolution=original` with
|
||||
`maxBitrate=0` (no cap) and `preset=ultrafast`. The GPU (NVENC) faithfully
|
||||
re-encoded 4K sources to **4K H.264**, and `ultrafast` is so inefficient it
|
||||
produced **77–264 Mbps** "optimized" files — often larger than the originals.
|
||||
|
||||
The mobile app streams that `encoded-video` copy. A 100 Mbps stream needs
|
||||
~12.5 MB/s sustained. All Immich video lives on `/srv/nfs/immich/{library,encoded-video}`
|
||||
→ `pve-nfs-data` LV → the **shared 7200rpm `sdc` thin pool** (same pool as every
|
||||
VM disk + etcd), reached over inter-VLAN NFS. Measured: a single cold read got
|
||||
42–54 MB/s, but under 3 concurrent reads it collapsed to 17–24 MB/s each — and
|
||||
real seeky multi-user playback drops below the needed bitrate → buffer underrun.
|
||||
Remotely, 100 Mbps simply exceeds typical home **upload** bandwidth.
|
||||
|
||||
So the "transcode" was making streaming *worse*, not better.
|
||||
|
||||
## Fix
|
||||
Transcode config is **DB-managed** (`system_metadata` key `system-config`, JSONB —
|
||||
NOT Terraform). Apply via the system-config API (broadcasts a live reload — no pod
|
||||
restart). Keep 4K, cap the bitrate, use an efficient preset:
|
||||
|
||||
```
|
||||
ffmpeg.maxBitrate : "0" -> "20000k" # ~20 Mbps cap (2.5 MB/s)
|
||||
ffmpeg.preset : "ultrafast"-> "medium" # ~2-3x more efficient
|
||||
ffmpeg.transcode : "required" -> "bitrate" # transcode anything >maxBitrate or non-h264
|
||||
ffmpeg.targetResolution : "original" # unchanged — 4K preserved
|
||||
ffmpeg.accel=nvenc, accelDecode=true # unchanged
|
||||
```
|
||||
|
||||
GET the full config, change only these keys, PUT it back (preserves SMTP/OAuth
|
||||
secrets). Admin API key works; `me@viktorbarzin.me`'s homepage-widget token in
|
||||
`immich-secrets.homepage_credentials.immich.token` has admin write.
|
||||
|
||||
**Originals are never touched** — only the `encoded-video/` streaming copy changes.
|
||||
|
||||
## Apply the new policy to EXISTING videos
|
||||
Config changes only affect new/missing transcodes. `videoConversion force=false`
|
||||
("Missing") only fills assets lacking a transcode row; it does NOT re-touch existing
|
||||
oversized ones. `force=true` ("All") re-does all ~11k (wasteful). To regenerate only
|
||||
the **non-conforming** subset:
|
||||
|
||||
1. Identify offenders: existing `encoded_video` files whose bitrate > 20 Mbps.
|
||||
Bitrate = filesize×8 ÷ `asset.duration` (codec/bitrate are NOT in the DB; size is
|
||||
on disk, filename = `<assetId>.mp4`). ~3296 offenders / 268 GB on 2026-06-01.
|
||||
2. Delete their derived rows (regenerable; never originals):
|
||||
`DELETE FROM asset_file WHERE type='encoded_video' AND "assetId" = ANY(:offenders);`
|
||||
This makes them "missing." The deterministic `<assetId>.mp4` path is overwritten on
|
||||
regen (reclaims space).
|
||||
3. Trigger `PUT /api/jobs/videoConversion {"command":"start","force":false}`.
|
||||
**Gotcha (seen 2026-06-02):** the enqueue is an async background scan. If a prior
|
||||
scan is still in-flight when you delete the rows, the freshly-missing assets get
|
||||
MISSED and the queue drains early (only 11/3296 offenders were picked up on the
|
||||
first pass). After the queue first reaches `waiting:0`, **re-trigger `force=false`
|
||||
once while the queue is idle** and confirm the still-missing/offender count actually
|
||||
dropped — a fresh scan enqueues anything missed.
|
||||
4. Per-asset API (`POST /api/assets/jobs`) is owner-scoped (admin can't drive other
|
||||
users' assets) — hence the delete-then-missing approach via the admin global job.
|
||||
|
||||
## Verify
|
||||
- New output bitrate: `ffprobe -show_entries format=bit_rate` on a freshly-written
|
||||
`encoded-video/*.mp4` → should be ≤ ~20 Mbps (was 77–264).
|
||||
- Progress: `SELECT count(*) FROM asset_file WHERE type='encoded_video';` rises as
|
||||
regeneration proceeds.
|
||||
|
||||
## Monitor while it runs (concurrency 1, can take 1–3 days)
|
||||
- `videoConversion` runs at concurrency **1** (Immich default; gentle — do NOT raise,
|
||||
protects sdc). Thumbnail/metadata/library are capped to 2 for the same reason.
|
||||
- Watch sdc (`iostat -x` on 192.168.1.127) and apiserver latency
|
||||
(`kubectl get --raw=/healthz`). The risk is sdc saturation → etcd starvation →
|
||||
apiserver down (precedent: `post-mortems/2026-05-25-immich-anca-elements-io-storm.md`).
|
||||
Healthy baseline during this job: sdc ~70% util, apiserver <100 ms.
|
||||
- Pause if it suffers: `PUT /api/jobs/videoConversion {"command":"pause"}`; resume with
|
||||
`{"command":"resume"}`.
|
||||
|
||||
## Real fix for the root contention
|
||||
This is mitigation. The durable fix is moving Immich video storage (or the VM disks)
|
||||
off the shared `sdc` 7200rpm pool — tracked in beads `code-oflt` (IO isolation).
|
||||
317
docs/runbooks/job-hunter.md
Normal file
317
docs/runbooks/job-hunter.md
Normal file
|
|
@ -0,0 +1,317 @@
|
|||
# Runbook: job-hunter — passive job + comp scraper
|
||||
|
||||
Last updated: 2026-06-02
|
||||
|
||||
`job-hunter` is a passive job-market + compensation scraper in the `job-hunter`
|
||||
namespace. It pulls open roles from ATS boards (Greenhouse / Lever / Ashby),
|
||||
HN "Who is hiring", and levels.fyi comp medians into a CNPG Postgres DB, and
|
||||
serves agent-friendly CLI queries (used by the `job-hunter` Claude skill). As
|
||||
of 2026-06-02 it also accumulates **dated snapshots** so comp and hiring-volume
|
||||
trends can be tracked over time.
|
||||
|
||||
## Where things live
|
||||
|
||||
| Thing | Location |
|
||||
|---|---|
|
||||
| Source code | Forgejo `https://forgejo.viktorbarzin.me/viktor/job-hunter` (NOT in the monorepo) |
|
||||
| Image | `forgejo.viktorbarzin.me/viktor/job-hunter:latest` (CI builds on push; Keel rolls the Deployment) |
|
||||
| Terraform stack | `infra/stacks/job-hunter/` (`main.tf` = Deployment/Service/ESO; `cronjob.tf` = weekly refresh) |
|
||||
| Database | `pg-cluster-rw.dbaas.svc.cluster.local:5432/job_hunter`, role `job_hunter` (Vault `static-creds/pg-job-hunter`, 7d rotation) |
|
||||
| App secrets | Vault `secret/job-hunter` → `webhook_bearer_token`, `cdio_api_key`, `smtp_username/password`, `digest_to/from_address` |
|
||||
| Grafana | `https://grafana.viktorbarzin.me` → datasource **Job Hunter** (PG, read-only) |
|
||||
| Claude skill | `~/.claude/skills/job-hunter/SKILL.md` |
|
||||
| Weekly scrape | CronJob `job-hunter-refresh`, **Sundays 04:00 UTC** |
|
||||
|
||||
## Architecture
|
||||
|
||||
- **Sources** (`job_hunter/sources/`): `ats` (Greenhouse/Lever/Ashby JSON APIs, ~35 companies in `config/companies.yaml`), `hn` (Algolia), `levels_fyi` (comp medians), `linkedin_guest` (opt-in), `changedetection` (`/webhook/cdio` for non-ATS careers pages in `config/cdio_watches.yaml`).
|
||||
- **Tables**: `companies`, `roles`, `comp_points`, `levels`, `fx_rates` (upsert-in-place, "current state"); `comp_snapshots`, `roles_snapshots` (append-only, one row per source-row per `snapshot_date` — the dated series). Snapshots are written as a side-effect of every upsert during a refresh.
|
||||
- **The ATS fetch is resilient**: a board returning a permanent 4xx (404/410/403) is skipped with a warning; 5xx/network errors retry once then skip. One dead board cannot abort the whole run (regression fixed 2026-06-02 — Elastic's 404 had been taking down every refresh). Boards are fetched concurrently (bounded semaphore, default 8 in-flight).
|
||||
|
||||
---
|
||||
|
||||
## OPS
|
||||
|
||||
### Is it healthy?
|
||||
|
||||
```bash
|
||||
# CronJob exists + last schedule/success
|
||||
kubectl -n job-hunter get cronjob job-hunter-refresh
|
||||
# Most recent run's pods + logs
|
||||
kubectl -n job-hunter get jobs -l app=job-hunter --sort-by=.metadata.creationTimestamp
|
||||
kubectl -n job-hunter logs -l job-name=$(kubectl -n job-hunter get jobs -o jsonpath='{.items[-1:].metadata.name}')
|
||||
# Deployment (serves the CLI / webhook) is up
|
||||
kubectl -n job-hunter get deploy job-hunter
|
||||
# Data freshness — newest snapshot date should advance weekly
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter report --days 7 | jq '.source_mix'
|
||||
```
|
||||
|
||||
Row-count sanity (via the read-only Grafana datasource or a direct exec):
|
||||
|
||||
```bash
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- python -c "import job_hunter" # smoke
|
||||
```
|
||||
|
||||
### Manual refresh (off-schedule)
|
||||
|
||||
```bash
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- \
|
||||
python -m job_hunter refresh --source ats --source hn --source levels_fyi
|
||||
```
|
||||
|
||||
Or trigger the CronJob immediately:
|
||||
|
||||
```bash
|
||||
kubectl -n job-hunter create job --from=cronjob/job-hunter-refresh jh-manual-$(date +%s)
|
||||
```
|
||||
|
||||
### Seed / re-snapshot the dated series
|
||||
|
||||
Snapshots are written automatically on every refresh. To seed a baseline from
|
||||
the current tables (idempotent — one row per source-row per day):
|
||||
|
||||
```bash
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter snapshot
|
||||
# back-date a snapshot if needed:
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter snapshot --date 2026-06-01
|
||||
```
|
||||
|
||||
### Add an ATS company
|
||||
|
||||
ATS companies are scraped from `config/companies.yaml` in the **Forgejo repo**
|
||||
(not the monorepo). To add one:
|
||||
|
||||
1. Live-probe the slug returns HTTP 200 with London roles before adding it:
|
||||
```bash
|
||||
curl -s "https://boards-api.greenhouse.io/v1/boards/<slug>/jobs?content=true" -o /dev/null -w '%{http_code}\n'
|
||||
# Lever: https://api.lever.co/v0/postings/<slug>?mode=json
|
||||
# Ashby: https://api.ashbyhq.com/posting-api/job-board/<slug>?includeCompensation=true
|
||||
```
|
||||
2. Add a `{slug, display_name, ats_type, ats_id, careers_url}` block to `config/companies.yaml`, commit, push.
|
||||
3. CI builds the image; Keel rolls the Deployment. The next refresh picks it up. (No Terraform change — config ships in the image.)
|
||||
|
||||
A board that later starts 404ing is skipped automatically; remove its entry
|
||||
when the 404 is permanent (keeps logs clean).
|
||||
|
||||
### Add a changedetection.io watch (non-ATS firms)
|
||||
|
||||
Firms without a public ATS JSON API (Citadel, Two Sigma, G-Research, HRT, xAI,
|
||||
Wise, Revolut, …) are diff-monitored via CDIO. Add to `config/cdio_watches.yaml`
|
||||
in the Forgejo repo, then reconcile:
|
||||
|
||||
```bash
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-seed --dry-run # preview
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-seed # create
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-reconcile # list
|
||||
```
|
||||
|
||||
Changes hit `/webhook/cdio`; comp/role extraction from the diff is manual or
|
||||
LLM-side (CDIO only captures the changed text).
|
||||
|
||||
### Deploying (build triggers the rollout)
|
||||
|
||||
Deploys are **automatic on push to master** — we build the image, so CI also
|
||||
drives the rollout (`.woodpecker.yml`: `build-and-push` tags `latest` +
|
||||
`${CI_COMMIT_SHA:0:8}`, then a `deploy` step runs
|
||||
`kubectl set image deployment/job-hunter ...:${SHA}` + `rollout status`). The
|
||||
woodpecker-agent SA is cluster-admin, so no kubeconfig/RBAC is wired into the
|
||||
step. Keel stays enrolled in parallel as a redundant net (finds the SHA already
|
||||
running → no-op). So to ship code:
|
||||
|
||||
```bash
|
||||
# in the job-hunter source repo (forgejo viktor/job-hunter)
|
||||
git push origin master # → lint+test → build (latest + :<sha>) → set image → rollout
|
||||
```
|
||||
|
||||
The **Deployment** rolls to the just-built `:<sha>`. The **CronJob** runs
|
||||
`:latest` with `imagePullPolicy: Always`, so its next scheduled pod pulls the
|
||||
newest image (no rollout needed for a CronJob). `image_tag = "latest"` in
|
||||
`terragrunt.hcl` is just the TF baseline; the running Deployment digest is
|
||||
whatever CI last set (`kubectl -n job-hunter get deploy job-hunter -o jsonpath='{..image}'`).
|
||||
|
||||
**Versioning** is still semver — bump `pyproject.toml` and cut a `git tag
|
||||
vX.Y.Z` to mark a release; that's the human version record, independent of the
|
||||
`:<sha>` deploy tag (map a running SHA back to a version with `git describe`).
|
||||
|
||||
**Rollback**: `kubectl -n job-hunter rollout undo deployment/job-hunter` (last
|
||||
ReplicaSet), or push a revert commit (CI redeploys the reverted SHA).
|
||||
|
||||
### Applying the Terraform stack
|
||||
|
||||
```bash
|
||||
cd infra/stacks/job-hunter
|
||||
scripts/tg plan # vault login -method=oidc first
|
||||
scripts/tg apply
|
||||
```
|
||||
|
||||
The DB password rotates every 7 days (Vault static role `pg-job-hunter`);
|
||||
Reloader restarts the Deployment when the ESO-synced secret changes. The
|
||||
Grafana datasource password is mirrored via a second ExternalSecret in the
|
||||
`monitoring` namespace.
|
||||
|
||||
### Common failures
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---|---|---|
|
||||
| Refresh job `Error`, log shows `ats: skipping company=X — HTTP 404` | A board slug was renamed/removed | Expected — the run continues. Remove the dead slug from `companies.yaml` if permanent. |
|
||||
| Refresh aborts with a traceback before any company | Pre-2026-06-02 image (no skip-on-404) | Confirm Keel rolled the new image: `kubectl -n job-hunter get deploy job-hunter -o jsonpath='{..image}'`. |
|
||||
| `snapshot` / refresh fails: `relation "job_hunter.comp_snapshots" does not exist` | Migration 0004 not applied | The CronJob + Deployment run `migrate` on start. Run `kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter migrate`. |
|
||||
| `/webhook/cdio` returns 401 | `webhook_bearer_token` mismatch between Vault and the CDIO notification URL | Re-run `cdio-seed` after rotating the token; it rebuilds the `jsons://...?+Authorization=` URL. |
|
||||
| Non-GBP comp looks wrong / NULL | `fx_rates` gap for the role's `posted_at` date | `kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter backfill-fx --days 30` |
|
||||
| Job OOMKilled | levels.fyi HTML parse spike across many companies | Bump the CronJob container memory limit in `cronjob.tf` (currently 1Gi). |
|
||||
|
||||
---
|
||||
|
||||
## ANALYST
|
||||
|
||||
### Weekly above-target Slack alert
|
||||
|
||||
The `job-hunter-alert` CronJob (Sundays 05:00 UTC, an hour after the refresh)
|
||||
posts to Slack the companies whose London p50 total comp **≥ £500k**, flagging
|
||||
any that **newly crossed** since last week's snapshot. Threshold is the
|
||||
`--threshold` arg in `cronjob.tf` (default 500000 — well above the ~£267k move
|
||||
floor, so only clearly-exceptional comp pings). Slack webhook comes from Vault
|
||||
`secret/job-hunter` → `slack_webhook_url` (seeded from the shared workspace
|
||||
webhook → currently posts to the same channel as Keel; repoint to a dedicated
|
||||
channel by `vault kv patch secret/job-hunter slack_webhook_url=<url>`).
|
||||
|
||||
```bash
|
||||
# Preview the message without posting
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter alert --stdout
|
||||
# Different bar / location
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- \
|
||||
python -m job_hunter alert --threshold 350000 --location london --stdout
|
||||
# Fire it now (posts to Slack)
|
||||
kubectl -n job-hunter create job --from=cronjob/job-hunter-alert jh-alert-manual
|
||||
```
|
||||
|
||||
`newly_crossed` needs ≥2 snapshot dates — it's empty until the second weekly
|
||||
run accumulates. To change the standing threshold, edit `--threshold` in
|
||||
`infra/stacks/job-hunter/cronjob.tf` and apply.
|
||||
|
||||
### The periodic "market leaders in comp" report
|
||||
|
||||
This is the headline command — current leaders by p50 total comp, week-over-week
|
||||
movers, new entrants, open-role counts, and sample-size caveats:
|
||||
|
||||
```bash
|
||||
# London senior leaders, human-readable
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- \
|
||||
python -m job_hunter analyze --level senior --top-n 10
|
||||
# All levels, JSON for downstream tools
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- \
|
||||
python -m job_hunter analyze --format json
|
||||
```
|
||||
|
||||
`--trend-weeks N` sets the movers comparison window (default 12). Movers report
|
||||
`available: false` until at least two snapshot dates spanning the window exist —
|
||||
the series starts accumulating from the first refresh after 2026-06-02, so
|
||||
12-week movers become meaningful around late August 2026.
|
||||
|
||||
### Query recipes
|
||||
|
||||
```bash
|
||||
# Salary band for a slice
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter bands --title 'staff'
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-band --level senior
|
||||
# Per-(company, level) comp table
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-table --location london
|
||||
# Open roles, highest-confidence comp first
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter query --title sre --with-salary --limit 20
|
||||
# Compare two firms
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-band --company janestreet
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-band --company optiver
|
||||
```
|
||||
|
||||
### Trend queries (Grafana or psql against the snapshot tables)
|
||||
|
||||
The dated series lives in `comp_snapshots` / `roles_snapshots`. Examples (run in
|
||||
Grafana's "Job Hunter" datasource, or `psql` as the `job_hunter` role):
|
||||
|
||||
```sql
|
||||
-- Comp trend: median total comp per company over time (London)
|
||||
SELECT s.snapshot_date, c.display_name,
|
||||
percentile_cont(0.5) WITHIN GROUP (ORDER BY COALESCE(s.total_gbp, s.base_gbp)) AS p50_gbp
|
||||
FROM job_hunter.comp_snapshots s
|
||||
JOIN job_hunter.companies c ON c.id = s.company_id
|
||||
WHERE s.location_bucket = 'london'
|
||||
GROUP BY s.snapshot_date, c.display_name
|
||||
ORDER BY s.snapshot_date, p50_gbp DESC;
|
||||
|
||||
-- Hiring-volume trend: open London roles per company per snapshot
|
||||
SELECT s.snapshot_date, c.display_name, COUNT(*) AS open_roles
|
||||
FROM job_hunter.roles_snapshots s
|
||||
JOIN job_hunter.companies c ON c.id = s.company_id
|
||||
WHERE s.primary_location = 'london'
|
||||
GROUP BY s.snapshot_date, c.display_name
|
||||
ORDER BY s.snapshot_date, open_roles DESC;
|
||||
|
||||
-- Two-snapshot diff: p50 change for one company between two dates
|
||||
SELECT c.display_name, s.snapshot_date,
|
||||
percentile_cont(0.5) WITHIN GROUP (ORDER BY COALESCE(s.total_gbp, s.base_gbp)) AS p50
|
||||
FROM job_hunter.comp_snapshots s
|
||||
JOIN job_hunter.companies c ON c.id = s.company_id
|
||||
WHERE c.slug = 'janestreet' AND s.snapshot_date IN ('2026-06-02', '2026-08-30')
|
||||
GROUP BY c.display_name, s.snapshot_date;
|
||||
```
|
||||
|
||||
### "Your comp vs the market" dashboard panel + your baselines
|
||||
|
||||
The Job Hunter Grafana dashboard (`grafana.viktorbarzin.me` → Job Hunter) has a
|
||||
bar chart **"Your comp vs the market — London p50 total comp"** ranking every
|
||||
company's London median TC with your comp shown in line. Your figures are
|
||||
deliberately **not hardcoded in the committed dashboard JSON** — they live in
|
||||
the DB as labeled comp_points with `source='self'` (the panel tags any
|
||||
`source='self'` row as "You" and renders one bar each). There are **two**, by
|
||||
design:
|
||||
|
||||
- `self-realized` — **"Me - realized gross" ≈ £409k**: your actual P60 gross
|
||||
for the current tax year. **Source = `SUM(payslip_ingest.payslip.taxable_pay)`**
|
||||
for the tax year (this equals the P60 "pay for tax"; do NOT use
|
||||
`salary+bonus+rsu_vest`, where `rsu_vest` is net/partial and understates RSU
|
||||
income by ~half). Inflated by concurrent stacked RSU vests + META price.
|
||||
- `self-current` — **"Me - package (grant TC)" ≈ £267k**: base + bonus +
|
||||
current-year RSU refresher *grant face* (£117,927). This is the basis
|
||||
**levels.fyi uses for the company bars**, so it's the apples-to-apples figure
|
||||
for comparing a job *offer*.
|
||||
|
||||
Both sit below the £500k alert bar (never ping Slack). Re-seed when comp changes
|
||||
(realized: re-pull `taxable_pay`; grant-value: from the YE letter). The
|
||||
grant-value seed (run the realized one the same way with `company_slug='self-realized'`,
|
||||
`company_display_name='Me - realized gross'`, `total_value=<taxable_pay sum>`):
|
||||
|
||||
```bash
|
||||
kubectl -n job-hunter exec deploy/job-hunter -- python -c "
|
||||
import asyncio; from decimal import Decimal; from datetime import date
|
||||
from job_hunter.db import create_engine_from_env, make_session_factory
|
||||
from job_hunter.sources.comp.base import CompPoint
|
||||
from job_hunter.storage_comp import upsert_comp_point
|
||||
async def m():
|
||||
e=create_engine_from_env(); sf=make_session_factory(e)
|
||||
async with sf() as s:
|
||||
# total_value is what the comparison/bar uses — it MUST be full TC
|
||||
# (base + bonus + RSU). Store the components too for transparency.
|
||||
await upsert_comp_point(s, CompPoint(source='self', external_id='self-current',
|
||||
company_slug='self-current', company_display_name='Me (Meta IC5)',
|
||||
level_slug='senior', location_bucket='london',
|
||||
base_value=Decimal('123682'), bonus_value=Decimal('25734'),
|
||||
rsu_grant_value=Decimal('117927'), rsu_vesting_years=1,
|
||||
total_value=Decimal('267343'), currency='GBP', effective_date=date.today()))
|
||||
await s.commit()
|
||||
await e.dispose()
|
||||
asyncio.run(m())"
|
||||
```
|
||||
|
||||
### Interpreting the numbers — caveats
|
||||
|
||||
- **Sample size**: `analyze` flags companies with `n < 3` as `low_confidence`. A single self-reported datapoint is anecdote, not a band — chase the p50 only where n is healthy.
|
||||
- **levels.fyi bias**: comp_points are self-reported medians; they skew toward people who report (often higher earners) and lag the market by a quarter or two.
|
||||
- **HFT/quant**: base comp is the disclosed figure; bonus (often the larger half) is variable and usually absent from postings. Treat HFT base as a floor, not total.
|
||||
- **Currency**: all figures are GBP-normalised via ECB rates looked up by `posted_at` (7-day fallback). A FX gap shows as NULL comp, not a wrong number.
|
||||
- **Movers need history**: a delta is only as good as the two snapshot dates behind it; early deltas (< full `trend_weeks` of data) compare against the earliest available snapshot and are noted as such.
|
||||
|
||||
## Related
|
||||
|
||||
- Skill: `~/.claude/skills/job-hunter/SKILL.md` (agent invocation patterns)
|
||||
- Beads epic: `code-snp`
|
||||
- Storage / backup context: this DB is on the shared CNPG cluster (`dbaas`), backed up by the per-db `postgresql-backup-per-db` CronJob.
|
||||
207
docs/runbooks/k8s-node-auto-upgrades.md
Normal file
207
docs/runbooks/k8s-node-auto-upgrades.md
Normal file
|
|
@ -0,0 +1,207 @@
|
|||
# K8s Node Auto-Upgrades
|
||||
|
||||
## Overview
|
||||
|
||||
OS-level package upgrades on the 5 K8s VMs (master + 4 workers) are driven by `unattended-upgrades` and rebooted by `kured`, with multiple safety gates layered on top to prevent the failure mode that caused the March 2026 26h cluster outage.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
apt-daily.timer (random within window)
|
||||
│ apt-get update
|
||||
│
|
||||
▼
|
||||
apt-daily-upgrade.timer (random within window)
|
||||
│ unattended-upgrades runs
|
||||
│ - Allowed-Origins: -security, -updates, ESM
|
||||
│ - Package-Blacklist: containerd*, runc, calico-*, cni-plugins-*, docker-ce
|
||||
│ - apt-mark hold on kubelet, kubeadm, kubectl, containerd*, runc
|
||||
│ - Automatic-Reboot=false (kured handles reboots)
|
||||
│
|
||||
▼ if kernel/glibc/systemd updated
|
||||
/var/run/reboot-required appears on the host
|
||||
│
|
||||
▼ (sentinel-gate DaemonSet polls every 5min)
|
||||
kured-sentinel-gate checks:
|
||||
├── 1. Host has /var/run/reboot-required
|
||||
├── 2. ALL nodes Ready
|
||||
├── 3. ALL calico-node pods Running
|
||||
└── 4. NO node Ready-transition in last 24h (soak window)
|
||||
│
|
||||
▼ all pass
|
||||
touch /var/run/gated-reboot-required
|
||||
│
|
||||
▼ (kured polls every 1h within 02:00-06:00 London, any day of the week)
|
||||
kured checks Prometheus before draining:
|
||||
│ http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts
|
||||
│ ANY firing alert (except ignore-list) blocks the drain
|
||||
│ Ignore-list: ^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$
|
||||
│
|
||||
▼ no blockers
|
||||
kured drains the node (priority-ordered, 310s budget)
|
||||
kured runs /bin/systemctl reboot
|
||||
│
|
||||
▼ node returns
|
||||
kured uncordons + posts Slack notification (configuration.notifyUrl)
|
||||
│
|
||||
▼ 24h cool-down begins (sentinel-gate Check 4)
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### unattended-upgrades (in-guest)
|
||||
- **Config**: `/etc/apt/apt.conf.d/52unattended-upgrades-k8s` + `/etc/apt/apt.conf.d/20auto-upgrades`
|
||||
- **Source of truth**: `infra/modules/create-template-vm/cloud_init.yaml` (lines for `is_k8s_template`)
|
||||
- **Day-2 push**: SSH-based — see "Restore / re-apply config" below
|
||||
|
||||
### kured (Helm release)
|
||||
- **Stack**: `infra/stacks/kured/main.tf`
|
||||
- **Helm chart**: `kured-5.11.0` (image `ghcr.io/kubereboot/kured:1.21.0`)
|
||||
- **Window**: 02:00-06:00 Europe/London, every day of the week (was Mon-Fri until 2026-05-16), period=1h, concurrency=1
|
||||
- **Sentinel**: `/sentinel/gated-reboot-required` (created by sentinel-gate DaemonSet)
|
||||
- **Slack hook**: Vault `secret/kured` → `slack_kured_webhook`
|
||||
|
||||
### kured-sentinel-gate (DaemonSet)
|
||||
- **Source**: `kubernetes_daemon_set_v1.kured_sentinel_gate` in `infra/stacks/kured/main.tf` (lines ~120-260)
|
||||
- **Image**: `bitnami/kubectl:latest`
|
||||
- **Loop period**: every 300s
|
||||
- **Gate logic**: 4 checks — see Architecture diagram
|
||||
|
||||
### Upgrade Gates Prometheus alerts
|
||||
- **Source**: `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` group `Upgrade Gates`
|
||||
- **10 alerts**: KubeAPIServerDown, KubeStateMetricsDown, PrometheusRuleEvaluationFailing, PVCStuckPending, RecentNodeReboot, MysqlStandaloneDown, ClusterPodReadyRatioDropped, NodeMemoryPressure, NodeDiskPressure, KubeQuotaAlmostFull
|
||||
- **Effect**: kured `--prometheus-url` polls Prometheus before each drain — any non-ignored firing alert halts the rollout
|
||||
|
||||
## Common Operations
|
||||
|
||||
### Verify the system is healthy
|
||||
```bash
|
||||
# kured pods + sentinel-gate Running on all 5 nodes
|
||||
kubectl -n kured get pods
|
||||
|
||||
# kured can reach Prometheus
|
||||
kubectl -n kured exec ds/kured -- /usr/bin/kured --help | grep prometheus
|
||||
|
||||
# Upgrade Gates rules loaded + state
|
||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
||||
wget -q -O- 'http://localhost:9090/api/v1/rules' | \
|
||||
jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | " \(.name): \(.state)"'
|
||||
|
||||
# Per-node unattended-upgrades status
|
||||
for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
|
||||
echo "=== $n ==="
|
||||
ssh $n "systemctl is-active unattended-upgrades; apt list --upgradable 2>/dev/null | wc -l"
|
||||
done
|
||||
```
|
||||
|
||||
### Halt rollout in an emergency
|
||||
```bash
|
||||
# Option 1: scale kured to 0 (most decisive)
|
||||
kubectl -n kured scale ds kured --replicas=0
|
||||
# When ready: kubectl -n kured scale ds kured --replicas=5
|
||||
|
||||
# Option 2: silence the gate via Alertmanager (allows kured to retry once silence expires)
|
||||
# Use Alertmanager UI at https://prometheus.viktorbarzin.me/alertmanager/
|
||||
```
|
||||
|
||||
### Force halt by adding a custom blocker alert
|
||||
- Add a PrometheusRule expression that's always-1 (e.g. `vector(1)`) to the `Upgrade Gates` group temporarily.
|
||||
- Apply, wait for sync (~120s), kured will block on the next poll.
|
||||
- Remove when ready.
|
||||
|
||||
### Pause apt upgrades on a single node
|
||||
```bash
|
||||
ssh <node> sudo systemctl stop unattended-upgrades
|
||||
ssh <node> sudo systemctl disable unattended-upgrades
|
||||
# Re-enable when ready:
|
||||
ssh <node> sudo systemctl enable --now unattended-upgrades
|
||||
```
|
||||
|
||||
### Restore / re-apply unattended-upgrades config to existing nodes
|
||||
Cloud-init only runs on first boot. To bring existing nodes into compliance with the IaC:
|
||||
|
||||
```bash
|
||||
# Per node — installs uu, drops apt config, holds k8s/runtime packages, enables service
|
||||
for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
|
||||
ssh $n sudo bash -s <<'EOF'
|
||||
set -e
|
||||
systemctl unmask unattended-upgrades 2>/dev/null || true
|
||||
DEBIAN_FRONTEND=noninteractive apt-get install -y unattended-upgrades update-notifier-common
|
||||
cat > /etc/apt/apt.conf.d/52unattended-upgrades-k8s <<'CONF'
|
||||
Unattended-Upgrade::Allowed-Origins {
|
||||
"${distro_id}:${distro_codename}";
|
||||
"${distro_id}:${distro_codename}-security";
|
||||
"${distro_id}:${distro_codename}-updates";
|
||||
"${distro_id}ESMApps:${distro_codename}-apps-security";
|
||||
"${distro_id}ESM:${distro_codename}-infra-security";
|
||||
};
|
||||
Unattended-Upgrade::Package-Blacklist {
|
||||
"^containerd(\.io)?$";
|
||||
"^runc$";
|
||||
"^cri-tools$";
|
||||
"^kubernetes-cni$";
|
||||
"^calico-.*";
|
||||
"^cni-plugins-.*";
|
||||
"^docker-ce$";
|
||||
};
|
||||
Unattended-Upgrade::DevRelease "false";
|
||||
Unattended-Upgrade::Automatic-Reboot "false";
|
||||
CONF
|
||||
cat > /etc/apt/apt.conf.d/20auto-upgrades <<'CONF'
|
||||
APT::Periodic::Update-Package-Lists "1";
|
||||
APT::Periodic::Unattended-Upgrade "1";
|
||||
CONF
|
||||
apt-mark hold kubelet kubeadm kubectl
|
||||
apt-mark hold containerd containerd.io runc 2>/dev/null || true
|
||||
systemctl enable --now unattended-upgrades
|
||||
EOF
|
||||
done
|
||||
```
|
||||
|
||||
### Roll back a bad apt upgrade
|
||||
1. Identify the package(s) that broke things from `/var/log/apt/history.log` on the affected node.
|
||||
2. Hold them: `sudo apt-mark hold <pkg>`.
|
||||
3. Downgrade: `sudo apt-get install -y --allow-downgrades <pkg>=<previous-version>` (find versions via `apt-cache madison <pkg>`).
|
||||
4. Reboot the node manually if the package needs it.
|
||||
5. Add the package to the `Unattended-Upgrade::Package-Blacklist` in `cloud_init.yaml` AND drop the holds via the SSH push above so future apt runs skip it.
|
||||
|
||||
### kured halted — investigate which alert is blocking
|
||||
```bash
|
||||
# Show kured logs — it logs "blocking alerts" when halting
|
||||
kubectl -n kured logs ds/kured --tail=100 | grep -i alert
|
||||
|
||||
# List currently firing alerts (any of these blocks kured):
|
||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
||||
wget -q -O- 'http://localhost:9090/api/v1/alerts' | \
|
||||
jq -r '.data.alerts[] | select(.state == "firing") | " \(.labels.alertname) (\(.labels.severity // "info"))"' | sort -u
|
||||
```
|
||||
|
||||
The alert is either:
|
||||
- One of the 10 `Upgrade Gates` (genuine cluster-health issue — fix it),
|
||||
- A pre-existing alert (any of the ~211 in the library — investigate),
|
||||
- Or `RecentNodeReboot` — expected for 24h after each node reboot. This is the soak window.
|
||||
|
||||
### Verify the 24h soak is enforcing
|
||||
```bash
|
||||
# Sentinel-gate logs Check 4 outcome
|
||||
kubectl -n kured logs ds/kured-sentinel-gate --tail=20 | grep -E "soak|cool-down|24"
|
||||
|
||||
# kured won't drain another node until the most recent Ready-transition is >24h ago.
|
||||
# If you need to override (e.g. emergency security patch), shorten the cool-down by
|
||||
# editing infra/stacks/kured/main.tf (sentinel script: 86400 → smaller) and applying.
|
||||
```
|
||||
|
||||
## Past Incidents
|
||||
|
||||
- **2026-03-16 SEV-1**: Kured + Containerd Cascade Outage (26h). See `docs/post-mortems/2026-03-16-kured-containerd-cascade-outage.html`. Root cause: unattended-upgrades pushed a kernel update → kured rebooted nodes → containerd's overlayfs snapshotter corrupted → image pulls failed → calico broke → cascading outage. Remediations now baked into this system: 24h soak, Prometheus halt-on-alert, Package-Blacklist for runtime components, sentinel-gate health checks.
|
||||
|
||||
## File Pointers
|
||||
|
||||
| What | Where |
|
||||
|------|-------|
|
||||
| kured Helm + sentinel-gate | `infra/stacks/kured/main.tf` |
|
||||
| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
|
||||
| Cloud-init for new nodes | `infra/modules/create-template-vm/cloud_init.yaml` |
|
||||
| Slack webhook | Vault `secret/kured` → `slack_kured_webhook` |
|
||||
| Post-mortem | `infra/docs/post-mortems/2026-03-16-kured-containerd-cascade-outage.html` |
|
||||
| Architecture doc | `infra/docs/architecture/automated-upgrades.md` (OS section) |
|
||||
345
docs/runbooks/k8s-version-upgrade.md
Normal file
345
docs/runbooks/k8s-version-upgrade.md
Normal file
|
|
@ -0,0 +1,345 @@
|
|||
# K8s Version Upgrade Pipeline
|
||||
|
||||
## Overview
|
||||
|
||||
Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 5 K8s
|
||||
VMs are upgraded automatically by a weekly detection CronJob that seeds a
|
||||
chain of small phase Jobs. Each Job is **pinned to a node that is NOT its
|
||||
drain target** — so no pod in the chain can preempt itself.
|
||||
|
||||
The chain (Sun 12:00 UTC weekly):
|
||||
|
||||
```
|
||||
detection CronJob → preflight Job → master Job → worker × 4 Jobs → postflight Job
|
||||
```
|
||||
|
||||
This is **independent** of the OS-side `unattended-upgrades + kured`
|
||||
pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts.
|
||||
Schedules can overlap (kured runs daily 02:00-06:00 London; detection
|
||||
here runs Sun 12:00 UTC) — when a kured reboot lands within 24h of the
|
||||
Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates
|
||||
group blocks the version-upgrade preflight, so the chain self-defers
|
||||
to the next Sunday rather than rolling on top of a half-fresh node.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns, SA: k8s-upgrade-job)
|
||||
│ kubectl get nodes → running version
|
||||
│ ssh master 'apt-cache madison kubeadm' → latest patch (within current minor)
|
||||
│ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor available?
|
||||
│ push k8s_upgrade_available{kind,running,target} → Pushgateway
|
||||
│
|
||||
▼ if a target is detected
|
||||
envsubst on /template/job-template.yaml | kubectl apply -f -
|
||||
│ creates k8s-upgrade-preflight-<target_version>
|
||||
▼
|
||||
|
||||
Job 0 — preflight (pinned: k8s-node1)
|
||||
├── All nodes Ready + no Mem/Disk pressure
|
||||
├── halt-on-alert (kured-style ignore-list)
|
||||
├── 24h-quiet baseline (no Ready transitions <24h ago)
|
||||
├── kubeadm upgrade plan matches target
|
||||
├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
|
||||
├── Trigger backup-etcd Job, wait, verify snapshot byte count
|
||||
├── SSH master: containerd skew fix (if master < workers)
|
||||
├── SSH all 5 nodes: apt repo URL rewrite (only kind=minor)
|
||||
└── spawn_next → k8s-upgrade-master-<target_version>
|
||||
▼
|
||||
|
||||
Job 1 — master upgrade (pinned: k8s-node1)
|
||||
├── halt-on-alert recheck (no firing alerts)
|
||||
├── drain k8s-master (predrain_unstick deletes PDB-blocked pods)
|
||||
├── ssh wizard@k8s-master 'bash -s' < /scripts/update_k8s.sh -- --role master --release X.Y.Z
|
||||
├── kubectl uncordon k8s-master; wait Ready + version match
|
||||
├── verify control-plane pods Running
|
||||
├── halt-on-alert recheck (allows RecentNodeReboot)
|
||||
└── spawn_next → k8s-upgrade-worker-<v>-k8s-node4
|
||||
▼
|
||||
|
||||
Job 2 — worker k8s-node4 (pinned: k8s-node1)
|
||||
Job 3 — worker k8s-node3 (pinned: k8s-node1)
|
||||
Job 4 — worker k8s-node2 (pinned: k8s-node1)
|
||||
(identical pattern: halt-on-alert wait 30m → drain → ssh script → uncordon → 10-min soak → spawn_next)
|
||||
▼
|
||||
|
||||
Job 5 — worker k8s-node1 (pinned: k8s-master + control-plane toleration)
|
||||
└── spawn_next → k8s-upgrade-postflight-<target_version>
|
||||
▼
|
||||
|
||||
Job 6 — postflight (no pinning)
|
||||
├── Verify all 5 nodes at target version
|
||||
├── Verify no firing Upgrade Gates alerts
|
||||
├── Compute pod-ready ratio (should be ≥ 0.9)
|
||||
├── Clear k8s-upgrade-* annotations on namespace
|
||||
├── Push k8s_upgrade_in_flight=0, k8s_upgrade_snapshot_taken=0, k8s_upgrade_started_timestamp=0
|
||||
└── Slack: ✅ K8s upgrade complete
|
||||
```
|
||||
|
||||
**Pin choices summarised:**
|
||||
- k8s-node1 hosts every Job that drains master or another worker. k8s-node1
|
||||
itself is upgraded **last**.
|
||||
- k8s-master hosts Job 5 (which drains k8s-node1). Job 5's spec includes a
|
||||
toleration for `node-role.kubernetes.io/control-plane:NoSchedule`.
|
||||
- If anyone reorders the worker sequence, the pin for Job 5 needs to track
|
||||
whatever worker is upgraded last. The mapping is in `scripts/upgrade-step.sh`
|
||||
→ the `case "${PHASE}:${TARGET_NODE:-}"` block.
|
||||
|
||||
## Components
|
||||
|
||||
### Shared resources (one-time, Terraform-managed)
|
||||
|
||||
| Resource | Purpose |
|
||||
|---|---|
|
||||
| **ConfigMap `k8s-upgrade-scripts`** | Mounts `/scripts/upgrade-step.sh` (universal phase body, dispatches on `$PHASE`) and `/scripts/update_k8s.sh` (per-node kubeadm/kubelet/kubectl upgrade body — same script the old manual loop used) in every Job pod. |
|
||||
| **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. |
|
||||
| **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. |
|
||||
| **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. |
|
||||
| **CronJob `k8s-version-check`** | Sun 12:00 UTC. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. |
|
||||
|
||||
### Pushgateway metrics
|
||||
|
||||
Pushed by upgrade-step.sh during phase execution; observed by the
|
||||
`Upgrade Gates` alert group in `stacks/monitoring/.../prometheus_chart_values.tpl`:
|
||||
|
||||
| Metric | Pushed by | Cleared by |
|
||||
|---|---|---|
|
||||
| `k8s_upgrade_in_flight` (1/0) | preflight Job (set to 1) | postflight Job (set to 0) |
|
||||
| `k8s_upgrade_started_timestamp` (epoch s) | preflight Job | postflight Job (set to 0) |
|
||||
| `k8s_upgrade_snapshot_taken` (1/0) | preflight Job (set to 1 after Job=`pre-upgrade-etcd-*` completes with `Backup done:` log of ≥1 KiB) | postflight Job (0) |
|
||||
| `k8s_upgrade_available{kind,running,target}` | detection CronJob | next detection run (overwrite) |
|
||||
| `k8s_version_check_last_run_timestamp` | detection CronJob | (cumulative) |
|
||||
|
||||
### Upgrade Gates alerts (`Upgrade Gates` group in prometheus_chart_values.tpl)
|
||||
|
||||
- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout.
|
||||
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
|
||||
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
|
||||
- All three alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
|
||||
|
||||
### Vault secrets
|
||||
|
||||
- `secret/k8s-upgrade/ssh_key` — ed25519 PRIVATE key, used by Jobs to SSH `wizard@<node>`
|
||||
- `secret/k8s-upgrade/ssh_key_pub` — matching PUBLIC key, deployed to nodes' `~/.ssh/authorized_keys`
|
||||
- `secret/k8s-upgrade/slack_webhook` — Slack incoming-webhook URL
|
||||
|
||||
Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` namespace. The previous `api_bearer_token` entry is GONE — the chain does not POST to `claude-agent-service`.
|
||||
|
||||
## Common Operations
|
||||
|
||||
### Post-upgrade: restore apiserver OIDC (REQUIRED after any control-plane bump)
|
||||
|
||||
`kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
|
||||
and drops the `--authentication-config` flag**, silently disabling apiserver
|
||||
OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get
|
||||
401). This is not auto-detected (the `rbac` stack's `null_resource` trigger is a
|
||||
content hash that doesn't change). After any control-plane upgrade, re-apply:
|
||||
|
||||
```bash
|
||||
cd stacks/rbac
|
||||
TF_VAR_ssh_private_key="$(cat ~/.ssh/id_ed25519)" \
|
||||
VAULT_ADDR=https://vault.viktorbarzin.me ../../scripts/tg apply \
|
||||
--non-interactive -target=module.rbac.null_resource.apiserver_oidc_config
|
||||
```
|
||||
|
||||
(`ssh_private_key` must be a key authorized for `wizard@<master>`; it is not yet
|
||||
wired from Vault.) The provisioner re-writes `/etc/kubernetes/pki/auth-config.yaml`
|
||||
(both `kubernetes` + `k8s-dashboard` issuers), re-adds the flag, and
|
||||
health-gates `/livez` with auto-rollback. Verify: `curl -sk
|
||||
https://localhost:6443/livez` on the master = `ok`, and the apiserver manifest
|
||||
contains `--authentication-config`. See `docs/plans/2026-06-04-k8s-dashboard-sso-design.md`.
|
||||
|
||||
### Verify the pipeline is healthy
|
||||
```bash
|
||||
# CronJob present + not suspended
|
||||
kubectl -n k8s-upgrade get cronjob k8s-version-check
|
||||
|
||||
# Latest detection run output
|
||||
kubectl -n k8s-upgrade get jobs -l app=k8s-version-upgrade
|
||||
kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200
|
||||
|
||||
# Chain Jobs from the last run (retained 7 days via ttlSecondsAfterFinished)
|
||||
kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain
|
||||
|
||||
# Pushgateway — running detection metric
|
||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
||||
wget -q -O- 'http://prometheus-prometheus-pushgateway.monitoring:9091/metrics' | \
|
||||
grep -E '^(k8s_upgrade_(available|in_flight|started_timestamp|snapshot_taken)|k8s_version_check_last_run_timestamp)'
|
||||
|
||||
# Upgrade Gates rules loaded
|
||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
||||
wget -q -O- 'http://localhost:9090/api/v1/rules' | \
|
||||
jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | " \(.name): \(.state)"'
|
||||
```
|
||||
|
||||
### Manually trigger detection (no upgrade)
|
||||
Use `detection_dry_run=true` to short-circuit before spawning Job 0:
|
||||
|
||||
```bash
|
||||
# Toggle var in TF, apply, and trigger
|
||||
# (in stacks/k8s-version-upgrade/main.tf)
|
||||
# variable "detection_dry_run" { default = true }
|
||||
# scripts/tg apply
|
||||
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check version-check-test
|
||||
kubectl -n k8s-upgrade logs -l job-name=version-check-test -f
|
||||
# When done, flip back to false.
|
||||
```
|
||||
|
||||
### Manually trigger the chain (skip detection)
|
||||
Useful for testing or to force a specific target. Render Job 0 directly:
|
||||
|
||||
```bash
|
||||
TARGET=1.34.7
|
||||
KIND=patch
|
||||
IMAGE=$(kubectl -n k8s-upgrade get cronjob k8s-version-check \
|
||||
-o jsonpath='{.spec.jobTemplate.spec.template.spec.containers[0].image}')
|
||||
|
||||
cat <<EOF | envsubst | kubectl apply -f -
|
||||
$(kubectl -n k8s-upgrade get cm k8s-upgrade-job-template -o jsonpath='{.data.job-template\.yaml}')
|
||||
EOF
|
||||
# Note: export JOB_NAME, PHASE_NEXT, etc. first — see the CronJob's command for
|
||||
# the full env block. Easier: just trigger detection with the right inputs.
|
||||
```
|
||||
|
||||
### Kill a stuck Job (chain halted mid-flight)
|
||||
The chain stalls if any Job dies without spawning its successor. `K8sUpgradeStalled`
|
||||
fires after 90 min. Recovery:
|
||||
|
||||
```bash
|
||||
# 1. Identify the failed Job
|
||||
kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain
|
||||
kubectl -n k8s-upgrade describe job/<failed-job-name> | tail -50
|
||||
kubectl -n k8s-upgrade logs job/<failed-job-name>
|
||||
|
||||
# 2. Diagnose. Common causes:
|
||||
# - drain stuck on PDB-violating pod (predrain_unstick should handle this;
|
||||
# but a brand-new PDB pattern could escape it — manually delete the pod)
|
||||
# - SSH from Job pod failing (node restarted? known_hosts mismatch?)
|
||||
# - kubeadm upgrade failed on a node (check journalctl + apt history on that node)
|
||||
|
||||
# 3. Fix the root cause first.
|
||||
|
||||
# 4. Delete the failed Job + re-spawn it. Naming is deterministic so
|
||||
# `kubectl apply` of the same name reconciles to a single Job.
|
||||
kubectl -n k8s-upgrade delete job/<failed-job-name>
|
||||
|
||||
# 5. Manually render + apply the same Job. Pull the template + spec from the
|
||||
# next-Job-creation block in upgrade-step.sh — easiest is to copy from a
|
||||
# sibling Job's YAML:
|
||||
kubectl -n k8s-upgrade get job/<sibling-job-name> -o yaml \
|
||||
| yq 'del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp, .metadata.managedFields, .status)' \
|
||||
| yq '.metadata.name = "<failed-job-name>"' \
|
||||
| yq '.spec.template.spec.containers[0].env[] | select(.name=="PHASE") .value = "<right-phase>"' \
|
||||
| kubectl apply -f -
|
||||
|
||||
# The chain will continue from there. The next-Job-creation step in upgrade-step.sh
|
||||
# is idempotent (deterministic name) so re-running won't duplicate downstream.
|
||||
```
|
||||
|
||||
### Skip a phase (advanced; use sparingly)
|
||||
If you've already done the work for a phase manually and want the chain to
|
||||
jump past it, manually create the NEXT phase's Job with the deterministic
|
||||
name. The previous phase's spawn-next will see the Job already exists and
|
||||
short-circuit. Example: master already on target; jump straight to worker:
|
||||
|
||||
```bash
|
||||
TARGET=1.34.7
|
||||
TGT_LBL=${TARGET//./-}
|
||||
# (compose Job from upgrade-step.sh spawn_next code, name=k8s-upgrade-worker-$TGT_LBL-k8s-node4, run on k8s-node1)
|
||||
```
|
||||
|
||||
### Halt the pipeline in an emergency
|
||||
|
||||
```bash
|
||||
# Option 1: suspend the detection CronJob (won't stop an in-flight chain)
|
||||
kubectl -n k8s-upgrade patch cronjob k8s-version-check \
|
||||
-p '{"spec":{"suspend":true}}' --type=merge
|
||||
# Re-enable: -p '{"spec":{"suspend":false}}'
|
||||
|
||||
# Option 2: delete all in-flight chain Jobs
|
||||
kubectl -n k8s-upgrade delete jobs -l app=k8s-upgrade-chain
|
||||
# This leaves the in-flight annotation + Pushgateway gauge intact —
|
||||
# K8sUpgradeStalled will fire to surface the halt.
|
||||
|
||||
# Option 3: force a blocker alert (same regex kured uses)
|
||||
# — see k8s-node-auto-upgrades.md "Force halt by adding a custom blocker alert"
|
||||
```
|
||||
|
||||
### Clear orphaned in-flight state
|
||||
After deciding NOT to retry a halted chain:
|
||||
|
||||
```bash
|
||||
kubectl annotate ns k8s-upgrade \
|
||||
viktorbarzin.me/k8s-upgrade-in-flight- \
|
||||
viktorbarzin.me/k8s-upgrade-target- \
|
||||
viktorbarzin.me/k8s-upgrade-snapshot-path-
|
||||
|
||||
# Reset Pushgateway gauges so K8sUpgradeStalled / EtcdPreUpgradeSnapshotMissing clear:
|
||||
kubectl -n monitoring port-forward svc/prometheus-prometheus-pushgateway 9091:9091 &
|
||||
printf '# TYPE k8s_upgrade_in_flight gauge\nk8s_upgrade_in_flight 0\n# TYPE k8s_upgrade_snapshot_taken gauge\nk8s_upgrade_snapshot_taken 0\n# TYPE k8s_upgrade_started_timestamp gauge\nk8s_upgrade_started_timestamp 0\n' \
|
||||
| curl --data-binary @- http://localhost:9091/metrics/job/k8s-version-upgrade
|
||||
kill %1
|
||||
```
|
||||
|
||||
### Rollback paths
|
||||
`kubeadm` does **not** support in-place downgrade. If a run fails:
|
||||
|
||||
#### Master broke during/after kubeadm upgrade
|
||||
1. Identify the etcd snapshot: `kubectl get ns k8s-upgrade -o jsonpath='{.metadata.annotations.viktorbarzin\.me/k8s-upgrade-snapshot-path}'`
|
||||
2. Restore etcd per `infra/docs/runbooks/restore-etcd.md`.
|
||||
3. Manually downgrade master `kubeadm`/`kubelet`/`kubectl` to the pre-upgrade version. Find versions in `/var/log/apt/history.log` on the node:
|
||||
```bash
|
||||
ssh wizard@k8s-master 'sudo cat /var/log/apt/history.log | tail -40'
|
||||
# Pre-upgrade versions are in the most recent "Commandline: apt-get install"
|
||||
sudo apt-mark unhold kubeadm kubelet kubectl
|
||||
sudo apt-get install --allow-downgrades -y \
|
||||
kubeadm=<OLD>-1.1 kubelet=<OLD>-1.1 kubectl=<OLD>-1.1
|
||||
sudo apt-mark hold kubeadm kubelet kubectl
|
||||
sudo systemctl daemon-reload && sudo systemctl restart kubelet
|
||||
```
|
||||
|
||||
#### Worker broke
|
||||
1. `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
|
||||
2. Downgrade apt packages on that node only (see above)
|
||||
3. `kubectl uncordon <node>`
|
||||
4. The cluster continues running on the master + remaining workers throughout
|
||||
|
||||
### One-shot SSH key rotation
|
||||
1. Generate new keypair: `ssh-keygen -t ed25519 -f /tmp/k8s-upgrade -N ""`
|
||||
2. Update Vault:
|
||||
```bash
|
||||
vault kv patch secret/k8s-upgrade \
|
||||
ssh_key=@/tmp/k8s-upgrade \
|
||||
ssh_key_pub=@/tmp/k8s-upgrade.pub
|
||||
```
|
||||
3. Push the new pubkey to every node:
|
||||
```bash
|
||||
for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
|
||||
ssh wizard@$n 'sed -i "/k8s-upgrade-key$/d" ~/.ssh/authorized_keys'
|
||||
ssh wizard@$n 'echo "$(cat /tmp/k8s-upgrade.pub) k8s-upgrade-key" >> ~/.ssh/authorized_keys'
|
||||
done
|
||||
```
|
||||
4. ESO refreshes within 15 min — or force: `kubectl -n k8s-upgrade annotate externalsecret k8s-upgrade-creds force-sync=$(date +%s) --overwrite`
|
||||
|
||||
## Past Incidents
|
||||
|
||||
### 2026-05-11 — Self-preemption (agent → Job-chain rewrite)
|
||||
- The v1 agent ran inside the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was scheduled to k8s-node4.
|
||||
- During Stage 6 (first worker drain) the agent ran `kubectl drain k8s-node4` — evicting itself.
|
||||
- The bash process died after the drain but before the SSH-pipe to install kubeadm on node4.
|
||||
- Node4 was left cordoned; cluster stuck at master v1.34.7, workers v1.34.2 until manual recovery.
|
||||
- **Mitigation**: rewrote the pipeline as a chain of Jobs, each `nodeSelector`-pinned to a non-target node. New `predrain_unstick` step deletes PDB-blocked single-replica pods (Anubis pattern) before drain so they don't loop forever. Added `K8sUpgradeStalled` alert (in-flight + started_timestamp > 90 min).
|
||||
|
||||
## File Pointers
|
||||
|
||||
| What | Where |
|
||||
|------|-------|
|
||||
| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` |
|
||||
| Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
|
||||
| Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` |
|
||||
| Per-node upgrade script | `infra/scripts/update_k8s.sh` |
|
||||
| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
|
||||
| Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` |
|
||||
| Architecture doc | `infra/docs/architecture/automated-upgrades.md` (K8s Version Upgrades section) |
|
||||
| Related (OS reboots) | `infra/docs/runbooks/k8s-node-auto-upgrades.md` |
|
||||
| Deprecated agent prompt (reference) | `infra/.claude/agents/k8s-version-upgrade.deprecated.md` |
|
||||
249
docs/runbooks/kms-public-exposure.md
Normal file
249
docs/runbooks/kms-public-exposure.md
Normal file
|
|
@ -0,0 +1,249 @@
|
|||
# Runbook: KMS public exposure (vlmcs.viktorbarzin.me:1688)
|
||||
|
||||
`vlmcs.viktorbarzin.me:1688/TCP` is intentionally open to the internet so any
|
||||
visitor can activate Volume License Microsoft products. The webpage at
|
||||
`https://kms.viktorbarzin.me/` documents how to use it.
|
||||
|
||||
**Two hostnames, on purpose** (do not merge them):
|
||||
|
||||
- `kms.viktorbarzin.me` — the **website** (Traefik). Serves the docs and the
|
||||
`/scripts/*.ps1` activators. Internally resolves to the Traefik LB
|
||||
(`10.0.20.203`), which has **no** `:1688` listener.
|
||||
- `vlmcs.viktorbarzin.me` — the **KMS endpoint** (vlmcsd). A-only (no AAAA —
|
||||
the IPv6 tunnel doesn't forward 1688). Resolves to `10.0.20.202` on the LAN
|
||||
(Technitium split-horizon, set via API — `cloudflare_record.vlmcs` in
|
||||
`stacks/kms` owns the public A) and to `176.12.22.76` on the internet
|
||||
(Cloudflare → pfSense WAN NAT :1688). Every `slmgr` / `ospp` command on the
|
||||
page points here.
|
||||
|
||||
Pointing a client at `kms.viktorbarzin.me:1688` fails from the LAN with "KMS
|
||||
server cannot be reached" — that name is the website, not the KMS server.
|
||||
|
||||
This runbook covers operations on the public exposure: where to find logs,
|
||||
how to tune the rate limit, how to revoke if abused.
|
||||
|
||||
## Architecture
|
||||
|
||||
- **K8s service**: `windows-kms` in namespace `kms`, MetalLB **dedicated**
|
||||
LB IP `10.0.20.202:1688`. ETP=Local, so vlmcsd sees real WAN client IPs
|
||||
in its log (pfSense WAN forwards do DNAT-only, no SNAT; ETP=Local skips
|
||||
the kube-proxy SNAT too). Same pattern mailserver used pre-2026-04-19.
|
||||
Sharing `10.0.20.200` isn't an option — all 10 services there are
|
||||
ETP=Cluster and MetalLB requires a single ETP per shared IP.
|
||||
- **Native DNS auto-discovery for LAN clients**: any Windows client with
|
||||
DNS suffix `viktorbarzin.lan` activates with zero config — Windows
|
||||
queries `_vlmcs._tcp.viktorbarzin.lan` SRV by default, the SRV target
|
||||
resolves to `vlmcs.viktorbarzin.lan` → `10.0.20.202`, and `slmgr /ato`
|
||||
succeeds. Records:
|
||||
- `_vlmcs._tcp.viktorbarzin.lan` SRV 0 0 1688 vlmcs.viktorbarzin.lan
|
||||
- `vlmcs.viktorbarzin.lan` A `10.0.20.202`
|
||||
- `kms.viktorbarzin.lan` A `10.0.20.200` (Traefik — for the user-facing
|
||||
website at `https://kms.viktorbarzin.lan/`; **not** the KMS server)
|
||||
Manual override (e.g., for clients without the suffix or for clients
|
||||
on the public internet): `slmgr /skms vlmcs.viktorbarzin.me:1688` (works
|
||||
LAN + WAN) or `slmgr /skms 10.0.20.202:1688` (LAN, direct). Do **not** use
|
||||
`kms.viktorbarzin.me:1688` — that name is the website (Traefik), not the
|
||||
KMS server. To revert a manually-overridden client back to auto-discovery:
|
||||
`slmgr /ckms`.
|
||||
- **Pod fluidity**: deployment has `replicas=1` (notifier dedup state is
|
||||
per-pod) with no node affinity. TCP readiness/liveness probes on 1688
|
||||
gate Pod Ready on the listener actually being up, so MetalLB only
|
||||
advertises `10.0.20.202` from a node where vlmcsd is serving.
|
||||
- **pfSense WAN forward**: `WAN TCP/1688 → k8s_kms_lb:1688`
|
||||
(alias = `10.0.20.202`, dedicated to KMS). Description: `KMS public —
|
||||
kms.viktorbarzin.me`. Other forwards using `k8s_shared_lb` (WireGuard,
|
||||
HTTPS, shadowsocks, smtps, etc.) are unaffected.
|
||||
- **Filter rule** on the WAN interface, TCP/1688 destination
|
||||
`<k8s_kms_lb>`, with state-table per-source caps:
|
||||
- `max-src-conn 50` — concurrent connections per source IP
|
||||
- `max-src-conn-rate 10/60` — 10 new connections per 60 seconds per
|
||||
source
|
||||
- `overload <virusprot>` flush — sources that exceed either cap get added
|
||||
to pfSense's stock `virusprot` pf table and have their existing states
|
||||
flushed. (`virusprot` is the only table pfSense's filter generator
|
||||
targets for `overload`; see `/etc/inc/filter.inc`. Don't try to point
|
||||
it at a custom table — the schema doesn't expose that knob.)
|
||||
- **Probe filter in slack-notifier**: a bare TCP open/close (no
|
||||
Application/Activation block from vlmcsd) is treated as a probe — Uptime
|
||||
Kuma's port-type monitor on `windows-kms.kms.svc:1688` and the kubelet
|
||||
readiness/liveness probes both hit this path. Probes increment
|
||||
`kms_connection_probes_total{source}` (`source` ∈ `internal_pod`,
|
||||
`cluster_node`, `external`) and log to stdout, but never post to Slack.
|
||||
Real activations still post.
|
||||
- **Website `/scripts` + `/keys.json` carve-out**: the website is Anubis-fronted
|
||||
(PoW challenge). `/scripts/*` and `/keys.json` are carved out to the bare
|
||||
nginx backend (`module.ingress_scripts` in `stacks/kms`, `ingress_path`)
|
||||
because PowerShell `iwr | iex` / `ConvertFrom-Json` are non-JS clients that
|
||||
can't solve the PoW — without the carve-out they'd download the Anubis
|
||||
challenge HTML and choke. Everything else stays behind Anubis. Verify:
|
||||
`curl -A curl https://kms.viktorbarzin.me/scripts/setup-kms.ps1` and
|
||||
`.../keys.json` both return real content (not "Making sure you're not a bot!").
|
||||
- **Auto-key selection**: the scripts no longer require the user to pick a GVLK.
|
||||
`/keys.json` is `data/products.yaml` rendered to JSON (Hugo KEYS output format).
|
||||
When no Volume License key is installed, `setup-kms.ps1` / `kms-bootstrap.ps1`
|
||||
detect the edition — Windows via registry `EditionID` (+ `CurrentBuildNumber`
|
||||
for LTSC/Server, which share an EditionID across releases), Office via the
|
||||
Click-to-Run `ProductReleaseIds` — fetch `/keys.json`, and `slmgr /ipk` /
|
||||
`ospp /inpkey` the matching key before activating. Only fires when not already
|
||||
licensed (never clobbers a working retail key). Azure-Edition server SKUs are
|
||||
intentionally unmapped (they collide with Datacenter and KMS may fail there).
|
||||
- **Edition switch (kms-bootstrap.ps1, consent-gated)**: when the installed
|
||||
product *can't* KMS-activate (Windows Home/retail; no VL Office), the bootstrap
|
||||
shows the consequences and asks before changing anything (default No). Windows
|
||||
→ `changepk.exe /ProductKey <target GVLK>` (default Pro; `$env:KMS_EDITION`
|
||||
overrides) — in-place edition UPGRADE, **needs a reboot then re-run**, one-way
|
||||
(no in-place downgrade). Office → slim ODT `setup.exe /configure` to a VL
|
||||
product (default ProPlus2024Volume; `$env:KMS_OFFICE_PRODUCT` overrides) — ~3 GB
|
||||
download, closes Office. If an INCOMPATIBLE Click-to-Run Office is installed
|
||||
(retail/M365 — `ProductReleaseIds` not ending in `Volume`), it's named in the
|
||||
prompt and **uninstalled first** via ODT `<Remove>` of just those products (VL
|
||||
products of other families are kept), then the VL product installs. The ODT run
|
||||
is one shared `Invoke-Odt` for both `<Add>` and `<Remove>`. **Removing the bundled
|
||||
consumer Office leaves a pending reboot**, so a VL install in the same run — or a
|
||||
re-run before rebooting — fails with `setup.exe` exit **1603**. Two guards: a
|
||||
hard-reboot (CBS/WU) gate before the ~3 GB download, and a reboot-aware 1603
|
||||
message telling the user to reboot + re-run (idempotent — the incompatible Office
|
||||
is already gone). `Invoke-Odt` checks the setup.exe exit code and on failure
|
||||
captures the C2R log from `%TEMP%` into telemetry; `Wait-OfficeInstalled` polls
|
||||
on-disk state (ospp.vbs + ProductReleaseIds) because `setup.exe` can return before
|
||||
the C2R install finishes. Non-interactive runs only proceed with an explicit env
|
||||
override. setup-kms.ps1 stays minimal and points non-VL editions at the bootstrap.
|
||||
NOTE: real-hardware status (2026-06-01) — the incompatible-uninstall path DID run
|
||||
on a real M365/Office-Home box (`O365HomePremRetail` removed cleanly); the VL
|
||||
install then needs a reboot first (hit 1603, now guided). changepk edition-switch
|
||||
remains untested (no Home test box; the Pro test VM can't be switched reversibly).
|
||||
- **SXSMSI/1603 deep-repair + escalation (2026-06-02):** when the VL install fails
|
||||
`[Failing PreReq=SXSMSI]`/1603 with NO pending reboot (the C2R bootstrap MSI fails),
|
||||
the script offers a consent-gated deep repair (`Repair-OfficePrereq`: `msiexec
|
||||
/unregister`+`/regserver` and reset `SoftwareDistribution`+`catroot2` — the level
|
||||
past DISM/SFC; uninstalls nothing; `$env:KMS_DEEP_REPAIR=1` auto-consents). It
|
||||
persists `HKLM\SOFTWARE\kms-bootstrap\DeepRepairDone`; if 1603 recurs AFTER a deep
|
||||
repair it stops looping and shows the in-place-Windows-repair guidance
|
||||
(`Show-InPlaceRepairHint`, telemetry `sxsmsi-unrecoverable`). **Pilot on PVE VM 300
|
||||
(2026-06-02) proved SXSMSI is client-machine-specific, not the script:** the
|
||||
identical script + the exact user journey both reach `office/ok` on a healthy
|
||||
Win10 — CF1 = clean (Remove-All+reboot) → VL install; CF2 = retail
|
||||
`O365HomePremRetail` → script targeted-remove → reboot → VL install. So a
|
||||
persistent SXSMSI is the client's corrupted Windows servicing/Installer subsystem
|
||||
(below DISM/SFC), fixed only by an in-place Windows repair-install. Also learned:
|
||||
the targeted retail uninstall is itself flaky under low disk (exit -1) and the
|
||||
qemu guest-agent drops during heavy C2R installs (poll telemetry/state, not
|
||||
guest-exec, for results).
|
||||
- **Self-hosted ODT bootstrapper**: the Office reinstall path fetches the Office
|
||||
Deployment Tool from `https://kms.viktorbarzin.me/scripts/odt-setup.exe` (a
|
||||
committed copy in `kms-website/static/scripts/`), NOT from Microsoft —
|
||||
`download.microsoft.com`'s ODT URL is build-numbered and rotates every release
|
||||
(the old hardcoded one 404'd). `$env:KMS_ODT_URL` overrides. The bootstrapper
|
||||
self-updates the Office payload, so refresh the committed copy only occasionally.
|
||||
- **Client telemetry → Loki**: the scripts POST a small ANONYMOUS diagnostics
|
||||
event per run to `https://kms.viktorbarzin.me/diag` (action, outcome, error +
|
||||
exit codes, EditionID/build/locale, detected Office products, script version;
|
||||
NO hostname/user/keys). Fire-and-forget (3s, swallowed) — never affects
|
||||
activation. `$env:KMS_NO_TELEMETRY=1` opts out; `$env:KMS_DIAG_URL` overrides.
|
||||
Collector: standalone `kms-diag` Deployment (`stacks/kms`, python stdlib HTTP
|
||||
on :9102) reachable via the `/diag` ingress carve-out (bypasses Anubis like
|
||||
`/scripts`); it prints `KMSDIAG <json>` to stdout → Loki. Query in Grafana:
|
||||
`{namespace="kms",pod=~"kms-diag.*"} |= "KMSDIAG"`. Disclosed in the site FAQ.
|
||||
|
||||
## Where the logs are
|
||||
|
||||
### vlmcsd (kms namespace, k8s)
|
||||
|
||||
```bash
|
||||
# Live tail
|
||||
kubectl logs -n kms -l app=kms-service -c windows-kms --tail=50 -f
|
||||
|
||||
# All activations in the running pod
|
||||
kubectl logs -n kms -l app=kms-service -c windows-kms | grep "Incoming KMS request"
|
||||
```
|
||||
|
||||
Source IPs from the WAN are real client IPs (pfSense DNAT-only + ETP=Local
|
||||
preserve them through the chain). LAN clients hitting the LB IP directly
|
||||
appear as their own IP. Pod-source probes (Uptime Kuma) appear as a Calico
|
||||
pod IP in `10.10.0.0/16`. Kubelet readiness/liveness probes appear as the
|
||||
hosting node IP in `10.0.20.0/24`.
|
||||
|
||||
### Slack notifier (kms namespace, k8s)
|
||||
|
||||
```bash
|
||||
kubectl logs -n kms -l app=kms-service -c slack-notifier --tail=50 -f
|
||||
```
|
||||
|
||||
Posts to `#alerts`, dedup window 1h per (source-IP, product). Activations
|
||||
also increment the Prometheus counter `kms_activations_total{product,status}`
|
||||
exposed on the same pod at `:9101/metrics` (scraped by the cluster-wide
|
||||
`kubernetes-pods` job; query via Prometheus or Grafana directly).
|
||||
|
||||
Probe-only TCP connections (open+close, no KMS RPC) are silently filtered
|
||||
out of Slack and counted in `kms_connection_probes_total{source}`. Useful
|
||||
queries:
|
||||
```promql
|
||||
# Probe rate by source
|
||||
rate(kms_connection_probes_total[5m])
|
||||
# Probes from the public WAN (a non-zero rate here means real port-scans
|
||||
# are reaching us, not just internal monitoring)
|
||||
rate(kms_connection_probes_total{source="external"}[5m])
|
||||
```
|
||||
|
||||
### pfSense — virusprot table and filter hits
|
||||
|
||||
```bash
|
||||
# SSH to 10.0.20.1 as root
|
||||
pfctl -t virusprot -T show # who's currently in the virusprot table
|
||||
pfctl -t virusprot -T expire 86400 # boot anyone added more than 24h ago
|
||||
pfctl -t virusprot -T flush # nuke the entire table
|
||||
|
||||
# Filter rule hit counts (find the KMS public rule, look at Evaluations / States)
|
||||
pfctl -sr -v | grep -A 4 1688
|
||||
|
||||
# State table — current TCP/1688 connections, per source
|
||||
pfctl -ss | grep ':1688 '
|
||||
```
|
||||
|
||||
## Tightening or loosening the rate limit
|
||||
|
||||
The filter rule is configured via the pfSense web UI
|
||||
(`Firewall → Rules → WAN`, look for the `KMS public — kms.viktorbarzin.me`
|
||||
rule) under **Advanced Options → "Maximum new connections per source per
|
||||
seconds"** and **"Maximum state entries per source"**.
|
||||
|
||||
- **Default**: `max-src-conn 50`, `max-src-conn-rate 10/60`
|
||||
- To **tighten** (suspected abuse): drop to `max-src-conn 10`,
|
||||
`max-src-conn-rate 3/60`. Flush state and existing virusprot afterwards
|
||||
(`pfctl -k 0.0.0.0/0 -K 0.0.0.0/0` is overkill — just save+apply the
|
||||
rule, pfSense reloads pf and existing virusprot stay blocked).
|
||||
- To **loosen** (legitimate users blocked): bump to
|
||||
`max-src-conn-rate 30/60`. The `virusprot` table flush still applies on
|
||||
overload; reduce its lifetime via
|
||||
`Firewall → Advanced → State Timeouts` if entries linger.
|
||||
|
||||
The `overload` table entry survives pf reloads. Running
|
||||
`pfctl -t virusprot -T flush` after a tuning change clears the slate.
|
||||
|
||||
## Revoking the public exposure
|
||||
|
||||
If the activation surface needs to come down (abuse, legal, audit):
|
||||
|
||||
1. **pfSense web UI** → `Firewall → NAT → Port Forward` → find
|
||||
`WAN TCP/1688 → k8s_kms_lb` → **delete** (or disable). Apply.
|
||||
2. **pfSense web UI** → `Firewall → Rules → WAN` → find
|
||||
`KMS public — kms.viktorbarzin.me` → **delete** (or disable). Apply.
|
||||
3. Verify externally: from a phone tether, `nc -zw3 kms.viktorbarzin.me 1688`
|
||||
should now fail.
|
||||
|
||||
The k8s service stays reachable on the LAN
|
||||
(`10.0.20.202:1688` directly, and the website at `kms.viktorbarzin.lan`
|
||||
via Traefik on `10.0.20.203:443`) — only the WAN port-forward is removed.
|
||||
|
||||
To put it back, recreate the NAT rule (target alias `k8s_kms_lb`,
|
||||
port `1688`) and the filter rule with the same per-source caps. The alias
|
||||
itself is independent of any forward and persists across delete/restore.
|
||||
|
||||
## Related
|
||||
|
||||
- Stack: `stacks/kms/` (Terraform; deployment, MetalLB Service, ingress,
|
||||
ExternalSecret for the Slack webhook)
|
||||
- Webpage source: `kms-website/` repo (Hugo + nginx; Woodpecker builds +
|
||||
pushes to forgejo, then `kubectl set image deployment/kms-web-page`)
|
||||
- Networking architecture footnote:
|
||||
`docs/architecture/networking.md` § "MetalLB & Load Balancing"
|
||||
222
docs/runbooks/mailserver-pfsense-haproxy.md
Normal file
222
docs/runbooks/mailserver-pfsense-haproxy.md
Normal file
|
|
@ -0,0 +1,222 @@
|
|||
# pfSense HAProxy for Mailserver — Runbook
|
||||
|
||||
Last updated: 2026-04-19 (Phase 6 complete)
|
||||
|
||||
## What & why
|
||||
|
||||
External mail traffic (SMTP/IMAP) requires **real client IP visibility** for
|
||||
CrowdSec + Postfix rate-limiting. MetalLB cannot inject PROXY-protocol
|
||||
headers (see [`mailserver-proxy-protocol.md`](./mailserver-proxy-protocol.md)),
|
||||
so pfSense runs a small HAProxy that:
|
||||
|
||||
1. Listens on the pfSense VLAN20 IP (`10.0.20.1`) on all 4 mail ports,
|
||||
2. Forwards each connection to a k8s node's NodePort with `send-proxy-v2`,
|
||||
3. Injects PROXY v2 framing so Postfix/Dovecot see the original client IP,
|
||||
4. TCP-checks every k8s worker via dedicated **non-PROXY healthcheck NodePorts**
|
||||
(30145/30146/30147 → pod stock 25/465/587 listeners, no PROXY required).
|
||||
This split path avoids the `smtpd_peer_hostaddr_to_sockaddr` fatal that
|
||||
used to fire on every PROXY-aware health probe and throttled real client
|
||||
connections.
|
||||
|
||||
Corresponding k8s-side setup (`stacks/mailserver/modules/mailserver/`):
|
||||
|
||||
- ConfigMap `mailserver-user-patches` → `user-patches.sh` appends 3 alt
|
||||
`master.cf` services to Postfix:
|
||||
- `:2525` postscreen (alt :25) with `postscreen_upstream_proxy_protocol=haproxy`
|
||||
- `:4465` smtpd (alt :465 SMTPS) with `smtpd_upstream_proxy_protocol=haproxy`
|
||||
- `:5587` smtpd (alt :587 submission) with `smtpd_upstream_proxy_protocol=haproxy`
|
||||
- ConfigMap `mailserver.config` adds Dovecot `inet_listener imaps_proxy` on
|
||||
port 10993 with `haproxy = yes` and `haproxy_trusted_networks = 10.0.20.0/24`.
|
||||
- Service `mailserver-proxy` (NodePort, ETP:Cluster) — 4 PROXY data ports +
|
||||
3 non-PROXY healthcheck ports:
|
||||
- Data (PROXY v2):
|
||||
- `port 25 → targetPort 2525 → nodePort 30125`
|
||||
- `port 465 → targetPort 4465 → nodePort 30126`
|
||||
- `port 587 → targetPort 5587 → nodePort 30127`
|
||||
- `port 993 → targetPort 10993 → nodePort 30128`
|
||||
- Healthcheck (no PROXY, stock SMTP/SMTPS/Submission listeners):
|
||||
- `port 2500 → targetPort 25 → nodePort 30145` (smtp-check)
|
||||
- `port 4650 → targetPort 465 → nodePort 30146` (smtps-check)
|
||||
- `port 5870 → targetPort 587 → nodePort 30147` (sub-check)
|
||||
- Service `mailserver` (ClusterIP) — unchanged stock ports 25/465/587/993
|
||||
for intra-cluster clients (Roundcube pod, `email-roundtrip-monitor`
|
||||
CronJob, book-search). These listeners are PROXY-free.
|
||||
|
||||
bd: `code-yiu`.
|
||||
|
||||
## Steady-state architecture
|
||||
|
||||
```
|
||||
External mail (WAN) path — PROXY v2
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Client (real IP) │
|
||||
│ │ SMTP/SMTPS/Sub/IMAPS │
|
||||
│ ▼ │
|
||||
│ pfSense WAN:{25,465,587,993} │
|
||||
│ │ NAT rdr → 10.0.20.1:{same} │
|
||||
│ ▼ │
|
||||
│ pfSense HAProxy (mode tcp, 4 frontends, 4 backend pools) │
|
||||
│ │ data: send-proxy-v2 → :{30125..30128} (PROXY-aware pod) │
|
||||
│ │ health: TCP-check → :{30145..30147} (no-PROXY pod) │
|
||||
│ │ inter 5000 │
|
||||
│ ▼ │
|
||||
│ k8s-node<1-4>:{30125..30128} ← any node (ETP:Cluster) │
|
||||
│ │ kube-proxy SNAT (source IP lost on the wire) │
|
||||
│ ▼ │
|
||||
│ mailserver pod :{2525,4465,5587,10993} │
|
||||
│ │ postscreen / smtpd / Dovecot parse PROXY v2 header │
|
||||
│ │ → real client IP recovered despite kube-proxy SNAT │
|
||||
│ ▼ │
|
||||
│ CrowdSec + Postfix / Dovecot see the true source IP ✓ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
|
||||
Intra-cluster path — no PROXY
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Roundcube pod / email-roundtrip-monitor CronJob │
|
||||
│ │ SMTP/IMAP │
|
||||
│ ▼ │
|
||||
│ mailserver.mailserver.svc.cluster.local:{25,465,587,993} │
|
||||
│ │ ClusterIP — bypasses LoadBalancer/NodePort layer entirely │
|
||||
│ ▼ │
|
||||
│ mailserver pod stock :{25,465,587,993} (PROXY-free) │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Validation
|
||||
|
||||
```sh
|
||||
# All HAProxy frontends listening
|
||||
ssh admin@10.0.20.1 'sockstat -l | grep haproxy'
|
||||
# Expect: *:25, *:465, *:587, *:993, *:2525 (test port)
|
||||
|
||||
# All backend pools healthy
|
||||
ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio" \
|
||||
| awk 'NR>1 {print $3, $4, $6}'
|
||||
# srv_op_state 2 = UP, 0 = DOWN
|
||||
|
||||
# Container listens on all 8 ports
|
||||
kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \
|
||||
ss -ltn | grep -E ':(25|2525|465|4465|587|5587|993|10993)\b'
|
||||
|
||||
# pf rdr points at pfSense (10.0.20.1), not <mailserver> alias
|
||||
ssh admin@10.0.20.1 'pfctl -sn' | grep -E 'port = (25|submission|imaps|smtps)'
|
||||
|
||||
# E2E probe — Brevo → external MX :25 → IMAP fetch
|
||||
kubectl create job --from=cronjob/email-roundtrip-monitor probe-test -n mailserver
|
||||
kubectl wait --for=condition=complete --timeout=90s job/probe-test -n mailserver
|
||||
kubectl logs job/probe-test -n mailserver | grep SUCCESS
|
||||
kubectl delete job probe-test -n mailserver
|
||||
|
||||
# Real client IP in maillog post-delivery
|
||||
kubectl logs -c docker-mailserver deployment/mailserver -n mailserver \
|
||||
| grep 'smtpd-proxy25.*CONNECT from' | tail -5
|
||||
# Expect external source IPs (e.g., Brevo 77.32.148.x), NOT 10.0.20.x
|
||||
```
|
||||
|
||||
## Bootstrap / restore from scratch
|
||||
|
||||
pfSense HAProxy config lives in `/cf/conf/config.xml` under
|
||||
`<installedpackages><haproxy>`. That file is scp'd nightly to
|
||||
`/mnt/backup/pfsense/config-YYYYMMDD.xml` by `scripts/daily-backup.sh`, then
|
||||
synced to Synology. To rebuild from source of truth (git):
|
||||
|
||||
```sh
|
||||
scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/
|
||||
ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'
|
||||
```
|
||||
|
||||
The script is idempotent — re-runs reset the mailserver frontends + backends
|
||||
to the declared state.
|
||||
|
||||
Expected output:
|
||||
```
|
||||
haproxy_check_and_run rc=OK
|
||||
```
|
||||
|
||||
## Operations
|
||||
|
||||
### Change backend k8s node IPs / NodePorts
|
||||
|
||||
Edit `infra/scripts/pfsense-haproxy-bootstrap.php` — `$NODES` array + the
|
||||
`build_pool()` port arguments. Re-run the bootstrap command above. Don't
|
||||
hand-edit `/var/etc/haproxy/haproxy.cfg` — it is regenerated from XML on
|
||||
every apply.
|
||||
|
||||
### Check health of backends
|
||||
|
||||
```sh
|
||||
ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio"
|
||||
```
|
||||
`srv_op_state=2` means UP, `0` means DOWN.
|
||||
|
||||
### View live HAProxy stats (WebUI)
|
||||
|
||||
`https://pfsense.viktorbarzin.me` → Services → HAProxy → Stats.
|
||||
|
||||
### Reload after config.xml edit
|
||||
|
||||
```sh
|
||||
ssh admin@10.0.20.1 'pfSsh.php playback svc restart haproxy'
|
||||
```
|
||||
|
||||
### Rollback (flip NAT back to MetalLB, post-Phase-6 only partial)
|
||||
|
||||
There is no Phase-6 rollback one-liner. Phase 6 removed the MetalLB
|
||||
LoadBalancer 10.0.20.202 entirely, so un-flipping NAT now would send
|
||||
traffic to a dead alias. To regress:
|
||||
|
||||
1. Re-add `metallb.io/loadBalancerIPs = "10.0.20.202"` + `type = "LoadBalancer"`
|
||||
+ `external_traffic_policy = "Local"` to `kubernetes_service.mailserver`,
|
||||
apply.
|
||||
2. Re-add the `mailserver` host alias in pfSense pointing at 10.0.20.202
|
||||
(Firewall → Aliases → Hosts).
|
||||
3. Run `infra/scripts/pfsense-nat-mailserver-haproxy-unflip.php` on pfSense.
|
||||
|
||||
For rollback of just the NAT (Phase 4) without touching the Service, only
|
||||
the third step is needed — but only meaningful BEFORE Phase 6.
|
||||
|
||||
### Restore from backup
|
||||
|
||||
pfSense config backup is a plain XML file:
|
||||
```
|
||||
/mnt/backup/pfsense/config-YYYYMMDD.xml # sda host copy (1.1TB RAID1)
|
||||
/volume1/Backup/Viki/pve-backup/pfsense/... # Synology offsite
|
||||
```
|
||||
|
||||
Full restore: pfSense WebUI → Diagnostics → Backup & Restore → Upload that
|
||||
`config.xml`. The `<installedpackages><haproxy>` section is included.
|
||||
|
||||
## Phase history (bd code-yiu)
|
||||
|
||||
| Phase | Status | Description |
|
||||
|---|---|---|
|
||||
| 1a | ✅ commit `ef75c02f` | k8s alt :2525 listener + NodePort Service |
|
||||
| 2 | ✅ 2026-04-19 | pfSense HAProxy pkg installed (`pfSense-pkg-haproxy-devel-0.63_2`, HAProxy 2.9-dev6) |
|
||||
| 3 | ✅ commit `ba697b02` | HAProxy config persisted in pfSense XML (bootstrap script + this runbook) |
|
||||
| 4+5| ✅ commit `9806d515` | 4-port alt listeners + HAProxy frontends for 25/465/587/993 + NAT flip |
|
||||
| 6 | ✅ this commit | Mailserver Service downgraded LoadBalancer → ClusterIP; `10.0.20.202` released back to MetalLB pool; orphan `mailserver` pfSense alias removed; monitors retargeted |
|
||||
|
||||
## Known warts
|
||||
|
||||
- ~~HAProxy TCP health-check with `send-proxy-v2` generates `getpeername:
|
||||
Transport endpoint not connected` warnings on postscreen every check cycle.~~
|
||||
**Resolved 2026-05-05**: dedicated non-PROXY healthcheck NodePorts
|
||||
(30145/30146/30147 → stock pod 25/465/587) added; HAProxy now checks
|
||||
those, eliminating both the `getpeername` postscreen warnings and the
|
||||
`smtpd_peer_hostaddr_to_sockaddr: ... Servname not supported` fatals
|
||||
that were throttling smtpd respawns and causing ~50% client timeouts on
|
||||
the public 587 path. `inter` dropped 120000 → 5000 (fast failover, no
|
||||
log-spam concern). `option smtpchk` was tried but flapped against
|
||||
postscreen (multi-line greet + DNSBL silence + anti-pre-greet detection
|
||||
trip HAProxy's parser → L7RSP). Plain TCP check on the no-PROXY ports
|
||||
is sufficient.
|
||||
- Frontend binds on all pfSense interfaces (`bind :25` instead of
|
||||
`10.0.20.1:25`). `<extaddr>` is set in XML but pfSense templates it
|
||||
port-only. Low concern in practice because WAN firewall rules plus the
|
||||
NAT rdr gate external access; internal VLAN clients SHOULD be able to
|
||||
reach HAProxy on any pfSense-local IP.
|
||||
- k8s-node5 doesn't exist — cluster has master + 4 workers. Backend pool
|
||||
capped at 4 servers.
|
||||
- Postscreen still logs `improper command pipelining` for legitimate
|
||||
clients that send `EHLO\r\nQUIT\r\n` as a single TCP write. This is
|
||||
unchanged pre/post-migration — postscreen's anti-bot heuristic.
|
||||
181
docs/runbooks/mailserver-proxy-protocol.md
Normal file
181
docs/runbooks/mailserver-proxy-protocol.md
Normal file
|
|
@ -0,0 +1,181 @@
|
|||
# Mailserver PROXY protocol — research & decision
|
||||
|
||||
Last updated: 2026-04-18 (original research). **Outcome implemented 2026-04-19 — see [UPDATE](#update-2026-04-19) below.**
|
||||
|
||||
> ## UPDATE (2026-04-19)
|
||||
>
|
||||
> This doc describes the research that led to the Phase-6 rollout. **Option C
|
||||
> (pfSense HAProxy + PROXY v2)** was chosen and is now live. Operational
|
||||
> state, cutover history, bootstrap, and rollback procedures live in
|
||||
> [`mailserver-pfsense-haproxy.md`](mailserver-pfsense-haproxy.md).
|
||||
>
|
||||
> This file is retained as a decision record — it explains *why* Option A
|
||||
> (pod-pinning via nodeSelector) was rejected mid-session in favour of
|
||||
> Option C, and documents the MetalLB upstream limitation (PROXY injection
|
||||
> is explicitly won't-implement). Future debates of "why don't we just pin
|
||||
> the pod?" should land here first.
|
||||
|
||||
## TL;DR
|
||||
|
||||
**MetalLB does not and will not inject PROXY protocol headers.** The original plan
|
||||
(`/home/wizard/.claude/plans/let-s-work-on-linking-temporal-valiant.md`, task
|
||||
`code-rtb`) assumed MetalLB could be configured to emit PROXY v1/v2 on behalf of
|
||||
the `mailserver` LoadBalancer Service. That assumption is wrong at the product
|
||||
level. MetalLB is a control-plane-only announcer (ARP/NDP for L2 mode, BGP for
|
||||
L3 mode); it never touches the L4 payload.
|
||||
|
||||
As a result, there is no single Terraform change that can flip
|
||||
`externalTrafficPolicy: Local` → `Cluster` on the `mailserver` Service while
|
||||
preserving the real client IP for Postfix/postscreen and Dovecot. Three
|
||||
alternative paths exist (see below); none is trivial.
|
||||
|
||||
## Environment (verified 2026-04-18)
|
||||
|
||||
- **MetalLB version**: `quay.io/metallb/controller:v0.15.3` /
|
||||
`quay.io/metallb/speaker:v0.15.3` (5 speakers).
|
||||
- **Advertisement type**: L2Advertisement `default` bound to IPAddressPool
|
||||
`default` (10.0.20.200–10.0.20.220). No BGPAdvertisements.
|
||||
- **Service**: `mailserver/mailserver` — type `LoadBalancer`, `loadBalancerIPs:
|
||||
10.0.20.202`, `externalTrafficPolicy: Local`,
|
||||
`healthCheckNodePort: 30234`, 5 ports (25, 465, 587, 993, 9166/dovecot-metrics).
|
||||
- **Pod**: single replica today, RWO PVCs prevent horizontal scale without
|
||||
further work (`mailserver-data-encrypted`, `mailserver-letsencrypt-encrypted`).
|
||||
|
||||
## Why the original plan fails
|
||||
|
||||
### MetalLB never touches packets
|
||||
|
||||
> *"MetalLB is controlplane only, making it part of the dataplane means we
|
||||
> would be responsible for the performance of the system, so more bugs to
|
||||
> fight, I personally don't see that happening."*
|
||||
> — MetalLB maintainer `champtar`, 2021-01-06
|
||||
> (issue [#797 — Feature Request: Supporting Proxy Protocol v2](https://github.com/metallb/metallb/issues/797))
|
||||
|
||||
Issue #797 is closed as "won't implement". Repeat asks in 2022–2023 got the
|
||||
same answer. The v0.15.3 API surface confirms this: no
|
||||
`proxyProtocol`/`haproxy`/`protocol: proxy` field exists on `IPAddressPool`,
|
||||
`L2Advertisement`, `BGPAdvertisement`, or as a Service annotation.
|
||||
|
||||
Only managed-cloud LBs (AWS NLB, Azure LB, OCI, DO, OVH, Scaleway, etc.) offer
|
||||
PROXY protocol as a tick-box. MetalLB's equivalents are:
|
||||
|
||||
| MetalLB feature | Does it preserve client IP? | Comment |
|
||||
|---|---|---|
|
||||
| `externalTrafficPolicy: Local` (current) | Yes, via iptables DNAT on the speaker node | Forces pod↔speaker colocation on L2 mode. This is the pain we wanted to avoid. |
|
||||
| `externalTrafficPolicy: Cluster` | No — kube-proxy SNATs to the node IP | The problem we would re-introduce if we flipped without PROXY injection. |
|
||||
| PROXY protocol injection | N/A — not implemented | Dead end. |
|
||||
|
||||
### The `Local` trap is real, but narrower than it seems
|
||||
|
||||
Today's `Local` policy means the ARP announcer node must also host the mailserver
|
||||
pod. MetalLB always picks a single speaker to advertise the VIP (leader
|
||||
election per IP), so in practice exactly one node matters at any moment. A pod
|
||||
rescheduled to a different node silently drops inbound SMTP/IMAP until a GARP
|
||||
flip or node cordon.
|
||||
|
||||
The only pods on our cluster that see this same class of risk are Traefik
|
||||
(3 replicas + PDB `minAvailable=2`, so 2 of 3 nodes always have a pod) and
|
||||
mailserver (1 replica). Traefik survives because the pods outnumber the nodes
|
||||
that could be the speaker at once; the mailserver cannot.
|
||||
|
||||
## Alternative paths (ranked by effort)
|
||||
|
||||
### Option A — Pin the mailserver pod to a specific node (SIMPLEST)
|
||||
|
||||
Add `nodeSelector` on the mailserver Deployment pointing at a label that's also
|
||||
stamped on the MetalLB speaker we want to advertise the VIP from, and use
|
||||
MetalLB's [node selector](https://metallb.io/configuration/_advanced_l2_configuration/#specify-network-interfaces-that-lb-ip-can-be-announced-from)
|
||||
on `L2Advertisement.spec.nodeSelectors` to pin the announcer to the same node.
|
||||
|
||||
Trade-offs:
|
||||
|
||||
- Zero changes to Postfix/Dovecot configs.
|
||||
- Keeps `externalTrafficPolicy: Local` — real client IP keeps arriving.
|
||||
- Loses HA (the whole point of the MetalLB layer) but reflects reality — one
|
||||
replica, one PVC, no HA today anyway.
|
||||
- Drain of that node requires a planned cutover, but that's no worse than
|
||||
today's silent failure mode.
|
||||
|
||||
Implementation (~10 lines of Terraform):
|
||||
|
||||
```hcl
|
||||
# In stacks/mailserver/modules/mailserver/main.tf, on the Deployment:
|
||||
node_selector = { "viktorbarzin.me/mailserver-anchor" = "true" }
|
||||
|
||||
# In stacks/platform (or wherever the MetalLB CRs live):
|
||||
resource "kubernetes_manifest" "mailserver_l2ad" {
|
||||
manifest = {
|
||||
apiVersion = "metallb.io/v1beta1"
|
||||
kind = "L2Advertisement"
|
||||
metadata = { name = "mailserver", namespace = "metallb-system" }
|
||||
spec = {
|
||||
ipAddressPools = ["default"]
|
||||
nodeSelectors = [{ matchLabels = { "viktorbarzin.me/mailserver-anchor" = "true" } }]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Plus a node label via `kubectl label node k8s-node3 viktorbarzin.me/mailserver-anchor=true`.
|
||||
|
||||
**Recommendation: this is the shortest path to eliminating the silent-drop
|
||||
failure mode** without taking on a new proxy tier.
|
||||
|
||||
### Option B — Put a HAProxy sidecar in front of Postfix/Dovecot
|
||||
|
||||
Stand up an in-cluster HAProxy with PROXY v2 enabled on the frontend and
|
||||
`send-proxy-v2` on the backend to `mailserver:25/465/587/993`. Expose HAProxy
|
||||
via a new MetalLB Service with `externalTrafficPolicy: Cluster` + kube-proxy
|
||||
DSR workaround (still loses client IP at that layer), or run HAProxy on the
|
||||
host-network of the same node (back to Option A's colocation).
|
||||
|
||||
Trade-offs:
|
||||
|
||||
- Introduces one more network hop and TLS-termination decision for every
|
||||
SMTP connect.
|
||||
- HAProxy needs its own cert rotation (or `tls-passthrough`) — adds moving
|
||||
parts to an already crowded mailserver module.
|
||||
- Doesn't actually solve the colocation problem on its own — HAProxy itself
|
||||
needs to receive the client IP, so we are back to externalTrafficPolicy
|
||||
constraints for HAProxy.
|
||||
|
||||
**Recommendation: avoid unless we also get HA for mailserver itself, which
|
||||
needs RWX storage + DB split-brain work — out of scope.**
|
||||
|
||||
### Option C — Replace MetalLB with a different LB for this Service
|
||||
|
||||
Candidates: [kube-vip](https://kube-vip.io/) (supports eBPF-based DSR but not
|
||||
PROXY injection either), [Cilium LB](https://docs.cilium.io/en/stable/network/lb-ipam/)
|
||||
(preserves client IP via DSR in hybrid mode), or a dedicated HAProxy running on
|
||||
pfSense and NAT-forwarding 25/465/587/993 with PROXY headers to a
|
||||
ClusterIP-exposed mailserver. Cilium requires a CNI migration (we run Calico
|
||||
today); pfSense HAProxy is genuinely feasible but belongs in a different bd
|
||||
task.
|
||||
|
||||
**Recommendation: track as P3 follow-up under a new bd task if Option A proves
|
||||
insufficient.**
|
||||
|
||||
## Decision
|
||||
|
||||
Do nothing in this session beyond this runbook + the bd note. The `code-rtb`
|
||||
task as written is not executable — MetalLB cannot inject PROXY headers, and
|
||||
the Postfix/Dovecot config changes the plan proposed would not receive the
|
||||
header they expect, they would hang waiting for it and then timeout (5s per
|
||||
connection).
|
||||
|
||||
Follow-up work filed as bd child tasks (if user wants to pursue):
|
||||
|
||||
- **Option A — pin mailserver + L2Advertisement nodeSelectors** (new bd task)
|
||||
- **Option C — HAProxy on pfSense with PROXY v2 to a ClusterIP** (new bd task)
|
||||
|
||||
## References
|
||||
|
||||
- [MetalLB issue #797 — Feature Request: Supporting Proxy Protocol v2](https://github.com/metallb/metallb/issues/797) (closed, won't implement)
|
||||
- [MetalLB PR #796 — Source IP Preservation discussion](https://github.com/metallb/metallb/issues/796)
|
||||
- Postfix [postscreen_upstream_proxy_protocol](https://www.postfix.org/postconf.5.html#postscreen_upstream_proxy_protocol) — expects the PROXY header *on every incoming connection*; if absent, postscreen drops after `postscreen_upstream_proxy_timeout`.
|
||||
- Dovecot [haproxy_trusted_networks](https://doc.dovecot.org/settings/core/#core_setting-haproxy_trusted_networks) — treats the header as mandatory for listed source networks.
|
||||
- Cluster state verified against: `kubectl -n metallb-system get pods`,
|
||||
`kubectl get ipaddresspools.metallb.io -A`,
|
||||
`kubectl get l2advertisements.metallb.io -A`,
|
||||
`kubectl get bgpadvertisements.metallb.io -A`,
|
||||
`kubectl -n mailserver get svc mailserver -o yaml`.
|
||||
57
docs/runbooks/nextcloud-add-archive.md
Normal file
57
docs/runbooks/nextcloud-add-archive.md
Normal file
|
|
@ -0,0 +1,57 @@
|
|||
# Runbook: Add a new archive to Nextcloud / PVE NFS
|
||||
|
||||
Use this runbook when you need to surface a new directory under `/srv/nfs/` or `/srv/nfs-ssd/` to specific Nextcloud users as a dedicated External mount. Each archive gets its own NC mount; only the listed `applicableUsers` can see and access it.
|
||||
|
||||
## Steps
|
||||
|
||||
1. **Create the directory on PVE.**
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
mkdir -p /srv/nfs/<archive-name>
|
||||
# Use /srv/nfs-ssd/<archive-name> for the SSD pool instead.
|
||||
```
|
||||
|
||||
2. **Populate the directory.**
|
||||
|
||||
Rsync from a remote source, copy from another NFS path, or let the granted user upload via the NC web UI after step 5. Example rsync:
|
||||
|
||||
```bash
|
||||
rsync -avP --info=progress2 user@source:/path/ /srv/nfs/<archive-name>/
|
||||
```
|
||||
|
||||
3. **Add a manifest entry.**
|
||||
|
||||
Edit `infra/stacks/nextcloud/external_storage.tf`. In the `kubernetes_config_map_v1.nextcloud_external_storage_manifest` resource, append a new entry to `archiveMounts`:
|
||||
|
||||
```json
|
||||
{ "mountPoint": "/<archive-name>", "dataDir": "/mnt/pve-nfs/<archive-name>", "applicableUsers": ["<owner1>", "admin"], "applicableGroups": [], "enableSharing": false }
|
||||
```
|
||||
|
||||
Use `/mnt/pve-nfs-ssd/<archive-name>` for the SSD pool. NC usernames are `admin`, `anca`, `emo` — not display names (`admin` is Viktor). `admin` is included so the owner of the homelab can always assist with the archive. Set `enableSharing: true` only if you want recipients to re-share subfolders.
|
||||
|
||||
4. **Plan and apply.**
|
||||
|
||||
```bash
|
||||
cd infra/stacks/nextcloud
|
||||
scripts/tg plan
|
||||
scripts/tg apply
|
||||
```
|
||||
|
||||
The bootstrap Job re-runs and applies the new mount plus `applicable_users` idempotently via `occ files_external:*` and `occ files_external:applicable`. No manual `occ` invocation needed.
|
||||
|
||||
5. **Verify.**
|
||||
|
||||
Log in as a granted user — `/<archive-name>` must appear in their NC sidebar; read, upload, and delete must all work. Log in as a non-granted user and confirm the mount is not visible at all.
|
||||
|
||||
## Backout
|
||||
|
||||
Remove the entry from `archiveMounts` in the manifest ConfigMap, then `scripts/tg apply`. The bootstrap Job re-runs and removes the mount. The root mounts (`PVE NFS Pool`, `PVE NFS-SSD Pool`, visible to group `admin` only) are unaffected throughout.
|
||||
|
||||
After the mount is gone there is no NC trash to clean. The directory on PVE (`/srv/nfs/<archive-name>`) can be `rmdir`'d once you have confirmed the data is safe elsewhere.
|
||||
|
||||
## Related
|
||||
|
||||
- Architecture: `docs/architecture/storage.md` — "Nextcloud as PVE-NFS browser" section
|
||||
- Original design/plan: `infra/docs/plans/2026-05-23-anca-elements-{design,plan}.md` <!-- TODO: confirm path once orchestrator files the plan docs -->
|
||||
- Manifest source: `infra/stacks/nextcloud/external_storage.tf` (`kubernetes_config_map_v1.nextcloud_external_storage_manifest`)
|
||||
66
docs/runbooks/nfs-prerequisites.md
Normal file
66
docs/runbooks/nfs-prerequisites.md
Normal file
|
|
@ -0,0 +1,66 @@
|
|||
# NFS Prerequisites for `modules/kubernetes/nfs_volume`
|
||||
|
||||
The `nfs_volume` Terraform module creates a `PersistentVolume` pointing at a
|
||||
path on the Proxmox NFS server (`192.168.1.127`). It does **not** create the
|
||||
underlying directory on the server.
|
||||
|
||||
If the path does not exist, the first pod that tries to mount the resulting
|
||||
PVC gets stuck in `ContainerCreating` with the kubelet event:
|
||||
|
||||
```
|
||||
MountVolume.SetUp failed for volume "<name>" : mount failed: exit status 32
|
||||
mount.nfs: mounting 192.168.1.127:/srv/nfs/<path> failed, reason given by
|
||||
server: No such file or directory
|
||||
```
|
||||
|
||||
## Bootstrap before first apply
|
||||
|
||||
Before adding a new `nfs_volume` consumer (backup CronJob, data PV, etc.),
|
||||
create the export root on the PVE host:
|
||||
|
||||
```sh
|
||||
# Replace <app> with the backup stack name, e.g. mailserver-backup,
|
||||
# roundcube-backup, immich-backup, etc.
|
||||
ssh root@192.168.1.127 'mkdir -p /srv/nfs/<app> && chmod 755 /srv/nfs/<app>'
|
||||
|
||||
# Confirm exports are live (no change to /etc/exports needed — `/srv/nfs`
|
||||
# is already exported via the root entry in pve-nfs-exports).
|
||||
ssh root@192.168.1.127 exportfs -v | grep '/srv/nfs\b'
|
||||
```
|
||||
|
||||
`/srv/nfs` is exported with the root entry. Subdirectories inherit the
|
||||
export automatically; they just have to exist on disk.
|
||||
|
||||
## Known consumers
|
||||
|
||||
| Consumer | NFS path | Owning stack |
|
||||
|--------------------------------|---------------------------------|--------------------------|
|
||||
| `mailserver-backup` | `/srv/nfs/mailserver-backup` | `stacks/mailserver/` |
|
||||
| `roundcube-backup` | `/srv/nfs/roundcube-backup` | `stacks/mailserver/` |
|
||||
| `mysql-backup` | `/srv/nfs/mysql-backup` | `stacks/dbaas/` |
|
||||
| `postgresql-backup` | `/srv/nfs/postgresql-backup` | `stacks/dbaas/` |
|
||||
| `vaultwarden-backup` | `/srv/nfs/vaultwarden-backup` | `stacks/vaultwarden/` |
|
||||
|
||||
Use `grep -rn 'nfs_volume' infra/stacks/` to find all active consumers.
|
||||
|
||||
## Why not auto-create?
|
||||
|
||||
Two options were considered for automating this:
|
||||
|
||||
1. `null_resource` + `local-exec` SSH `mkdir` in the `nfs_volume` module —
|
||||
works but adds an SSH dependency to every Terraform run, makes the
|
||||
module non-hermetic, and fails if the operator does not have SSH to
|
||||
the PVE host.
|
||||
2. `nfs-subdir-external-provisioner` — handles subdirs automatically but
|
||||
changes the PV/PVC shape and would require migrating all existing
|
||||
consumers.
|
||||
|
||||
Neither is worth the churn for a one-time operation per new backup stack.
|
||||
Document + checklist is the current call; re-evaluate if we start adding
|
||||
one NFS consumer per week.
|
||||
|
||||
## Related tasks
|
||||
|
||||
- `code-yo4` — this runbook
|
||||
- `code-z26` — mailserver backup CronJob (first-time setup hit this)
|
||||
- `code-1f6` — Roundcube backup CronJob (also hit this)
|
||||
72
docs/runbooks/offboard-user.md
Normal file
72
docs/runbooks/offboard-user.md
Normal file
|
|
@ -0,0 +1,72 @@
|
|||
# Runbook: Offboard a User
|
||||
|
||||
Removing a user can span two surfaces — the **in-cluster** namespace-owner model
|
||||
(Vault `k8s_users` / RBAC / namespace) and the **devvm Workstation** (roster /
|
||||
OS account / t3 instance). Both are **staged**: a *reversible cut* (revoke access,
|
||||
delete nothing) first, then an explicit, gated *destructive removal*. Do the
|
||||
reversible cut immediately; only do the destructive step once you're sure.
|
||||
|
||||
> Architecture: `../architecture/multi-tenancy.md`. Workstation design:
|
||||
> `../plans/2026-06-07-multi-user-workstation-design.md`.
|
||||
|
||||
---
|
||||
|
||||
## Part A — DevVM Workstation offboarding
|
||||
|
||||
Driven by removing the user's entry from `infra/scripts/workstation/roster.yaml`.
|
||||
`roster_engine.py offboard_plan` computes the staged actions (reversible cut vs the
|
||||
gated `userdel_archive`, which is **never** auto-applied).
|
||||
|
||||
### A1. Reversible cut (revoke access; delete nothing)
|
||||
|
||||
1. **Delete the user's entry** from `roster.yaml`; commit + push.
|
||||
2. **Reconcile** (`sudo /usr/local/bin/t3-provision-users`, or wait for the hourly
|
||||
timer). This **regenerates** `/etc/ttyd-user-map` + `dispatch.json` *without* the
|
||||
user → `t3-dispatch` now returns **403** for them. *(Automated.)*
|
||||
3. **Disable their instance + lock login** *(manual today; Phase 7 will fold this into
|
||||
the reconcile):*
|
||||
```bash
|
||||
sudo systemctl disable --now t3-serve@<os_user>.service
|
||||
sudo passwd -l <os_user>
|
||||
```
|
||||
4. **Verify:** they can no longer reach `t3.viktorbarzin.me` (302 → Authentik, then
|
||||
denied once removed from the `T3 Users` group — Part C) and cannot log in. Nothing
|
||||
is deleted; re-adding the roster entry + reconcile fully restores them.
|
||||
|
||||
### A2. Destructive removal (explicit, gated — NEVER automatic)
|
||||
|
||||
Only after the reversible cut and a deliberate decision:
|
||||
```bash
|
||||
sudo tar czf /mnt/backup/offboard/<os_user>-$(date +%Y%m%d).tar.gz /home/<os_user>
|
||||
sudo userdel -r <os_user> # removes home + mail spool — IRREVERSIBLE
|
||||
```
|
||||
Rollback before this step: re-add the roster entry + reconcile. After it: restore
|
||||
from the archive.
|
||||
|
||||
---
|
||||
|
||||
## Part B — In-cluster (namespace-owner) offboarding
|
||||
|
||||
1. **Reversible cut:** remove the user's Authentik group membership (edge/RBAC blocked)
|
||||
and their entry from the Vault `k8s_users` map (`secret/platform`).
|
||||
2. **Apply:** `scripts/tg apply` the `vault` → `platform` → `woodpecker` stacks (drops the
|
||||
RBAC binding, Vault identity/policy, and per-user CI). Their OIDC kubeconfig stops
|
||||
authorizing immediately.
|
||||
3. **Destructive (gated):** deleting their namespace(s) removes all their workloads +
|
||||
data — back up first (PVCs, DBs), then delete only on explicit decision.
|
||||
|
||||
---
|
||||
|
||||
## Part C — Authentik (both surfaces)
|
||||
|
||||
Remove the user from the relevant Authentik group(s) — `kubernetes-namespace-owners`
|
||||
(cluster) and/or `T3 Users` (workstation edge gate). This is the edge revocation; do
|
||||
it as part of the reversible cut so they're locked out at the front door.
|
||||
|
||||
---
|
||||
|
||||
## Order of operations
|
||||
|
||||
Reversible cut on **all** relevant surfaces first (Authentik group → roster removal +
|
||||
reconcile → `k8s_users` removal + apply) → verify access is gone → only then the gated
|
||||
destructive steps (`userdel -r`, namespace deletion), each after its own archive.
|
||||
281
docs/runbooks/pfsense-unbound.md
Normal file
281
docs/runbooks/pfsense-unbound.md
Normal file
|
|
@ -0,0 +1,281 @@
|
|||
# pfSense Unbound DNS Resolver
|
||||
|
||||
Last updated: 2026-04-19
|
||||
|
||||
## Overview
|
||||
|
||||
pfSense runs **Unbound** (DNS Resolver) as its sole DNS service, replacing
|
||||
dnsmasq (DNS Forwarder) as of 2026-04-19 (DNS hardening Workstream D,
|
||||
bd `code-k0d`).
|
||||
|
||||
Unbound AXFR-slaves the `viktorbarzin.lan` zone from the Technitium primary
|
||||
via the `10.0.20.201` LoadBalancer, so LAN-side `.lan` resolution survives
|
||||
a full Kubernetes outage. Public queries go to Cloudflare via DNS-over-TLS
|
||||
(`1.1.1.1` + `1.0.0.1` on port 853, SNI `cloudflare-dns.com`).
|
||||
|
||||
## Listeners
|
||||
|
||||
Unbound binds on:
|
||||
|
||||
| Interface | IP | Purpose |
|
||||
|-----------|-----|---------|
|
||||
| WAN | `192.168.1.2:53` | LAN (192.168.1.0/24) clients querying via pfSense WAN |
|
||||
| LAN | `10.0.10.1:53` | Management VLAN clients |
|
||||
| OPT1 | `10.0.20.1:53` | K8s VLAN clients (CoreDNS upstream) |
|
||||
| lo0 | `127.0.0.1:53` | pfSense itself |
|
||||
|
||||
The prior WAN NAT `rdr` rule (`192.168.1.2:53 → 10.0.20.201`) was removed in
|
||||
the same change — Unbound now answers directly on WAN.
|
||||
|
||||
## Config Summary
|
||||
|
||||
Relevant `<unbound>` keys in `/cf/conf/config.xml`:
|
||||
|
||||
| Key | Value | Meaning |
|
||||
|-----|-------|---------|
|
||||
| `enable` | flag | Enable Unbound |
|
||||
| `dnssec` | flag | DNSSEC validation on |
|
||||
| `forwarding` | flag | Forwarding mode (send recursive queries to upstream) |
|
||||
| `forward_tls_upstream` | flag | Use DoT for upstream forwarders |
|
||||
| `prefetch` | flag | Prefetch records near expiry |
|
||||
| `prefetchkey` | flag | Prefetch DNSKEY records |
|
||||
| `dnsrecordcache` | flag | `serve-expired: yes` |
|
||||
| `active_interface` | `lan,opt1,wan,lo0` | Listen interfaces |
|
||||
| `msgcachesize` | `256` (MB) | Message cache (rrset-cache auto-doubles to 512MB) |
|
||||
| `cache_max_ttl` | `604800` | 7 days |
|
||||
| `cache_min_ttl` | `60` | 60 seconds |
|
||||
| `custom_options` | base64 | Contains `serve-expired-ttl: 259200` + `auth-zone:` block |
|
||||
|
||||
Upstream DoT forwarders live in `<system>`:
|
||||
|
||||
- `dnsserver[0] = 1.1.1.1`
|
||||
- `dnsserver[1] = 1.0.0.1`
|
||||
- `dns1host = cloudflare-dns.com`
|
||||
- `dns2host = cloudflare-dns.com`
|
||||
|
||||
## Auth-Zone for viktorbarzin.lan
|
||||
|
||||
The custom_options block declares:
|
||||
|
||||
```
|
||||
server:
|
||||
serve-expired-ttl: 259200
|
||||
|
||||
auth-zone:
|
||||
name: "viktorbarzin.lan"
|
||||
master: 10.0.20.201
|
||||
fallback-enabled: yes
|
||||
for-downstream: yes
|
||||
for-upstream: yes
|
||||
zonefile: "viktorbarzin.lan.zone"
|
||||
allow-notify: 10.0.20.201
|
||||
```
|
||||
|
||||
- `master: 10.0.20.201` — AXFR source (Technitium LoadBalancer)
|
||||
- `fallback-enabled: yes` — if the zone can't refresh from master, fall back to normal recursion for this name (prevents hard-fail if AXFR breaks)
|
||||
- `for-downstream: yes` — answer queries for this zone with AA flag
|
||||
- `for-upstream: yes` — Unbound's internal iterator also uses this zone
|
||||
- `zonefile` is relative to the chroot (`/var/unbound/viktorbarzin.lan.zone`)
|
||||
- `allow-notify: 10.0.20.201` — accept NOTIFY from Technitium
|
||||
|
||||
## Technitium-side ACL
|
||||
|
||||
Zone `viktorbarzin.lan` on Technitium has `zoneTransfer = UseSpecifiedNetworkACL`
|
||||
with ACL entries:
|
||||
|
||||
- `10.0.20.1` (pfSense OPT1)
|
||||
- `10.0.10.1` (pfSense LAN)
|
||||
- `192.168.1.2` (pfSense WAN)
|
||||
|
||||
Verify via the Technitium API:
|
||||
|
||||
```
|
||||
curl -sk "http://127.0.0.1:5380/api/zones/options/get?token=$TOK&zone=viktorbarzin.lan" | jq .response.zoneTransfer
|
||||
```
|
||||
|
||||
## Operational Checks
|
||||
|
||||
```bash
|
||||
# Is Unbound listening?
|
||||
ssh admin@10.0.20.1 "sockstat -l -4 -p 53"
|
||||
|
||||
# Auth-zone loaded?
|
||||
ssh admin@10.0.20.1 "unbound-control -c /var/unbound/unbound.conf list_auth_zones"
|
||||
# Expected: viktorbarzin.lan. serial NNNNN
|
||||
|
||||
# LAN record via auth-zone? (aa flag = authoritative / from auth-zone)
|
||||
dig @192.168.1.2 idrac.viktorbarzin.lan +norec
|
||||
|
||||
# Public record via DoT? (ad flag = DNSSEC validated, via 1.1.1.1/1.0.0.1)
|
||||
dig @192.168.1.2 example.com +dnssec
|
||||
|
||||
# Zonefile has all records?
|
||||
ssh admin@10.0.20.1 "wc -l /var/unbound/viktorbarzin.lan.zone"
|
||||
```
|
||||
|
||||
## K8s Outage Drill
|
||||
|
||||
Tests that `.lan` resolution survives a full Technitium outage:
|
||||
|
||||
```bash
|
||||
# Scale Technitium primary to 0
|
||||
kubectl -n technitium scale deploy/technitium --replicas=0
|
||||
|
||||
# Wait ~5 seconds, then test from a LAN client
|
||||
ssh devvm.viktorbarzin.lan "dig @192.168.1.2 idrac.viktorbarzin.lan +short"
|
||||
# Expected: 192.168.1.4 (served from Unbound's cached auth-zone)
|
||||
|
||||
# Restore immediately
|
||||
kubectl -n technitium scale deploy/technitium --replicas=1
|
||||
```
|
||||
|
||||
Completed successfully on 2026-04-19 initial deployment.
|
||||
|
||||
Note: secondary/tertiary Technitium pods remain up and continue to serve
|
||||
queries via the `10.0.20.201` LoadBalancer even when the primary is down —
|
||||
so the strongest proof that Unbound's auth-zone serves locally is to also
|
||||
scale those down (optional, not part of the routine drill).
|
||||
|
||||
## Backup & Rollback
|
||||
|
||||
### Backups
|
||||
|
||||
- **On-box**: `/cf/conf/config.xml.2026-04-19-pre-unbound` (created before this
|
||||
workstream ran — keep for 30 days, then delete)
|
||||
- **Daily**: PVE `daily-backup` script copies `/cf/conf/config.xml` and a full
|
||||
pfSense config tar to `/mnt/backup/pfsense/` on the Proxmox host at 05:00
|
||||
- **Offsite**: Synology `pve-backup/pfsense/` (synced daily by
|
||||
`offsite-sync-backup`)
|
||||
|
||||
### Rollback to dnsmasq
|
||||
|
||||
If Unbound misbehaves, revert to dnsmasq + NAT rdr:
|
||||
|
||||
```bash
|
||||
# On pfSense
|
||||
cp /cf/conf/config.xml.2026-04-19-pre-unbound /cf/conf/config.xml
|
||||
|
||||
# Tell pfSense to re-read config and reload services
|
||||
php -r 'require_once("config.inc"); require_once("config.lib.inc"); disable_path_cache();'
|
||||
/etc/rc.restart_webgui # reloads PHP config caches
|
||||
# Restart services
|
||||
php -r 'require_once("config.inc"); require_once("services.inc"); services_dnsmasq_configure(); services_unbound_configure(); filter_configure();'
|
||||
/etc/rc.filter_configure # re-applies NAT rules (brings back rdr)
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
sockstat -l -4 -p 53 | grep dnsmasq # expect dnsmasq on 10.0.10.1 and 10.0.20.1
|
||||
pfctl -sn | grep '53' # expect rdr on wan UDP 53 → 10.0.20.201
|
||||
```
|
||||
|
||||
### Rollback without wiping new changes
|
||||
|
||||
If you only want to stop Unbound without restoring the whole config, edit
|
||||
config.xml and remove `<enable/>` from `<unbound>` + add it back to `<dnsmasq>`,
|
||||
then re-run `services_unbound_configure()` + `services_dnsmasq_configure()`.
|
||||
You also need to re-add the WAN NAT rdr in `<nat><rule>` (see the backup XML
|
||||
for the exact shape — tracker `1775670025`).
|
||||
|
||||
## Known Gotchas
|
||||
|
||||
1. **pfSense regenerates `/var/unbound/unbound.conf`** on every service reload
|
||||
from `<unbound>` in `config.xml`. Edits to unbound.conf are NOT durable.
|
||||
2. **`unbound-control` default config path is wrong**. Always use
|
||||
`unbound-control -c /var/unbound/unbound.conf <cmd>`.
|
||||
3. **`custom_options` is base64-encoded** in config.xml. Use `base64 -d` to
|
||||
decode in a shell, or `base64_decode()` in PHP.
|
||||
4. **`interface-automatic: yes` is NOT used** when `active_interface` is
|
||||
explicitly set to a list — pfSense emits explicit `interface: <ip>` lines.
|
||||
5. **`auth-zone`'s `zonefile` path is relative to the Unbound chroot**
|
||||
(`/var/unbound`), NOT absolute. Using an absolute path silently fails.
|
||||
6. **DoT forwarders need `forward_tls_upstream`** flag AND `dns1host` /
|
||||
`dns2host` set in `<system>` for SNI — without the hostname, pfSense emits
|
||||
`forward-addr: 1.1.1.1@853` (no `#`) which Cloudflare rejects with
|
||||
certificate hostname mismatch.
|
||||
|
||||
## Kea DHCP-DDNS TSIG (WS E, 2026-04-19)
|
||||
|
||||
Kea DHCP-DDNS on pfSense signs its RFC 2136 dynamic updates with an
|
||||
HMAC-SHA256 TSIG key (`kea-ddns`). Technitium's `viktorbarzin.lan` zone
|
||||
and reverse zones (`10.0.10.in-addr.arpa`, `20.0.10.in-addr.arpa`,
|
||||
`1.168.192.in-addr.arpa`) require both a pfSense-source IP (10.0.20.1 /
|
||||
10.0.10.1 / 192.168.1.2) AND a valid TSIG signature.
|
||||
|
||||
### Config locations
|
||||
|
||||
| Side | File | Notes |
|
||||
|------|------|-------|
|
||||
| pfSense | `/usr/local/etc/kea/kea-dhcp-ddns.conf` | Hand-managed. Pre-WS-E backup: `.2026-04-19-pre-tsig`. Daemon: `kea-dhcp-ddns` (`pkill -x kea-dhcp-ddns && /usr/local/sbin/kea-dhcp-ddns -c /usr/local/etc/kea/kea-dhcp-ddns.conf -d &`) |
|
||||
| Technitium | Zone options API: `POST /api/zones/options/set?zone=<z>&updateSecurityPolicies=kea-ddns\|*.<z>\|ANY&updateNetworkACL=10.0.20.1,10.0.10.1,192.168.1.2&update=UseSpecifiedNetworkACL` | Set on primary; replicates to secondary/tertiary via AXFR |
|
||||
| Technitium settings | TSIG keys array: `POST /api/settings/set` with `tsigKeys: [{keyName: "kea-ddns", sharedSecret: <b64>, algorithmName: "hmac-sha256"}]` | Must be set on all 3 Technitium instances (primary, secondary, tertiary) |
|
||||
| Vault | `secret/viktor/kea_ddns_tsig_secret` | Authoritative copy of the base64 secret |
|
||||
|
||||
### Rotating the TSIG key
|
||||
|
||||
1. Generate a new base64 32-byte secret: `openssl rand -base64 32` (any base64-encoded blob of reasonable length works; HMAC-SHA256 truncates/pads internally).
|
||||
2. Write it to Vault: `vault kv patch secret/viktor kea_ddns_tsig_secret=<new-secret>`.
|
||||
3. Add the new key under a **new name** (e.g., `kea-ddns-v2`) via the Technitium settings API on all 3 instances. Do NOT overwrite `kea-ddns` while Kea still uses it — you'd orphan in-flight updates.
|
||||
4. Update `/usr/local/etc/kea/kea-dhcp-ddns.conf` on pfSense to reference both keys in `tsig-keys`, set `key-name: kea-ddns-v2` on each `forward-ddns` / `reverse-ddns` domain, restart `kea-dhcp-ddns`.
|
||||
5. Update each affected zone's `updateSecurityPolicies` to use the new key name.
|
||||
6. After a lease-renewal cycle (default Kea lease = 7200s / 2h), verify with `kubectl -n technitium exec <primary-pod> -- grep "TSIG KeyName: kea-ddns-v2" /etc/dns/logs/<today>.log`.
|
||||
7. Remove the old `kea-ddns` key from Technitium settings + Kea config.
|
||||
|
||||
### Emergency TSIG bypass (if rotation breaks DDNS)
|
||||
|
||||
If DDNS updates are failing and you cannot quickly fix the key, temporarily
|
||||
downgrade the zone policy to IP-ACL only (pfSense source IPs) without
|
||||
TSIG:
|
||||
|
||||
```bash
|
||||
kubectl -n technitium port-forward pod/<primary-pod> 5380:5380 &
|
||||
TOKEN=$(curl -s -X POST http://127.0.0.1:5380/api/user/login \
|
||||
-d "user=admin&pass=$(vault kv get -field=technitium_password secret/platform)&includeInfo=false" | jq -r .token)
|
||||
|
||||
for Z in viktorbarzin.lan 10.0.10.in-addr.arpa 20.0.10.in-addr.arpa 1.168.192.in-addr.arpa; do
|
||||
curl -s -X POST "http://127.0.0.1:5380/api/zones/options/set?token=$TOKEN&zone=$Z&update=UseSpecifiedNetworkACL&updateNetworkACL=10.0.20.1,10.0.10.1,192.168.1.2&updateSecurityPolicies="
|
||||
done
|
||||
```
|
||||
|
||||
This clears `updateSecurityPolicies` while keeping the IP ACL. Updates
|
||||
now flow unsigned from pfSense IPs — **weaker** than TSIG but restores
|
||||
service. Re-enable TSIG as soon as the key issue is resolved.
|
||||
|
||||
### Verify TSIG is enforced
|
||||
|
||||
```bash
|
||||
# Unsigned update should fail
|
||||
nsupdate <<EOF
|
||||
server 10.0.20.201 53
|
||||
zone viktorbarzin.lan
|
||||
update delete tsig-test.viktorbarzin.lan.
|
||||
update add tsig-test.viktorbarzin.lan. 300 A 10.99.99.99
|
||||
send
|
||||
EOF
|
||||
# Expected: "update failed: REFUSED"
|
||||
|
||||
# Signed update should succeed
|
||||
cat > /tmp/kea-ddns.key <<EOF
|
||||
key "kea-ddns" {
|
||||
algorithm hmac-sha256;
|
||||
secret "$(vault kv get -field=kea_ddns_tsig_secret secret/viktor)";
|
||||
};
|
||||
EOF
|
||||
nsupdate -k /tmp/kea-ddns.key <<EOF
|
||||
server 10.0.20.201 53
|
||||
zone viktorbarzin.lan
|
||||
update delete tsig-test.viktorbarzin.lan.
|
||||
update add tsig-test.viktorbarzin.lan. 300 A 10.99.99.99
|
||||
send
|
||||
EOF
|
||||
dig @10.0.20.201 +short tsig-test.viktorbarzin.lan
|
||||
# Expected: 10.99.99.99
|
||||
rm -f /tmp/kea-ddns.key
|
||||
```
|
||||
|
||||
## Related Docs
|
||||
|
||||
- `docs/architecture/dns.md` — overall DNS architecture (K8s side, Technitium, CoreDNS)
|
||||
- `docs/architecture/networking.md` — VLAN layout, pfSense interface mapping
|
||||
- `.claude/skills/pfsense/skill.md` — SSH / CLI patterns for pfSense management
|
||||
103
docs/runbooks/proxmox-host.md
Normal file
103
docs/runbooks/proxmox-host.md
Normal file
|
|
@ -0,0 +1,103 @@
|
|||
# Runbook: Proxmox host (pve, 192.168.1.127)
|
||||
|
||||
Last updated: 2026-04-19
|
||||
|
||||
The Proxmox host is a baremetal hypervisor on the storage LAN
|
||||
(192.168.1.0/24) with a single IP `192.168.1.127`. It hosts every
|
||||
Kubernetes node VM and the NFS exports that back PVCs. It does **not**
|
||||
receive DHCP — its network config is static in
|
||||
`/etc/network/interfaces` (ifupdown). Because of that, DNS must be
|
||||
configured manually and stays out of the scope of Kea/DHCP-DDNS.
|
||||
|
||||
## DNS configuration
|
||||
|
||||
The host uses a plain `/etc/resolv.conf` with two nameservers. No
|
||||
`systemd-resolved`, no `resolvconf`, no NetworkManager — nothing
|
||||
manages `/etc/resolv.conf`; it is a regular file owned by root.
|
||||
|
||||
### Why plain `/etc/resolv.conf` and not systemd-resolved
|
||||
|
||||
1. Installing `systemd-resolved` on an active Proxmox node during
|
||||
business hours is the kind of change that risks breaking the NFS
|
||||
server or VM networking. PVE's Debian base does not ship
|
||||
`systemd-resolved` by default.
|
||||
2. The ifupdown `/etc/network/interfaces` file does not manage
|
||||
`/etc/resolv.conf` here — ifupdown's resolvconf integration is
|
||||
only active if the `resolvconf` package is installed, which it is
|
||||
not (`dpkg -l resolvconf` returns `un`).
|
||||
3. A plain file is the simplest mental model and avoids a second
|
||||
layer of "which tool is running now" confusion during an incident.
|
||||
|
||||
If you ever want to migrate to `systemd-resolved`, install the
|
||||
package, enable the service, symlink `/etc/resolv.conf` to
|
||||
`/run/systemd/resolve/stub-resolv.conf`, and drop the config in
|
||||
`/etc/systemd/resolved.conf.d/10-internal-dns.conf` — but do this
|
||||
during a maintenance window, not reactively.
|
||||
|
||||
### Current state
|
||||
|
||||
```
|
||||
# /etc/resolv.conf
|
||||
search viktorbarzin.lan
|
||||
nameserver 192.168.1.2
|
||||
nameserver 94.140.14.14
|
||||
options timeout:2 attempts:2
|
||||
```
|
||||
|
||||
| Field | Value | Purpose |
|
||||
|---|---|---|
|
||||
| Primary | `192.168.1.2` | pfSense LAN interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` |
|
||||
| Fallback | `94.140.14.14` | AdGuard public DNS — recursive only, used if pfSense LAN IP unreachable |
|
||||
| `search` | `viktorbarzin.lan` | Unqualified names (`technitium`, `idrac`, etc.) resolve against the internal zone |
|
||||
| `timeout:2 attempts:2` | — | Cap glibc resolver at 2s per server, 2 tries — reasonable fallback latency |
|
||||
|
||||
### Verification commands
|
||||
|
||||
```sh
|
||||
ssh root@192.168.1.127 '
|
||||
cat /etc/resolv.conf # should show the two nameservers
|
||||
dig +short idrac.viktorbarzin.lan # expect an A record (192.168.1.4)
|
||||
dig +short github.com # expect an A record
|
||||
'
|
||||
```
|
||||
|
||||
Simulated failover — force the primary unreachable and verify the
|
||||
fallback answers:
|
||||
|
||||
```sh
|
||||
ssh root@192.168.1.127 '
|
||||
ip route add blackhole 192.168.1.2
|
||||
dig +short +time=3 github.com # glibc times out on primary, tries 94.140.14.14 → A record returned
|
||||
ip route del blackhole 192.168.1.2 # cleanup
|
||||
'
|
||||
```
|
||||
|
||||
Expected behaviour: the first `dig` prints a warning about the UDP
|
||||
setup failing for 192.168.1.2 and then prints the GitHub A record
|
||||
(answered by 94.140.14.14).
|
||||
|
||||
## Rollback
|
||||
|
||||
A pre-change backup of `/etc/resolv.conf`, `/etc/network/interfaces`,
|
||||
and `/etc/network/interfaces.d/` lives at
|
||||
`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
|
||||
host. To roll back:
|
||||
|
||||
```sh
|
||||
ssh root@192.168.1.127 '
|
||||
# pick the backup you want (there may be multiple if this runbook has been applied more than once)
|
||||
BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
|
||||
tar -xzf "$BACKUP" -C /
|
||||
cat /etc/resolv.conf
|
||||
'
|
||||
```
|
||||
|
||||
No service restart is needed — glibc re-reads `/etc/resolv.conf` per
|
||||
lookup.
|
||||
|
||||
## Related docs
|
||||
|
||||
- `docs/architecture/dns.md` — where each resolver IP lives and which
|
||||
subnet it serves.
|
||||
- `docs/runbooks/nfs-prerequisites.md` — other operations on this
|
||||
host; read before adding new NFS exports.
|
||||
188
docs/runbooks/r730-ram-upgrade-272gb.md
Normal file
188
docs/runbooks/r730-ram-upgrade-272gb.md
Normal file
|
|
@ -0,0 +1,188 @@
|
|||
# RAM Upgrade — Dell R730 Proxmox Host (Completed 2026-04-01)
|
||||
|
||||
**Host**: Dell R730 @ 192.168.1.127 (Proxmox)
|
||||
**CPU**: Single Xeon E5-2699 v4 (CPU2 unpopulated — B-side slots unavailable)
|
||||
**Before**: 144 GB (4x32G Samsung BB1 + 2x8G SK Hynix) @ 2400 MHz
|
||||
**After**: 272 GB (4x32G Samsung BB1 + 4x32G Samsung CB1 + 2x8G SK Hynix) @ 2400 MHz
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. **3 DPC downclock**: Adding DIMMs to the 3rd slot per channel (A11/A12) caused automatic downclocking to 1866 MHz. Dell R730 BIOS allows manual override back to 2400 MHz via **System BIOS > Memory Settings > Memory Frequency > Max Performance**.
|
||||
2. **MySQL InnoDB Cluster CR recreation**: Deleting and recreating the InnoDBCluster CR generates new admin secrets that don't match the existing data on PVCs. Fix: manually create the new admin user in MySQL and configure GR recovery channel credentials.
|
||||
3. **CNPG primary label**: After restarting the CNPG operator, it may not immediately label the primary pod with `role=primary`. Deleting the pod forces the operator to recreate it with the correct labels.
|
||||
4. **LimitRange blocks MySQL**: The `dbaas` namespace LimitRange (4Gi max) blocks MySQL pods that need 5Gi. Kyverno policy resets LimitRange patches. Fix: reduce MySQL memory limit in CR to 4Gi.
|
||||
|
||||
## Physical DIMM Slot Map (looking down at motherboard, front of server at bottom)
|
||||
|
||||
```
|
||||
╔══════════════════════════════════════════════════════════════════════════════╗
|
||||
║ CPU1 DIMM SLOTS ║
|
||||
║ ║
|
||||
║ ┌─── WHITE (1st per channel) ───┐ ║
|
||||
║ │ │ ║
|
||||
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
|
||||
║ │ │ A1 │ │ A2 │ │ A3 │ │ A4 │ ║
|
||||
║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ ◄── KEEP (existing Samsung 32G) ║
|
||||
║ │ │██████│ │██████│ │██████│ │██████│ ║
|
||||
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
|
||||
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
|
||||
║ └────────────────────────────────┘ ║
|
||||
║ ║
|
||||
║ ┌─── BLACK (2nd per channel) ───┐ ║
|
||||
║ │ │ ║
|
||||
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
|
||||
║ │ │ A5 │ │ A6 │ │ A7 │ │ A8 │ ║
|
||||
║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ ◄── INSTALL NEW 32G Samsung ║
|
||||
║ │ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ (remove old 8G from A5/A6) ║
|
||||
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
|
||||
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
|
||||
║ └────────────────────────────────┘ ║
|
||||
║ ║
|
||||
║ ┌─── GREEN (3rd per channel) ───┐ ║
|
||||
║ │ │ ║
|
||||
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
|
||||
║ │ │ A9 │ │ A10 │ │ A11 │ │ A12 │ ║
|
||||
║ │ │ │ │ │ │ 8G │ │ 8G │ ◄── MOVE old 8G Hynix here ║
|
||||
║ │ │ empty│ │ empty│ │░░░░░░│ │░░░░░░│ (from A5 → A11, A6 → A12) ║
|
||||
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
|
||||
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
|
||||
║ └────────────────────────────────┘ ║
|
||||
║ ║
|
||||
║ Legend: ██ = existing 32G (keep in place) ║
|
||||
║ ▓▓ = NEW 32G Samsung M393A4K40BB1-CRC (install) ║
|
||||
║ ░░ = relocated 8G SK Hynix HMA81GR7AFR8N-UH (moved from A5/A6) ║
|
||||
║ ║
|
||||
╚══════════════════════════════════════════════════════════════════════════════╝
|
||||
```
|
||||
|
||||
## Channel Summary After Install
|
||||
|
||||
```
|
||||
Channel 0: A1 [32G] ──── A5 [32G] ──── A9 [ ] = 64 GB ✓ matched
|
||||
Channel 1: A2 [32G] ──── A6 [32G] ──── A10[ ] = 64 GB ✓ matched
|
||||
Channel 2: A3 [32G] ──── A7 [32G] ──── A11[ 8G ] = 72 GB ~ +8G bonus
|
||||
Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB ~ +8G bonus
|
||||
───────── ───────── ──────────
|
||||
WHITE BLACK GREEN TOTAL: 272 GB
|
||||
(keep) (new 32G) (moved 8G)
|
||||
```
|
||||
|
||||
**Performance**: ~1-2% bandwidth penalty on Ch2/Ch3 due to mixed DIMM sizes. Ch0/Ch1 fully matched.
|
||||
|
||||
## Shutdown Sequence
|
||||
|
||||
### Phase 0: Gracefully Stop Stateful Services
|
||||
|
||||
Scale down databases, caches, and secrets engines before draining nodes to ensure clean shutdown with no data loss.
|
||||
|
||||
```bash
|
||||
export KUBECONFIG=/path/to/config
|
||||
|
||||
# 1. Vault — seal all instances (flushes WAL, closes connections)
|
||||
kubectl -n vault exec vault-0 -- vault operator step-down 2>/dev/null
|
||||
kubectl -n vault exec vault-0 -- vault operator seal
|
||||
kubectl -n vault exec vault-1 -- vault operator seal
|
||||
kubectl -n vault exec vault-2 -- vault operator seal
|
||||
|
||||
# 2. MySQL InnoDB Cluster — set super_read_only, scale router to 0
|
||||
kubectl -n dbaas scale deploy mysql-cluster-router --replicas=0
|
||||
kubectl -n dbaas exec mysql-cluster-0 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;"
|
||||
kubectl -n dbaas exec mysql-cluster-1 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;"
|
||||
kubectl -n dbaas exec mysql-cluster-2 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;"
|
||||
# innodb_fast_shutdown=0 forces full purge + change buffer merge on stop
|
||||
|
||||
# 3. PostgreSQL CNPG — trigger checkpoint on primaries
|
||||
kubectl -n dbaas exec pg-cluster-2 -- psql -U postgres -c "CHECKPOINT;"
|
||||
kubectl -n dbaas exec pg-cluster-4 -- psql -U postgres -c "CHECKPOINT;"
|
||||
kubectl -n immich exec deploy/immich-postgresql -- psql -U postgres -c "CHECKPOINT;"
|
||||
|
||||
# 4. Redis — trigger BGSAVE then scale down
|
||||
kubectl -n redis exec redis-node-0 -- redis-cli BGSAVE
|
||||
kubectl -n redis exec redis-node-1 -- redis-cli BGSAVE
|
||||
sleep 5 # wait for RDB flush
|
||||
kubectl -n redis scale deploy redis-haproxy --replicas=0
|
||||
|
||||
# 5. ClickHouse — flush
|
||||
kubectl -n rybbit exec deploy/clickhouse -- clickhouse-client --query "SYSTEM FLUSH LOGS"
|
||||
|
||||
# 6. Scale down stateful workloads
|
||||
kubectl -n dbaas scale sts mysql-cluster --replicas=0
|
||||
kubectl -n redis scale sts redis-node --replicas=0
|
||||
kubectl -n vault scale sts vault --replicas=0
|
||||
|
||||
# 7. Verify all stateful pods terminated
|
||||
kubectl get pods -A | grep -iE 'mysql-cluster-[0-9]|pg-cluster|redis-node|vault-[0-9]|clickhouse'
|
||||
```
|
||||
|
||||
### Phase 1: Drain K8s Nodes
|
||||
|
||||
```bash
|
||||
# Drain workers (reverse order)
|
||||
kubectl drain k8s-node4 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
|
||||
kubectl drain k8s-node3 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
|
||||
kubectl drain k8s-node2 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
|
||||
kubectl drain k8s-node1 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
|
||||
|
||||
# Cordon master
|
||||
kubectl cordon k8s-master
|
||||
```
|
||||
|
||||
### Phase 2: Shutdown VMs (via Proxmox)
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
|
||||
# K8s workers
|
||||
for VMID in 201 202 203 204; do qm shutdown $VMID && echo "Shutdown VMID $VMID"; done
|
||||
sleep 30
|
||||
|
||||
# K8s master
|
||||
qm shutdown 200; sleep 15
|
||||
|
||||
# Docker registry
|
||||
qm shutdown 220; sleep 10
|
||||
|
||||
# Secondary VMs
|
||||
for VMID in 102 300 103; do qm shutdown $VMID; done
|
||||
sleep 20
|
||||
|
||||
# TrueNAS (decommissioned 2026-04-13 — VM 9000 should already be stopped; skip if absent)
|
||||
qm shutdown 9000 2>/dev/null || true
|
||||
|
||||
# pfSense (last — network gateway)
|
||||
qm shutdown 101; sleep 15
|
||||
|
||||
# Verify all VMs stopped
|
||||
qm list
|
||||
```
|
||||
|
||||
### Phase 3: Shutdown Proxmox Host
|
||||
|
||||
```bash
|
||||
shutdown -h now
|
||||
```
|
||||
|
||||
## Physical RAM Installation
|
||||
|
||||
| Step | Action | Slot(s) | DIMM |
|
||||
|------|--------|---------|------|
|
||||
| 1 | Power off host | — | Completed via Phase 3 above |
|
||||
| 2 | **Remove** | A5 (black clip) | Take out 8G Hynix, set aside |
|
||||
| 3 | **Remove** | A6 (black clip) | Take out 8G Hynix, set aside |
|
||||
| 4 | **Install NEW** | A5 (black clip) | Insert 32G Samsung |
|
||||
| 5 | **Install NEW** | A6 (black clip) | Insert 32G Samsung |
|
||||
| 6 | **Install NEW** | A7 (black clip) | Insert 32G Samsung |
|
||||
| 7 | **Install NEW** | A8 (black clip) | Insert 32G Samsung |
|
||||
| 8 | **Install MOVED** | A11 (green clip) | Insert 8G Hynix (was in A5) |
|
||||
| 9 | **Install MOVED** | A12 (green clip) | Insert 8G Hynix (was in A6) |
|
||||
| 10 | Power on | — | — |
|
||||
|
||||
## Post-Boot Verification
|
||||
|
||||
```bash
|
||||
# Verify all 10 DIMMs detected
|
||||
ssh root@192.168.1.127 'dmidecode -t memory | grep -E "Locator:|Size:" | grep -v Bank'
|
||||
|
||||
# Verify total RAM (~268 GiB usable)
|
||||
ssh root@192.168.1.127 'free -h'
|
||||
```
|
||||
170
docs/runbooks/registry-rebuild-image.md
Normal file
170
docs/runbooks/registry-rebuild-image.md
Normal file
|
|
@ -0,0 +1,170 @@
|
|||
# Runbook: Rebuild an Image After a Registry Orphan-Index Incident
|
||||
|
||||
Last updated: 2026-04-19
|
||||
|
||||
## When to use this
|
||||
|
||||
Pipelines that pull from `registry.viktorbarzin.me:5050` are failing with
|
||||
messages like:
|
||||
|
||||
- `failed to resolve reference … : not found`
|
||||
- `manifest unknown`
|
||||
- `image can't be pulled` (Woodpecker exit 126)
|
||||
- `error pulling image`: HEAD on a child manifest digest returns 404
|
||||
|
||||
…and `skopeo inspect --tls-verify --creds "$USER:$PASS" docker://registry.viktorbarzin.me:5050/<image>:<tag>`
|
||||
returns an OCI image index whose `manifests[].digest` references are 404
|
||||
on the registry.
|
||||
|
||||
This is the **orphan OCI-index** failure mode documented in
|
||||
`docs/post-mortems/2026-04-19-registry-orphan-index.md`. The fix is to
|
||||
rebuild the affected image from source so the registry receives a fresh,
|
||||
complete push.
|
||||
|
||||
If the symptom is different (e.g., registry container down, TLS expiry,
|
||||
auth failure), use `docs/runbooks/registry-vm.md` instead.
|
||||
|
||||
## Phase 1 — Confirm the diagnosis
|
||||
|
||||
From any host with `skopeo`:
|
||||
|
||||
```sh
|
||||
REG=registry.viktorbarzin.me:5050
|
||||
IMAGE=infra-ci
|
||||
TAG=latest
|
||||
|
||||
# 1. Confirm the index exists.
|
||||
skopeo inspect --tls-verify --creds "$USER:$PASS" \
|
||||
--raw "docker://$REG/$IMAGE:$TAG" | jq '.mediaType, .manifests[].digest'
|
||||
|
||||
# 2. HEAD each child. Any non-200 = confirmed orphan.
|
||||
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
|
||||
"docker://$REG/$IMAGE:$TAG" | jq -r '.manifests[].digest'); do
|
||||
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
|
||||
-I "https://$REG/v2/$IMAGE/manifests/$d")
|
||||
echo "$d → $code"
|
||||
done
|
||||
```
|
||||
|
||||
If every child is 200, the problem is elsewhere — stop here and check
|
||||
the registry VM, TLS, or auth.
|
||||
|
||||
The `registry-integrity-probe` CronJob in the `monitoring` namespace
|
||||
runs this same check every 15 minutes across every tag in the catalog;
|
||||
its last run is also a fast way to see which image(s) are affected:
|
||||
|
||||
```sh
|
||||
kubectl -n monitoring logs \
|
||||
$(kubectl -n monitoring get pods -l job-name -o name \
|
||||
| grep registry-integrity-probe | head -1)
|
||||
```
|
||||
|
||||
## Phase 2 — Rebuild
|
||||
|
||||
### Option A (preferred): rebuild via CI
|
||||
|
||||
Find the `build-*.yml` pipeline that produces the image:
|
||||
|
||||
| Image | Pipeline | Repo ID |
|
||||
|---|---|---|
|
||||
| `infra-ci` | `.woodpecker/build-ci-image.yml` | 1 (infra) |
|
||||
| `infra` (cli) | `.woodpecker/build-cli.yml` | 1 (infra) |
|
||||
| `k8s-portal` | `.woodpecker/k8s-portal.yml` | 1 (infra) |
|
||||
|
||||
Trigger a manual build. The Woodpecker API expects a numeric repo ID
|
||||
(paths with `owner/name` return HTML):
|
||||
|
||||
```sh
|
||||
WOODPECKER_TOKEN=$(vault kv get -field=woodpecker_admin_token secret/viktor)
|
||||
|
||||
# Kick off a manual build against master.
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://ci.viktorbarzin.me/api/repos/1/pipelines" \
|
||||
-d '{"branch":"master"}' | jq .number
|
||||
|
||||
# Follow the pipeline at https://ci.viktorbarzin.me/repos/1/pipeline/<number>
|
||||
```
|
||||
|
||||
The pipeline's `verify-integrity` step walks every blob the push
|
||||
references. If it passes, the registry now has a clean index; pull
|
||||
consumers will recover on next attempt.
|
||||
|
||||
### Option B (fallback): build on the registry VM
|
||||
|
||||
Only use this if Woodpecker itself is broken (its own pipeline runs
|
||||
from the same `infra-ci` image, so a corrupted `infra-ci:latest` can
|
||||
prevent Option A from recovering).
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 '
|
||||
cd /tmp
|
||||
git clone --depth 1 https://github.com/ViktorBarzin/infra
|
||||
cd infra/ci
|
||||
docker build -t registry.viktorbarzin.me:5050/infra-ci:manual -t registry.viktorbarzin.me:5050/infra-ci:latest .
|
||||
docker login -u "$USER" -p "$PASS" registry.viktorbarzin.me:5050
|
||||
docker push registry.viktorbarzin.me:5050/infra-ci:manual
|
||||
docker push registry.viktorbarzin.me:5050/infra-ci:latest
|
||||
'
|
||||
```
|
||||
|
||||
Then re-run any pipelines that failed — Woodpecker UI → Restart, or:
|
||||
|
||||
```sh
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
|
||||
"https://ci.viktorbarzin.me/api/repos/1/pipelines/<failed-pipeline-number>"
|
||||
```
|
||||
|
||||
## Phase 3 — Verify
|
||||
|
||||
```sh
|
||||
# 1. Pull the image fresh (bypassing containerd cache) and check its index.
|
||||
REG=registry.viktorbarzin.me:5050
|
||||
skopeo inspect --tls-verify --creds "$USER:$PASS" \
|
||||
--raw "docker://$REG/infra-ci:latest" \
|
||||
| jq '.manifests[] | {digest, platform}'
|
||||
|
||||
# 2. HEAD every child digest — all should be 200.
|
||||
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
|
||||
"docker://$REG/infra-ci:latest" | jq -r '.manifests[].digest'); do
|
||||
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
|
||||
-I "https://$REG/v2/infra-ci/manifests/$d")
|
||||
[ "$code" = "200" ] || echo "STILL BROKEN: $d → $code"
|
||||
done
|
||||
echo "verified"
|
||||
|
||||
# 3. Kick off the next scheduled probe for good measure.
|
||||
kubectl -n monitoring create job --from=cronjob/registry-integrity-probe \
|
||||
registry-integrity-probe-verify-$(date +%s)
|
||||
kubectl -n monitoring logs -f -l job-name=registry-integrity-probe-verify-$(date +%s)
|
||||
```
|
||||
|
||||
The `RegistryManifestIntegrityFailure` alert clears automatically when
|
||||
the probe's next run returns zero failures.
|
||||
|
||||
## Phase 4 — Investigate orphans
|
||||
|
||||
Once the immediate fix is in, check whether any OTHER images on the
|
||||
registry have orphan children:
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 'python3 /opt/registry/fix-broken-blobs.sh --dry-run 2>&1 | grep "ORPHAN INDEX"'
|
||||
```
|
||||
|
||||
Each hit is a separate image that will eventually fail to pull. Rebuild
|
||||
them in the same way (Option A preferred). If the list is long, open a
|
||||
beads task — do NOT batch-delete the indexes; that's a destructive
|
||||
registry operation outside this runbook's scope.
|
||||
|
||||
## Related
|
||||
|
||||
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — why this
|
||||
happens.
|
||||
- `docs/runbooks/registry-vm.md` — VM-level operations (DNS,
|
||||
`docker compose` restarts).
|
||||
- `modules/docker-registry/fix-broken-blobs.sh` — the scanner cron
|
||||
itself, runs nightly and after each GC.
|
||||
- `stacks/monitoring/modules/monitoring/main.tf` —
|
||||
`registry_integrity_probe` CronJob definition.
|
||||
227
docs/runbooks/registry-vm.md
Normal file
227
docs/runbooks/registry-vm.md
Normal file
|
|
@ -0,0 +1,227 @@
|
|||
# Runbook: Registry VM (docker-registry, 10.0.20.10)
|
||||
|
||||
Last updated: 2026-05-07
|
||||
|
||||
The registry VM is an Ubuntu 24.04 VM on the cluster LAN subnet
|
||||
`10.0.20.0/24`, with a static netplan config (no DHCP). Because it
|
||||
sits on a subnet that only has pfSense as its gateway, its DNS must
|
||||
be statically configured.
|
||||
|
||||
**As of Phase 4 of forgejo-registry-consolidation 2026-05-07** the VM
|
||||
no longer hosts the private R/W registry. It hosts pull-through
|
||||
caches only:
|
||||
|
||||
| Port | Upstream |
|
||||
|---|---|
|
||||
| 5000 | docker.io (Docker Hub) — auth via dockerhub_registry_password |
|
||||
| 5010 | ghcr.io |
|
||||
| 5020 | quay.io |
|
||||
| 5030 | registry.k8s.io |
|
||||
| 5040 | reg.kyverno.io |
|
||||
|
||||
The decommissioned private registry (port 5050) is now hosted on
|
||||
Forgejo at `forgejo.viktorbarzin.me/viktor/<image>`. See
|
||||
`docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md` for the
|
||||
migration. Break-glass tarballs of `infra-ci` are still produced on
|
||||
each build to `/opt/registry/data/private/_breakglass/` — see
|
||||
`docs/runbooks/forgejo-registry-breakglass.md`.
|
||||
|
||||
## DNS configuration
|
||||
|
||||
Ubuntu ships `systemd-resolved` and uses netplan to declare per-link
|
||||
`nameservers`. Netplan writes systemd-networkd or NetworkManager
|
||||
configs that resolved reads at runtime. There is **no automatic
|
||||
merging** of netplan DNS with the `[Resolve]` section of
|
||||
`/etc/systemd/resolved.conf` — per-link settings override the global
|
||||
ones. So both layers must be in sync:
|
||||
|
||||
| Layer | File | Role |
|
||||
|---|---|---|
|
||||
| Netplan | `/etc/netplan/50-cloud-init.yaml` | Per-link DNS servers that resolved reports on `Link 2 (eth0)` |
|
||||
| Resolved global | `/etc/systemd/resolved.conf.d/10-internal-dns.conf` | `Global` scope `DNS=` / `FallbackDNS=` — also shown in `resolvectl status` |
|
||||
|
||||
### Current state
|
||||
|
||||
`/etc/systemd/resolved.conf.d/10-internal-dns.conf`:
|
||||
|
||||
```ini
|
||||
[Resolve]
|
||||
DNS=10.0.20.1
|
||||
FallbackDNS=94.140.14.14
|
||||
Domains=viktorbarzin.lan
|
||||
```
|
||||
|
||||
`/etc/netplan/50-cloud-init.yaml` (eth0 block, simplified):
|
||||
|
||||
```yaml
|
||||
nameservers:
|
||||
addresses:
|
||||
- 10.0.20.1
|
||||
- 94.140.14.14
|
||||
search:
|
||||
- viktorbarzin.lan
|
||||
```
|
||||
|
||||
`resolvectl status` output after the change:
|
||||
|
||||
```
|
||||
Global
|
||||
resolv.conf mode: stub
|
||||
Current DNS Server: 10.0.20.1
|
||||
DNS Servers: 10.0.20.1
|
||||
Fallback DNS Servers: 94.140.14.14
|
||||
DNS Domain: viktorbarzin.lan
|
||||
|
||||
Link 2 (eth0)
|
||||
Current Scopes: DNS
|
||||
Current DNS Server: 10.0.20.1
|
||||
DNS Servers: 10.0.20.1 94.140.14.14
|
||||
DNS Domain: viktorbarzin.lan
|
||||
```
|
||||
|
||||
| Field | Value | Purpose |
|
||||
|---|---|---|
|
||||
| Primary | `10.0.20.1` | pfSense OPT1 interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` |
|
||||
| Fallback | `94.140.14.14` | AdGuard public DNS — used if pfSense unreachable (e.g., OPT1 flap) |
|
||||
| Search | `viktorbarzin.lan` | Unqualified names resolve against the internal zone |
|
||||
|
||||
### Why this matters for the registry
|
||||
|
||||
Container builds on this VM reference `.lan` hostnames (Technitium,
|
||||
NFS, etc.) and external hostnames (Docker Hub, GHCR). Before the
|
||||
hardening the netplan had `1.1.1.1` / `8.8.8.8` only, which meant:
|
||||
|
||||
1. Internal hostname lookups silently failed (slow timeout) — the
|
||||
VM could not resolve `idrac.viktorbarzin.lan` or any internal
|
||||
helper.
|
||||
2. If Cloudflare's 1.1.1.1 had an outage, the VM would lose DNS
|
||||
entirely.
|
||||
|
||||
With the new config the VM can resolve both zones and keeps working
|
||||
if the primary DNS server is unreachable.
|
||||
|
||||
## Apply / re-apply
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 '
|
||||
netplan generate
|
||||
netplan apply
|
||||
systemctl restart systemd-resolved
|
||||
resolvectl status | head -20
|
||||
'
|
||||
```
|
||||
|
||||
`netplan apply` is not disruptive when only `nameservers` change — it
|
||||
does not bounce the link.
|
||||
|
||||
## Verification
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 '
|
||||
dig +short idrac.viktorbarzin.lan # 192.168.1.4
|
||||
dig +short github.com # GitHub A record
|
||||
dig +short registry.viktorbarzin.me # 10.0.20.10 + external A
|
||||
'
|
||||
```
|
||||
|
||||
Fallback test — blackhole the primary and confirm external lookups
|
||||
still succeed through 94.140.14.14:
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 '
|
||||
ip route add blackhole 10.0.20.1
|
||||
dig +short +time=5 +tries=2 github.com # should still answer
|
||||
ip route del blackhole 10.0.20.1
|
||||
'
|
||||
```
|
||||
|
||||
Internal lookups do fail during the blackhole (the fallback is a
|
||||
public resolver and does not know about the internal zone), which is
|
||||
expected — the fallback buys availability for external pulls, not
|
||||
internal hostnames.
|
||||
|
||||
## Rollback
|
||||
|
||||
A pre-change backup of `/etc/resolv.conf`, `/etc/systemd/resolved.conf`,
|
||||
and `/etc/netplan/` lives at
|
||||
`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
|
||||
VM. To roll back:
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 '
|
||||
BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
|
||||
tar -xzf "$BACKUP" -C /
|
||||
rm -f /etc/systemd/resolved.conf.d/10-internal-dns.conf
|
||||
netplan apply
|
||||
systemctl restart systemd-resolved
|
||||
resolvectl status | head -10
|
||||
'
|
||||
```
|
||||
|
||||
## Auto-sync pipeline
|
||||
|
||||
Changes to `modules/docker-registry/{docker-compose.yml, fix-broken-blobs.sh,
|
||||
cleanup-tags.sh, nginx_registry.conf, config-private.yml}` deploy
|
||||
automatically via `.woodpecker/registry-config-sync.yml`:
|
||||
|
||||
- Fires on `push` to master touching any of those paths, or via `manual`
|
||||
event (Woodpecker UI / API).
|
||||
- SCPs every managed file to `/opt/registry/` on `10.0.20.10`.
|
||||
- Bounces containers + nginx when a compose-visible file changed; leaves
|
||||
them alone when only scripts changed (cron picks up automatically).
|
||||
- Runs a dry-run `fix-broken-blobs.sh` at the end to verify the registry
|
||||
is still coherent.
|
||||
|
||||
SSH credentials: Woodpecker repo-secret `registry_ssh_key` (ed25519,
|
||||
provisioned 2026-04-19). Public key at `/root/.ssh/authorized_keys` on
|
||||
`10.0.20.10`. Private key mirrored at `secret/woodpecker/registry_ssh_key`
|
||||
in Vault (subkeys `private_key` / `public_key` / `known_hosts_entry`).
|
||||
|
||||
Manual override if you need to sync right now:
|
||||
|
||||
```sh
|
||||
curl -sf -X POST \
|
||||
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
|
||||
"https://ci.viktorbarzin.me/api/repos/1/pipelines" \
|
||||
-d '{"branch":"master"}' | jq .number
|
||||
```
|
||||
|
||||
## Bouncing registry containers — the nginx DNS trap
|
||||
|
||||
`docker compose up -d` on `/opt/registry/docker-compose.yml` recreates
|
||||
`registry-*` containers when their image tag changes, which assigns them
|
||||
new IPs on the `registry` bridge network. **`registry-nginx` resolves its
|
||||
upstream DNS names (`registry-private`, `registry-dockerhub`, …) ONCE at
|
||||
startup and caches the results** — it does not re-resolve after a
|
||||
recreate.
|
||||
|
||||
Symptom if you forget: `/v2/_catalog` on `:5050` returns
|
||||
`{"repositories": []}`, `/v2/` returns 200 without auth, pulls return
|
||||
the wrong image. nginx is forwarding to a stale IP that now belongs to a
|
||||
different registry-* backend (commonly the pull-through ghcr or
|
||||
dockerhub cache, which have empty catalogs from the htpasswd-auth user's
|
||||
perspective).
|
||||
|
||||
**Always follow a registry-* bounce with `docker restart registry-nginx`.**
|
||||
Or prevent the problem by setting a `resolver` directive in
|
||||
`nginx_registry.conf` so upstream names are re-resolved per request.
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 '
|
||||
cd /opt/registry && docker compose up -d
|
||||
docker restart registry-nginx
|
||||
sleep 3
|
||||
docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}" \
|
||||
| grep -E "registry-"
|
||||
'
|
||||
```
|
||||
|
||||
## Related docs
|
||||
|
||||
- `docs/architecture/dns.md` — resolver IP assignments per subnet.
|
||||
- `.claude/CLAUDE.md` (at repo root) — notes on the private registry
|
||||
and `containerd` `hosts.toml` redirects.
|
||||
- `docs/runbooks/registry-rebuild-image.md` — rebuild an image after an
|
||||
orphan OCI-index incident (different class of problem than DNS).
|
||||
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — root cause
|
||||
+ detection gaps behind the recurring missing-blob incidents.
|
||||
96
docs/runbooks/restore-etcd.md
Normal file
96
docs/runbooks/restore-etcd.md
Normal file
|
|
@ -0,0 +1,96 @@
|
|||
# Restore etcd
|
||||
|
||||
## Prerequisites
|
||||
- SSH access to `k8s-master` node
|
||||
- etcd snapshot available on NFS at `/mnt/main/etcd-backup/`
|
||||
- etcd PKI certs at `/etc/kubernetes/pki/etcd/` on master node
|
||||
|
||||
## Backup Location
|
||||
- NFS: `/mnt/main/etcd-backup/etcd-snapshot-YYYYMMDD-HHMMSS.db`
|
||||
- Replicated to Synology NAS (192.168.1.13) via Proxmox host offsite-sync-backup (inotify-driven rsync)
|
||||
- Retention: 30 days
|
||||
- Schedule: Daily at 00:00
|
||||
|
||||
## CRITICAL: etcd is the foundation of the cluster
|
||||
Restoring etcd will reset the entire Kubernetes state to the snapshot time. All objects created after the snapshot will be lost. This is a last-resort operation.
|
||||
|
||||
**Only restore etcd if the control plane is completely broken.**
|
||||
|
||||
## Restore Procedure
|
||||
|
||||
### 1. SSH to the master node
|
||||
```bash
|
||||
ssh k8s-master
|
||||
```
|
||||
|
||||
### 2. Identify the snapshot to restore
|
||||
```bash
|
||||
ls -lt /mnt/main/etcd-backup/etcd-snapshot-*.db | head -10
|
||||
```
|
||||
|
||||
### 3. Stop the API server and etcd
|
||||
```bash
|
||||
# Move static pod manifests to stop them
|
||||
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/
|
||||
sudo mv /etc/kubernetes/manifests/etcd.yaml /etc/kubernetes/
|
||||
|
||||
# Wait for pods to stop
|
||||
sudo crictl ps | grep -E "etcd|apiserver"
|
||||
```
|
||||
|
||||
### 4. Back up current etcd data
|
||||
```bash
|
||||
sudo mv /var/lib/etcd /var/lib/etcd.bak.$(date +%Y%m%d-%H%M%S)
|
||||
```
|
||||
|
||||
### 5. Restore the snapshot
|
||||
```bash
|
||||
sudo ETCDCTL_API=3 etcdctl snapshot restore /mnt/main/etcd-backup/etcd-snapshot-YYYYMMDD-HHMMSS.db \
|
||||
--data-dir=/var/lib/etcd \
|
||||
--name=k8s-master \
|
||||
--initial-cluster=k8s-master=https://127.0.0.1:2380 \
|
||||
--initial-advertise-peer-urls=https://127.0.0.1:2380
|
||||
```
|
||||
|
||||
### 6. Fix permissions
|
||||
```bash
|
||||
sudo chown -R root:root /var/lib/etcd
|
||||
```
|
||||
|
||||
### 7. Restart etcd and API server
|
||||
```bash
|
||||
sudo mv /etc/kubernetes/etcd.yaml /etc/kubernetes/manifests/
|
||||
# Wait for etcd to be ready
|
||||
sleep 30
|
||||
sudo mv /etc/kubernetes/kube-apiserver.yaml /etc/kubernetes/manifests/
|
||||
```
|
||||
|
||||
### 8. Verify restoration
|
||||
```bash
|
||||
# Check etcd health
|
||||
sudo ETCDCTL_API=3 etcdctl \
|
||||
--endpoints=https://127.0.0.1:2379 \
|
||||
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
|
||||
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
|
||||
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
|
||||
endpoint health
|
||||
|
||||
# Check cluster status
|
||||
kubectl get nodes
|
||||
kubectl get pods -A | head -20
|
||||
```
|
||||
|
||||
### 9. Reconcile state
|
||||
After etcd restore, some objects may be stale:
|
||||
```bash
|
||||
# Re-apply critical infrastructure
|
||||
cd /path/to/infra
|
||||
scripts/tg apply stacks/platform
|
||||
|
||||
# Check for orphaned resources
|
||||
kubectl get pods -A | grep -E "Terminating|Error|Unknown"
|
||||
```
|
||||
|
||||
## Estimated Time
|
||||
- Snapshot restore: ~10-15 minutes
|
||||
- Full reconciliation: ~30-60 minutes (depends on drift)
|
||||
173
docs/runbooks/restore-full-cluster.md
Normal file
173
docs/runbooks/restore-full-cluster.md
Normal file
|
|
@ -0,0 +1,173 @@
|
|||
# Full Cluster Rebuild
|
||||
|
||||
Last updated: 2026-04-06
|
||||
|
||||
## When to Use
|
||||
- Complete cluster failure (all VMs lost)
|
||||
- etcd corruption requiring full rebuild
|
||||
- Proxmox host failure requiring fresh VM provisioning
|
||||
|
||||
## Prerequisites
|
||||
- Proxmox host (192.168.1.127) accessible, with NFS exports on `/srv/nfs` and `/srv/nfs-ssd`
|
||||
- Synology NAS (192.168.1.13) accessible for offsite backup restore if the PVE host backup disk is also lost
|
||||
- sda backup disk mounted at `/mnt/backup` on PVE host (or restore from Synology first)
|
||||
- Git repo with infra code
|
||||
- SOPS age keys for state decryption (`~/.config/sops/age/keys.txt`)
|
||||
- Vault unseal keys (emergency kit)
|
||||
|
||||
## Rebuild Order
|
||||
|
||||
The rebuild must follow dependency order. Each layer depends on the one before it.
|
||||
|
||||
### Phase 1: Infrastructure (Proxmox VMs)
|
||||
```bash
|
||||
# 1. Provision VMs via Terraform
|
||||
cd infra
|
||||
scripts/tg apply stacks/infra
|
||||
|
||||
# 2. Wait for VMs to boot and be reachable
|
||||
# k8s-master, k8s-node3, k8s-node4, k8s-node5
|
||||
# (node1 has GPU workloads, node2 excluded from MySQL anti-affinity only — both are active cluster members)
|
||||
```
|
||||
|
||||
### Phase 2: Kubernetes Control Plane
|
||||
```bash
|
||||
# 3. Initialize kubeadm on master (if starting fresh)
|
||||
sudo kubeadm init --config /etc/kubernetes/kubeadm-config.yaml
|
||||
|
||||
# 4. Join worker nodes
|
||||
# Get join command from master, run on each node
|
||||
|
||||
# 5. OR restore etcd from snapshot (see restore-etcd.md)
|
||||
# This restores all K8s objects from the snapshot time
|
||||
```
|
||||
|
||||
### Phase 3: Storage Layer
|
||||
```bash
|
||||
# 6. Deploy CSI drivers (NFS + Proxmox)
|
||||
scripts/tg apply stacks/nfs-csi
|
||||
scripts/tg apply stacks/proxmox-csi
|
||||
|
||||
# 7. Verify PVs are accessible
|
||||
kubectl get pv
|
||||
kubectl get pvc -A | grep -v Bound
|
||||
```
|
||||
|
||||
### Phase 3.5: Restore PVC Data from sda Backup
|
||||
|
||||
After storage layer is deployed, restore PVC data from the sda backup disk:
|
||||
|
||||
```bash
|
||||
# 8a. List available backup weeks
|
||||
ssh root@192.168.1.127
|
||||
ls -l /mnt/backup/pvc-data/
|
||||
|
||||
# 8b. For each critical PVC, restore files:
|
||||
# Example: vaultwarden-data-proxmox
|
||||
WEEK="2026-14" # Use most recent week
|
||||
NAMESPACE="vaultwarden"
|
||||
PVC_NAME="vaultwarden-data-proxmox"
|
||||
|
||||
# Find the PV LV name
|
||||
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' | grep $PVC_NAME
|
||||
|
||||
# Assuming volumeHandle is "local-lvm:vm-999-pvc-abc123"
|
||||
LV_NAME="vm-999-pvc-abc123"
|
||||
|
||||
# Mount the LV
|
||||
lvchange -ay pve/$LV_NAME
|
||||
mkdir -p /mnt/restore-temp
|
||||
mount /dev/pve/$LV_NAME /mnt/restore-temp
|
||||
|
||||
# Restore from backup
|
||||
rsync -avP --delete /mnt/backup/pvc-data/$WEEK/$NAMESPACE/$PVC_NAME/ /mnt/restore-temp/
|
||||
|
||||
# Unmount
|
||||
umount /mnt/restore-temp
|
||||
lvchange -an pve/$LV_NAME
|
||||
|
||||
# 8c. Repeat for all critical PVCs (prioritize: vaultwarden, vault, redis, nextcloud)
|
||||
```
|
||||
|
||||
**Note on pfSense restore**: If pfSense needs restoration, restore `config.xml` from `/mnt/backup/pfsense/<week>/config.xml` via web UI, or full filesystem tar for custom scripts.
|
||||
|
||||
**Note on PVE config restore**: If custom scripts/timers are lost, restore from `/mnt/backup/pve-config/` (daily-backup, offsite-sync-backup, lvm-pvc-snapshot scripts + timers).
|
||||
|
||||
### Phase 4: Vault (secrets foundation)
|
||||
```bash
|
||||
# 8. Deploy Vault (see restore-vault.md for full procedure)
|
||||
scripts/tg apply stacks/vault
|
||||
|
||||
# 9. Initialize/unseal/restore raft snapshot
|
||||
# 10. Verify ESO can connect
|
||||
scripts/tg apply stacks/external-secrets
|
||||
kubectl get externalsecrets -A
|
||||
```
|
||||
|
||||
### Phase 5: Platform Services
|
||||
```bash
|
||||
# 11. Deploy platform stack (Traefik, monitoring, Kyverno, etc.)
|
||||
scripts/tg apply stacks/platform
|
||||
|
||||
# 12. Verify ingress is working
|
||||
curl -s -o /dev/null -w "%{http_code}" https://viktorbarzin.me/
|
||||
```
|
||||
|
||||
### Phase 6: Databases
|
||||
```bash
|
||||
# 13. Deploy database stack
|
||||
scripts/tg apply stacks/dbaas
|
||||
|
||||
# 14. Wait for CNPG and InnoDB clusters to initialize
|
||||
kubectl wait --for=condition=Ready cluster/pg-cluster -n dbaas --timeout=600s
|
||||
|
||||
# 15. Restore PostgreSQL from dump (see restore-postgresql.md)
|
||||
# 16. Restore MySQL from dump (see restore-mysql.md)
|
||||
```
|
||||
|
||||
### Phase 7: Application Services
|
||||
```bash
|
||||
# 17. Deploy remaining stacks in any order
|
||||
for stack in vaultwarden immich nextcloud linkwarden health; do
|
||||
scripts/tg apply stacks/$stack
|
||||
done
|
||||
|
||||
# 18. Restore Vaultwarden (see restore-vaultwarden.md)
|
||||
```
|
||||
|
||||
### Phase 8: Verification
|
||||
```bash
|
||||
# 19. Check all pods are running
|
||||
kubectl get pods -A | grep -v Running | grep -v Completed
|
||||
|
||||
# 20. Check all ingresses respond
|
||||
kubectl get ingress -A -o jsonpath='{range .items[*]}{.spec.rules[0].host}{"\n"}{end}' | while read host; do
|
||||
code=$(curl -s -o /dev/null -w "%{http_code}" "https://$host/" 2>/dev/null)
|
||||
echo "$host: $code"
|
||||
done
|
||||
|
||||
# 21. Check monitoring
|
||||
# Verify Prometheus targets: https://prometheus.viktorbarzin.me/targets
|
||||
# Verify Alertmanager: https://alertmanager.viktorbarzin.me/
|
||||
|
||||
# 22. Run backup CronJobs manually to establish baseline
|
||||
kubectl create job --from=cronjob/backup-etcd manual-etcd-backup -n default
|
||||
kubectl create job --from=cronjob/postgresql-backup manual-pg-backup -n dbaas
|
||||
kubectl create job --from=cronjob/mysql-backup manual-mysql-backup -n dbaas
|
||||
kubectl create job --from=cronjob/vault-raft-backup manual-vault-backup -n vault
|
||||
kubectl create job --from=cronjob/vaultwarden-backup manual-vw-backup -n vaultwarden
|
||||
```
|
||||
|
||||
## Dependency Graph
|
||||
```
|
||||
etcd → K8s API → CSI Drivers → Restore PVC data from sda → Vault → ESO → Platform → Databases → Apps
|
||||
↓
|
||||
Restore DB dumps from
|
||||
/mnt/backup/nfs-mirror
|
||||
or Synology/pve-backup
|
||||
```
|
||||
|
||||
## Estimated Time
|
||||
- Full cluster rebuild from scratch: ~2-4 hours
|
||||
- With etcd restore (objects preserved): ~1-2 hours
|
||||
- Individual service restore: ~10-30 minutes each
|
||||
159
docs/runbooks/restore-lvm-snapshot.md
Normal file
159
docs/runbooks/restore-lvm-snapshot.md
Normal file
|
|
@ -0,0 +1,159 @@
|
|||
# Runbook: Restore PVC from LVM Thin Snapshot
|
||||
|
||||
Last updated: 2026-04-06
|
||||
|
||||
## When to Use
|
||||
|
||||
- Rolling back a PVC to a previous state after a bad migration, data corruption, or accidental deletion
|
||||
- Pre-upgrade safety: snapshot before upgrade, restore if upgrade fails
|
||||
- Fast recovery for data changed within the last 7 days
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- SSH access to PVE host (192.168.1.127)
|
||||
- The `lvm-pvc-snapshot` script at `/usr/local/bin/lvm-pvc-snapshot`
|
||||
- kubectl configured on PVE host (`/root/.kube/config`)
|
||||
|
||||
## Snapshot Retention
|
||||
|
||||
- **Daily snapshots**: Created at 03:00 via systemd timer
|
||||
- **Retention**: 7 days (older snapshots automatically pruned)
|
||||
- **Coverage**: All proxmox-lvm PVCs except `dbaas` and `monitoring` namespaces
|
||||
|
||||
**If you need data older than 7 days**, see "Alternative: Restore from sda Backup" below.
|
||||
|
||||
## Procedure
|
||||
|
||||
### 1. List Available Snapshots
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 lvm-pvc-snapshot list
|
||||
```
|
||||
|
||||
Output shows all snapshots with their original LV, age, and data divergence percentage.
|
||||
|
||||
### 2. Identify the PVC LV Name
|
||||
|
||||
Find the LV name for your PVC:
|
||||
|
||||
```bash
|
||||
# From your workstation (with kubectl):
|
||||
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle'
|
||||
|
||||
# The HANDLE column shows "local-lvm:<lv-name>"
|
||||
```
|
||||
|
||||
### 3. Run the Restore
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
lvm-pvc-snapshot restore <pvc-lv-name> <snapshot-lv-name>
|
||||
```
|
||||
|
||||
The script will:
|
||||
1. Look up the K8s PV/PVC/workload for the LV
|
||||
2. Show a dry-run of all actions
|
||||
3. Ask for confirmation (type `yes`)
|
||||
4. Scale down the workload (Deployment or StatefulSet)
|
||||
5. Rename the current LV to `<name>_pre_restore_<timestamp>`
|
||||
6. Rename the snapshot LV to the original name
|
||||
7. Scale the workload back up
|
||||
8. Wait for pod to become Ready
|
||||
|
||||
### 4. Verify
|
||||
|
||||
```bash
|
||||
# Check pod is running
|
||||
kubectl get pods -n <namespace> -l app=<workload>
|
||||
|
||||
# Check the application is working correctly
|
||||
# (service-specific verification)
|
||||
```
|
||||
|
||||
### 5. Clean Up
|
||||
|
||||
Once you've verified the restore is correct, remove the pre-restore backup:
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 lvremove -f pve/<original-lv>_pre_restore_<timestamp>
|
||||
```
|
||||
|
||||
## Manual Restore (if script fails)
|
||||
|
||||
If the automated restore fails, perform these steps manually:
|
||||
|
||||
```bash
|
||||
# 1. Scale down the workload
|
||||
kubectl scale deployment/<name> -n <ns> --replicas=0
|
||||
# or for StatefulSets:
|
||||
kubectl scale statefulset/<name> -n <ns> --replicas=0
|
||||
|
||||
# 2. Wait for pods to terminate
|
||||
kubectl wait --for=delete pod -l app=<name> -n <ns> --timeout=120s
|
||||
|
||||
# 3. SSH to PVE host
|
||||
ssh root@192.168.1.127
|
||||
|
||||
# 4. Verify LV is inactive
|
||||
lvs -o lv_name,lv_active pve | grep <lv-name>
|
||||
|
||||
# 5. Rename LVs
|
||||
lvrename pve <original-lv> <original-lv>_pre_restore_$(date +%Y%m%d_%H%M)
|
||||
lvrename pve <snapshot-lv> <original-lv>
|
||||
|
||||
# 6. Scale back up
|
||||
kubectl scale deployment/<name> -n <ns> --replicas=1
|
||||
```
|
||||
|
||||
## Database-Specific Notes
|
||||
|
||||
- **MySQL InnoDB**: After restore, InnoDB will replay redo logs automatically on startup. Check `SHOW ENGINE INNODB STATUS` for recovery progress.
|
||||
- **PostgreSQL**: WAL replay happens automatically. Check `pg_is_in_recovery()` and PostgreSQL logs.
|
||||
- **Redis**: Redis loads the RDB file on startup. Check `INFO persistence` for load status.
|
||||
|
||||
For databases, prefer the app-level backup restore (see `restore-mysql.md`, `restore-postgresql.md`) unless you need a very recent point-in-time that predates the last dump.
|
||||
|
||||
## Alternative: Restore from sda Backup
|
||||
|
||||
If LVM snapshots are too old or missing (data lost >7 days ago), use the weekly file-level backup on sda:
|
||||
|
||||
**Location**: `/mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/` on PVE host
|
||||
**Retention**: 4 weekly versions (weeks 0-3)
|
||||
|
||||
### Procedure
|
||||
|
||||
```bash
|
||||
# 1. List available backup weeks
|
||||
ssh root@192.168.1.127
|
||||
ls -l /mnt/backup/pvc-data/
|
||||
|
||||
# 2. Identify the PVC backup directory
|
||||
ls -l /mnt/backup/pvc-data/2026-14/<namespace>/
|
||||
|
||||
# 3. Scale down the workload
|
||||
kubectl scale deployment/<name> -n <ns> --replicas=0
|
||||
|
||||
# 4. Mount the live PVC LV on PVE host
|
||||
lvchange -ay pve/<pvc-lv-name>
|
||||
mkdir -p /mnt/restore-temp
|
||||
mount /dev/pve/<pvc-lv-name> /mnt/restore-temp
|
||||
|
||||
# 5. Restore from backup
|
||||
rsync -avP --delete /mnt/backup/pvc-data/2026-14/<namespace>/<pvc-name>/ /mnt/restore-temp/
|
||||
|
||||
# 6. Unmount and scale up
|
||||
umount /mnt/restore-temp
|
||||
lvchange -an pve/<pvc-lv-name>
|
||||
kubectl scale deployment/<name> -n <ns> --replicas=1
|
||||
```
|
||||
|
||||
See `restore-pvc-from-backup.md` for detailed walkthrough.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Problem | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| "Another instance is running" | Concurrent snapshot/restore | Wait for timer to finish: `systemctl status lvm-pvc-snapshot.service` |
|
||||
| LV still active after scale-down | Proxmox CSI hasn't detached | Wait 30s, or `lvchange -an pve/<lv>` |
|
||||
| Pod stuck in ContainerCreating | Volume not attached to node | `kubectl describe pod` — check events for attach errors |
|
||||
| No PV found for volume handle | LV name doesn't match any PV | Check `kubectl get pv -o yaml` for the correct volumeHandle format |
|
||||
256
docs/runbooks/restore-mysql.md
Normal file
256
docs/runbooks/restore-mysql.md
Normal file
|
|
@ -0,0 +1,256 @@
|
|||
# Restore MySQL (Standalone)
|
||||
|
||||
Last updated: 2026-05-18 (after the 8.4.9 DD-upgrade disaster recovery)
|
||||
|
||||
Applies to the `mysql-standalone` StatefulSet in the `dbaas` namespace
|
||||
(raw `kubernetes_stateful_set_v1`, migrated from InnoDB Cluster on
|
||||
2026-04-16). The historic InnoDB-Cluster recovery flow is gone.
|
||||
|
||||
## Prerequisites
|
||||
- `kubectl` against the cluster
|
||||
- Root password: `kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d`
|
||||
- A backup dump on NFS at `/srv/nfs/mysql-backup/` (exported via
|
||||
`dbaas-mysql-backup-host` PVC inside the cluster)
|
||||
|
||||
## Backup Locations
|
||||
|
||||
| Location | Purpose | Retention |
|
||||
|---|---|---|
|
||||
| `/srv/nfs/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz` | Full daily dump (CronJob `mysql-backup`, daily 00:30 UTC) | 14 days |
|
||||
| `/srv/nfs/mysql-backup/per-db/<dbname>/dump_*.sql.gz` | Per-DB dumps (CronJob `mysql-backup-per-db`, daily 00:45 UTC) | 14 days |
|
||||
| Synology `Backup/Viki/nfs/mysql-backup/` | Offsite mirror via inotify-tracked rsync | unlimited |
|
||||
|
||||
Latest full dump is ~230MB compressed (~3GB uncompressed). Restore
|
||||
of a full dump into a fresh MySQL pod takes ~3 minutes.
|
||||
|
||||
## Scenario A — Single database restored alongside the others
|
||||
|
||||
When one DB is corrupted but MySQL is otherwise fine.
|
||||
|
||||
```bash
|
||||
ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
|
||||
|
||||
# List per-db dumps for the affected database
|
||||
kubectl -n dbaas exec mysql-standalone-0 -- ls -lt /backup/per-db/<dbname>/
|
||||
|
||||
# Pipe a chosen dump into MySQL (REPLACE existing data in <dbname>):
|
||||
kubectl -n dbaas exec -i mysql-standalone-0 -- \
|
||||
sh -c "zcat /backup/per-db/<dbname>/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -uroot -p\"$ROOT_PWD\" <dbname>"
|
||||
|
||||
# Restart consumers
|
||||
kubectl -n <ns> rollout restart deployment
|
||||
```
|
||||
|
||||
## Scenario B — Full disaster: data dictionary corrupt or PVC unsalvageable
|
||||
|
||||
This is the path executed on 2026-05-18 when a Keel-driven bump to
|
||||
`mysql:8.4.9` left the data dictionary half-upgraded and 8.4.8 refused
|
||||
to start (`Server upgrade of version 80408 is still pending` —
|
||||
MY-013379). Wipes the PVC and rehydrates from the daily dump.
|
||||
|
||||
**Estimated downtime: 25 minutes.** Plan accordingly — Forgejo +
|
||||
registry + every MySQL app go offline during this.
|
||||
|
||||
### B.1 Stop the failing MySQL pod
|
||||
|
||||
```bash
|
||||
kubectl -n dbaas scale statefulset mysql-standalone --replicas=0
|
||||
```
|
||||
|
||||
### B.2 Verify the dump you intend to restore is healthy
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'ls -la /srv/nfs/mysql-backup/dump_*.sql.gz | tail -5'
|
||||
# Sanity-check the header
|
||||
ssh root@192.168.1.127 'zcat /srv/nfs/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz | head -20'
|
||||
# Should show "MySQL dump 10.13 ... Server version 8.4.X"
|
||||
```
|
||||
|
||||
### B.3 Pin MySQL image in Terraform (if it auto-bumped)
|
||||
|
||||
If the upgrade was triggered by a Keel bump on a floating tag
|
||||
(`mysql:8.4`), edit `stacks/dbaas/modules/dbaas/main.tf` to pin to a
|
||||
known-good exact version (`mysql:8.4.8`). Commit but don't apply yet.
|
||||
|
||||
### B.4 Wipe the corrupted PVC
|
||||
|
||||
The PV reclaim policy defaults to **Retain** on
|
||||
`proxmox-lvm-encrypted` — `kubectl delete pvc` alone leaves the PV
|
||||
attached to the (corrupted) disk. Flip to `Delete` first so the CSI
|
||||
driver actually cleans up the underlying LV.
|
||||
|
||||
```bash
|
||||
PV=$(kubectl -n dbaas get pvc data-mysql-standalone-0 -o jsonpath='{.spec.volumeName}')
|
||||
kubectl patch pv "$PV" -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
|
||||
kubectl -n dbaas delete pvc data-mysql-standalone-0
|
||||
```
|
||||
|
||||
The PV transitions to `Released` then gets cleaned up by the CSI
|
||||
controller; confirm with `kubectl get pv | grep <PV>` (eventually
|
||||
disappears).
|
||||
|
||||
### B.5 Scale MySQL back up via Terraform
|
||||
|
||||
```bash
|
||||
cd stacks/dbaas && /home/wizard/code/infra/scripts/tg apply
|
||||
```
|
||||
|
||||
This recreates the PVC fresh (5Gi initial; pvc-autoresizer grows it
|
||||
on demand) and starts a brand-new MySQL pod. The pod initializes an
|
||||
empty datadir using `MYSQL_ROOT_PASSWORD` from the `cluster-secret`
|
||||
K8s Secret — ~30s to ready.
|
||||
|
||||
### B.6 Restore the full dump via a one-shot Job
|
||||
|
||||
```bash
|
||||
cat <<'YAML' | kubectl apply -f -
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: mysql-restore-$(date +%Y-%m-%d)
|
||||
namespace: dbaas
|
||||
spec:
|
||||
ttlSecondsAfterFinished: 3600
|
||||
template:
|
||||
spec:
|
||||
restartPolicy: Never
|
||||
containers:
|
||||
- name: restore
|
||||
image: mysql:8.4.8
|
||||
command: ["bash","-c"]
|
||||
args:
|
||||
- |
|
||||
set -euo pipefail
|
||||
gunzip -c /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | \
|
||||
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"$MYSQL_ROOT_PASSWORD"
|
||||
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"$MYSQL_ROOT_PASSWORD" -e 'SHOW DATABASES;'
|
||||
env:
|
||||
- name: MYSQL_ROOT_PASSWORD
|
||||
valueFrom:
|
||||
secretKeyRef: { name: cluster-secret, key: ROOT_PASSWORD }
|
||||
volumeMounts:
|
||||
- { name: backup, mountPath: /backup, readOnly: true }
|
||||
volumes:
|
||||
- name: backup
|
||||
persistentVolumeClaim: { claimName: dbaas-mysql-backup-host, readOnly: true }
|
||||
YAML
|
||||
```
|
||||
|
||||
Watch progress: `kubectl -n dbaas logs -f job/<name>`. Takes ~3 min
|
||||
for a 230MB compressed dump.
|
||||
|
||||
### B.7 Reset static MySQL users with passwords from Vault
|
||||
|
||||
**This step is mandatory.** `mysqldump` restores rows in `mysql.user`
|
||||
verbatim, including password hashes. But `null_resource.mysql_static_user`
|
||||
in Terraform writes the **current Vault password** to `forgejo` and
|
||||
`roundcubemail` — and that current password rarely matches the dump's
|
||||
hash. The apps will fail auth (forgejo logs `Error 1045 (28000): Access
|
||||
denied for user 'forgejo'@'...'`) until you reset them.
|
||||
|
||||
```bash
|
||||
FORGEJO_PW=$(vault kv get -field=mysql_forgejo_password secret/viktor)
|
||||
RC_PW=$(vault kv get -field=mysql_roundcubemail_password secret/viktor)
|
||||
|
||||
kubectl -n dbaas exec -i mysql-standalone-0 -- bash -c 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD"' <<SQL
|
||||
DROP USER IF EXISTS 'forgejo'@'%';
|
||||
DROP USER IF EXISTS 'roundcubemail'@'%';
|
||||
CREATE USER 'forgejo'@'%' IDENTIFIED WITH caching_sha2_password BY '$FORGEJO_PW';
|
||||
CREATE USER 'roundcubemail'@'%' IDENTIFIED WITH caching_sha2_password BY '$RC_PW';
|
||||
GRANT ALL PRIVILEGES ON \`forgejo\`.* TO 'forgejo'@'%';
|
||||
GRANT ALL PRIVILEGES ON \`roundcubemail\`.* TO 'roundcubemail'@'%';
|
||||
FLUSH PRIVILEGES;
|
||||
SQL
|
||||
```
|
||||
|
||||
`ALTER USER` sometimes hits `ERROR 1396 Operation ALTER USER failed`
|
||||
on freshly-restored DBs (stale grant-table cache); `DROP USER` +
|
||||
`CREATE USER` is the reliable form.
|
||||
|
||||
Vault-rotated app users (nextcloud, codimd, grafana, paperless,
|
||||
phpipam, etc.) are managed by Vault DB engine and their dump password
|
||||
already matches the live K8s secret, so they need no manual fixup.
|
||||
|
||||
### B.8 Restart MySQL-dependent apps
|
||||
|
||||
The dump restore brings MySQL up, but app pods still hold stale
|
||||
connections (and forgejo has been crash-looping). Roll the
|
||||
deployments to force fresh connections:
|
||||
|
||||
```bash
|
||||
for ns_app in \
|
||||
"forgejo:deploy/forgejo" \
|
||||
"nextcloud:deploy/nextcloud" \
|
||||
"hackmd:deploy/hackmd" \
|
||||
"monitoring:deploy/grafana" \
|
||||
"paperless-ngx:deploy/paperless-ngx" \
|
||||
"uptime-kuma:deploy/uptime-kuma" \
|
||||
"url:deploy/shlink" \
|
||||
"realestate-crawler:deploy/realestate-crawler-api" \
|
||||
"realestate-crawler:deploy/realestate-crawler-celery" \
|
||||
"realestate-crawler:deploy/realestate-crawler-celery-beat" \
|
||||
"realestate-crawler:deploy/realestate-crawler-ui"; do
|
||||
ns=${ns_app%%:*}; app=${ns_app##*:}
|
||||
kubectl -n "$ns" rollout restart "$app" &
|
||||
done
|
||||
wait
|
||||
```
|
||||
|
||||
If any deployments stay stuck in `ImagePullBackOff` (e.g.
|
||||
`chrome-service`, `fire-planner`, `freedify`), those rely on the
|
||||
Forgejo registry — once forgejo is back, just delete their pods to
|
||||
force a fresh pull:
|
||||
|
||||
```bash
|
||||
kubectl -n chrome-service delete pod --all
|
||||
kubectl -n fire-planner delete pod --all
|
||||
kubectl -n freedify delete pod --all
|
||||
```
|
||||
|
||||
### B.9 Verify recovery
|
||||
|
||||
```bash
|
||||
# All workloads ready
|
||||
kubectl get deploy,sts -A -o json | jq -r '.items[] | select(.spec.replicas != .status.readyReplicas and .spec.replicas > 0) | "\(.metadata.namespace)/\(.metadata.name)"'
|
||||
# (empty output = healthy)
|
||||
|
||||
# Database integrity — table counts per schema
|
||||
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
|
||||
-e "SELECT table_schema, COUNT(*) FROM information_schema.tables \
|
||||
WHERE table_schema NOT IN ('information_schema','performance_schema','sys') \
|
||||
GROUP BY table_schema;"
|
||||
|
||||
# Forgejo's registry catalog (catches the cascade alert)
|
||||
kubectl -n monitoring create job --from=cronjob/forgejo-integrity-probe manual-postrestore-$(date +%s)
|
||||
kubectl -n monitoring logs job/manual-postrestore-<timestamp> --tail=10
|
||||
# Expect "Probe complete: 0 failures across N repos / M tags / K indexes"
|
||||
|
||||
# Cluster-health re-run
|
||||
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --quiet
|
||||
```
|
||||
|
||||
### B.10 Clean up failed CronJob pods from the outage window
|
||||
|
||||
```bash
|
||||
kubectl delete pods -A --field-selector=status.phase=Failed
|
||||
```
|
||||
|
||||
## Why the 8.4.9 upgrade got us — and the version pin
|
||||
|
||||
The MySQL 8.4.9 data-dictionary upgrade from 80408 → 80409 stalls
|
||||
reliably on this hardware. ~24s of writes to `mysql.ibd` and the redo
|
||||
log, then no further progress, no CPU, no completion. We bumped the
|
||||
liveness probe to 600s (`initial_delay_seconds`) and still no
|
||||
progress. Hypothesised root cause: `innodb_io_capacity=100` combined
|
||||
with `innodb_page_cleaners=1` — the upgrade's spatial-reference-system
|
||||
flush phase is IO-starved. **Don't retry 8.4.9 without first bumping
|
||||
IO capacity and pinning a proper maintenance window.**
|
||||
|
||||
Until then, the StatefulSet pins to `mysql:8.4.8` exactly, not the
|
||||
floating `mysql:8.4` tag. Keel will not silently bump it.
|
||||
|
||||
## See also
|
||||
- `docs/runbooks/forgejo-registry-breakglass.md` — companion runbook
|
||||
for when the cascade has reached the registry layer.
|
||||
- Beads `code-eme8` / `code-k40p` — incident tracker entries (closed
|
||||
in commit ea475c3d).
|
||||
160
docs/runbooks/restore-postgresql.md
Normal file
160
docs/runbooks/restore-postgresql.md
Normal file
|
|
@ -0,0 +1,160 @@
|
|||
# Restore PostgreSQL (CNPG)
|
||||
|
||||
Last updated: 2026-04-06
|
||||
|
||||
## Prerequisites
|
||||
- `kubectl` access to the cluster
|
||||
- CNPG operator running in the cluster
|
||||
- Backup dump available on NFS at `/mnt/main/postgresql-backup/`
|
||||
- PostgreSQL superuser password (from `pg-cluster-superuser` secret in `dbaas` namespace)
|
||||
|
||||
## Backup Location
|
||||
- NFS: `/mnt/main/postgresql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz`
|
||||
- Mirrored to sda: `/mnt/backup/nfs-mirror/postgresql-backup/` (PVE host 192.168.1.127)
|
||||
- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/postgresql-backup/`
|
||||
- Retention: 14 days (on NFS), latest only (on sda), unlimited (on Synology)
|
||||
|
||||
## Restore from pg_dumpall
|
||||
|
||||
### 1. Identify the backup to restore
|
||||
```bash
|
||||
# List available backups (from any node with NFS access)
|
||||
ls -lt /mnt/main/postgresql-backup/dump_*.sql | head -20
|
||||
|
||||
# Or via a pod:
|
||||
kubectl run pg-restore --rm -it --image=postgres:16.4-bullseye \
|
||||
--overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"dbaas-postgresql-backup"}}],"containers":[{"name":"pg-restore","image":"postgres:16.4-bullseye","volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["ls","-lt","/backup/"]}]}}' \
|
||||
-n dbaas
|
||||
```
|
||||
|
||||
### 2. Get the superuser password
|
||||
```bash
|
||||
kubectl get secret pg-cluster-superuser -n dbaas -o jsonpath='{.data.password}' | base64 -d
|
||||
```
|
||||
|
||||
### 3. Option A: Restore into existing CNPG cluster
|
||||
```bash
|
||||
# Port-forward to the CNPG primary
|
||||
kubectl port-forward svc/pg-cluster-rw -n dbaas 5433:5432 &
|
||||
|
||||
# Restore (decompress and pipe to psql — this will overwrite existing data)
|
||||
PGPASSWORD=$(kubectl get secret pg-cluster-superuser -n dbaas -o jsonpath='{.data.password}' | base64 -d) \
|
||||
zcat /path/to/dump_YYYY_MM_DD_HH_MM.sql.gz | psql -h 127.0.0.1 -p 5433 -U postgres
|
||||
```
|
||||
|
||||
### 3. Option B: Rebuild CNPG cluster from scratch
|
||||
```bash
|
||||
# 1. Delete the existing cluster
|
||||
kubectl delete cluster pg-cluster -n dbaas
|
||||
|
||||
# 2. Wait for PVCs to be cleaned up
|
||||
kubectl get pvc -n dbaas -l cnpg.io/cluster=pg-cluster
|
||||
|
||||
# 3. Re-apply the cluster manifest (via terragrunt)
|
||||
cd infra && scripts/tg apply -target=null_resource.pg_cluster stacks/dbaas
|
||||
|
||||
# 4. Wait for cluster to be ready
|
||||
kubectl wait --for=condition=Ready cluster/pg-cluster -n dbaas --timeout=300s
|
||||
|
||||
# 5. Restore the dump
|
||||
PGPASSWORD=$(kubectl get secret pg-cluster-superuser -n dbaas -o jsonpath='{.data.password}' | base64 -d) \
|
||||
kubectl run pg-restore --rm -it --image=postgres:16.4-bullseye \
|
||||
--overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"dbaas-postgresql-backup"}}],"containers":[{"name":"pg-restore","image":"postgres:16.4-bullseye","env":[{"name":"PGPASSWORD","value":"'$PGPASSWORD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | psql -h pg-cluster-rw.dbaas -U postgres"]}]}}' \
|
||||
-n dbaas
|
||||
```
|
||||
|
||||
### 4. Verify restoration
|
||||
```bash
|
||||
# Check databases exist
|
||||
PGPASSWORD=$PGPASSWORD psql -h 127.0.0.1 -p 5433 -U postgres -c "\l"
|
||||
|
||||
# Check table counts for critical databases
|
||||
for db in health linkwarden affine woodpecker claude_memory; do
|
||||
echo "=== $db ==="
|
||||
PGPASSWORD=$PGPASSWORD psql -h 127.0.0.1 -p 5433 -U postgres -d $db -c \
|
||||
"SELECT schemaname, tablename, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 5;"
|
||||
done
|
||||
```
|
||||
|
||||
### 5. Restart dependent services
|
||||
After restore, restart services that connect to PostgreSQL to pick up fresh connections:
|
||||
```bash
|
||||
kubectl rollout restart deployment -n health
|
||||
kubectl rollout restart deployment -n linkwarden
|
||||
# ... repeat for all PG-dependent services (excluding trading — disabled)
|
||||
```
|
||||
|
||||
## Restore Single Database (from per-db backup)
|
||||
|
||||
Per-database backups use `pg_dump -Fc` (custom format) and are stored at `/mnt/main/postgresql-backup/per-db/<dbname>/`.
|
||||
|
||||
### 1. List available per-db backups
|
||||
```bash
|
||||
ls -lt /mnt/main/postgresql-backup/per-db/<dbname>/
|
||||
|
||||
# Or via a pod:
|
||||
kubectl exec -n dbaas pg-cluster-1 -c postgres -- ls -lt /backup/per-db/<dbname>/ 2>/dev/null || \
|
||||
echo "Mount a backup pod — see Option A below"
|
||||
```
|
||||
|
||||
### 2. Restore a single database
|
||||
```bash
|
||||
# Port-forward to the CNPG primary
|
||||
kubectl port-forward svc/pg-cluster-rw -n dbaas 5433:5432 &
|
||||
|
||||
# Restore single database (drops and recreates objects in that DB only)
|
||||
PGPASSWORD=$(kubectl get secret pg-cluster-superuser -n dbaas -o jsonpath='{.data.password}' | base64 -d) \
|
||||
pg_restore -h 127.0.0.1 -p 5433 -U postgres -d <dbname> --clean --if-exists \
|
||||
/path/to/per-db/<dbname>/dump_YYYY_MM_DD_HH_MM.dump
|
||||
```
|
||||
|
||||
### 3. Verify
|
||||
```bash
|
||||
PGPASSWORD=$PGPASSWORD psql -h 127.0.0.1 -p 5433 -U postgres -d <dbname> -c \
|
||||
"SELECT schemaname, tablename, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 10;"
|
||||
```
|
||||
|
||||
### 4. Restart the affected service only
|
||||
```bash
|
||||
kubectl rollout restart deployment -n <namespace>
|
||||
```
|
||||
|
||||
**Advantages over full restore**: Only the target database is affected. All other databases continue running with their current data.
|
||||
|
||||
## Alternative: Restore from sda Backup
|
||||
|
||||
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
|
||||
|
||||
```bash
|
||||
# 1. SSH to PVE host
|
||||
ssh root@192.168.1.127
|
||||
|
||||
# 2. Find the latest backup
|
||||
ls -lt /mnt/backup/nfs-mirror/postgresql-backup/
|
||||
|
||||
# 3. Mount sda backup on a pod
|
||||
PGPASSWORD=$(kubectl get secret pg-cluster-superuser -n dbaas -o jsonpath='{.data.password}' | base64 -d)
|
||||
|
||||
kubectl run pg-restore --rm -it --image=postgres:16.4-bullseye \
|
||||
--overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/postgresql-backup"}}],"containers":[{"name":"pg-restore","image":"postgres:16.4-bullseye","env":[{"name":"PGPASSWORD","value":"'$PGPASSWORD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | psql -h pg-cluster-rw.dbaas -U postgres"]}],"nodeName":"k8s-master"}}' \
|
||||
-n dbaas
|
||||
```
|
||||
|
||||
## Alternative: Restore from Synology (if PVE host is down)
|
||||
|
||||
If the PVE host itself is unavailable:
|
||||
|
||||
```bash
|
||||
# 1. SSH to Synology NAS
|
||||
ssh Administrator@192.168.1.13
|
||||
|
||||
# 2. Navigate to backup directory
|
||||
cd /volume1/Backup/Viki/nfs/postgresql-backup/
|
||||
|
||||
# 3. Copy dump to a temporary location accessible from cluster
|
||||
# (e.g., via rsync to a surviving node, or restore PVE host first)
|
||||
```
|
||||
|
||||
## Estimated Time
|
||||
- Restore into existing cluster: ~10 minutes (depends on dump size)
|
||||
- Full rebuild: ~20-30 minutes
|
||||
231
docs/runbooks/restore-pvc-from-backup.md
Normal file
231
docs/runbooks/restore-pvc-from-backup.md
Normal file
|
|
@ -0,0 +1,231 @@
|
|||
# Runbook: Restore PVC from sda File Backup
|
||||
|
||||
Last updated: 2026-04-06
|
||||
|
||||
## When to Use
|
||||
|
||||
- LVM snapshots are too old (>7 days) or missing
|
||||
- Need to restore data from a specific week (up to 4 weeks back)
|
||||
- LVM snapshot restore failed or snapshot is corrupt
|
||||
- Granular file-level restore (not full PVC)
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- SSH access to PVE host (192.168.1.127)
|
||||
- kubectl configured (either on PVE host or your workstation)
|
||||
- sda backup disk mounted at `/mnt/backup` on PVE host
|
||||
|
||||
## Backup Location
|
||||
|
||||
**Path**: `/mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/` on PVE host
|
||||
**Retention**: 4 weekly versions (weeks 0-3)
|
||||
**Deduplication**: `--link-dest` hardlink dedup (unchanged files share inodes across weeks)
|
||||
|
||||
## Procedure
|
||||
|
||||
### 1. List Available Backup Weeks
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
ls -l /mnt/backup/pvc-data/
|
||||
|
||||
# Output shows week directories like:
|
||||
# 2026-13
|
||||
# 2026-14
|
||||
# 2026-15
|
||||
# 2026-16
|
||||
```
|
||||
|
||||
### 2. Identify the PVC Backup Directory
|
||||
|
||||
```bash
|
||||
# List namespaces in a specific week
|
||||
ls -l /mnt/backup/pvc-data/2026-14/
|
||||
|
||||
# List PVCs in a namespace
|
||||
ls -l /mnt/backup/pvc-data/2026-14/vaultwarden/
|
||||
|
||||
# Example: vaultwarden-data-proxmox/
|
||||
```
|
||||
|
||||
### 3. Find the Live PVC LV Name
|
||||
|
||||
From your workstation (or PVE host with kubectl):
|
||||
|
||||
```bash
|
||||
# Get the PV volumeHandle (contains LV name)
|
||||
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' | grep <pvc-name>
|
||||
|
||||
# Example output:
|
||||
# pvc-abc123 vaultwarden-data-proxmox vaultwarden local-lvm:vm-999-pvc-abc123
|
||||
# ↑ this is the LV name
|
||||
```
|
||||
|
||||
### 4. Scale Down the Workload
|
||||
|
||||
```bash
|
||||
# Find the workload using the PVC
|
||||
kubectl get deployment,statefulset -n <namespace> -o json | jq '.items[] | select(.spec.template.spec.volumes[]?.persistentVolumeClaim.claimName == "<pvc-name>") | .metadata.name'
|
||||
|
||||
# Scale down (Deployment example)
|
||||
kubectl scale deployment/<workload-name> -n <namespace> --replicas=0
|
||||
|
||||
# Or StatefulSet:
|
||||
kubectl scale statefulset/<workload-name> -n <namespace> --replicas=0
|
||||
|
||||
# Wait for pod to terminate
|
||||
kubectl wait --for=delete pod -l app=<workload-name> -n <namespace> --timeout=120s
|
||||
```
|
||||
|
||||
### 5. Mount the Live PVC LV
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
|
||||
# Activate the LV (should already be inactive after pod termination)
|
||||
lvchange -ay pve/<lv-name>
|
||||
|
||||
# Create mount point
|
||||
mkdir -p /mnt/restore-temp
|
||||
|
||||
# Mount the LV
|
||||
mount /dev/pve/<lv-name> /mnt/restore-temp
|
||||
```
|
||||
|
||||
### 6. Restore from Backup
|
||||
|
||||
**Option A: Full PVC restore (replace all data)**
|
||||
|
||||
```bash
|
||||
# This will delete existing files in the PVC and replace with backup
|
||||
rsync -avP --delete /mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/ /mnt/restore-temp/
|
||||
|
||||
# Example:
|
||||
rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/
|
||||
```
|
||||
|
||||
**Option B: Selective file restore (merge)**
|
||||
|
||||
```bash
|
||||
# Restore specific files or directories without deleting existing data
|
||||
rsync -avP /mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/path/to/file /mnt/restore-temp/path/to/
|
||||
|
||||
# Example: Restore only db.sqlite3
|
||||
rsync -avP /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/db.sqlite3 /mnt/restore-temp/
|
||||
```
|
||||
|
||||
### 7. Unmount and Deactivate LV
|
||||
|
||||
```bash
|
||||
# Unmount
|
||||
umount /mnt/restore-temp
|
||||
|
||||
# Deactivate LV (optional, kubelet will activate it when pod starts)
|
||||
lvchange -an pve/<lv-name>
|
||||
```
|
||||
|
||||
### 8. Scale Up the Workload
|
||||
|
||||
```bash
|
||||
# From your workstation:
|
||||
kubectl scale deployment/<workload-name> -n <namespace> --replicas=1
|
||||
|
||||
# Or StatefulSet:
|
||||
kubectl scale statefulset/<workload-name> -n <namespace> --replicas=1
|
||||
|
||||
# Wait for pod to be ready
|
||||
kubectl wait --for=condition=Ready pod -l app=<workload-name> -n <namespace> --timeout=120s
|
||||
```
|
||||
|
||||
### 9. Verify
|
||||
|
||||
```bash
|
||||
# Check pod logs for startup errors
|
||||
kubectl logs -n <namespace> -l app=<workload-name> --tail=20
|
||||
|
||||
# Test application functionality (service-specific)
|
||||
curl -s -o /dev/null -w "%{http_code}" https://<service>.viktorbarzin.me/
|
||||
```
|
||||
|
||||
## Example: Full Vaultwarden Restore
|
||||
|
||||
```bash
|
||||
# 1. List backups
|
||||
ssh root@192.168.1.127
|
||||
ls -l /mnt/backup/pvc-data/
|
||||
|
||||
# 2. Scale down
|
||||
kubectl scale deployment vaultwarden -n vaultwarden --replicas=0
|
||||
kubectl wait --for=delete pod -l app=vaultwarden -n vaultwarden --timeout=120s
|
||||
|
||||
# 3. Find LV name
|
||||
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,HANDLE:.spec.csi.volumeHandle' | grep vaultwarden-data-proxmox
|
||||
# Output: pvc-xyz vaultwarden-data-proxmox local-lvm:vm-105-pvc-xyz456
|
||||
|
||||
# 4. Mount and restore
|
||||
ssh root@192.168.1.127
|
||||
lvchange -ay pve/vm-105-pvc-xyz456
|
||||
mkdir -p /mnt/restore-temp
|
||||
mount /dev/pve/vm-105-pvc-xyz456 /mnt/restore-temp
|
||||
|
||||
rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/
|
||||
|
||||
umount /mnt/restore-temp
|
||||
lvchange -an pve/vm-105-pvc-xyz456
|
||||
|
||||
# 5. Scale up
|
||||
kubectl scale deployment vaultwarden -n vaultwarden --replicas=1
|
||||
kubectl wait --for=condition=Ready pod -l app=vaultwarden -n vaultwarden --timeout=120s
|
||||
|
||||
# 6. Test
|
||||
curl -s -o /dev/null -w "%{http_code}" https://vaultwarden.viktorbarzin.me/
|
||||
```
|
||||
|
||||
## Database-Specific Notes
|
||||
|
||||
For databases (MySQL, PostgreSQL), prefer the app-level backup restore (see `restore-mysql.md`, `restore-postgresql.md`) unless:
|
||||
- You need a very recent point-in-time that predates the last dump
|
||||
- The database dump is corrupt or missing
|
||||
- You're restoring a non-SQL database (e.g., Redis RDB)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Problem | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| "LV is active" during mount | Workload pod still running or stuck | `kubectl get pods -A | grep <pvc-name>`, delete pod if stuck |
|
||||
| "No such file or directory" in backup | PVC not backed up (in excluded namespace) | Check `daily-backup` script EXCLUDE_NAMESPACES |
|
||||
| rsync shows 0 files transferred | Wrong backup week or PVC name | Double-check paths: `ls /mnt/backup/pvc-data/<week>/<ns>/<pvc>/` |
|
||||
| Pod stuck in ContainerCreating after restore | LV still active on PVE host | `lvchange -an pve/<lv-name>`, wait 30s, check pod again |
|
||||
| Backup week missing | Daily backup hasn't run for that week | Check `systemctl status daily-backup.service`, verify retention |
|
||||
|
||||
## Restore from Synology (if PVE host sda is unavailable)
|
||||
|
||||
If the PVE host sda backup disk is unavailable or corrupt:
|
||||
|
||||
```bash
|
||||
# 1. SSH to Synology NAS
|
||||
ssh Administrator@192.168.1.13
|
||||
|
||||
# 2. Navigate to backup directory
|
||||
cd /volume1/Backup/Viki/pve-backup/pvc-data/
|
||||
|
||||
# 3. Find the PVC backup
|
||||
ls -l 2026-14/<namespace>/<pvc-name>/
|
||||
|
||||
# 4. Copy to a temporary location accessible from cluster
|
||||
# Option A: Restore sda on PVE host first
|
||||
# Option B: rsync to a surviving node's local disk
|
||||
# Option C: Mount Synology NFS share on a pod (if network accessible)
|
||||
```
|
||||
|
||||
## Estimated Time
|
||||
|
||||
- Small PVC (<1GB): ~5 minutes
|
||||
- Medium PVC (1-10GB): ~10-15 minutes
|
||||
- Large PVC (>10GB): ~30+ minutes (depends on size and network)
|
||||
|
||||
## Related
|
||||
|
||||
- **`restore-lvm-snapshot.md`** — Fast restore for recent changes (<7 days)
|
||||
- **`restore-full-cluster.md`** — Disaster recovery procedure (uses this runbook in Phase 3.5)
|
||||
- **`docs/architecture/backup-dr.md`** — Backup architecture overview
|
||||
146
docs/runbooks/restore-vault.md
Normal file
146
docs/runbooks/restore-vault.md
Normal file
|
|
@ -0,0 +1,146 @@
|
|||
# Restore Vault (Raft)
|
||||
|
||||
Last updated: 2026-04-06
|
||||
|
||||
## Prerequisites
|
||||
- `kubectl` access to the cluster
|
||||
- Vault root token (from `vault-root-token` secret in `vault` namespace — manually created, independent of automation)
|
||||
- Raft snapshot available on NFS at `/mnt/main/vault-backup/`
|
||||
- Unseal keys (stored securely — check `secret/viktor` in Vault or emergency kit)
|
||||
|
||||
## Backup Location
|
||||
- NFS: `/mnt/main/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db`
|
||||
- Mirrored to sda: `/mnt/backup/nfs-mirror/vault-backup/` (PVE host 192.168.1.127)
|
||||
- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/vault-backup/`
|
||||
- Retention: 30 days (on NFS), latest only (on sda), unlimited (on Synology)
|
||||
- Schedule: Weekly on Sundays at 02:00 (`0 2 * * 0`)
|
||||
|
||||
## CRITICAL: Vault is a dependency for many services
|
||||
Vault provides secrets to the entire cluster via ESO (External Secrets Operator). A Vault outage affects:
|
||||
- All ExternalSecrets (43 secrets + 9 DB-creds secrets)
|
||||
- Vault DB engine password rotation
|
||||
- K8s credentials engine
|
||||
- CI/CD secret sync
|
||||
|
||||
**Priority: Restore Vault before any other service (except etcd).**
|
||||
|
||||
## Restore Procedure
|
||||
|
||||
### 1. Identify the snapshot to restore
|
||||
```bash
|
||||
# List available snapshots
|
||||
ls -lt /mnt/main/vault-backup/vault-raft-*.db | head -10
|
||||
```
|
||||
|
||||
### 2. Restore Raft snapshot
|
||||
```bash
|
||||
# Get root token
|
||||
VAULT_TOKEN=$(kubectl get secret vault-root-token -n vault -o jsonpath='{.data.vault-root-token}' | base64 -d)
|
||||
|
||||
# Port-forward to Vault
|
||||
kubectl port-forward svc/vault-active -n vault 8200:8200 &
|
||||
|
||||
# Restore the snapshot (this will overwrite current state)
|
||||
export VAULT_ADDR=http://127.0.0.1:8200
|
||||
export VAULT_TOKEN
|
||||
vault operator raft snapshot restore -force /path/to/vault-raft-YYYYMMDD-HHMMSS.db
|
||||
```
|
||||
|
||||
### 3. Unseal Vault (if sealed after restore)
|
||||
|
||||
> **Note:** Vault now has an auto-unseal sidecar that automatically unseals pods
|
||||
> using the `vault-unseal-key` K8s Secret. The manual procedure below is a
|
||||
> fallback if auto-unseal fails.
|
||||
|
||||
```bash
|
||||
# Check seal status
|
||||
vault status
|
||||
|
||||
# If sealed, unseal with keys (need threshold number of keys)
|
||||
vault operator unseal <key1>
|
||||
vault operator unseal <key2>
|
||||
vault operator unseal <key3>
|
||||
```
|
||||
|
||||
### 4. Verify restoration
|
||||
```bash
|
||||
# Check Vault health
|
||||
vault status
|
||||
|
||||
# Check raft peers
|
||||
vault operator raft list-peers
|
||||
|
||||
# Verify key secrets exist
|
||||
vault kv get secret/viktor
|
||||
vault kv list secret/
|
||||
|
||||
# Check DB engine
|
||||
vault list database/roles
|
||||
|
||||
# Check K8s engine
|
||||
vault list kubernetes/roles
|
||||
```
|
||||
|
||||
### 5. Trigger ESO refresh
|
||||
After Vault restore, ExternalSecrets may need a refresh:
|
||||
```bash
|
||||
# Restart ESO to force re-sync
|
||||
kubectl rollout restart deployment -n external-secrets
|
||||
|
||||
# Check ExternalSecret status
|
||||
kubectl get externalsecrets -A | grep -v "SecretSynced"
|
||||
```
|
||||
|
||||
## Alternative: Restore from sda Backup
|
||||
|
||||
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
|
||||
|
||||
```bash
|
||||
# 1. SSH to PVE host
|
||||
ssh root@192.168.1.127
|
||||
|
||||
# 2. Find the latest snapshot
|
||||
ls -lt /mnt/backup/nfs-mirror/vault-backup/
|
||||
|
||||
# 3. Copy snapshot to a location accessible from cluster
|
||||
# Port-forward to Vault and restore
|
||||
kubectl port-forward svc/vault-active -n vault 8200:8200 &
|
||||
export VAULT_ADDR=http://127.0.0.1:8200
|
||||
export VAULT_TOKEN=$(kubectl get secret vault-root-token -n vault -o jsonpath='{.data.vault-root-token}' | base64 -d)
|
||||
|
||||
# Copy snapshot from PVE host to local workstation, then restore
|
||||
scp root@192.168.1.127:/mnt/backup/nfs-mirror/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./
|
||||
vault operator raft snapshot restore -force ./vault-raft-YYYYMMDD-HHMMSS.db
|
||||
```
|
||||
|
||||
## Alternative: Restore from Synology (if PVE host is down)
|
||||
|
||||
If the PVE host itself is unavailable:
|
||||
|
||||
```bash
|
||||
# 1. SSH to Synology NAS
|
||||
ssh Administrator@192.168.1.13
|
||||
|
||||
# 2. Navigate to backup directory
|
||||
cd /volume1/Backup/Viki/nfs/vault-backup/
|
||||
|
||||
# 3. Copy snapshot to local workstation
|
||||
scp Administrator@192.168.1.13:/volume1/Backup/Viki/nfs/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./
|
||||
|
||||
# 4. Restore via port-forward (same as above)
|
||||
```
|
||||
|
||||
## Full Vault Rebuild (from zero)
|
||||
If Vault needs to be rebuilt from scratch:
|
||||
1. Comment out data sources + OIDC config in `stacks/vault/main.tf`
|
||||
2. Apply Helm release: `scripts/tg apply -target=helm_release.vault stacks/vault`
|
||||
3. Initialize: `vault operator init`
|
||||
4. Unseal with generated keys
|
||||
5. Restore raft snapshot (step 2 above)
|
||||
6. Populate `secret/vault` with OIDC credentials
|
||||
7. Uncomment data sources + OIDC
|
||||
8. Re-apply: `scripts/tg apply stacks/vault`
|
||||
|
||||
## Estimated Time
|
||||
- Snapshot restore + unseal: ~10 minutes
|
||||
- Full rebuild: ~30-45 minutes
|
||||
128
docs/runbooks/restore-vaultwarden.md
Normal file
128
docs/runbooks/restore-vaultwarden.md
Normal file
|
|
@ -0,0 +1,128 @@
|
|||
# Restore Vaultwarden
|
||||
|
||||
Last updated: 2026-04-06
|
||||
|
||||
## Prerequisites
|
||||
- `kubectl` access to the cluster
|
||||
- Backup available on NFS at `/mnt/main/vaultwarden-backup/`
|
||||
|
||||
## Backup Location
|
||||
- NFS: `/mnt/main/vaultwarden-backup/YYYY_MM_DD_HH_MM/` (directory per backup)
|
||||
- Each backup contains: `db.sqlite3`, `rsa_key.pem`, `rsa_key.pub.pem`, `attachments/`, `sends/`, `config.json`
|
||||
- Mirrored to sda: `/mnt/backup/nfs-mirror/vaultwarden-backup/` (PVE host 192.168.1.127)
|
||||
- PVC file backup (alternative): `/mnt/backup/pvc-data/<YYYY-WW>/vaultwarden/vaultwarden-data-proxmox/`
|
||||
- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/vaultwarden-backup/`
|
||||
- Retention: 30 days (on NFS), latest only (on sda nfs-mirror), 4 weeks (on sda pvc-data), unlimited (on Synology)
|
||||
- Schedule: Every 6 hours (00:00, 06:00, 12:00, 18:00)
|
||||
- Integrity check: Both source and backup are verified before/after each backup
|
||||
|
||||
## Backup Contents
|
||||
| File | Purpose | Critical? |
|
||||
|------|---------|-----------|
|
||||
| `db.sqlite3` | All passwords, TOTP seeds, org data | Yes |
|
||||
| `rsa_key.pem` / `rsa_key.pub.pem` | JWT signing keys | Yes — without these, all sessions invalidate |
|
||||
| `attachments/` | File attachments on vault items | Yes |
|
||||
| `sends/` | Bitwarden Send files | No |
|
||||
| `config.json` | Server configuration | No — can be recreated |
|
||||
|
||||
## Restore Procedure
|
||||
|
||||
### 1. Identify the backup to restore
|
||||
```bash
|
||||
# List available backups (directories sorted by date)
|
||||
kubectl run vw-ls --rm -it --image=alpine \
|
||||
--overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"vaultwarden-backup"}}],"containers":[{"name":"vw-ls","image":"alpine","volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["ls","-lt","/backup/"]}]}}' \
|
||||
-n vaultwarden
|
||||
```
|
||||
|
||||
### 2. Scale down Vaultwarden
|
||||
```bash
|
||||
kubectl scale deployment vaultwarden -n vaultwarden --replicas=0
|
||||
```
|
||||
|
||||
### 3. Restore the backup
|
||||
```bash
|
||||
BACKUP_DIR="YYYY_MM_DD_HH_MM" # Set to desired backup
|
||||
|
||||
kubectl run vw-restore --rm -it --image=alpine \
|
||||
--overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"vaultwarden-backup"}},{"name":"data","persistentVolumeClaim":{"claimName":"vaultwarden-data-proxmox"}}],"containers":[{"name":"vw-restore","image":"alpine","volumeMounts":[{"name":"backup","mountPath":"/backup"},{"name":"data","mountPath":"/data"}],"command":["/bin/sh","-c","cp /backup/'$BACKUP_DIR'/db.sqlite3 /data/db.sqlite3 && cp /backup/'$BACKUP_DIR'/rsa_key.pem /data/ && cp /backup/'$BACKUP_DIR'/rsa_key.pub.pem /data/ && cp -a /backup/'$BACKUP_DIR'/attachments /data/ 2>/dev/null; echo Restore complete"]}]}}' \
|
||||
-n vaultwarden
|
||||
```
|
||||
|
||||
### 4. Scale up Vaultwarden
|
||||
```bash
|
||||
kubectl scale deployment vaultwarden -n vaultwarden --replicas=1
|
||||
|
||||
# Wait for pod to be ready
|
||||
kubectl wait --for=condition=Ready pod -l app=vaultwarden -n vaultwarden --timeout=120s
|
||||
```
|
||||
|
||||
### 5. Verify restoration
|
||||
```bash
|
||||
# Check pod logs for startup errors
|
||||
kubectl logs -n vaultwarden -l app=vaultwarden --tail=20
|
||||
|
||||
# Test web UI access
|
||||
curl -s -o /dev/null -w "%{http_code}" https://vaultwarden.viktorbarzin.me/
|
||||
```
|
||||
|
||||
### 6. Test login
|
||||
Log in to the Vaultwarden web UI and verify:
|
||||
- [ ] Can log in with your account
|
||||
- [ ] Vault items are present and readable
|
||||
- [ ] Attachments are accessible
|
||||
- [ ] TOTP codes are generating correctly
|
||||
|
||||
## Alternative: Restore from PVC File Backup
|
||||
|
||||
If the NFS backup is unavailable or corrupt, restore from the weekly PVC file backup on sda:
|
||||
|
||||
```bash
|
||||
# 1. List available backup weeks
|
||||
ssh root@192.168.1.127
|
||||
ls -l /mnt/backup/pvc-data/
|
||||
|
||||
# 2. Scale down Vaultwarden
|
||||
kubectl scale deployment vaultwarden -n vaultwarden --replicas=0
|
||||
|
||||
# 3. Mount the live PVC LV on PVE host
|
||||
# Find the LV name first:
|
||||
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,HANDLE:.spec.csi.volumeHandle' | grep vaultwarden-data-proxmox
|
||||
# Assuming volumeHandle is "local-lvm:vm-999-pvc-abc123"
|
||||
LV_NAME="vm-999-pvc-abc123"
|
||||
|
||||
lvchange -ay pve/$LV_NAME
|
||||
mkdir -p /mnt/restore-temp
|
||||
mount /dev/pve/$LV_NAME /mnt/restore-temp
|
||||
|
||||
# 4. Restore from backup (pick a week)
|
||||
rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/
|
||||
|
||||
# 5. Unmount and scale up
|
||||
umount /mnt/restore-temp
|
||||
lvchange -an pve/$LV_NAME
|
||||
kubectl scale deployment vaultwarden -n vaultwarden --replicas=1
|
||||
```
|
||||
|
||||
## Alternative: Restore from sda Backup Mirror
|
||||
|
||||
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
|
||||
|
||||
```bash
|
||||
# 1. SSH to PVE host
|
||||
ssh root@192.168.1.127
|
||||
|
||||
# 2. Find the latest backup
|
||||
ls -lt /mnt/backup/nfs-mirror/vaultwarden-backup/
|
||||
|
||||
# 3. Mount sda backup on a pod
|
||||
BACKUP_DIR="YYYY_MM_DD_HH_MM" # Set to desired backup
|
||||
|
||||
kubectl run vw-restore --rm -it --image=alpine \
|
||||
--overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/vaultwarden-backup"}},{"name":"data","persistentVolumeClaim":{"claimName":"vaultwarden-data-proxmox"}}],"containers":[{"name":"vw-restore","image":"alpine","volumeMounts":[{"name":"backup","mountPath":"/backup"},{"name":"data","mountPath":"/data"}],"command":["/bin/sh","-c","cp /backup/'$BACKUP_DIR'/db.sqlite3 /data/db.sqlite3 && cp /backup/'$BACKUP_DIR'/rsa_key.pem /data/ && cp /backup/'$BACKUP_DIR'/rsa_key.pub.pem /data/ && cp -a /backup/'$BACKUP_DIR'/attachments /data/ 2>/dev/null; echo Restore complete"]}],"nodeName":"k8s-master"}}' \
|
||||
-n vaultwarden
|
||||
```
|
||||
|
||||
## Estimated Time
|
||||
- Restore: ~5 minutes
|
||||
- Verification: ~5 minutes
|
||||
196
docs/runbooks/scale-k8s-cluster.md
Normal file
196
docs/runbooks/scale-k8s-cluster.md
Normal file
|
|
@ -0,0 +1,196 @@
|
|||
# Runbook: Scale K8s worker count (PVC capacity headroom)
|
||||
|
||||
Use when block-PVC pressure, memory pressure, or planned workload growth requires adding or removing K8s worker VMs. The cluster currently runs **6 workers (k8s-node1..6) + 1 control plane (k8s-master)**, sized to absorb the 2026-05-26 proxmox-csi LUN-cap incident with sustained headroom.
|
||||
|
||||
## Current shape
|
||||
|
||||
| Node | VMID | Memory | Disk | Special |
|
||||
|------|------|--------|------|---------|
|
||||
| k8s-master | 200 | 32 GiB | 64G | Control plane, no worker workloads |
|
||||
| k8s-node1 | 201 | 48 GiB | 256G | GPU host (NVIDIA Tesla T4 passthrough), DNS primary |
|
||||
| k8s-node2 | 202 | 32 GiB | 256G | |
|
||||
| k8s-node3 | 203 | 32 GiB | 256G | |
|
||||
| k8s-node4 | 204 | 32 GiB | 256G | |
|
||||
| k8s-node5 | 205 | 32 GiB | 256G | Added 2026-05-26 (LUN-cap incident) |
|
||||
| k8s-node6 | 206 | 32 GiB | 256G | Added 2026-05-26 (LUN-cap incident) |
|
||||
|
||||
Capacity envelope (6 workers): **174 block-PVC slots**, ~192 GiB memory, ~96 vCPU, GPU on node1 only. Pod cap is kubelet-default 110/node.
|
||||
|
||||
## Binding constraints — read these first
|
||||
|
||||
The cluster has 6 capacity dimensions. The one that bites first depends on workload shape; check each before adding/removing nodes.
|
||||
|
||||
1. **Per-VM block-PVC ceiling = 29** — hardcoded by `sergelogvinov/proxmox-csi-plugin` at `pkg/csi/utils.go:394` (`for lun = 1; lun < 30; lun++`). Symptom: pods stuck `ContainerCreating` with `FailedAttachVolume … no free lun found`. `CSINode.allocatable.count` advertises `28`/node. Switching `scsihw` to `virtio-scsi-single` does NOT raise this — it's a plugin constraint, not a Proxmox/QEMU one. See `docs/architecture/storage.md` § "Per-VM SCSI-LUN cap".
|
||||
|
||||
2. **Memory commitment** — node1 has historically run hot (was 117% of limits before the 2026-06 memory bump to 48 GiB). Treat memory as the next-binding constraint after PVC slots, especially since limits-vs-requests divergence isn't enforced by the scheduler.
|
||||
|
||||
3. **sdc IO contention** — every K8s VM disk + TrueNAS NFS LV live on the same Proxmox thin pool on sdc (10.7 TB RAID1 HDD). Three IO storms in 17 days (2026-05-09, 2026-05-16/17, 2026-05-25) — see `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. Adding workers redistributes block PVCs but does NOT relieve underlying disk contention; that's beads `code-oflt`.
|
||||
|
||||
4. **GPU concentration** — Tesla T4 is passthrough-only on node1. Frigate ML / Immich ML / Whisper / Piper / llama-cpp all schedule there via `nvidia.com/gpu.present` label. Cannot be spread without provisioning a second GPU node.
|
||||
|
||||
5. **PVE host memory** — total PVE RAM 320 GiB. K8s VMs claim 240 GiB; TrueNAS / pfsense / Windows VMs claim ~80 GiB more. Adding a 32-GiB worker requires verifying PVE has the headroom (`free -h`).
|
||||
|
||||
6. **Per-stack Terraform state** — adding/removing nodes does NOT live in any single Terragrunt stack today. VMs are created via `scripts/provision-k8s-worker` (which calls `qm clone`). They are *not* managed declaratively in TF. Consequence: removal is a manual `kubectl delete node` + `qm stop` + `qm destroy`, not `tg destroy`.
|
||||
|
||||
## When to scale UP (add a worker)
|
||||
|
||||
Add a worker when **any** of these is true for ≥7 days:
|
||||
|
||||
| Trigger | Threshold | How to observe |
|
||||
|---------|-----------|----------------|
|
||||
| PVC slots per node | `max(per-node VA count) ≥ 25` (~86% of 29 cap) | `kubectl get volumeattachment -o json \| jq -r '.items[].spec.nodeName' \| sort \| uniq -c` |
|
||||
| Cluster memory requests | `> 90%` | `kubectl describe nodes \| grep -A4 "Allocated resources"` or Goldilocks dashboard |
|
||||
| Planned PVC additions | ≥3 net-new block PVCs in next sprint AND current max VA ≥ 22 | Project-tracker / beads |
|
||||
| LUN-cap incident | Even one `no free lun found` event | Prometheus alert `ProxmoxCSILunsExhausted` (added 2026-05-31, commit `aded77d5`) |
|
||||
| Sustained pod-eviction churn | Eviction count > 20/day for ≥3 days | `kubectl get events -A --field-selector reason=Evicted` |
|
||||
|
||||
### Playbook — add a worker
|
||||
|
||||
```bash
|
||||
# 1. Choose VMID + IP (next free in 10.0.20.0/22 worker range, 10.0.20.105+ used)
|
||||
NEXT_VMID=207
|
||||
NEXT_IP=10.0.20.107
|
||||
NAME=k8s-node7
|
||||
|
||||
# 2. Verify PVE memory headroom (need ≥34 GiB free for a 32-GiB VM with overhead)
|
||||
ssh root@192.168.1.127 'free -h; pvesh get /nodes/pve/status --output-format=json | jq .memory'
|
||||
|
||||
# 3. Verify thin pool has space (need ≥256 GiB raw thin allocation, but thin so only growth matters)
|
||||
ssh root@192.168.1.127 'lvs pve/data'
|
||||
|
||||
# 4. Clone + cloud-init + auto-join (idempotent — aborts if VMID or IP exists)
|
||||
scp scripts/provision-k8s-worker root@192.168.1.127:/tmp/
|
||||
ssh root@192.168.1.127 'bash /tmp/provision-k8s-worker '"$NAME $NEXT_VMID $NEXT_IP"
|
||||
|
||||
# 5. Wait for node to appear Ready (3-5 min for cloud-init + kubeadm join)
|
||||
kubectl get nodes -w
|
||||
|
||||
# 6. Verify CSI registration (proxmox-csi + nfs-csi node pods)
|
||||
kubectl get pods -A -o wide --field-selector spec.nodeName=$NAME | grep -E "csi|calico"
|
||||
|
||||
# 7. Confirm Goldilocks / Kyverno / Prometheus targets it (DaemonSets populate within ~2 min)
|
||||
kubectl get ds -A -o wide | awk '{print $7,$8}' | head -20
|
||||
|
||||
# 8. Update this runbook's "Current shape" table
|
||||
```
|
||||
|
||||
**Post-add validation:**
|
||||
- `kubectl top node $NAME` reports stats (kubelet metrics OK)
|
||||
- A test pod with a `proxmox-lvm` PVC schedules there and binds
|
||||
- No new alerts firing in monitoring
|
||||
|
||||
## When to scale DOWN (drain a worker)
|
||||
|
||||
Scale down when **all** of these hold for ≥30 days:
|
||||
|
||||
| Condition | Threshold |
|
||||
|-----------|-----------|
|
||||
| Max-node PVC count | `≤ 20` (≈70% of cap) |
|
||||
| Cluster memory requests | `< 70%` |
|
||||
| Cluster memory limits | `< 95%` (no over-committed node) |
|
||||
| No upcoming workload additions | Confirmed via beads / project tracker |
|
||||
|
||||
Scaling down is also reasonable as a deliberate trade-off (cost, IO reduction, consolidation) even if thresholds aren't met — but accept that the next scale-up cycle will incur the LUN-cap risk again.
|
||||
|
||||
### Playbook — drain + remove a worker
|
||||
|
||||
**Pick the lightest node first.** Survey before draining:
|
||||
|
||||
```bash
|
||||
NODE=k8s-node5
|
||||
|
||||
# 1. Inventory what's there
|
||||
kubectl get pods -A -o wide --field-selector spec.nodeName=$NODE \
|
||||
| awk 'NR>1 {print $1}' | sort | uniq -c # pods per namespace
|
||||
|
||||
# 2. List drain blockers (local-path PVCs in use, GPU pods, single-replica services)
|
||||
kubectl get pvc -A -o json | jq -r --arg n "$NODE" '.items[]
|
||||
| select(.spec.storageClassName == "local-path")
|
||||
| select(.status.phase == "Bound")
|
||||
| "\(.metadata.namespace)/\(.metadata.name)"'
|
||||
|
||||
# 3. Check presence board — is anyone mutating workloads on this node right now?
|
||||
~/code/scripts/presence list
|
||||
# If a `service:*` claim covers any pod on $NODE, DEFER until released.
|
||||
|
||||
# 4. Cordon (mark unschedulable, existing pods stay)
|
||||
kubectl cordon $NODE
|
||||
|
||||
# 5. Watch memory pressure forecast on remaining nodes BEFORE evicting
|
||||
kubectl top nodes # baseline
|
||||
# Expected addition: ~ (sum of pod memory requests on $NODE) / (N - 1) per other node
|
||||
|
||||
# 6. Drain (respects PDBs; --delete-emptydir-data needed for tmp volumes)
|
||||
kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data --timeout=15m
|
||||
|
||||
# Expected blips during drain (~30s-2min each for PVC reattach):
|
||||
# any singleton on $NODE (Deployment replicas=1 or StatefulSet with no peers)
|
||||
# Multi-replica services with PDB just roll without downtime.
|
||||
|
||||
# 7. Verify everything rescheduled cleanly
|
||||
kubectl get pods -A -o wide --field-selector spec.nodeName=$NODE
|
||||
# Should show only DaemonSet pods + Completed jobs
|
||||
|
||||
# 8. Remove from cluster
|
||||
kubectl delete node $NODE
|
||||
|
||||
# 9. Shut down + (optional) destroy the VM
|
||||
VMID=205
|
||||
ssh root@192.168.1.127 "qm shutdown $VMID --timeout 300; qm status $VMID"
|
||||
# To fully destroy (frees thin-pool space):
|
||||
# ssh root@192.168.1.127 "qm destroy $VMID --purge"
|
||||
|
||||
# 10. Verify post-drain shape
|
||||
kubectl get volumeattachment -o json \
|
||||
| jq -r '.items[] | select(.spec.attacher == "csi.proxmox.sinextra.dev") | .spec.nodeName' \
|
||||
| sort | uniq -c
|
||||
|
||||
# 11. Update this runbook's "Current shape" table
|
||||
```
|
||||
|
||||
**Cold-spare option:** instead of `qm destroy`, keep the VM stopped. The 256 GiB disk stays allocated on thin pool but the VM consumes no CPU/RAM. Re-add via `qm start <VMID>` + `kubeadm join` (the snippet still lives at `/var/lib/vz/snippets/k8s_cloud_init.yaml`).
|
||||
|
||||
## Special cases
|
||||
|
||||
### Critical singletons that blip during drain
|
||||
|
||||
These services are single-replica and incur ~30s-2min outages while their PVC reattaches to the new node:
|
||||
|
||||
- **Stateful databases**: `mysql-standalone-0`, `pg-cluster-*` members (CNPG handles failover gracefully)
|
||||
- **Mail**: `mailserver`, `roundcubemail` (Dovecot maildir locking — defer if mid-incident)
|
||||
- **Browser-trust services**: `nextcloud` (sessions reset), `vaultwarden` (active sessions blip)
|
||||
- **Observability**: `prometheus-server` (scrape data gap), `claude-memory`
|
||||
- **Self-hosted apps with SQLite**: hackmd, n8n, paperless-ngx, freshrss, navidrome, audiobookshelf
|
||||
|
||||
Coordinate the drain timing with users if any of these is on the node being drained. Single-pod Postgres/MySQL DBs are the most painful — schedule during low-traffic windows.
|
||||
|
||||
### GPU pods
|
||||
|
||||
GPU pods scheduled via `nvidia.com/gpu.present=true` node selector. They **cannot** drain off node1; if node1 itself needs maintenance, scale GPU workloads to 0 first or defer drain. See `docs/runbooks/k8s-node-auto-upgrades.md` for the kured-driven reboot path.
|
||||
|
||||
### Active sessions
|
||||
|
||||
Check `~/code/scripts/presence list` before any drain. If another session holds a claim on a service hosted on the target node, defer or coordinate.
|
||||
|
||||
### Force-clean stuck VolumeAttachments
|
||||
|
||||
If a drained node has lingering VolumeAttachment entries after `kubectl delete node`:
|
||||
|
||||
```bash
|
||||
kubectl get volumeattachment -o json \
|
||||
| jq -r --arg n "$NODE" '.items[] | select(.spec.nodeName == $n) | .metadata.name' \
|
||||
| xargs -r kubectl patch volumeattachment -p '{"metadata":{"finalizers":null}}' --type=merge
|
||||
kubectl get volumeattachment -o json \
|
||||
| jq -r --arg n "$NODE" '.items[] | select(.spec.nodeName == $n) | .metadata.name' \
|
||||
| xargs -r kubectl delete volumeattachment
|
||||
```
|
||||
|
||||
## Related
|
||||
|
||||
- `docs/architecture/storage.md` § "Per-VM SCSI-LUN cap" — root-cause explanation of the PVC ceiling
|
||||
- `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md` — IO contention on sdc
|
||||
- `docs/runbooks/k8s-node-auto-upgrades.md` — kured-driven rolling reboots (separate from scale)
|
||||
- `docs/runbooks/restore-full-cluster.md` — disaster scenarios
|
||||
- `scripts/provision-k8s-worker` — the actual cloning/join script
|
||||
- Beads `code-oflt` — IO isolation (long-term fix for sdc contention)
|
||||
- Remote memory id=2788 — `proxmox-csi-plugin hardcodes a per-VM SCSI-LUN ceiling`
|
||||
191
docs/runbooks/security-incident.md
Normal file
191
docs/runbooks/security-incident.md
Normal file
|
|
@ -0,0 +1,191 @@
|
|||
# Security Incident Response
|
||||
|
||||
What to do when a wave-1 security alert fires. Each alert links to a Loki query for investigation and concrete remediation steps.
|
||||
|
||||
**Status: planned, not yet implemented.** Beads epic: `code-8ywc`. This runbook is the response playbook for when wave 1 ships.
|
||||
|
||||
## General workflow
|
||||
|
||||
1. **Acknowledge in Alertmanager.** Silence only after triage starts.
|
||||
2. **Pull context from Loki** (queries below). Get the actor, source IP, timestamp.
|
||||
3. **Decide: real or false-positive?** Use the "false-positive cases" notes below.
|
||||
4. **If real:** revoke credentials (Vault token revoke, K8s SA token rotate, SSH key remove, OIDC session invalidate), then post-mortem.
|
||||
5. **If false-positive:** tune the alert (extend allowlist, refine LogQL query).
|
||||
|
||||
## Allowlist CIDRs
|
||||
|
||||
All source-IP-based alerts (K2, K9, V7, S1) reference this list. Update in one place: Terraform variable `security_source_ip_allowlist` in `stacks/monitoring`.
|
||||
|
||||
- `10.0.20.0/22` — VLAN 20 (cluster + main LAN)
|
||||
- `192.168.1.0/24` — Proxmox + Sofia LAN
|
||||
- K8s pod CIDR (verify at implementation time)
|
||||
- K8s service CIDR
|
||||
- Headscale tailnet
|
||||
|
||||
**Anything outside = alert.** No public-IP exceptions.
|
||||
|
||||
## Viktor's identity
|
||||
|
||||
`me@viktorbarzin.me` is the ONLY allowlisted human identity. NOT `viktor@viktorbarzin.me`. NOT `emo@viktorbarzin.me`. emo's identity scheme is separate and must be added explicitly if/when needed.
|
||||
|
||||
---
|
||||
|
||||
## K-alerts (K8s API audit)
|
||||
|
||||
### K2 — ServiceAccount token used from outside cluster
|
||||
|
||||
**Meaning:** A K8s ServiceAccount token authenticated a request whose `sourceIPs[0]` is not in the pod CIDR or trusted LAN. Stolen SA token used externally.
|
||||
|
||||
```logql
|
||||
{job="kube-audit"} | json | user_username =~ "system:serviceaccount:.*" | sourceIPs_0 !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*"
|
||||
```
|
||||
|
||||
**Action:** Identify the SA. Rotate its token (`kubectl delete secret <sa-token-name>` if old-style, or recreate the SA if projected token). Audit the SA's permissions and tighten.
|
||||
|
||||
**False positives:** Pod-to-apiserver traffic that egresses and re-enters via NodePort/LB (rare). Investigate the originating workload.
|
||||
|
||||
### K3 — Secret read in sensitive namespace by unexpected actor
|
||||
|
||||
**Meaning:** A Secret in `vault`, `sealed-secrets`, or `external-secrets` namespace was read by an SA NOT in the allowlist (ESO controller, sealed-secrets controller, Vault SA, `me@viktorbarzin.me`).
|
||||
|
||||
```logql
|
||||
{job="kube-audit"} | json | verb =~ "get|list" | objectRef_resource = "secrets" | objectRef_namespace =~ "vault|sealed-secrets|external-secrets" | user_username !~ "(me@viktorbarzin.me|system:serviceaccount:external-secrets:.*|system:serviceaccount:sealed-secrets:.*|system:serviceaccount:vault:.*)"
|
||||
```
|
||||
|
||||
**Action:** Identify the actor. If a service account, audit its bindings — it shouldn't have RBAC to read those secrets. Revoke the binding. Rotate any secrets that were read.
|
||||
|
||||
### K4 — Exec into sensitive pod
|
||||
|
||||
**Meaning:** Someone `kubectl exec`'d into a pod in `vault`, `kube-system`, `dbaas`, or `cnpg-system`.
|
||||
|
||||
```logql
|
||||
{job="kube-audit"} | json | verb = "create" | objectRef_resource = "pods" | objectRef_subresource = "exec" | objectRef_namespace =~ "vault|kube-system|dbaas|cnpg-system" | user_username != "me@viktorbarzin.me"
|
||||
```
|
||||
|
||||
**Action:** Determine if Viktor authorized the exec. If unrecognized actor, revoke their access and rotate any credentials they could have read inside the pod.
|
||||
|
||||
**False positives:** Break-glass SAs used during incident response — extend the allowlist to include them by SA name.
|
||||
|
||||
### K5 — Mass delete
|
||||
|
||||
**Meaning:** Single actor deleted >5 Pods, Secrets, or ConfigMaps in 60 seconds. Either a script gone wrong or destructive intrusion.
|
||||
|
||||
```logql
|
||||
sum by (user_username) (count_over_time({job="kube-audit"} | json | verb = "delete" | objectRef_resource =~ "pods|secrets|configmaps" [1m])) > 5
|
||||
```
|
||||
|
||||
**Action:** Identify actor. If a Terraform apply or known cleanup job, false positive. If unrecognized, suspend the actor's credentials immediately and audit what was deleted.
|
||||
|
||||
### K6 — Audit policy modified
|
||||
|
||||
**Meaning:** Someone changed the kube-apiserver audit policy. Should only happen via Terraform.
|
||||
|
||||
**Action:** Verify the change came from a planned Terraform apply (check recent commits to `stacks/infra`). If not, treat as critical compromise — attacker disabling visibility.
|
||||
|
||||
### K7 — New ClusterRole with full wildcards
|
||||
|
||||
**Meaning:** A new ClusterRole was created with `verbs: ["*"]` and `resources: ["*"]`. Privilege escalation primitive.
|
||||
|
||||
```logql
|
||||
{job="kube-audit"} | json | verb = "create" | objectRef_resource = "clusterroles" | requestObject_rules_0_verbs_0 = "*" | requestObject_rules_0_resources_0 = "*"
|
||||
```
|
||||
|
||||
**Action:** Verify the change is intentional (some operators install such roles — calico, kyverno). If unrecognized, delete the ClusterRole and audit the creator.
|
||||
|
||||
### K8 — Anonymous binding
|
||||
|
||||
**Meaning:** A RoleBinding or ClusterRoleBinding was created referencing `system:anonymous` or `system:unauthenticated`. Catastrophic — allows unauthenticated cluster access.
|
||||
|
||||
**Action:** Delete the binding immediately. Audit who created it. Treat as full cluster compromise — rotate all secrets, force kubeconfig re-issue.
|
||||
|
||||
### K9 — Viktor's identity from unexpected source IP
|
||||
|
||||
**Meaning:** A request authenticated as `me@viktorbarzin.me` arrived from a source IP outside the allowlist. Stolen OIDC token / kubeconfig.
|
||||
|
||||
```logql
|
||||
{job="kube-audit"} | json | user_username = "me@viktorbarzin.me" | sourceIPs_0 !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*|<pod-cidr>|<headscale-cidr>"
|
||||
```
|
||||
|
||||
**Action:** Revoke Viktor's OIDC session in Authentik. Rotate Vault OIDC tokens. Audit recent activity from that IP. Verify Viktor's devices for compromise.
|
||||
|
||||
**False positives:** Viktor's machine on a new network without VPN — should not happen per the "no public IP access" policy. If it does, the policy needs revisiting, not the alert.
|
||||
|
||||
---
|
||||
|
||||
## V-alerts (Vault audit)
|
||||
|
||||
### V1 — Root token created
|
||||
|
||||
```logql
|
||||
{job="vault-audit"} | json | request_path = "auth/token/create" | response_auth_policies = "root"
|
||||
```
|
||||
|
||||
**Action:** Verify against Terraform / planned operation. Root tokens should ONLY be created during initial Vault setup or break-glass.
|
||||
|
||||
### V2 — Audit device disabled/modified
|
||||
|
||||
**Action:** Attacker silencing visibility. Re-enable immediately. Treat as critical compromise.
|
||||
|
||||
### V3 — Seal status changed
|
||||
|
||||
**Action:** Verify whether this is a planned operation (unseal during upgrade). If unplanned, treat as critical.
|
||||
|
||||
### V4 — Policy modified
|
||||
|
||||
**Action:** Confirm change came from a Terraform apply. Allowlist Terraform's source IP / token role. Otherwise: review the policy diff, revert if malicious.
|
||||
|
||||
### V5 — Auth failure spike
|
||||
|
||||
**Action:** Identify the auth method and source. If CI token rotation, false positive. If unknown source brute-forcing, block the source IP at pfSense.
|
||||
|
||||
### V6 — Token with policies different from parent
|
||||
|
||||
**Action:** Privilege escalation attempt. Revoke the new token. Audit the parent token's policies.
|
||||
|
||||
### V7 — Viktor's Vault identity from unexpected source IP
|
||||
|
||||
**Meaning:** A Vault operation authenticated as Viktor's entity_id arrived from an IP not in the allowlist. Requires `x_forwarded_for_authorized_addrs` to be configured (Vault sits behind Traefik so `remote_addr` is Traefik's pod IP without XFF trust).
|
||||
|
||||
**Action:** Revoke Viktor's Vault OIDC tokens. Force OIDC re-auth. Audit Vault access from that IP.
|
||||
|
||||
---
|
||||
|
||||
## S-alerts (Host)
|
||||
|
||||
### S1 — PVE sshd auth success from unexpected IP
|
||||
|
||||
```logql
|
||||
{job="sshd-pve"} |= "Accepted" | regexp "Accepted (?P<method>\\S+) for (?P<user>\\S+) from (?P<ip>\\S+)" | ip !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*|<headscale-cidr>"
|
||||
```
|
||||
|
||||
**Action:** Remove the user's SSH key from `/root/.ssh/authorized_keys` if it's still there. Audit recent sudo/login history (`last`, `sudo -i; journalctl _COMM=sudo`). Consider PVE as compromised — rotate root password, audit `/root/.luks-backup-key`, audit `/usr/local/bin/lvm-pvc-snapshot` and backup scripts for tampering.
|
||||
|
||||
---
|
||||
|
||||
## False-positive triage decision tree
|
||||
|
||||
```
|
||||
Did the alert fire from a known operational event?
|
||||
├─ Terraform apply at the same time? → likely V4 (policy modified)
|
||||
├─ Keel auto-roll? → not a security path
|
||||
├─ CI/CD pipeline running? → check V5 / K5
|
||||
└─ Viktor doing recovery work? → K4, K9, S1 candidates
|
||||
Extend allowlist if persistent
|
||||
```
|
||||
|
||||
## Escalation
|
||||
|
||||
For SEV1 (multiple alerts, cluster-admin grants, anonymous bindings, mass deletes):
|
||||
|
||||
1. Cordon all nodes (`kubectl cordon`) to prevent further pod scheduling — but be aware this also stops legitimate recovery work
|
||||
2. Revoke all OIDC sessions in Authentik
|
||||
3. Rotate Vault root keys + reseal
|
||||
4. Restore from a pre-incident backup if data integrity is questionable
|
||||
5. Post-mortem per `incident-response.md`
|
||||
|
||||
## Related
|
||||
|
||||
- [Security architecture](../architecture/security.md)
|
||||
- [Monitoring architecture](../architecture/monitoring.md)
|
||||
- [Incident response (general)](../architecture/incident-response.md)
|
||||
- Beads epic: `code-8ywc`
|
||||
127
docs/runbooks/synology-storage.md
Normal file
127
docs/runbooks/synology-storage.md
Normal file
|
|
@ -0,0 +1,127 @@
|
|||
# Runbook: Synology NAS storage — navigate, assess, clean
|
||||
|
||||
**Target:** Synology DS218 (`NAS_Barzini`), `192.168.1.13`, `/volume1`
|
||||
(5.3 TiB btrfs). This is the **offsite backup target** (Copy 3 of the
|
||||
3-2-1 strategy) **and a shared family volume** — homelab data is only
|
||||
under `Backup/Viki/`; `Anca/`, `Emo/`, `Common/`, `music`, `video`,
|
||||
`photo` etc. are family data.
|
||||
|
||||
Related: [storage architecture](../architecture/storage.md) ·
|
||||
[backup & DR](../architecture/backup-dr.md)
|
||||
|
||||
## Access
|
||||
|
||||
- SSH: `ssh Administrator@192.168.1.13` (capital `A`; key-auth works
|
||||
from devvm and the PVE host). `Administrator` can `sudo`.
|
||||
- sudo password: Vault `secret/viktor` → `synology_admin_password`
|
||||
(`VAULT_ADDR=https://vault.viktorbarzin.me`). DSM Web API has 2FA, so
|
||||
**SSH+sudo is the only unattended path** (`read -r PW; printf '%s\n'
|
||||
"$PW" | sudo -S -p '' <cmd>` to keep the secret out of `argv`).
|
||||
|
||||
## ⚠️ NEVER run `du` / `find` / `ncdu` on this NAS
|
||||
|
||||
Recursive walks over the multi-TB `Backup` share take 10+ min (often
|
||||
never finish) and burn disk/IO on the NAS. Use Synology's own
|
||||
pre-indexed data instead:
|
||||
|
||||
| Need | Instant, non-walking source |
|
||||
|---|---|
|
||||
| Volume fill | `df -h /volume1` |
|
||||
| btrfs real usage | `btrfs filesystem df /volume1` |
|
||||
| Per-subvolume | `sudo btrfs qgroup show -prce --raw /volume1` |
|
||||
| **Per-share / per-owner / per-type / largest / oldest / dupes** | **Storage Analyzer weekly report** (below) |
|
||||
|
||||
### Storage Analyzer weekly report
|
||||
|
||||
Storage Analyzer is installed and writes a report every **Monday
|
||||
~00:00** to:
|
||||
|
||||
```
|
||||
/volume1/Backup/Viki/synoreport/weekly storage report/<YYYY-MM-DD_..>/
|
||||
```
|
||||
|
||||
Data is up to ~7 days stale. The useful files are zipped CSVs in
|
||||
`csv/` — **content is UTF-16, and there is no `unzip` on the box**, so
|
||||
read them with Python:
|
||||
|
||||
```python
|
||||
import zipfile, os
|
||||
R=".../<date>/csv"
|
||||
def readcsv(n):
|
||||
z=zipfile.ZipFile(os.path.join(R,n)); raw=z.read(z.namelist()[0])
|
||||
for enc in ("utf-16","utf-8-sig","utf-8"):
|
||||
try: return raw.decode(enc)
|
||||
except Exception: pass
|
||||
```
|
||||
|
||||
Key CSVs: `volume_usage`, `share_list` (per-share, incl/excl recycle),
|
||||
`quota_usage.share` (**per-owner within a share**), `file_group`
|
||||
(per-file-type), `large_file`, `least_modify` (oldest), `duplicate_file`.
|
||||
The `*.db` files (`folder.db` etc.) are a **custom Synology format —
|
||||
NOT sqlite**; `report.html` does not embed clean folder totals.
|
||||
|
||||
## btrfs space-reclaim is ASYNCHRONOUS — and snapshot-pinned
|
||||
|
||||
- Deleting files/snapshots returns instantly but `df` lags minutes
|
||||
while the btrfs cleaner reclaims extents (~30 GB/min on the DS218).
|
||||
- Data deleted from the live share **stays on disk until the share
|
||||
snapshots that still reference it also rotate out.** There are 4
|
||||
daily `Backup` share snapshots (`GMT-*-21.00.02`), so **expect up to
|
||||
~4 days of lag** before a delete fully frees space.
|
||||
- Snapshot CLI (sudo, full path): `/usr/syno/sbin/synosharesnapshot
|
||||
{list|delete} Backup <snap>...`. Retention:
|
||||
`/usr/syno/etc/sharesnap/sharesnap.conf`.
|
||||
|
||||
## Capacity alert
|
||||
|
||||
The Synology mount surfaces to Prometheus as the PVE host NFS mount
|
||||
`/mnt/synology-backup` (`job="proxmox-host"`, `fstype=nfs4`), caught by
|
||||
the **global `NodeFilesystemFull`** rule in
|
||||
`stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`.
|
||||
|
||||
- **2026-06-05:** threshold changed **90% → 95%** (`* 100 < 5`) at
|
||||
user request — a backup target legitimately runs hot, so 90% was
|
||||
noisy. NOTE: this rule is **global**, so the looser 95% now applies to
|
||||
all node/system disks too. `BackupDiskFull` (the sda `/mnt/backup`
|
||||
disk, separate alert) stays at 85%.
|
||||
|
||||
## Current assessment — 2026-06-05
|
||||
|
||||
`/volume1` at **94% (5.0 TiB used / 5.3 TiB, 324 GiB free)**, down from
|
||||
98% on 2026-05-24. The **`Backup` share is 4.42 TiB (86%)**:
|
||||
Administrator/homelab **3.92 TiB**, Emo/family **504 GiB**. By type:
|
||||
Other 1.76 TiB, Videos 1.33 TiB, Pictures 631 GiB, Zipped 495 GiB,
|
||||
DiskImage 77 GiB. The ~1.9 TiB of media is mostly the **Immich offsite
|
||||
backup** (`Viki/nfs/immich` + `nfs-ssd/immich`), which **grows daily —
|
||||
the structural capacity driver now that one-off cleanups are spent.**
|
||||
|
||||
### Already reclaimed (verified gone)
|
||||
|
||||
`Anca/Elements` (770 GiB — dir now empty), `prometheus-backup` (63 GiB),
|
||||
`ollama`/`llamacpp`/`audiblez`/`ebook2audiobook` — removed in the
|
||||
2026-06-01 cleanup; nfs-mirror now excludes the regenerable services.
|
||||
|
||||
### Cleanup candidates — homelab (`Backup/Viki/`, Administrator-owned)
|
||||
|
||||
| Target | Size | Notes |
|
||||
|---|---|---|
|
||||
| `Photos/gphotos-1/` | **208 GiB** zips (+ extracted) | 2023 Google Takeout, **already imported to Immich** (`immich-go.exe` beside them; dupes confirmed). Redundant. |
|
||||
| `laptop/` | ~167 GiB | old VM images (Kali/windows vdis, metasploitable, soton-rpi.img) |
|
||||
| `All-in-one/` | ~95 GiB | 2015–2018 archives |
|
||||
| `#recycle/` (Backup) | ~16 GiB | recycle bin (HA backup rotation) |
|
||||
| loose `*.asc`/`*.mov` in `Viki/` root | ~8 GiB | old encrypted archives, phone videos |
|
||||
| `sgs7/` | ~3.5 GiB | 2021 Galaxy S7 backup |
|
||||
|
||||
**~500 GiB** reclaimable without touching live backups or family data.
|
||||
|
||||
### Cleanup candidates — family (flag to Emo, do not delete)
|
||||
|
||||
- `Emo/D/` Windows 7 vmdks — **3 identical 39.5 GiB copies** (one live +
|
||||
two under `_SYNCAPP/Versioning/`) → 79 GiB dedup.
|
||||
- Emo-shared recycle bin: 12.6 GiB.
|
||||
|
||||
### Do NOT touch
|
||||
|
||||
`Viki/pve-backup/` (live structured backup), `Viki/nfs/immich` +
|
||||
`nfs-ssd/immich` (irreplaceable), `HomeAssistant/` + `ha_backup_vermont/`
|
||||
(~7 GiB, healthy 3-copy retention).
|
||||
51
docs/runbooks/technitium-apply.md
Normal file
51
docs/runbooks/technitium-apply.md
Normal file
|
|
@ -0,0 +1,51 @@
|
|||
# Runbook: Applying the Technitium Terraform stack
|
||||
|
||||
Last updated: 2026-04-19
|
||||
|
||||
The `stacks/technitium/` apply has a **post-apply readiness gate** that asserts all three DNS instances are healthy before the apply is allowed to finish. This runbook explains what it checks, how to interpret failures, and how to override it for emergency maintenance.
|
||||
|
||||
## What the gate checks
|
||||
|
||||
`stacks/technitium/modules/technitium/readiness.tf` defines `null_resource.technitium_readiness_gate`. It runs after the three Technitium deployments, the DNS LoadBalancer service, and the PDB are applied, and performs:
|
||||
|
||||
1. **Rollout status** — `kubectl rollout status deploy/<name> --timeout=180s` for `technitium`, `technitium-secondary`, `technitium-tertiary`. Fails if any deployment has not reached its desired pod count within 180s.
|
||||
2. **Per-pod API health** — for every pod with label `dns-server=true`, executes `wget http://127.0.0.1:5380/api/stats/get` inside the pod and asserts the response contains `"status":"ok"`. Catches Technitium process hangs that TCP probes miss.
|
||||
3. **Zone-count parity** — queries `technitium-web`, `technitium-secondary-web`, `technitium-tertiary-web` and counts the zones returned. Fails if the three counts differ, which would mean `technitium-zone-sync` has drifted or a replica has lost state.
|
||||
|
||||
The gate is re-run whenever any of the deployment container spec, the CoreDNS Corefile, or the apply timestamp changes (see `triggers` in `readiness.tf`).
|
||||
|
||||
## Emergency override
|
||||
|
||||
Set `skip_readiness=true` via terragrunt inputs or pass it directly to the Terraform apply:
|
||||
|
||||
```bash
|
||||
cd infra/stacks/technitium
|
||||
scripts/tg apply -var skip_readiness=true
|
||||
```
|
||||
|
||||
Only use this when you need to land a Terraform change while one Technitium instance is intentionally offline (e.g., you are replacing its PVC, migrating storage, or recovering a corrupted config DB). Re-apply without the flag once the instance is back.
|
||||
|
||||
You can also target around the gate during emergency work:
|
||||
|
||||
```bash
|
||||
scripts/tg apply -target=kubernetes_config_map.coredns
|
||||
```
|
||||
|
||||
`-target` bypasses the `depends_on` chain feeding the gate, so a single-resource push does not need the gate to pass.
|
||||
|
||||
## Failure modes and responses
|
||||
|
||||
| Symptom | Likely cause | Fix |
|
||||
|---------|--------------|-----|
|
||||
| `rollout status` times out on one deployment | Pod stuck `Pending` (node pressure / anti-affinity with other dns-server pods) or `ImagePullBackOff` | `kubectl describe pod` for events. If anti-affinity is blocking, confirm 3 nodes are Ready. |
|
||||
| API check fails on a pod but readiness probe passes | Technitium process hung but port 53 still accepting TCP (liveness probe is `tcp_socket` on :53) | `kubectl delete pod <name>` — deployment will recreate it. |
|
||||
| Zone count differs between instances | `technitium-zone-sync` CronJob is failing or AXFR is blocked | `kubectl logs -n technitium -l job-name=<latest-zone-sync-job>`. Check `TechnitiumZoneSyncFailed` alert. |
|
||||
| Gate passes but external clients still cannot resolve | Gate only checks in-pod API and intra-cluster zone parity — external path (LoadBalancer → Technitium pod) is not tested | Run the LAN-client drill in `docs/architecture/dns.md` troubleshooting section. |
|
||||
|
||||
## What the gate does NOT check
|
||||
|
||||
- External reachability through the LoadBalancer IP `10.0.20.201` (that would require a LAN-side probe).
|
||||
- CoreDNS health (CoreDNS is patched by `coredns.tf`, not this module's deployments — alerts `CoreDNSErrors` / `CoreDNSForwardFailureRate` catch regressions post-apply).
|
||||
- Upstream resolver health (covered by `CoreDNSForwardFailureRate`).
|
||||
|
||||
For broader end-to-end verification, see `docs/architecture/dns.md` → "Verification" section, or run the Uptime Kuma external DNS probe.
|
||||
217
docs/runbooks/vault-raft-leader-deadlock.md
Normal file
217
docs/runbooks/vault-raft-leader-deadlock.md
Normal file
|
|
@ -0,0 +1,217 @@
|
|||
# Runbook: Vault Raft Leader Deadlock + Safe Pod Restart
|
||||
|
||||
Captures the 2026-04-22 incident pattern. When a Vault raft leader enters a
|
||||
stuck goroutine state (port 8201 accepts TCP but RPCs never return), the
|
||||
recovery is *not* `kubectl delete --force`. Force-deleting a Vault pod that
|
||||
holds a stuck NFS mount leaves kernel NFS client state corrupted, which
|
||||
blocks all subsequent NFS mounts from the node and usually requires a VM
|
||||
hard-reset to clear.
|
||||
|
||||
**Related**: [post-mortems/2026-04-22-vault-raft-leader-deadlock.md](../post-mortems/2026-04-22-vault-raft-leader-deadlock.md).
|
||||
|
||||
## Symptoms
|
||||
|
||||
- `https://vault.viktorbarzin.me/v1/sys/health` returns HTTP 503.
|
||||
- Standbys log `msgpack decode error [pos 0]: i/o timeout` every 2s.
|
||||
- `kubectl exec` into a standby shows raft thinks the leader is alive
|
||||
(peers list all `Voter`, leader address populated) but `vault operator
|
||||
raft autopilot state` stalls or errors.
|
||||
- The "leader" pod's logs go silent — no heartbeats, no audit writes,
|
||||
nothing. TCP on 8201 still accepts connections.
|
||||
- ESO-backed secrets stop refreshing (ExternalSecret `SecretSyncedError`).
|
||||
- Woodpecker CI pipelines that read from Vault at plan time hang.
|
||||
|
||||
## 0. Confirm the diagnosis (before touching anything)
|
||||
|
||||
Don't jump to force-delete. Verify the leader is actually stuck, not just
|
||||
slow:
|
||||
|
||||
```sh
|
||||
# 1. Who does raft think the leader is?
|
||||
kubectl exec -n vault vault-0 -c vault -- vault status 2>&1 | \
|
||||
grep -E 'HA Mode|Active Node|Leader|Raft'
|
||||
|
||||
# 2. Is the leader's port open but unresponsive?
|
||||
LEADER_POD=vault-2 # or whichever vault status reports
|
||||
kubectl exec -n vault $LEADER_POD -c vault -- sh -c \
|
||||
'timeout 3 nc -zv 127.0.0.1 8200 2>&1; echo; timeout 3 vault status'
|
||||
|
||||
# 3. Is the active vault service pointing at a real pod?
|
||||
kubectl get endpoints -n vault vault-active -o yaml | \
|
||||
grep -E 'addresses|notReadyAddresses' -A2
|
||||
|
||||
# 4. What do standby logs say?
|
||||
kubectl logs -n vault vault-0 -c vault --tail=40 | grep -iE 'msgpack|decode|rpc'
|
||||
```
|
||||
|
||||
If (2) hangs and (4) shows repeated msgpack errors → stuck leader.
|
||||
|
||||
## 1. Identify the stuck pod precisely
|
||||
|
||||
```sh
|
||||
# Find the pod whose vault_core_active would be 1 if it were scraping
|
||||
# (currently no telemetry — use logs as proxy until telemetry is enabled).
|
||||
for p in vault-0 vault-1 vault-2; do
|
||||
echo "=== $p ==="
|
||||
kubectl logs -n vault $p -c vault --tail=5 2>&1 | head -5
|
||||
done | grep -B1 'no recent output'
|
||||
```
|
||||
|
||||
The pod whose logs have been silent for minutes while the others are
|
||||
actively erroring is the stuck leader.
|
||||
|
||||
## 2. The safe restart sequence (avoids zombie containers)
|
||||
|
||||
**DO NOT** `kubectl delete pod --force --grace-period=0` as the first
|
||||
step. On NFS-backed Vault that's the exact move that leaves the kernel
|
||||
NFS client corrupted on the node where the stuck pod ran.
|
||||
|
||||
Instead:
|
||||
|
||||
### 2a. Graceful delete first (30s grace)
|
||||
|
||||
```sh
|
||||
kubectl delete pod -n vault vault-2
|
||||
```
|
||||
|
||||
Wait 30 seconds. Most of the time the TERM → SIGKILL path works and the
|
||||
new pod schedules cleanly. The remaining leaders re-elect and the external
|
||||
endpoint recovers.
|
||||
|
||||
### 2b. If the pod is Terminating after 60s, find the stuck process
|
||||
|
||||
```sh
|
||||
NODE=$(kubectl get pod -n vault vault-2-<suffix> -o jsonpath='{.spec.nodeName}')
|
||||
POD_UID=$(kubectl get pod -n vault vault-2-<suffix> -o jsonpath='{.metadata.uid}')
|
||||
|
||||
ssh $NODE "sudo ps auxf | grep -A2 $POD_UID | head -20"
|
||||
# Look for: mount.nfs (D-state), vault (Z-state), or the sh wrapper in do_wait
|
||||
```
|
||||
|
||||
### 2c. Unmount stale NFS before force-deleting
|
||||
|
||||
If the old pod's NFS mount is still present, lazy-unmount it FIRST so
|
||||
the kernel can release NFS session state cleanly:
|
||||
|
||||
```sh
|
||||
ssh $NODE "sudo mount | grep $POD_UID | awk '{print \$3}' | xargs -I{} sudo umount -l {}"
|
||||
```
|
||||
|
||||
Verify no mount.nfs processes are in D-state on the node:
|
||||
|
||||
```sh
|
||||
ssh $NODE "ps -eo state,pid,comm | grep '^D' | head -5"
|
||||
```
|
||||
|
||||
### 2d. Only NOW force-delete if needed
|
||||
|
||||
```sh
|
||||
kubectl delete pod -n vault vault-2-<suffix> --force --grace-period=0
|
||||
```
|
||||
|
||||
## 3. Recovery when the node is already stuck
|
||||
|
||||
If you force-deleted before reading this runbook and NFS is now broken
|
||||
on the node:
|
||||
|
||||
**Diagnostic — confirm NFS client state is corrupted:**
|
||||
|
||||
```sh
|
||||
NODE=k8s-node2 # node where the force-delete happened
|
||||
ssh $NODE "sudo mkdir -p /tmp/nfstest && sudo timeout 30 \
|
||||
mount -t nfs 192.168.1.127:/srv/nfs /tmp/nfstest && echo MOUNT_OK"
|
||||
```
|
||||
|
||||
If the mount times out at 30-110s, kernel NFS client state is stuck.
|
||||
No userspace recovery exists — only a VM reboot clears it.
|
||||
|
||||
**Workaround before rebooting**: mounting with `nfsvers=4.1` succeeds
|
||||
on broken nodes (the corruption is NFSv4.2 session-state specific).
|
||||
This is useful for diagnostic mounts, but does NOT fix CSI pods —
|
||||
their mount options come from the `nfs-proxmox` StorageClass and can't
|
||||
be overridden per-pod.
|
||||
|
||||
**Reboot the affected node VM:**
|
||||
|
||||
```sh
|
||||
# Find PVE VM ID — nodes numbered 201-204 for k8s-node1..4
|
||||
ssh root@192.168.1.127 "qm reset 20<N>"
|
||||
|
||||
# If qm reset leaves the VM PID unchanged (it didn't actually reboot),
|
||||
# use qm stop/start:
|
||||
ssh root@192.168.1.127 "qm stop 20<N> && qm start 20<N>"
|
||||
```
|
||||
|
||||
Wait for the node to become Ready (`kubectl get node k8s-node<N> -w`)
|
||||
and CSI driver to register (`kubectl get pods -n nfs-csi -o wide`).
|
||||
|
||||
**Gotcha — `qm reset` can be a no-op.** On the 2026-04-22 incident,
|
||||
`qm reset 201` returned exit 0 but did NOT restart the VM (same QEMU PID
|
||||
before and after). `qm status` reported "running" throughout. Always
|
||||
verify by checking the QEMU PID or VM uptime post-reset. If uptime is
|
||||
unchanged, escalate to `qm stop && qm start`.
|
||||
|
||||
**Gotcha — check boot order before stop/start.** Long-running VMs
|
||||
(630+ day uptime) may have stale `bootdisk:` config that's been hidden
|
||||
by never rebooting. On 2026-04-22, k8s-node1's config had `bootdisk:
|
||||
scsi0` but the actual OS disk was on `scsi1`, so the first boot after
|
||||
stop attempted iPXE and failed. Before stopping, verify:
|
||||
|
||||
```sh
|
||||
ssh root@192.168.1.127 "grep -E 'boot|scsi[0-9]+:' /etc/pve/qemu-server/20<N>.conf"
|
||||
```
|
||||
|
||||
If `bootdisk` references a disk ID that doesn't exist, fix it first
|
||||
with `qm set 20<N> --boot "order=scsi<ID>"` (use the ID of the main
|
||||
OS disk).
|
||||
|
||||
## 4. Prevent re-infection — the chown loop
|
||||
|
||||
After the node comes back, the vault pod's PV chown walk can still
|
||||
peg kubelet. The durable fix is in `stacks/vault/main.tf`:
|
||||
|
||||
```hcl
|
||||
statefulSet = {
|
||||
securityContext = {
|
||||
pod = {
|
||||
fsGroupChangePolicy = "OnRootMismatch"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This was applied in commit `2f1f9107` (2026-04-22). If you find
|
||||
yourself editing this in a kubectl patch for live recovery, follow
|
||||
up with a Terraform apply the same session — leaving the cluster
|
||||
ahead of Terraform state is technical debt that re-triggers on the
|
||||
next apply.
|
||||
|
||||
## 5. Verify end-to-end
|
||||
|
||||
```sh
|
||||
# External endpoint — the user-facing health check
|
||||
curl -sk -o /dev/null -w "%{http_code}\n" https://vault.viktorbarzin.me/v1/sys/health
|
||||
# expect: 200
|
||||
|
||||
# Raft peers (needs VAULT_TOKEN with operator capability)
|
||||
kubectl exec -n vault vault-0 -c vault -- vault operator raft list-peers
|
||||
|
||||
# All pods 2/2
|
||||
kubectl get pods -n vault -l app.kubernetes.io/name=vault -o wide
|
||||
|
||||
# No alerts fired (once VaultRaftLeaderStuck + VaultHAStatusUnavailable are live)
|
||||
curl -s https://alertmanager.viktorbarzin.me/api/v2/alerts | \
|
||||
jq '.[] | select(.labels.alertname | test("Vault"))'
|
||||
```
|
||||
|
||||
## Known limitations
|
||||
|
||||
- **No alert for stuck leaders yet.** `VaultRaftLeaderStuck` and
|
||||
`VaultHAStatusUnavailable` require Vault telemetry enabled
|
||||
(`telemetry { unauthenticated_metrics_access = true }`) and a
|
||||
scrape job. Alerts are defined in `prometheus_chart_values.tpl`
|
||||
but stay silent until telemetry lands — tracked as a beads task.
|
||||
- **Vault on NFS violates the documented rule.** `infra/.claude/CLAUDE.md`
|
||||
says critical services must use `proxmox-lvm-encrypted`. The
|
||||
`dataStorage`/`auditStorage` still use `nfs-proxmox`. Migration
|
||||
tracked as an epic-level beads task.
|
||||
114
docs/runbooks/vault-token-renew-devvm.md
Normal file
114
docs/runbooks/vault-token-renew-devvm.md
Normal file
|
|
@ -0,0 +1,114 @@
|
|||
# Runbook: devvm Vault token auto-renewal
|
||||
|
||||
**Host:** `devvm` (10.0.10.10), user `wizard`
|
||||
**Source of truth:** `infra/scripts/vault-token-renew.{sh,service,timer}`
|
||||
**Live paths:** `~/.local/bin/vault-token-renew`, `~/.config/systemd/user/vault-token-renew.{service,timer}`
|
||||
|
||||
## What this is
|
||||
|
||||
`wizard@devvm` authenticates to Vault with a **periodic, orphan** token stored
|
||||
in `~/.vault-token`, instead of a 7-day OIDC login that needed weekly
|
||||
re-auth. A systemd **user** timer renews it daily so it never expires.
|
||||
|
||||
| Property | Value |
|
||||
|---|---|
|
||||
| `display_name` | `token-devvm-wizard` |
|
||||
| `period` | `768h` (32 days) |
|
||||
| `explicit_max_ttl` | `0` (no hard cap) |
|
||||
| `policies` | `default`, `sops-admin`, `vault-admin` |
|
||||
| `orphan` | `true` (not revoked when any parent expires) |
|
||||
|
||||
Periodic tokens have no max-TTL; they only need renewing once per `period`.
|
||||
Daily renewal leaves a 32× margin. **If devvm is decommissioned and the timer
|
||||
stops, the token self-expires within ~32 days** — deliberately, unlike a root
|
||||
token which would live forever (this is the security trade-off Viktor chose:
|
||||
periodic + renewer over a never-expiring root token).
|
||||
|
||||
## Deploy on a fresh devvm
|
||||
|
||||
The renewer is a host-side script + user systemd units, deployed manually (same
|
||||
model as the other `infra/scripts/` host scripts). From a checkout of the repo
|
||||
**as user `wizard` on devvm**:
|
||||
|
||||
```bash
|
||||
cd ~/code/infra/scripts
|
||||
install -m 0755 vault-token-renew.sh ~/.local/bin/vault-token-renew # strip .sh
|
||||
install -m 0644 vault-token-renew.service vault-token-renew.timer ~/.config/systemd/user/
|
||||
|
||||
# user manager must survive logout, so the daily timer fires headless
|
||||
loginctl enable-linger "$USER"
|
||||
|
||||
systemctl --user daemon-reload
|
||||
systemctl --user enable --now vault-token-renew.timer
|
||||
```
|
||||
|
||||
Then mint the token (one-time, interactive — see below). The script and units
|
||||
carry no secret; only the token itself is sensitive and stays out of git.
|
||||
|
||||
## Mint / re-mint the token
|
||||
|
||||
Requires an interactive OIDC login (browser), so it can't run unattended:
|
||||
|
||||
```bash
|
||||
export VAULT_ADDR=https://vault.viktorbarzin.me
|
||||
vault login -method=oidc
|
||||
vault token create -orphan -period=768h \
|
||||
-policy=vault-admin -policy=sops-admin -display-name=devvm-wizard \
|
||||
-field=token > ~/.vault-token
|
||||
chmod 600 ~/.vault-token
|
||||
```
|
||||
|
||||
Vault prefixes the display name, so it becomes `token-devvm-wizard` (which is
|
||||
what the drift guard checks for). `-orphan` is essential: a child of the 7-day
|
||||
OIDC token would be revoked when that parent expired.
|
||||
|
||||
## Health check
|
||||
|
||||
```bash
|
||||
export VAULT_ADDR=https://vault.viktorbarzin.me
|
||||
vault token lookup | grep -E 'display_name|period|explicit_max_ttl|policies'
|
||||
# expect: display_name token-devvm-wizard, period 768h, explicit_max_ttl 0s,
|
||||
# policies [default sops-admin vault-admin]
|
||||
|
||||
# authoritative write-capability check (do NOT trust the policies field alone —
|
||||
# an OIDC token shows policies=[default] but carries vault-admin via identity):
|
||||
vault token capabilities secret/data/viktor # expect create/update/.../sudo
|
||||
|
||||
# renewer health
|
||||
systemctl --user list-timers | grep vault-token-renew # next/last run
|
||||
tail -5 ~/.local/state/vault-token-renew.log # recent results
|
||||
```
|
||||
|
||||
A healthy log line looks like:
|
||||
`<ts> OK renewed (dn=token-devvm-wizard ttl=2764800s)` (ttl 2764800s = 768h).
|
||||
|
||||
## Drift guard & recovery
|
||||
|
||||
`~/.vault-token` is the Vault CLI's default token sink, so **any** `vault login`
|
||||
overwrites it. Two confirmed clobber vectors:
|
||||
|
||||
1. `vault login -method=oidc` → replaces it with a 7-day OIDC token (the renewer
|
||||
can't push past the OIDC role's 7-day `token_max_ttl`).
|
||||
2. A stray `vault login -method=kubernetes` (e.g. a headless agent flow) →
|
||||
writes a read-only `kubernetes-woodpecker-default` token (can read Vault but
|
||||
**cannot** write `secret/*`). This happened 2026-06-05 and went unnoticed for
|
||||
two days — reads worked, writes silently 403'd.
|
||||
|
||||
To stop the renewer from silently keeping a foreign token alive, it runs a
|
||||
**drift guard** first: it refuses to renew unless the token is
|
||||
`token-devvm-wizard` **and** carries `vault-admin`. On drift it logs loudly and
|
||||
exits non-zero (the systemd unit goes `failed`) rather than renewing someone
|
||||
else's token. Symptom in the log:
|
||||
|
||||
`<ts> DRIFT: ~/.vault-token is dn=... policies=... Refusing to renew a foreign token. Re-mint: ...`
|
||||
|
||||
**Recovery: re-mint** (the DRIFT log line contains the exact command) — run the
|
||||
[mint/re-mint](#mint--re-mint-the-token) block. The drift guard detects but does
|
||||
**not** auto-recover (a deliberate scope choice — version-only, no self-heal);
|
||||
recovery is the manual re-mint above.
|
||||
|
||||
## Tests
|
||||
|
||||
`infra/scripts/test-vault-token-renew.sh` unit-tests the drift-guard decision
|
||||
and the lookup-JSON parsers (including the exact 2026-06-05 woodpecker-clobber
|
||||
case). Run: `bash infra/scripts/test-vault-token-renew.sh`.
|
||||
86
docs/runbooks/woodpecker-onboard-forgejo-repo.md
Normal file
86
docs/runbooks/woodpecker-onboard-forgejo-repo.md
Normal file
|
|
@ -0,0 +1,86 @@
|
|||
# Runbook: Onboarding a new Forgejo repo to Woodpecker
|
||||
|
||||
Last updated: 2026-05-07
|
||||
|
||||
## Programmatic (preferred)
|
||||
|
||||
```bash
|
||||
infra/scripts/woodpecker-register-forgejo-repo.sh viktor/<repo-name>
|
||||
```
|
||||
|
||||
The script:
|
||||
1. Pulls the `viktor` (Forgejo-OAuth'd) user's `hash` from the
|
||||
Woodpecker PG `users` table.
|
||||
2. Mints a session JWT (HS256, signed with that hash) — Woodpecker
|
||||
per-user session JWTs have payload
|
||||
`{"type":"user","user-id":"<id>"}` and the signing key is the
|
||||
user's `hash` column. (Confirmed against a known-good admin
|
||||
token: same payload shape, signature reproducible from the user's
|
||||
stored hash via `openssl dgst -sha256 -hmac "$HASH"`.)
|
||||
3. Looks up the Forgejo repo id and POSTs to
|
||||
`https://ci.viktorbarzin.me/api/repos?forge_remote_id=<id>` as
|
||||
that user. Woodpecker server creates the per-repo webhook +
|
||||
per-repo signing key on the Forgejo side automatically (uses
|
||||
the user's stored Forgejo OAuth `access_token` to do so — that's
|
||||
why this only works with viktor's user, not the GitHub admin's).
|
||||
|
||||
Pre-requisites:
|
||||
- `vault login -method=oidc` with read access to
|
||||
`database/static-creds/pg-woodpecker`.
|
||||
- `kubectl` cluster access (the script spawns a 5-min psql pod in
|
||||
the `woodpecker` namespace to query the DB).
|
||||
- A Forgejo PAT in `secret/viktor/forgejo_admin_token` (or pass
|
||||
`FORGEJO_TOKEN=…` env), used to look up the repo's numeric ID.
|
||||
- The `viktor` Woodpecker user must already exist (i.e., they've
|
||||
logged in via Forgejo OAuth at least once on the Web UI).
|
||||
If user_id=2 / forge_id=2 doesn't exist in `users`, the OAuth
|
||||
bootstrap is unavoidable — but it only needs to happen once for
|
||||
the lifetime of the Woodpecker DB.
|
||||
|
||||
## Why the GitHub admin token can't do this
|
||||
|
||||
The earlier 500 from `POST /api/repos?forge_remote_id=N` was
|
||||
because my admin session token authenticates as `ViktorBarzin`
|
||||
(GitHub user, forge_id=1). Woodpecker tries to call Forgejo as
|
||||
that user (using their stored Forgejo OAuth token) — which doesn't
|
||||
exist for the GitHub user, hence the lookup error. There's no way
|
||||
around this without acting as the Forgejo user.
|
||||
|
||||
## Why the previous "JWT for the webhook" approach didn't work
|
||||
|
||||
I tried generating a webhook JWT signed with `WOODPECKER_AGENT_SECRET`
|
||||
(the global agent secret) and registering it directly on Forgejo.
|
||||
That fails because the webhook JWT verification path runs through a
|
||||
DB-backed `keyfunc` — Woodpecker stores a per-repo signing key when
|
||||
the repo is activated, and rejects any JWT signed with a different
|
||||
key. POST /api/repos is what creates that per-repo key.
|
||||
|
||||
## After registration
|
||||
|
||||
Pipelines fire automatically on push. The `WOODPECKER_FORGE_TIMEOUT`
|
||||
default of 3s was too tight for our cluster (Forgejo response time
|
||||
spikes to 1-2s under load) — bumped to 30s in
|
||||
`infra/stacks/woodpecker/values.yaml` 2026-05-07. Without that bump,
|
||||
config-loader hits the deadline and every pipeline errors with
|
||||
`could not load config from forge: context deadline exceeded`.
|
||||
|
||||
## When the v3.13 → v3.14 server upgrade matters
|
||||
|
||||
`v3.14.0` doesn't fix this on its own — the timeout default is the
|
||||
same. Set `WOODPECKER_FORGE_TIMEOUT` regardless of version. The
|
||||
v3.14 upgrade was useful for unrelated forge-API changes (smarter
|
||||
config-loader, fewer redundant calls per trigger).
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- Pipeline status `error` with `could not load config from forge`:
|
||||
bump `WOODPECKER_FORGE_TIMEOUT`. 30s is plenty.
|
||||
- Pipeline status `error` with `secret "registry-password" not found`:
|
||||
the repo's `.woodpecker.yml` still references registry-private
|
||||
credentials. Drop the `registry.viktorbarzin.me` block — Forgejo
|
||||
is the only registry now.
|
||||
- Pipeline status `failure` with `"/vault": not found` (or any
|
||||
other COPY of a binary): the gitignored binary wasn't pushed to
|
||||
Forgejo. Switch the Dockerfile to `curl … && unzip` from the
|
||||
HashiCorp/upstream release URL. See `claude-agent-service/Dockerfile`
|
||||
commit bab6dd2 for the pattern.
|
||||
Loading…
Add table
Add a link
Reference in a new issue