Commit graph

89 commits

Author SHA1 Message Date
Viktor Barzin
c948dc0dbe backup pipeline: flock manifest + cap + drop LAN -z
Three more audit fixes from the 2026-05-24 backup-pipeline review:

#5 (S1 race) — manifest flock
  daily-backup and nfs-mirror both append to /mnt/backup/.changed-files.
  If they overlap (nfs-mirror Mon 04:11 running long, daily-backup
  starting Mon 05:00), concurrent appends from `find | tee` and
  `find | sed >>` could interleave mid-line — partial paths would slip
  past rsync's --files-from. Both scripts now share a manifest_append()
  helper using `flock -x` on /mnt/backup/.changed-files.lock. The 4
  daily-backup call sites + the 1 nfs-mirror call site all pipe through
  it instead of redirecting directly.

#7 (S2 unbounded manifest)
  daily-backup gains check_manifest_size() invoked after the PVE-config
  append (the last manifest writer of the run). Above MANIFEST_MAX_LINES
  (500k) it touches /mnt/backup/.force-full-sync — offsite-sync's Step 1
  now treats that flag the same as day-of-month ≤ 7 (full sync with
  --delete) and clears it on success. Catches the "Synology unreachable
  for many days" edge case where the manifest would grow unbounded.

#9 (wear — drop -z on LAN hops)
  offsite-sync rsync calls to Synology over the same 192.168.1.0/24
  gigabit LAN had `-rltz`. Compression burns CPU on the PVE host (already
  IO-busy) and gives nothing on a saturated GigE link. Dropped to `-rlt`
  on all 5 offsite rsync invocations (Step 1 full + Step 1 incremental +
  Step 2 full nfs + Step 2 full nfs-ssd + Step 2 incremental).

Other adjustments:
- nfs-mirror's find-after-rsync now also excludes the new state files
  (.changed-files.lock, .force-full-sync) when populating the manifest.
- offsite-sync Step 1 full-sync excludes the same .force-full-sync flag
  so it doesn't ship to Synology.

Deployed to PVE host (/usr/local/bin/{daily-backup,nfs-mirror,
offsite-sync-backup}). Currently in-flight nfs-mirror run is unaffected
(bash loaded the old script into memory at start). Next runs use the
new behaviour.

Refs: 2026-05-24 audit Section 2 items #1 (manifest race), #4 (unbounded
manifest), #6 (LAN -z wear).
2026-05-24 16:27:42 +00:00
Viktor Barzin
4798583db7 backup pipeline: S1 fixes from 2026-05-24 audit
Three immediate fixes surfaced by the backup-pipeline audit:

1. **S1 silent-loss race fix** (daily-backup.sh:142): remove the
   `> "${MANIFEST}"` truncation at the start of daily-backup. Truncation
   already lives in offsite-sync-backup at line 159, gated on a successful
   sync. With both scripts truncating, an offsite-sync failure followed by
   the next morning's daily-backup would silently wipe yesterday's
   unconsumed manifest entries — those files would only reach Synology
   via the monthly full sync (1st-7th of month). Now only offsite-sync
   truncates, and only on success.

2. **Missing alert OffsiteBackupSyncFailing**: documented in backup-dr.md
   but was never added to prometheus_chart_values.tpl. Step 1 or Step 2
   failure pushes offsite_sync_last_status=1 but nothing read it. Added.

3. **wear: drop `-z` from local-only rsyncs** (daily-backup.sh:218 PVC
   snapshot rsync + line 347 /etc/pve sync). Both are local-to-sda
   transfers — compression wastes CPU and yields nothing (gigabit local
   path, intermediate disk doesn't benefit).

Bonus cleanups (zero functional impact):
- "Weekly backup starting/complete" → "daily-backup starting/complete"
  (the timer is daily, not weekly — legacy from earlier monthly-rotation
  schedule).
- "--- Step 2: PVC file copy ---" → "Step 1:" (was numbered from 2 with no
  Step 1 above).
- **wear: pfSense full filesystem tar now Sunday-only** instead of daily.
  config.xml stays daily (it's the primary restore artifact and tiny).
  Full tar is forensic recovery only — re-tarring ~100MB+ daily writes
  ~3G/month to sda + Synology for unchanged content. Weekly is plenty.

docs/architecture/backup-dr.md: rewritten Overview + 3-2-1 breakdown to
reflect today's two-leg architecture; added a "2026-05-24 session"
changelog summary at the top; added a "Synology snapshot management"
subsection with the sudo + `synosharesnapshot` recipe (DSM API is gated
by 2FA so this is the only programmatic path); updated Key Files table
with nfs-mirror + the Synology SSH access notes.

Open follow-ups from the audit (S2 — file as beads if pursued):
- Factor two-leg invariant into /etc/backup-skip-list.conf sourced by
  both nfs-mirror.sh and offsite-sync-backup.sh.
- Manifest write-collision flock between nfs-mirror Mon 04:11 and
  daily-backup Mon 05:00.
- Unbounded manifest cap (force full sync if > 500k lines).
- Synology free-space scraper + alert.
- LVM thin pool meta-pool fill alert.
- nfs-change-tracker.service heartbeat to Pushgateway.
- Synology config drift TF surface (snap retention, share defs).
2026-05-24 16:18:44 +00:00
Viktor Barzin
9277d71d81 nfs-mirror: append transferred files to offsite-sync manifest
Some checks failed
ci/woodpecker/push/default Pipeline is running
ci/woodpecker/push/build-cli Pipeline failed
Step 1 of offsite-sync-backup is incremental on non-monthly days,
driven by /mnt/backup/.changed-files which only daily-backup wrote
to. nfs-mirror's writes were therefore invisible to Step 1 until the
next monthly --delete pass — which would *also* wipe data
pre-positioned on Synology pve-backup/ (e.g. the in-place btrfs
rename we just did to relocate ~160G of NFS subtrees from
/Backup/Viki/nfs/<svc>/ to /Backup/Viki/pve-backup/<svc>/).

Fix: snapshot a timestamp before rsync, then after rsync use
`find -newer $STAMP -type f -printf '%P\n'` to enumerate every file
nfs-mirror created/modified and append to the manifest. Paths are
relative to /mnt/backup/ (matches Step 1 --files-from expectation).
State files are excluded.

The current in-flight first run started before this patch was
deployed, so its writes won't auto-populate the manifest — a one-off
manual backfill will be done after it completes.
2026-05-24 15:32:22 +00:00
Viktor Barzin
15745eab2f backup: retire anca-elements-mirror + anca-elements-sync.sh
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Both subsumed by nfs-mirror (deployed earlier this session) — see
commit 4d756be4. anca-elements-sync.sh is now dead code because its
upstream (Synology /volume1/Backup/Anca/Elements) was deleted today
once the sda mirror was parity-verified (109,624 files /
827,480,937,976 bytes equal both sides). PVE NFS is the source of
truth for the archive from here on.

Final script inventory on the PVE host (down from 6 to 4):
- /usr/local/bin/daily-backup           (block PVCs + sqlite + pfsense)
- /usr/local/bin/lvm-pvc-snapshot       (snapshot management)
- /usr/local/bin/nfs-mirror             (NFS local mirror to sda)
- /usr/local/bin/offsite-sync-backup    (sda + bypass-list NFS to Synology)
2026-05-24 14:58:55 +00:00
Viktor Barzin
4d756be4f5 backup: consolidate to one local-mirror script + invert offsite filter
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline failed
Before this commit, the in-flight design split anca-elements (its own
mirror script + timer) from the rest of /srv/nfs (still going to
Synology via inotify-tracked offsite-sync). It also meant Synology
received some bytes via both paths (sda → Synology AND direct NFS →
Synology), which doubled consumption.

This commit collapses both into a clean 3-2-1:

  Copy 1 (sdc):       live /srv/nfs/* + cluster block PVCs
  Copy 2 (sda):       /mnt/backup/{pvc-data,sqlite-backup,pfsense,
                                   pve-config,<critical-nfs>/}
                      ← daily-backup + nfs-mirror (one script each)
  Copy 3 (Synology):  /Backup/Viki/{pve-backup,nfs,nfs-ssd}
                      ← offsite-sync-backup Step 1 (sda → Synology)
                        + Step 2 (sda-BYPASS paths only → Synology direct)

scripts/nfs-mirror.{sh,service,timer}:
  New consolidated weekly mirror. Replaces anca-elements-mirror (to be
  removed in a follow-up after the current in-flight rsync completes,
  parity-verified, and Synology source-of-truth is deleted). Single
  rsync /srv/nfs/ → /mnt/backup/ with an explicit EXCLUDES list that
  drops paths not worth a local 2nd copy: immich (1.2T — too big),
  frigate (14d ring), prometheus/loki (rebuildable), ollama/llamacpp/
  audiblez/ebook2audiobook (re-fetchable), *-backup (already backups),
  temp/alertmanager (transient). Nice=10, IOSchedulingClass=idle.

scripts/offsite-sync-backup.sh:
  Step 2 (NFS → Synology) filter inverted: instead of `--exclude=
  anca-elements/`, it now `--include`s only the sda-BYPASS paths
  (immich, frigate, prometheus, *-backup, …). The bypass-include
  regex MUST stay in lockstep with nfs-mirror's EXCLUDES — they are
  complementary and any drift creates either gaps or duplication on
  Synology. Comment in the script flags this.

monitoring alerts: renamed AncaElementsMirror{Stale,Failing} to
NfsMirror{Stale,Failing} matching the new metric job name
`nfs-mirror`. Thresholds unchanged.

docs/architecture/backup-dr.md: rewritten Step 1/Step 2 sections and
added the bypass-list rationale + cross-reference between scripts.

NOT YET DEPLOYED — gated on the in-flight anca-elements-mirror rsync
finishing + parity verification + Synology /volume1/Backup/Anca/
Elements deletion. The old scripts (anca-elements-{mirror,sync.sh})
remain on the PVE host until then, and will be removed in a cleanup
commit.
2026-05-24 12:49:20 +00:00
Viktor Barzin
6db64fe060 anca-elements: weekly local mirror sdc → sda (replaces Synology as 2nd copy)
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Synology is being removed as a host for the Anca/Elements archive
(770G). /srv/nfs/anca-elements on PVE becomes the source of truth;
sda /mnt/backup/anca-elements becomes the single-disk-failure mirror.
No offsite for this archive — by design.

- scripts/anca-elements-mirror.sh: rsync -rlt --delete -H, idempotent,
  pushes anca_elements_mirror_last_{run_timestamp,status,bytes} to
  Pushgateway, lockfile in /run, SIGTERM-safe (status=2 on abort).
- .service: oneshot, Nice=10, IOSchedulingClass=idle, 5h timeout.
- .timer: weekly Mon 04:00, Persistent=true, 15-min randomised delay.

Deployed to PVE host; timer enabled; initial 770G sync running in
background. Synology original to be deleted after first run completes
and parity is verified.

docs/architecture/backup-dr.md: documents Layer 3a + updated path
exclusion rationale (PVE is now upstream, not downstream).
2026-05-24 11:51:52 +00:00
Viktor Barzin
34f8c0f537 docs+scripts: lock in nextcloud-as-PVE-NFS-browser surface
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
- docs/architecture/storage.md: new "Nextcloud as PVE-NFS browser"
  section documenting mount-per-archive + applicable_users model,
  why mount-level ACL beats Files Access Control on NC 30/31, the
  manifest shape (with current applicableUsers + enableSharing
  fields), and the trade-off
- docs/runbooks/nextcloud-add-archive.md: 5-step runbook to surface
  a new directory under /srv/nfs/* to specific NC users via the
  bootstrap Job
- scripts/anca-elements-sync.sh: deployed at
  /usr/local/bin/anca-elements-sync.sh on the PVE host; fpsync from
  Synology Anca/Elements to /srv/nfs/anca-elements (idempotent +
  resumable). The PVE replica is what the NC /anca-elements mount
  serves; the offsite-sync pipeline excludes this path (committed
  earlier this session) so we don't write it back to Synology

NC usernames are admin/anca/emo (not display names — admin is
Viktor). Stale "viktor" references in the manifest example dropped.
2026-05-24 11:45:01 +00:00
Viktor Barzin
05f047f290 offsite-sync-backup + nfs-change-tracker: exclude /srv/nfs/anca-elements
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
The 771G under /srv/nfs/anca-elements is a downstream replica synced
FROM Synology (/volume1/Backup/Anca/Elements) by anca-elements-sync.sh.
The offsite-sync pipeline was copying it back to Synology under
/volume1/Backup/Viki/nfs/anca-elements, creating a self-duplicate
(~122G already partially copied during the last monthly full sync).

- nfs-change-tracker.service: drop anca-elements/ from inotify watch
  (incremental syncs no longer queue these paths)
- offsite-sync-backup.sh: --exclude='anca-elements/' on the monthly
  full rsync; grep -v on the incremental files-from list

Deployed to 192.168.1.127:/usr/local/bin/offsite-sync-backup +
/etc/systemd/system/nfs-change-tracker.service; service reloaded.
2026-05-24 11:03:09 +00:00
Viktor Barzin
4713c3a6d9 k8s-version-upgrade: tigera quiesce + etcd-skip retry + IO-wait alert ignore
Three changes unblocking the autonomous chain for k8s patch upgrades:

1. **phase_master quiesces tigera-operator before drain, restores after.**
   Tigera crashes immediately if apiserver is unreachable (no retry logic)
   and crashlooping it during master static-pod swaps generates ~500MB/s
   disk I/O that pushes kubeadm's 5-min static-pod-hash watch past its
   limit. Quiesce removes the storm contributor; calico data plane keeps
   running unchanged (data plane is the DaemonSet+Typha, operator is just
   the reconciler).

2. **update_k8s.sh retries with --etcd-upgrade=false on the 2nd attempt.**
   For patch upgrades (1.34.7→1.34.8), etcd's image doesn't change — kubeadm
   writes an identical manifest, hash doesn't update, watch times out and
   rolls back forever. The skip-etcd retry sidesteps it for the legitimate
   no-change case while still doing a full etcd upgrade on the first
   attempt (correct for minor-version bumps).

3. **halt_on_alert_query also ignores IngressTTFBHigh + NodeHighIOWait.**
   Both are symptoms-not-causes: ingress latency spikes briefly during any
   pod-restart wave; high IOwait is exactly what upgrade activity causes
   (chicken-and-egg). The inline quiet-baseline check (Ready transition
   <10min) is the real cluster-churn gate.

RBAC: k8s-upgrade-job ClusterRole gains `patch` on deployments + scale
subresource so the chain can do the scale-to-0/back-to-1 on tigera.

These three together get the chain past the cascade that's been blocking
1.34.7→1.34.8 for a week. Long-term fix is still HA control plane
(beads code-n0ow); these are the bridge.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 08:40:11 +00:00
Viktor Barzin
4830230984 cluster-health #43: tighten PVE thermal threshold to 65 C
Per Viktor: healthy baseline range is 55-65 C; anything above 65 C is a
signal a VM/workload is using too much CPU and warrants investigation.

Previous thresholds were calibrated to the hardware's TjMax (75/83 C) —
that was too lax, since cluster-load-driven elevation arrives a long
time before throttling. The 65 C cutoff matches the live Prometheus
baseline (Apr 20-May 8 2026: peak 61-69 C, avg 51-55 C) and the
session-observed correlation: above 65 C means the cluster is doing
sustained work that should be looked at, even if hardware is still
nowhere near its limit.

Updated:
  PASS  < 65 C   (within 55-65 baseline)
  WARN  65-82 C  (elevated; check top kvm processes for the culprit)
  FAIL  >= 83 C  (at/above TjMax — throttling imminent)

Verified live: 67 C now WARN (was PASS under the 75 C threshold).
2026-05-22 14:09:08 +00:00
Viktor Barzin
8228171104 cluster-health: add checks 43 + 44 (PVE host thermals + load)
Both new checks SSH read-only to the PVE host and emit PASS/WARN/FAIL
via the standard healthcheck output + JSON. They run alongside the
existing 42 checks and surface the same alerts the 2026-05-20/21
optimization session had to gather by hand.

#43 PVE Host Thermals — Xeon E5-2699v4 package + per-core temps
  Reads every /sys/class/hwmon/hwmon0/temp*_input in one SSH round-trip.
  Thresholds tuned to the live TjMax=83 / Tcrit=93:
    PASS  < 75 °C package
    WARN  75-82 °C  (approaching max, action time)
    FAIL  >= 83 °C  (at/above TjMax, throttling imminent)
  Reports hottest core label too so a single hot core doesn't hide in
  the package average.

#44 PVE Host Load — load avg vs 44-thread capacity
  Reads /proc/loadavg, compares 5-min to thread count (44):
    PASS  load_5 < 30   (< 70% threads busy)
    WARN  30-37         (oversubscribed but not saturating)
    FAIL  >= 38         (~85%+ threads busy — scheduler saturation)
  Uses 5-min so brief work spikes don't false-fail.

Both gracefully WARN-degrade if SSH BatchMode fails, matching the
existing check 36 (LVM PVC snapshots) pattern. TOTAL_CHECKS bumped
42 -> 44 and the dispatcher updated.
2026-05-22 09:55:11 +00:00
Viktor Barzin
2dc7e001bd k8s-version-upgrade: retry kubeadm apply on static-pod-hash timeout
kubeadm's `upgrade apply` waits 5min for each static-pod manifest swap
to be picked up by the kubelet (it polls the pod's
`kubernetes.io/config.hash` annotation via apiserver). On a freshly-rebooted
master with apiserver-to-kubelet status sync lagging, that 5min isn't
enough — kubeadm declares the upgrade failed and rolls back.

The thing is: the etcd container HAS already been swapped to the new
image by then (verified live — pod is on registry.k8s.io/etcd:3.6.5-0
when this fires). kubeadm's check is just slow to notice. The 2nd
attempt sees etcd already on target, skips it, and proceeds cleanly.

Wrap `kubeadm upgrade apply` in a 3-attempt loop with 30s between.
Worker phase doesn't need this — `kubeadm upgrade node` has no
static-pod-hash waits.

Today's autonomous-pipeline session: master phase Failed at 5m on
attempt #1 with this exact error, retried, hit same timeout, gave up
(backoffLimit=1). The wrapper turns this from a fatal pipeline halt
into a "wait a bit, try again" that usually completes on attempt #2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 09:32:29 +00:00
Viktor Barzin
7a1751a668 upgrade-state: filter transient registry digest-check errors
Keel polls ~175 image manifests hourly against public registries.
Transient i/o timeouts and registry 5xx responses are inherent at
that scale and auto-recover on the next poll, but they were tripping
the Apps row into ⚠ attn — pure noise.

Extend benign_re to cover:
  - failed to check digest + (i/o timeout | connection refused
    | connection reset | context deadline exceeded | TLS handshake
    timeout | no such host | EOF)
  - failed to check digest + non-successful response (status=5xx)

Real actionable digest-check failures (HTTP 401 auth, 404 removed
tag) still surface. Persistent registry-side 5xx is owned by the
registry's own monitoring (forgejo-integrity-probe +
RegistryCatalogInaccessible), not by Keel logs.

Tested locally: Apps row flips from ⚠ attn → ✓ healthy after the
filter is in place; remaining errors-line drops to "(none in last
24h)".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:06:21 +00:00
Viktor Barzin
34c1d64a88 upgrade-state: suppress known-benign Keel slack-bot-not-configured noise
Keel 1.2.0 registers a Slack socket-mode bot whenever SLACK_BOT_TOKEN is
set, then fails because we don't supply an `xapp-` app-level token:

    bot.slack.Configure(): SLACK_APP_TOKEN must have the prefix "xapp-".
    bot.Run(): can not get configuration for bot [slack]

We don't want the interactive bot — opt-out auto-update + no approval flow
(see stacks/keel/main.tf comment). The Slack NOTIFICATION sender works
independently and continues posting rollout messages to #general fine.

But /upgrade-state's broad `grep level=error` was counting these as real
errors → ⚠ on the Apps row every run. Add a small skip-pattern list so the
two recurring benign lines drop out; any new genuine Keel error still
shows. Reuses `bot.Run()` + `SLACK_APP_TOKEN must have the prev?if|prefix`
(typo in Keel's actual log message preserved as alternation).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 19:45:40 +00:00
Viktor Barzin
9a06a76883 k8s-version-upgrade: switch detection cron from weekly to daily
Was `0 12 * * 0` (Sun 12:00 UTC) — patch releases waited up to 6 days
before the chain picked them up. Now `0 12 * * *` (daily 12:00 UTC,
still outside kured's 02:00-06:00 London window). Concurrency is
bounded by Forbid + deterministic job-name idempotency (the detection
job exits early if a preflight Job for the same target already exists),
so back-to-back days can't pile up parallel runs.

- stacks/k8s-version-upgrade/main.tf: var.schedule default + rationale comment
- scripts/upgrade_state.sh: rename next_sunday_noon_utc -> next_daily_noon_utc
  (now returns "Tue 2026-05-19 12:00 UTC" form); change "(Sun cron)" label
  to "(daily cron)"
- .claude/skills/upgrade-state/SKILL.md: cadence column + frontmatter

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 18:29:08 +00:00
Viktor Barzin
9e045e2c16 upgrade-state: skill + script + Keel scrape for periodic three-pipeline audit
Three autonomous-upgrade pipelines run independently — Keel for apps
(hourly registry polling), unattended-upgrades+kured for OS, and the
k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there
was no single place to see whether each was healthy, what's pending,
or whether anything's stuck. The /upgrade-state skill collapses the
state of all three into one table you can run before each Sunday's
k8s-version-check fires.

- stacks/keel/main.tf: add Prometheus pod-annotation scrape on
  container port 9300. Surfaces pending_approvals,
  poll_trigger_tracked_images, and registries_scanned_total{image}
  so the skill has a real timeseries (also opens the door to a
  future "pending_approvals > 0 for 24h" alert).
- scripts/upgrade_state.sh: collector + renderer. Three-row table
  (Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2.
  SSH fan-out (parallel subshells) to all five nodes for apt
  state + reboot-required + uu log; Prometheus query for Keel;
  Pushgateway parse for k8s_upgrade_* gauges. Read-only.
- .claude/skills/upgrade-state/SKILL.md: hardlinked to
  ~/.claude/skills/upgrade-state/SKILL.md so the skill is
  discoverable from both monorepo-rooted and global sessions.

Verification: ran the script, stress-tested the ✗ stalled path by
pushing in_flight=1 + started_timestamp=-100min to Pushgateway and
resetting after — script correctly raised ✗ and exit 2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 10:50:43 +00:00
Viktor Barzin
545cd2b854 healthcheck: probe uptime-kuma via internal Service (port-forward), not public URL
The Uptime Kuma check was hitting https://uptime.viktorbarzin.me, which
sits behind Authentik forward-auth. Authentik 302-redirects the Socket.IO
handshake the uptime-kuma-api library uses, and the library can't
complete the OAuth flow, so every healthcheck reported "Connection
failed" even though the pod was healthy and serving 225 monitors.

Fix: open a transient `kubectl port-forward` to svc/uptime-kuma in the
uptime-kuma namespace for the duration of the check, connect the
library to http://127.0.0.1:<port> (no auth gate), then SIGKILL the
port-forward on the way out. The disown is to suppress bash's "Killed"
job notification on stderr, which corrupted stdout when stderr was
merged for JSON parsing.

Verified end-to-end: healthcheck now reports the real signal —
"external down(3): www, xray-vless, hermes-agent" — the same 3
Cloudflare-facing endpoints flagging in the uptime-kuma logs.
2026-05-11 20:02:57 +00:00
Viktor Barzin
0712a1b659 infra/scripts/tg: enforce ingress_factory auth-comment convention
Every `tg plan/apply/destroy/refresh` now runs
`scripts/check-ingress-auth-comments.py` against the current stack
before invoking terragrunt. The check fails closed if any
`auth = "app"` or `auth = "none"` line in the stack's .tf files lacks
an immediately-preceding `# auth = "<tier>": ...` comment documenting
what gates the app (for "app") or why the endpoint is intentionally
public (for "none").

Why tg-level (not git pre-commit): tg is the universal entry point
for all infra changes. CI runs it, headless agents run it, humans
run it. A pre-commit hook only catches the human path. Wiring the
check into tg means the anti-exposure guard fires regardless of who
or what is invoking terragrunt.

Stack-scoped: each stack documents itself the next time it's edited.
The 30+ existing `auth = "none"` stacks that predate this guard are
not blocked from operating today; they'll need the comment added the
next time someone runs `tg plan` on them — at which point the gate
forces a conscious "yes, this is intentional" moment before any
state change can land.

Skipped on: init, fmt, validate, output, etc. — anything that doesn't
read or write infra state.
2026-05-11 19:18:27 +00:00
Viktor Barzin
2f0e8c88a9 healthcheck: tune noise filters + nvidia-exporter auth=none
Six tuning changes to cluster_healthcheck.sh so PASS sections actually
reflect "nothing to act on":

1. prometheus_alerts: only count severity=warning|critical. Info-level
   alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the
   alert rule itself sets severity; the script should respect it.

2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert
   auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the
   Lets Encrypt wildcard renews weekly; <14d is the only window where
   human attention is genuinely useful.

3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/
   event/image/update domains (transient by design), skip friendly
   names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and
   only count entities whose last_changed > 24h. Was 431/1470,
   most of which were "phone in standby" noise.

4. ha_automations: only flag DISABLED automations as abandoned if
   they've also been untouched (last_changed) for >180 days; raise
   stale threshold 30d → 180d. Was flagging seasonal/holiday-only
   automations as broken.

5. problematic_pods + evicted_pods: exclude pods owned by Jobs.
   CronJob retry leftovers (Error/Failed phase pods that K8s keeps
   around for log inspection) aren't problematic at the cluster level.

6. uptime_kuma: retry the WebSocket login 3x with backoff. Single-
   shot failures were a recurring false-positive even though the
   service was healthy.

Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's
nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll
/metrics and got 302'd to Authentik like the idrac/snmp ones did.
Same fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 22:27:39 +00:00
Viktor Barzin
a58d777059 k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
2026-05-10 19:07:42 +00:00
Viktor Barzin
c647791774 scripts: timeout rsync + sqlite calls in daily-backup
Per-PVC rsync had no timeout, so any single hung PVC (e.g. on a
corrupted snapshot or a sqlite held open by a writer) blocked the
whole script until systemd's 4h TimeoutStartSec kicked in, leaving
every later PVC silently unbacked. Today's run hung on
mailserver/roundcubemail-enigma-encrypted at 05:09 and didn't recover
— hence WeeklyBackupFailing alert.

Now:
- rsync per PVC: timeout 30 min, exit 124 logged separately
- sqlite3 per database: timeout 5 min
- /etc/pve rsync: timeout 5 min

Each timed-out PVC bumps PVC_FAIL but the loop keeps moving.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:39:07 +00:00
Viktor Barzin
64c71615e8 scripts: cluster_healthcheck defaults to ~/.kube/config
The previous default of $(pwd)/config required running the script from
the infra/ directory or always passing --kubeconfig. From a parent
shell or any other working directory, the lookup hit a non-existent
file and kubectl returned a stale-token error, masking real check
results.

Now: use $KUBECONFIG if set, then ~/.kube/config, then fall back to
$(pwd)/config for backwards compatibility.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:12:40 +00:00
Viktor Barzin
cfe969fe43 backup: fix daily-backup silent failures, postiz pg_dump CronJob, doc reconcile
daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr
30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount,
which blocked the next run from completing — root cause of the WeeklyBackupStale
alert going silent (the metric never reached its end-of-script push).

Fixes:
- TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting
  the wall during week 18 runs)
- Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as
  belt-and-braces for any inherited stuck state from a prior crashed run
- TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of
  the alert going blind on systemd kills
- pfsense metric pushed in BOTH success and failure paths (was only on success;
  any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert
  threshold expired)

Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node
OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup
that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to
/srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end:
3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql
image is stripped (no curl/wget/python) — switched to docker.io/library/postgres
matching the dbaas/postgresql-backup pattern with apt-installed curl.

Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed
backup_weekly_last_success_timestamp but the script pushes
daily_backup_last_run_timestamp). Updated to match what's actually emitted, and
added a "default-covered" footnote to the Service Protection Matrix so the
~40 services with PVCs not enumerated in the table are no longer ambiguous.

Manual PVE-host actions (out-of-band, not in TF):
- unmounted 6 stacked snapshots from /tmp/pvc-mount
- pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the
  loop got SIGTERMed against repeatedly, so prune kept failing)
- created /srv/nfs/postiz-backup directory
- triggered a one-shot daily-backup run with the new TimeoutStartSec to
  validate the fix end-to-end

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 17:41:04 +00:00
Viktor Barzin
02a6a955f5 [woodpecker] Programmatic Forgejo repo registration
Earlier I claimed the OAuth Web UI flow was the only way to onboard
new Forgejo repos in Woodpecker. That's wrong.

Two parts to the actual workaround:
1. Woodpecker session JWTs are HS256 signed with the user's per-user
   `hash` column from the PG `users` table (NOT the global agent
   secret). Mint a session JWT for the Forgejo viktor user (id=2,
   forge_id=2), and you're authenticated as that user.
2. POST /api/repos?forge_remote_id=N as viktor → Woodpecker calls
   Forgejo with viktor's stored OAuth access_token to create the
   webhook + per-repo signing key. Works.

The 500 I saw earlier was from POST'ing as ViktorBarzin (GitHub
admin), whose user row has no Forgejo OAuth token — Woodpecker's
forge-API call fails for that user, surfacing as a 500.

scripts/woodpecker-register-forgejo-repo.sh wraps the whole flow:
extract hash from PG → mint JWT → activate repo. Verified against
viktor/{broker-sync,claude-agent-service,freedify,hmrc-sync} in
this session — all activated cleanly.

Also updated the runbook with the actual mechanism + the
WOODPECKER_FORGE_TIMEOUT=30s tip (the real root cause of the
'context deadline exceeded' failures, NOT the v3.14 upgrade).
2026-05-07 23:33:26 +00:00
Viktor Barzin
c2f8a300d4 [forgejo] Migration script: exclude empty repos, all-images full mode
Updated to handle the actual situation: wealthfolio-sync and
fire-planner have registry repos but no tags (broken/abandoned
deployments). Skip those with a SKIP marker. Migrate everything
else as a stop-gap until Woodpecker pipelines start producing
Forgejo images on their own.

The image list now covers all private images currently in scope.
2026-05-07 17:21:39 +00:00
Viktor Barzin
8b53d99e9a [forgejo] Add chrome-service-novnc:v4 to orphan-image migrator 2026-05-07 16:52:05 +00:00
Viktor Barzin
5d22b449f9 [forgejo] Phase 0 of registry consolidation: prepare Forgejo OCI registry
Stage 1 of moving private images off the registry:2 container at
registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption
3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk —
pods still pull from the existing registry until Phase 3.

What changes:
* Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi).
  Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive,
  v11 default-on).
* ingress_factory: max_body_size variable was declared but never wired
  in after the nginx→Traefik migration. Now creates a per-ingress
  Buffering middleware when set; default null = no limit (preserves
  existing behavior). Forgejo ingress sets max_body_size=5g to allow
  multi-GB layer pushes.
* Cluster-wide registry-credentials Secret: 4th auths entry for
  forgejo.viktorbarzin.me, populated from Vault secret/viktor/
  forgejo_pull_token (cluster-puller PAT, read:package). Existing
  Kyverno ClusterPolicy syncs cluster-wide — no policy edits.
* Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster
  Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls).
  Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh
  for existing nodes.
* Forgejo retention CronJob (0 4 * * *): keeps newest 10 versions per
  package + always :latest. First 7 days dry-run (DRY_RUN=true);
  flip the local in cleanup.tf after log review.
* Forgejo integrity probe CronJob (*/15): same algorithm as the
  existing registry-integrity-probe. Existing Prometheus alerts
  (RegistryManifestIntegrityFailure et al) made instance-aware so
  they cover both registries during the bake.
* Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/.

Operational note — the apply order is non-trivial because the new
Vault keys (forgejo_pull_token, forgejo_cleanup_token,
secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the
kyverno + monitoring + forgejo stacks. The setup runbook documents
the bootstrap sequence.

Phase 1 (per-project dual-push pipelines) follows in subsequent
commits. Bake clock starts when the last project goes dual-push.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 15:51:34 +00:00
Viktor Barzin
4c8d12229f mailserver: split healthcheck path off PROXY-aware listeners + book-search uses ClusterIP
Two coordinated fixes for the same root cause: Postfix's smtpd_upstream_proxy_protocol
listener fatals on every HAProxy health probe with `smtpd_peer_hostaddr_to_sockaddr:
... Servname not supported for ai_socktype` — the daemon respawns get throttled by
postfix master, and real client connections that land mid-respawn time out. We saw
this as ~50% timeout rate on public 587 from inside the cluster.

Layer 1 (book-search) — stacks/ebooks/main.tf:
  SMTP_HOST mail.viktorbarzin.me → mailserver.mailserver.svc.cluster.local
  Internal services should use ClusterIP, not hairpin through pfSense+HAProxy.
  12/12 OK in <28ms vs ~6/12 timeouts on the public path.

Layer 2 (pfSense HAProxy) — stacks/mailserver + scripts/pfsense-haproxy-bootstrap.php:
  Add 3 non-PROXY healthcheck NodePorts to mailserver-proxy svc:
    30145 → pod 25  (stock postscreen)
    30146 → pod 465 (stock smtps)
    30147 → pod 587 (stock submission)
  HAProxy uses `port <healthcheck-nodeport>` (per-server in advanced field) to
  redirect L4 health probes to those ports while real client traffic keeps
  going to 30125-30128 with PROXY v2.
  Result: 0 fatals/min (was 96), 30/30 probes OK on 587, e2e roundtrip 20.4s.
  Inter dropped 120000 → 5000 since log-spam concern is gone.

`option smtpchk EHLO` was tried first but flapped against postscreen (multi-line
greet + DNSBL silence + anti-pre-greet detection trip HAProxy's parser → L7RSP).
Plain TCP accept-on-port check is sufficient for both submission and postscreen.

Updated docs/runbooks/mailserver-pfsense-haproxy.md to reflect the new healthcheck
path and mark the "Known warts" entry as resolved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 19:45:33 +00:00
Viktor Barzin
4315ed5c2a [backup] Fix lvm-pvc-snapshot Pushgateway push (stdout pollution in cmd_prune_count)
cmd_prune_count's `log "  Pruned: ..."` wrote to stdout, which the
caller captures via `pruned=$(cmd_prune_count)`. From 2026-04-16 onward
(7d retention kicked in), pruned snapshots polluted the captured value
with multi-line log text, breaking the Prometheus exposition format
on the metric push (`lvm_snapshot_pruned_total ${pruned}` → 400 from
Pushgateway). Snapshots themselves were always fine; only the metric
push silently failed for ~9 nights, eventually triggering
LVMSnapshotNeverRun (alert has 48h `for:`).

Fix: redirect the inner log call to stderr so cmd_prune_count's stdout
contains only the count. Also adopts `infra/scripts/lvm-pvc-snapshot.sh`
as the source-of-truth (was edited only on the PVE host) and updates
backup-dr.md to point at the .sh and document the scp deploy.

Deploy: scp infra/scripts/lvm-pvc-snapshot.sh root@192.168.1.127:/usr/local/bin/lvm-pvc-snapshot

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 14:30:58 +00:00
Viktor Barzin
3eb8b9a4ea ci: add vault CLI to infra-ci image + surface real errors in scripts/tg
The Woodpecker CI pipeline has been silently failing to apply Tier 1
stacks since the state-migration commit e80b2f02 because the Alpine
CI image never had the vault CLI. `scripts/tg` swallowed stderr with
`2>/dev/null` and surfaced a misleading "Cannot read PG credentials
from Vault" message — the real error was `sh: vault: not found`.

Verified with an in-cluster probe: woodpecker/default SA + role=ci
already gets the terraform-state policy and has read capability on
database/static-creds/pg-terraform-state. Auth was never the problem;
the vault binary just wasn't there.

- ci/Dockerfile: pin vault v1.18.1 (matches server) and install
- scripts/tg: pre-flight check + surface real vault output on failure
- Next build-ci-image.yml run rebuilds :latest with vault included;
  subsequent default.yml runs unblock monitoring apply (code-aoxk)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-22 08:46:50 +00:00
Viktor Barzin
4bedabb9e8 healthcheck: fix three false-positive WARNs (HA token, cert-manager, LVM snap grep)
- HA Sofia token: auto-bootstrap from Vault secret/viktor/haos_api_token when
  HOME_ASSISTANT_SOFIA_{URL,TOKEN} env vars are unset. Default URL =
  https://ha-sofia.viktorbarzin.me.
- cert-manager: add cert_manager_installed() probe (kubectl get crd
  certificates.cert-manager.io). When not installed — which is our current
  state — report PASS "N/A" instead of noisy WARN "CRDs unavailable".
- LVM snapshot freshness: grep pattern was `-- -snap` but actual LV names use
  underscore (`foo_snap_YYYY...`), so the grep matched nothing and the check
  always WARN'd. Fixed to `grep _snap`.

After fix: PASS 36→40, WARN 9→6, FAIL 1→1 (new ha_entities FAIL is a real
HA issue, not a script bug — 400/1401 sensors stale on ha-sofia).
2026-04-19 22:13:32 +00:00
Viktor Barzin
a0d770d9a7 [cluster-health] Expand to 42 checks, remove pod CronJob path
- scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager
  readiness/expiry/requests, backup freshness per-DB/offsite/LVM,
  monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared
  +authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS
  to 42, add --no-fix flag.
- Remove the duplicate pod-version .claude/cluster-health.sh (1728
  lines) and the openclaw cluster_healthcheck CronJob (local CLI is
  now the single authoritative runner). Keep the healthcheck SA +
  Role + RoleBinding — still reused by task_processor CronJob.
- Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete
  the unused setup-monitoring.sh.
- Rewrite .claude/skills/cluster-health/SKILL.md: mandates running
  the script first, refreshes the 42-check table, drops stale
  CronJob/Slack/post-mortem sections, documents the monorepo-canonical
  + hardlink layout. File is hardlinked to
  /home/wizard/code/.claude/skills/cluster-health/SKILL.md for
  dual discovery.
- AGENTS.md + k8s-portal agent page: 25-check → 42-check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:13:03 +00:00
Viktor Barzin
9806d515dd [mailserver] Phase 4+5 — pfSense HAProxy cutover for all 4 mail ports [ci skip]
## Context (bd code-yiu)

Cutover of external mail traffic from the MetalLB LB IP path (ETP:Local,
pod-speaker colocation) to pfSense HAProxy + PROXY v2 (ETP:Cluster). Real
client IP now preserved end-to-end on ports 25/465/587/993, both for
postscreen anti-spam scoring and CrowdSec auth-failure bans.

## This change

### k8s (stacks/mailserver/modules/mailserver/main.tf)

- `mailserver-user-patches` ConfigMap's `user-patches.sh` now appends 3
  alt PROXY-speaking services to master.cf:
  - `:2525` postscreen (alt :25)
  - `:4465` smtpd (alt :465 SMTPS, wrappermode TLS)
  - `:5587` smtpd (alt :587 submission)
  All with `postscreen_upstream_proxy_protocol=haproxy` / `smtpd_upstream_proxy_protocol=haproxy`.
  Mirror stock submission/submissions options (SASL via Dovecot, TLS,
  client restrictions, mua_sender_restrictions). chroot=n so the SASL
  socket path `/dev/shm/sasl-auth.sock` resolves outside the chroot.
- `dovecot.cf` ConfigMap adds:
  ```
  haproxy_trusted_networks = 10.0.20.0/24
  service imap-login { inet_listener imaps_proxy { port=10993; ssl=yes; haproxy=yes } }
  ```
  Stock :993 stays PROXY-free for internal Roundcube/probe clients.
- Container ports: 4 new (4465, 5587, 10993, 2525 already there).
- `mailserver-proxy` NodePort Service now exposes all 4 ports:
  25→2525→30125, 465→4465→30126, 587→5587→30127, 993→10993→30128
  (ETP:Cluster).

### pfSense (scripts/pfsense-haproxy-bootstrap.php)

Rebuilt to declare 4 backend pools (one per NodePort) and 4 production
frontends on `10.0.20.1:{25,465,587,993}` TCP mode, plus the legacy
`:2525` test frontend. All pools: `send-proxy-v2 check inter 120000`.
Idempotent — re-runs converge on declared state.

### pfSense (scripts/pfsense-nat-mailserver-haproxy-{flip,unflip}.php)

Flip script: updates `<nat><rule>` entries for mail ports from target
`<mailserver>` alias (10.0.20.202 MetalLB) → `10.0.20.1` (pfSense
HAProxy). Runs `filter_configure()` to rebuild pf rules. Unflip is the
rollback. Both scripts are idempotent.

## What is NOT in this change

- Phase 6 (decommission MetalLB LB path, downgrade mailserver Service
  from LoadBalancer to ClusterIP, free 10.0.20.202) — USER-GATED. Do
  NOT run until explicit approval.
- Legacy MetalLB `mailserver` LB still live on 10.0.20.202 with stock
  ETP:Local ports — functional backup path + consumed by internal
  clients that hit `mailserver.mailserver.svc.cluster.local` (routes
  via ClusterIP layer of the LB Service, bypassing ETP).
- Port :143 (plain IMAP) — no HAProxy frontend; stays on MetalLB via
  unchanged NAT rule.

## Test Plan

### Automated (verified pre-commit 2026-04-19)
```
# k8s container listens on all 8 ports
$ kubectl exec -c docker-mailserver deployment/mailserver -n mailserver \
    -- ss -ltn | grep -E ':(25|2525|465|4465|587|5587|993|10993)\b'
... all 8 listening ...

# pfSense HAProxy listens on all 5 (production + legacy test)
$ ssh admin@10.0.20.1 'sockstat -l | grep haproxy'
www  haproxy  49418  5   tcp4  *:25
www  haproxy  49418  6   tcp4  *:2525
www  haproxy  49418  10  tcp4  *:465
www  haproxy  49418  11  tcp4  *:587
www  haproxy  49418  12  tcp4  *:993

# Post-flip: pf rdr rules point at pfSense, not <mailserver>
$ ssh admin@10.0.20.1 'pfctl -sn' | grep 'smtp\|sub\|imap\|:25'
rdr on vtnet0 ... port = submission -> 10.0.20.1
rdr on vtnet0 ... port = imaps -> 10.0.20.1
rdr on vtnet0 ... port = smtps -> 10.0.20.1
rdr on vtnet0 ... port = 25 -> 10.0.20.1

# 4 HAProxy frontends reachable + SMTP/IMAP banners
$ python3 <test script> → SMTP/SMTPS/Sub/IMAPS all respond correctly

# Real client IP in maillog for external delivery via Brevo → MX
postfix/smtpd-proxy25/postscreen: CONNECT from [77.32.148.26]:36334 to [10.0.20.1]:25
postfix/smtpd-proxy25/postscreen: PASS NEW [77.32.148.26]:36334

# E2E probe (Brevo HTTP → external SMTP delivery → IMAP fetch) succeeds
$ kubectl create job --from=cronjob/email-roundtrip-monitor probe-yiu-flip -n mailserver
... Round-trip SUCCESS in 20.3s ...

# Internal Roundcube path unchanged
$ curl -sI https://mail.viktorbarzin.me/  →  302 (Authentik gate intact)

# No mail alerts firing
$ kubectl exec prometheus-server ... /api/v1/alerts | grep Email  →  (empty)
```

### Rollback
```
scp infra/scripts/pfsense-nat-mailserver-haproxy-unflip.php admin@10.0.20.1:/tmp/
ssh admin@10.0.20.1 'php /tmp/pfsense-nat-mailserver-haproxy-unflip.php'
```
Immediate (<2s). Flips all 4 NAT rdrs back to `<mailserver>` alias.
Pre-flip config snapshot also saved at
`/tmp/config.xml.pre-yiu-flip.20260419-1222` on pfSense.

## Phase roadmap (bd code-yiu)

| Phase | Status |
|---|---|
| 1a |  commit ef75c02f  — alt :2525 listener + NodePort |
| 2  |  2026-04-19      — HAProxy pkg installed on pfSense |
| 3  |  commit ba697b02 — HAProxy config persisted in pfSense XML |
| 4+5|  **this commit** — 4-port alt listeners + HAProxy frontends + NAT flip |
| 6  | ⏸ USER-GATED      — MetalLB LB decommission after 48h observation |
2026-04-19 12:24:50 +00:00
Viktor Barzin
ba697b02a2 [mailserver] Phase 2-3 — pfSense HAProxy bootstrap + runbook [ci skip]
## Context (bd code-yiu)

Phase 2 (HAProxy on pfSense) and Phase 3 (persist config in pfSense XML so
it lives in the nightly backup) of the PROXY-v2 migration. Test path only —
listens on pfSense 10.0.20.1:2525 → k8s node NodePort :30125 → pod :2525
postscreen. Real client IP verified in maillog
(`postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:...`), Phase 1a
container plumbing is already live (commit ef75c02f).

pfSense HAProxy config lives in `/cf/conf/config.xml` under
`<installedpackages><haproxy>`. That file is captured daily by
`scripts/daily-backup.sh` (scp → `/mnt/backup/pfsense/config-YYYYMMDD.xml`)
and synced offsite to Synology. No new backup wiring needed — this commit
documents the fact + adds the reproducer script.

## This change

Two files, both additive:

1. `scripts/pfsense-haproxy-bootstrap.php` — idempotent PHP script that
   edits pfSense config.xml to add:
   - Backend pool `mailserver_nodes` with 4 k8s workers on NodePort 30125,
     `send-proxy-v2`, TCP health-check every 120000 ms (2 min).
   - Frontend `mailserver_proxy_test` listening on pfSense 10.0.20.1:2525
     in TCP mode, forwarding to the pool.
   Uses `haproxy_check_and_run()` to regenerate `/var/etc/haproxy/haproxy.cfg`
   and reload HAProxy. Removes existing items with the same name before
   adding, so repeat runs converge on declared state.

2. `docs/runbooks/mailserver-pfsense-haproxy.md` — ops runbook covering
   current state, validation, bootstrap/restore, health checks, phase
   roadmap, and known warts (health-check noise + bind-address templating).

## What is NOT in this change

- Phase 4 (NAT rdr flip for :25 from `<mailserver>` → HAProxy) — deferred.
- Phase 5 (extend to 465/587/993 with alt listeners + Dovecot dual-
  inet_listener) — deferred.
- Terraform for pfSense HAProxy pkg install — not possible (no Terraform
  provider for pfSense pkg management). Runbook documents the manual
  `pkg install` command.

## Test Plan

### Automated
```
$ ssh admin@10.0.20.1 'pgrep -lf haproxy; sockstat -l | grep :2525'
64009 /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D
www  haproxy  64009 5 tcp4  *:2525  *:*

$ ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio" \
    | awk 'NR>1 {print $4, $6}'
node1 2
node2 2
node3 2
node4 2        # all UP

$ python3 -c "
import socket; s=socket.socket(); s.settimeout(10)
s.connect(('10.0.20.1', 2525))
print(s.recv(200).decode())
s.send(b'EHLO persist-test.example.com\r\n')
print(s.recv(500).decode())
s.send(b'QUIT\r\n'); s.close()"
220-mail.viktorbarzin.me ESMTP
...
250-mail.viktorbarzin.me
250-SIZE 209715200
...
221 2.0.0 Bye

$ kubectl logs -c docker-mailserver deployment/mailserver -n mailserver --tail=50 \
    | grep smtpd-proxy.*CONNECT
postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:33010 to [10.0.20.1]:2525
```

Real client IP `[10.0.10.10]` visible (not the k8s-node IP after kube-proxy
SNAT) → PROXY-v2 roundtrip confirmed.

### Manual Verification
Trigger a pfSense reboot; after boot, HAProxy should auto-restart from the
now-persisted config (`<enable>yes</enable>` in XML). Connection test above
should still work.

## Reproduce locally
1. `scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/`
2. `ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'` → rc=OK
3. `python3 -c '...' ` SMTP roundtrip test above.
2026-04-19 12:07:47 +00:00
Viktor Barzin
42f1c3cf4f [claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP
## Context

The claude-agent-service K8s pod (deployed 2026-04-15) provides an HTTP API
for running Claude headless agents. Three workflows still SSH'd to the DevVM
(10.0.10.10) to invoke `claude -p`. This eliminates that dependency.

## This change

Pipeline migrations (SSH → HTTP POST to claude-agent-service):
- `.woodpecker/issue-automation.yml` — Vault auth fetches API token instead
  of SSH key; curl POST /execute + poll /jobs/{id} replaces SSH invocation
- `scripts/postmortem-pipeline.sh` — same pattern; uses jq for safe JSON
  construction of TODO payloads
- `.woodpecker/postmortem-todos.yml` — drop openssh-client from apk install
- `stacks/n8n/workflows/diun-upgrade.json` — SSH node replaced with HTTP
  Request node; API token via $env.CLAUDE_AGENT_API_TOKEN (added to Vault
  secret/n8n)

Documentation updates:
- `docs/architecture/incident-response.md` — Mermaid diagram: DevVM → K8s
- `docs/architecture/automated-upgrades.md` — pipeline diagram + n8n action
- `AGENTS.md` — pipeline description updated

## What is NOT in this change

- DevVM decommissioning (still hosts terminal/foolery services)
- Removal of SSH key secrets from Vault (kept for rollback)
- n8n workflow import (must be done manually in n8n UI)

[ci skip]

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
2026-04-18 10:12:02 +00:00
Viktor Barzin
e80b2f026f [infra] Migrate Terraform state from local SOPS to PostgreSQL backend
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
  state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
  10.0.20.200:5432/terraform_state with native pg_advisory_lock.

Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.

Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
  skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
  for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
  service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:33:12 +00:00
Viktor Barzin
f726d1c3fd fix: stash local changes before git pull in CI pipelines
DevVM may have unstaged changes from active sessions. Use git stash
before pull to avoid 'cannot pull with rebase: unstaged changes' errors.
Stash pop after to restore working state.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:37:10 +00:00
Viktor Barzin
de42acd68e fix: backup LUKS rsync tolerance, stale mapping cleanup, tier-4-aux quota bump
- daily-backup: handle rsync exit 23 (partial transfer) as OK for LUKS
  noload mounts — in-flight writes have corrupt metadata from skipped
  journal replay, but core data is intact
- daily-backup: clean up stale LUKS dm mappings from previous crashed
  runs before attempting to open
- daily-backup: capture rsync exit code safely with set -e (|| pattern)
- kyverno: bump tier-4-aux requests.memory 2Gi→3Gi (servarr was at 83%)
- actualbudget: patched custom quota 5Gi→6Gi (was at 82%)

Verified: backup now completes status=0 (96 PVCs OK, 0 failed)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:21:51 +00:00
Viktor Barzin
92495d0fc3 fix: start Claude from ~/code to load root CLAUDE.md correctly
Both issue-automation and postmortem pipelines were cd'ing into
~/code/infra before running Claude, missing the root CLAUDE.md
with beads config and project-wide instructions. Now cd to ~/code
and use relative agent paths from there.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:15:32 +00:00
Viktor Barzin
9baefa22ab fix: technitium CronJob scheduling, LUKS backup support, speedtest scrape
- technitium-password-sync: remove RWO encrypted PVC mount that caused
  pods to stick in ContainerCreating on wrong nodes. Plugin install now
  warns instead of failing when zip unavailable.
- daily-backup: add LUKS decryption support for encrypted PVC snapshots
  using /root/.luks-backup-key. Uses noload mount option to skip ext4
  journal replay. Also installed cryptsetup-bin on PVE host.
- speedtest: disable prometheus.io/scrape annotation (no /prometheus
  endpoint exists, causing ScrapeTargetDown alert).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:12:32 +00:00
Viktor Barzin
36454b87d1 feat: CI/CD performance overhaul
- New custom CI Docker image (ci/Dockerfile) with TF 1.5.7, TG 0.99.4,
  git-crypt, sops, kubectl pre-installed. Pushed to private registry.
  Eliminates 17 apk add calls + binary downloads per pipeline run.

- Unified CI pipeline: merge default.yml + app-stacks.yml into one.
  Changed-stacks-only detection (git diff, with global-file fallback).
  Concurrency limit (xargs -P 4). Step consolidation (2 steps vs 4).
  Shallow clone (depth=2). Provider cache (TF_PLUGIN_CACHE_DIR).

- Per-stack Vault advisory locks in scripts/tg. 30min TTL with stale
  lock detection. Blocks concurrent applies to same stack.

- TF_PLUGIN_CACHE_DIR enabled by default in scripts/tg for local dev.

- Daily drift detection pipeline (.woodpecker/drift-detection.yml).
  Runs terraform plan on all stacks, Slack alert on drift.

- CI image build pipeline (.woodpecker/build-ci-image.yml).

Expected speedup: ~5-10 min per pipeline run → ~2-4 min.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 11:22:26 +00:00
Viktor Barzin
bcad200a23 chore: add untracked stacks, scripts, and agent configs
- New stacks: beads-server, hermes-agent
- Terragrunt tiers.tf for infra, phpipam, status-page
- Secrets symlinks for vault, phpipam, hermes-agent
- Scripts: cluster_manager, image_pull, containerd pullthrough setup
- Frigate config, audiblez-web app source, n8n workflows dir
- Claude agent: service-upgrade, reference: upgrade-config.json
- Removed: claudeception skill, excalidraw empty submodule, temp listings

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 09:33:06 +00:00
Viktor Barzin
24a23709a5 fix: update healthcheck to report internal and external monitors separately
- Increase Uptime Kuma API timeout to 120s with wait_events=0.2
- Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var
- Report internal and external monitor status separately
- Install uptime-kuma-api in local venv

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:44:20 +00:00
Viktor Barzin
4498f61402 fix(post-mortem): add /etc/exports to git, NFS health check in daily-backup, document CSI requirements [PM-2026-04-14]
- scripts/pve-nfs-exports: git-managed copy of PVE host /etc/exports with
  detailed comments explaining fsid=0 danger and NFSv3 disable rationale.
  Deploy: scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra
- scripts/daily-backup.sh: add check_nfs_exports() that runs before backup starts.
  Detects: missing /etc/exports, dangerous fsid=0 on /srv/nfs, nfs-server not running,
  no active exports. Warns but doesn't abort (block-storage PVC backups can still run).
- .claude/CLAUDE.md: document NFS CSI mount option requirements — nfsvers=4 mandatory,
  fsid=0 forbidden, /etc/exports is git-managed, critical services must use proxmox-lvm-encrypted.

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 18:08:24 +00:00
Viktor Barzin
b1b408ff0e fix: use full path to claude CLI for non-interactive SSH 2026-04-14 17:44:50 +00:00
Viktor Barzin
f2e7367401 fix: use sh instead of bash in pipeline (Alpine compat) 2026-04-14 17:29:14 +00:00
Viktor Barzin
c742fa3dfb fix: scan all post-mortems for TODOs (no git diff needed) 2026-04-14 17:14:22 +00:00
Viktor Barzin
59367cc588 fix: handle Woodpecker shallow clone in postmortem pipeline 2026-04-14 17:12:02 +00:00
Viktor Barzin
8540f48a28 fix: move pipeline logic to shell script (avoid YAML quoting issues) 2026-04-14 16:46:42 +00:00
Viktor Barzin
8badb8181a feat: post-mortem automation pipeline
E2E workflow for incident post-mortems:
1. /post-mortem skill generates structured post-mortem markdown
2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes
3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor)
4. postmortem-todo-resolver agent implements TODOs headlessly
5. Agent updates post-mortem with Follow-up Implementation table

Components:
- .claude/skills/post-mortem/ — writer skill + template
- .claude/agents/postmortem-todo-resolver.md — headless agent
- .woodpecker/postmortem-todos.yml — CI pipeline
- scripts/parse-postmortem-todos.sh — TODO extractor
- cluster-health skill — auto-suggest post-mortem after recovery

Safety: only auto-implements Alert/Config/Monitor types.
Architecture/Migration/Investigation items are skipped.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:34:42 +00:00