The pfSense NAT rdr rules added in f7cf9f07 hardcoded 10.0.20.203
(Traefik LB IP) as the redirect source. That couples mail's LAN
path to Traefik's IP choice — if Traefik moves again (it just
moved .200 → .203 on 2026-05-30), the mail path silently breaks.
Removing the script and the matching doc paragraph; keeping the
networking.md .200 → .203 staleness fix (separate correction).
Follow-up: give the mail HAProxy listener a dedicated pfSense
Virtual IP (IP Alias on opt1), update Technitium internal zone
+ WAN port-forwards to target the VIP, so mail's LAN-side path
is decoupled from any other service's LB IP.
Reconciles the tripit stack source with live state and adds the forward
flow. Ingest now polls vbarzin@gmail.com [Gmail]/All Mail read-only over a
rolling 12-month X-GM-RAW travel-sender window (Croatia Jet2 refs excluded),
filing trips under MAIL_DEFAULT_OWNER_EMAIL=vbarzin@gmail.com (Viktor's
Authentik login identity). Adds an ingest-plans CronJob that polls spam@
filtered to To:plans@viktorbarzin.me (the @viktorbarzin.me catch-all target)
so forwarded bookings are extracted and attached to the matching trip;
IMAP_PASSWORD is overridden per-job to spam@'s creds (PLANS_IMAP_PASSWORD).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Technitium's split-horizon rewrites *.viktorbarzin.me to 10.0.20.203
(Traefik LB) for the 192.168.1.0/24 Barzini WiFi (TP-Link router has
no hairpin NAT). The rule is name-agnostic so mail.viktorbarzin.me
(and imap./smtp.) get sent to .203 too — where Traefik does not
listen on 25/465/587/993. iOS Mail on Barzini WiFi silently hangs
while Roundcube (port 443 via Traefik) keeps working.
Adds pfSense NAT rdr rules so traffic to 10.0.20.203:{25,465,587,993}
gets redirected to 10.0.20.1 (the mail HAProxy listener already
serving the public path). Loaded on every incoming interface by
pfSense rule generation, so any LAN/VPN client falling into the
split-horizon answer lands on the right service unchanged.
Includes idempotent reproducer script (mirrors the existing
pfsense-haproxy-bootstrap.php pattern) and the networking.md
mail carve-out paragraph plus the stale .200 → .203 reference.
The service now runs agent calls concurrently (bounded semaphore, per-job
isolated clones) instead of single-flight. Infra side:
- mount git-crypt-key into the main container (each job re-unlocks its own clone)
- MAX_CONCURRENCY=10 env (excess calls queue FIFO)
- bump pod memory 2Gi req / 12Gi limit, cpu req 1 (Burstable, tier-aux) — sized
for ~10 concurrent claude+terraform runs; fits node2/3/5 headroom
- docs: beads-auto-dispatch + automated-upgrades no longer describe single-slot
Service code: viktor/claude-agent-service @ 66104a3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
immich-ml at TTL=0 never unloaded models; a heavy OCR library job
inflated onnxruntime's CUDA arena to ~10.7GB and held it on the shared
time-sliced T4, starving llama-swap (qwen3-8b) so recruiter-responder
triage 502'd silently for hours (emails preserved unseen, no loss).
TTL=600 lets idle ad-hoc models (OCR, face) free VRAM while preloaded
CLIP/smart-search stays warm.
Docs: correct stale llama-cpp GPU notes (T4 is time-sliced, no VRAM
isolation; add qwen3-8b to model table), immich MODEL_TTL gotcha in
.claude/CLAUDE.md, and a post-mortem.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The offsite Synology hit 97% — the Backup share grew +670G in a week, traced
to the 2026-05-26 change that began mirroring large regenerable services
offsite, plus an unbounded nextcloud.log bloating its backups to 87G.
- nfs-mirror: re-exclude ollama, prometheus-backup, audiblez, ebook2audiobook
(regenerable; live-only on sdc). Keep *-backup DB dumps (real safety copies).
- offsite-sync Step 2: nfs-ssd leg is now immich-only; ollama/llamacpp on the
SSD no longer ship offsite (re-pullable models).
- daily-backup: skip nextcloud/nextcloud-data-proxmox (orphaned pre-encryption
PV, still backed up weekly).
- nextcloud: cap+rotate the log (log_rotate_size=10MB); the dedicated backup
now excludes html/ (app code, from image), logs, and preview cache and keeps
only the latest copy (pvc-data holds version history) → <5G (was 87G).
- nextcloud: pin image to 32.0.9 in chart_values. A 2026-05-26 Keel bump moved
the live pod to 32.0.9 (data migrated to 32.0.9.2) but TF still defaulted to
32.0.3; reconciling that drift this session rolled a 32.0.3 pod that
CrashLooped on the downgrade. Pinning eliminates the drift.
Docs: backup-dr.md + infra CLAUDE.md updated (add nfs-mirror, new exclusions).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The k8s-master gate pod OOM-killed child kubectls 149x/7d (accelerating:
0/day → 15 → 134) while master sat in pending-reboot. Root cause: only the
pending-reboot node's gate pod runs the kubectl-heavy hot path each cycle,
and the immortal bash loop slowly leaks (kubectl forks + Check-4 process
substitution) past the 64Mi cgroup limit. PID 1 bash survives each kill, so
the pod never restarts — just silent oom_events.
Fix: raise limit 64Mi→256Mi (headroom for ~30-50Mi kubectl forks) + add a
MAX_ITER=72 self-exit (~6h) so kubelet restarts the pod fresh and the leak
can never accumulate, regardless of how long a node stays pending-reboot.
Docs: post-mortem + automated-upgrades.md gate note.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The redis-v2 Sentinel cluster split-brained: redis-v2-0 booted during a network
partition, hit the init script's deterministic "pod-0 = bootstrap master"
fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2.
HAProxy's `expect rstring role:master` matched both and round-robined client
connections across the two diverging masters, so Immich enqueued BullMQ jobs on
one while its workers blocked-popped on the other -> every queue wedged and
new-upload thumbnails 404'd cluster-wide. Third Sentinel-class incident in ~6
weeks (after the 2026-04-19 PM quorum drift and 2026-04-22 flap cascade).
Revert to a single standalone instance: replicas=1; drop Sentinel + HAProxy +
init bootstrap configmap + both PDBs; redis container only (+ exporter).
maxmemory-policy allkeys-lru -> volatile-lru so one shared instance serves both
workload classes correctly: evict only TTL'd cache keys, never TTL-less Immich
BullMQ / Celery job keys. redis-master service name/DNS unchanged -> no consumer
edits; collapsed onto redis-v2-0's existing dataset (queued jobs preserved).
Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source +
docs only, hence [ci skip].
Monitoring: drop RedisReplicationLagHigh + RedisReplicasMissing (no replicas
now; the latter would false-fire), RedisMemoryPressure 85%->80% volatile-lru backstop.
Docs: rewrite databases.md Redis section (single-instance design + incident
history); add post-mortem 2026-05-30-redis-split-brain.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Large Immich video downloads and uploads failed at a hard ~60s wall. The
websecure entrypoint set respondingTimeouts.{read,write}Timeout=60s; unlike
nginx proxy_*_timeout (per-read idle), Traefik respondingTimeouts are hard caps
on total request/response duration, so every transfer slower than 60s was cut
mid-stream. Reproduced: a 6 MB/s throttled 650MB download died at 386MB / 62s
with an HTTP/2 stream reset.
- writeTimeout=0 (Traefik's default, which Immich's reverse-proxy guidance
assumes): unlimited download size/duration.
- readTimeout=3600s: passes multi-GB uploads while keeping a slow-loris backstop
(Immich has no resumable upload, so the window must exceed real upload times).
Verified: the same 650MB download now completes fully (650MB / 102s, exit 0).
IPv6 path needs no change - the pfSense bridge HAProxy 1h timeouts are
inactivity-based, not total caps. Applied via tg (Tier 1 / PG-authoritative
state); this commit syncs source + docs only, hence [ci skip].
Docs: networking.md (Entrypoint Transport Timeouts + troubleshooting),
.claude/CLAUDE.md networking note.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add tripit (self-hosted TripIt-clone travel-itinerary PWA) to the
service catalog Optional tier and Non-Proxied DNS list, and to the
CNPG consumer + PostgreSQL rotation lists in the databases doc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the pfSense socat IPv6 forwarder (which masked every IPv6 client
as 10.0.20.1) with a standalone HAProxy bridge using send-proxy-v2, so
real IPv6 client IPs reach Traefik/CrowdSec. Traefik now trusts PROXY-v2
only from 10.0.20.1 on the web/websecure entrypoints; real IPv4 clients
(ETP=Local, own source IP) are unaffected. Mail-over-IPv6 routed through
the mail NodePorts (send-proxy-v2) too. Bridge is TCP/h2 only (no QUIC
over IPv6). Persistence on pfSense: rc.d/ipv6proxy + ipv6_proxy.sh
(config.xml shellcmd), keeping the nginx-off-[::] patch.
Also fixes stale networking.md: Traefik was still documented on the
shared .200; it moved to dedicated .203/ETP=Local on 2026-05-30.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pin immich-server to the GPU node with a time-sliced nvidia.com/gpu slice
so ffmpeg uses hardware NVENC encode + NVDEC decode instead of software.
This frees the ~3-4 CPU cores the software transcoder was burning inside
the request-serving pod (which was slowing thumbnail/photo browsing), and
makes incompatible (HEVC/iPhone) videos playable in seconds. Activation is
ffmpeg.accel=nvenc + accelDecode=true in the DB system-config (Immich app
config is DB-managed here, like oauth/smtp — not Terraform).
Also give immich-frame the same Keel ignore_changes immich-server already
has, so an untargeted apply no longer churns it (pre-existing drift).
Docs: .claude/CLAUDE.md Immich row + compute.md GPU-workloads list.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add new "Data Routing" flowchart up front showing which paths go
where (sda mirror vs Synology-direct vs not-backed-up).
- Overall Backup Flow: split Layer 2 into 2a (nfs-mirror daily 02:00)
and 2b (daily-backup 05:00); show nfs-mirror as an explicit
component; clarify Step 2 is immich-only direct + nfs-ssd.
- Weekly Backup Timeline → Daily Backup Timeline: actual schedule
(00:00 LVM, 00:15 PG, 00:45 MySQL, 02:00 nfs-mirror, 05:00 daily-
backup, 06:00 offsite-sync, 12:00 second LVM); explicit inotify
feeding Step 2.
- Physical Disk Layout: current capacity numbers + dual sdc→sda and
sdc→Synology arrows (immich-only) reflecting the two-leg design.
- Restore Decision Tree: refreshed age tiers (< 12h LVM, 12h-4w sda,
> 4w Synology) + dedicated branch for immich photos (which only
have an offsite copy).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Steady-state delta runs in 10-20 min and the weekly cadence left a
real RPO gap: app data under /srv/nfs/<svc>/ that isn't a PVC
(captured by daily-backup) or a *-backup CronJob (captured daily by
the CronJob writing to /srv/nfs/<svc>-backup/) was on a 7-day worst
case for off-disk durability. Affected paths include nextcloud shared
files, audiobookshelf library, mailserver Maildir, calibre, servarr
metadata, real-estate-crawler scraped data, openclaw agent state.
Daily cadence drops their RPO to ~24h at negligible cost.
Slot: 02:00, 3h ahead of daily-backup (05:00) so the manifest is
populated before offsite-sync reads it at 06:00.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously /srv/nfs/{ollama,audiblez,ebook2audiobook,*-backup} took
the sdc → Synology direct leg. They now ride sdc → sda → Synology
pve-backup/ via nfs-mirror like every other NFS subtree, so sda
becomes the single canonical mirror and Synology only has to ingest
one feed for the bulk of cluster state.
frigate + temp dropped from BOTH legs (no backup anywhere) per
explicit user ask — frigate is a 14d camera ring, temp is scratch.
prometheus/loki/alertmanager dropped as no-op (orphan dirs that
no longer exist on /srv/nfs).
Also: nfs-mirror's manifest collection switched from find -newer
(mtime) to find -cnewer (ctime) — rsync -t preserves source mtime
on dest, so freshly-written files looked "older than \$STAMP" and
the 2026-05-26 full mirror run captured only 2 of 800k transferred
files. Hit during this session, recovered via .force-full-sync.
Operational result post-rollout:
- sda 87% → 70% (anca-elements 423G deleted, +260G new dirs)
- /Viki/nfs/ on Synology: was 24 stale dirs (~430G), now immich only
- Synology free: ~300G → ~430G+ once btrfs reclaim catches up
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reflects the 2026-05-26 decision (commit 44c3770a) to keep Linux VMs
out of Terraform — telmate/proxmox v3.0.2 mangles dynamically-attached
disks (id=539) and doesn't refresh mbps_*_concurrent back from live
state. What stays in TF: the cloud-init templates. Per-VM I/O caps
now driven by the apply-mbps-caps systemd timer (commit 56a338f8).
Replaces the stale note about iSCSI mangling — that rationale is
obsolete (iSCSI gone since 2026-04-11) and the new scope is
intentional, not provisional.
The proxmox-csi-plugin hardcodes a 29-disks-per-VM ceiling in
pkg/csi/utils.go:394 (lun < 30 loop). This is the actual block-
storage scaling bottleneck — NOT QEMU, NOT Proxmox, NOT the kernel.
Adds a "Per-VM SCSI-LUN cap" section to docs/architecture/storage.md
explaining:
- the source-level hardcode and how to recognise it (FailedAttachVolume
"no free lun found")
- why switching scsihw to virtio-scsi-single buys ZERO additional
capacity (perf-only)
- levers in leverage-per-effort order (migrate non-DB to NFS,
add a worker, fork+patch the plugin)
- the Wave 1 NFS migration (2026-05-26) that took 5 services off
block and skipped two more on pre-flight (plotting-book SQLite+WAL,
stirling-pdf H2 .mv.db)
Discovered during the Wave 1 work — see remote memory ids 2788+ for
full context and 2798+ for the related postiz state-drift discovery.
Three immediate fixes surfaced by the backup-pipeline audit:
1. **S1 silent-loss race fix** (daily-backup.sh:142): remove the
`> "${MANIFEST}"` truncation at the start of daily-backup. Truncation
already lives in offsite-sync-backup at line 159, gated on a successful
sync. With both scripts truncating, an offsite-sync failure followed by
the next morning's daily-backup would silently wipe yesterday's
unconsumed manifest entries — those files would only reach Synology
via the monthly full sync (1st-7th of month). Now only offsite-sync
truncates, and only on success.
2. **Missing alert OffsiteBackupSyncFailing**: documented in backup-dr.md
but was never added to prometheus_chart_values.tpl. Step 1 or Step 2
failure pushes offsite_sync_last_status=1 but nothing read it. Added.
3. **wear: drop `-z` from local-only rsyncs** (daily-backup.sh:218 PVC
snapshot rsync + line 347 /etc/pve sync). Both are local-to-sda
transfers — compression wastes CPU and yields nothing (gigabit local
path, intermediate disk doesn't benefit).
Bonus cleanups (zero functional impact):
- "Weekly backup starting/complete" → "daily-backup starting/complete"
(the timer is daily, not weekly — legacy from earlier monthly-rotation
schedule).
- "--- Step 2: PVC file copy ---" → "Step 1:" (was numbered from 2 with no
Step 1 above).
- **wear: pfSense full filesystem tar now Sunday-only** instead of daily.
config.xml stays daily (it's the primary restore artifact and tiny).
Full tar is forensic recovery only — re-tarring ~100MB+ daily writes
~3G/month to sda + Synology for unchanged content. Weekly is plenty.
docs/architecture/backup-dr.md: rewritten Overview + 3-2-1 breakdown to
reflect today's two-leg architecture; added a "2026-05-24 session"
changelog summary at the top; added a "Synology snapshot management"
subsection with the sudo + `synosharesnapshot` recipe (DSM API is gated
by 2FA so this is the only programmatic path); updated Key Files table
with nfs-mirror + the Synology SSH access notes.
Open follow-ups from the audit (S2 — file as beads if pursued):
- Factor two-leg invariant into /etc/backup-skip-list.conf sourced by
both nfs-mirror.sh and offsite-sync-backup.sh.
- Manifest write-collision flock between nfs-mirror Mon 04:11 and
daily-backup Mon 05:00.
- Unbounded manifest cap (force full sync if > 500k lines).
- Synology free-space scraper + alert.
- LVM thin pool meta-pool fill alert.
- nfs-change-tracker.service heartbeat to Pushgateway.
- Synology config drift TF surface (snap retention, share defs).
Grows pve/nfs-data 3T → 4T (online lvextend + resize2fs) to absorb ~340 GB
of new originals landing under /srv/nfs/immich/upload during the import.
Adds:
- module "nfs_anca_elements_host" — RO PVC over /srv/nfs/anca-elements,
consumed only by the import Job (not mounted in immich-server).
- kubernetes_job_v1.anca_elements_import — immich-go v0.31.0 uploader
posting to immich-server.immich.svc:2283 with Anca's API key (synced
via the existing immich-secrets ExternalSecret from
secret/immich.anca_api_key). Filters to image extensions, bans the
non-photo top-level dirs (filme/, Music/, carti/, courses, installers,
docs, etc.), puts every asset in the album "Poze (Elements)". Default
`--pause-immich-jobs` is disabled — non-admin keys can't pause jobs.
- docs/architecture/storage.md — note the new 4 TB size in 3 places.
- docs/runbooks/grow-pve-nfs-lv.md — captures the one-shot lvextend
procedure (no pve-host TF stack exists for this).
Job is removed in the follow-up cleanup commit once the upload completes;
the PVC stays for a videos batch later.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Before this commit, the in-flight design split anca-elements (its own
mirror script + timer) from the rest of /srv/nfs (still going to
Synology via inotify-tracked offsite-sync). It also meant Synology
received some bytes via both paths (sda → Synology AND direct NFS →
Synology), which doubled consumption.
This commit collapses both into a clean 3-2-1:
Copy 1 (sdc): live /srv/nfs/* + cluster block PVCs
Copy 2 (sda): /mnt/backup/{pvc-data,sqlite-backup,pfsense,
pve-config,<critical-nfs>/}
← daily-backup + nfs-mirror (one script each)
Copy 3 (Synology): /Backup/Viki/{pve-backup,nfs,nfs-ssd}
← offsite-sync-backup Step 1 (sda → Synology)
+ Step 2 (sda-BYPASS paths only → Synology direct)
scripts/nfs-mirror.{sh,service,timer}:
New consolidated weekly mirror. Replaces anca-elements-mirror (to be
removed in a follow-up after the current in-flight rsync completes,
parity-verified, and Synology source-of-truth is deleted). Single
rsync /srv/nfs/ → /mnt/backup/ with an explicit EXCLUDES list that
drops paths not worth a local 2nd copy: immich (1.2T — too big),
frigate (14d ring), prometheus/loki (rebuildable), ollama/llamacpp/
audiblez/ebook2audiobook (re-fetchable), *-backup (already backups),
temp/alertmanager (transient). Nice=10, IOSchedulingClass=idle.
scripts/offsite-sync-backup.sh:
Step 2 (NFS → Synology) filter inverted: instead of `--exclude=
anca-elements/`, it now `--include`s only the sda-BYPASS paths
(immich, frigate, prometheus, *-backup, …). The bypass-include
regex MUST stay in lockstep with nfs-mirror's EXCLUDES — they are
complementary and any drift creates either gaps or duplication on
Synology. Comment in the script flags this.
monitoring alerts: renamed AncaElementsMirror{Stale,Failing} to
NfsMirror{Stale,Failing} matching the new metric job name
`nfs-mirror`. Thresholds unchanged.
docs/architecture/backup-dr.md: rewritten Step 1/Step 2 sections and
added the bypass-list rationale + cross-reference between scripts.
NOT YET DEPLOYED — gated on the in-flight anca-elements-mirror rsync
finishing + parity verification + Synology /volume1/Backup/Anca/
Elements deletion. The old scripts (anca-elements-{mirror,sync.sh})
remain on the PVE host until then, and will be removed in a cleanup
commit.
Synology is being removed as a host for the Anca/Elements archive
(770G). /srv/nfs/anca-elements on PVE becomes the source of truth;
sda /mnt/backup/anca-elements becomes the single-disk-failure mirror.
No offsite for this archive — by design.
- scripts/anca-elements-mirror.sh: rsync -rlt --delete -H, idempotent,
pushes anca_elements_mirror_last_{run_timestamp,status,bytes} to
Pushgateway, lockfile in /run, SIGTERM-safe (status=2 on abort).
- .service: oneshot, Nice=10, IOSchedulingClass=idle, 5h timeout.
- .timer: weekly Mon 04:00, Persistent=true, 15-min randomised delay.
Deployed to PVE host; timer enabled; initial 770G sync running in
background. Synology original to be deleted after first run completes
and parity is verified.
docs/architecture/backup-dr.md: documents Layer 3a + updated path
exclusion rationale (PVE is now upstream, not downstream).
- docs/architecture/storage.md: new "Nextcloud as PVE-NFS browser"
section documenting mount-per-archive + applicable_users model,
why mount-level ACL beats Files Access Control on NC 30/31, the
manifest shape (with current applicableUsers + enableSharing
fields), and the trade-off
- docs/runbooks/nextcloud-add-archive.md: 5-step runbook to surface
a new directory under /srv/nfs/* to specific NC users via the
bootstrap Job
- scripts/anca-elements-sync.sh: deployed at
/usr/local/bin/anca-elements-sync.sh on the PVE host; fpsync from
Synology Anca/Elements to /srv/nfs/anca-elements (idempotent +
resumable). The PVE replica is what the NC /anca-elements mount
serves; the offsite-sync pipeline excludes this path (committed
earlier this session) so we don't write it back to Synology
NC usernames are admin/anca/emo (not display names — admin is
Viktor). Stale "viktor" references in the manifest example dropped.
Stragglers from the same drift as commit b288a59 (monorepo) / the
2026-05-22 viktorbarzin.me apex incident — the `.101` references were
left over from the NodePort exposure era. Technitium's actual MetalLB LB
IP is `.201` (in pool 10.0.20.200-220).
- architecture/vpn.md — Technitium component cell + AdGuard forwarder
example + nslookup troubleshooting hint
- architecture/networking.md — 502 ingress troubleshooting snippet
- plans/2026-02-22-talos-linux-migration-evaluation.md — nameservers
example
First analysis pass over Calico GNP wave1-egress-observe-tier34 data captured
in Loki since 2026-05-19. Pulled ~10000 flow log lines covering 36 source
namespaces (of 82 selected by tier 3+4). Analysis script outputs preserved
on the dev host at /tmp/{analyze_flows2,build_allowlist}.py.
## Findings
**Universal baseline (every observed ns):**
- DNS to kube-system/kube-dns UDP/53
- Often mysql.dbaas TCP/3306 or pg.dbaas TCP/5432
- Often redis.redis TCP/6379
**Rollout tiering by egress fan-out:**
- Tier A (recruiter-responder only): 2 destinations, ideal pilot
- Tier B (29 namespaces): ≤3 external IPs, ≤5 internal — batch rollout
- Tier C (4 namespaces: f1-stream/openclaw/woodpecker/status-page):
needs per-IP investigation
- Tier D (servarr): 130+ external IPs (BitTorrent P2P) — keep Log+Allow
permanently or move to dedicated egress proxy
## Caveats blocking immediate enforce
- Observation horizon too short: ~6h dense data, ~24h total. Need ≥7 days
to catch weekly CronJobs, Vault token rotations, Keel pulls.
- External IPs are dynamic (Cloudflare/AWS rotate). Static IP allowlists
will break — need DNS-based selectors or CIDR ranges.
- Some intra-namespace traffic bypasses the Calico filter chain.
## Recommended next steps
1. Continue observation through 2026-05-29 (full week). Compare destination
set day-over-day; if stable, allowlist is ready.
2. First enforce: recruiter-responder (allowlist = kube-dns + telegram CIDR
+ vault/ESO service IPs).
3. Tier B phased rollout at 3-5 ns/day after pilot proves out.
Full analysis: docs/architecture/wave1-egress-observation-2026-05-22.md
Tracked under beads code-8ywc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico
Enterprise-only field, rejected by OSS v3.26) with the supported primitive:
Calico GlobalNetworkPolicy with `action: Log`.
## Mechanics (verified end-to-end on 2026-05-19)
1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder`
with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`,
`types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`.
2. Felix translates to iptables LOG rule in
`cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5.
3. Linux kernel emits LOG entries to ring buffer with transport=kernel.
4. systemd-journald captures kernel transport entries.
5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`.
6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing
SRC/DST/PROTO/PORT for every NEW egress connection.
## Verified output sample
`calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132
DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...`
The Allow rule in the GNP keeps egress functional (recruiter-responder
remained 1/1 Running through the apply — verified Python TCP connections to
1.1.1.1, 8.8.8.8, 9.9.9.9 succeed).
## Wave 1 status
W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7
remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"`
samples, build empirical egress allowlist, flip the GNP rules from
`[Log, Allow]` to `[Allow <specific dests>, Deny]`.
Expand observation to additional namespaces by adding entries to
`spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
## Vault audit-tail sidecar (APPLIED + VERIFIED)
- Added `audit-tail` extraContainer to vault helm chart values: busybox:1.37 with
`tail -F /vault/audit/vault-audit.log`. Reads the audit PVC (`audit` volume
from the chart's auditStorage), emits JSON audit events to stdout. kubelet
captures the stdout; once Loki+Alloy are deployed (blocked on code-146x),
these logs flow automatically to Loki with `container="audit-tail"`.
- Resources: 5m CPU / 16Mi mem request, 32Mi limit. PVC mount is readOnly.
- Applied via `tg apply -target=helm_release.vault`. All 3 vault pods rolled
cleanly (OnDelete strategy, manual one-at-a-time, auto-unseal each ~10s).
- Verified: `kubectl logs -n vault vault-2 -c audit-tail` shows live JSON
audit lines from ESO token issuance, KV reads, etc.
## Doc reality-check
While verifying logs reached Loki, discovered Loki is NOT actually deployed.
`stacks/monitoring/modules/monitoring/loki.tf` defines `helm_release.loki` but
has a self-referencing `depends_on = [helm_release.loki]` that prevented apply.
No `loki` Helm release in the cluster, no Loki pods, no Loki Service. The
monitoring.md "Loki: deployed" claim was aspirational.
- security.md W1.2 row: PENDING → PARTIAL (sidecar live, shipping blocked on
code-146x)
- security.md W1.3 row: gated on code-146x added
- monitoring.md Loki row: marked NOT DEPLOYED with cross-ref to code-146x
## New beads task
- code-146x P1 — Loki + log shipper missing. Lists the helm_release self-depends_on bug,
investigation paths, and revised wave 1 sequencing (Loki/Alloy is prereq 0).
## Wave 1 status update
- W1.2: Vault audit device + XFF + audit-tail sidecar all LIVE; Loki shipping blocked on code-146x
- W1.1, W1.3, W1.6, W1.7: still not started (W1.6 also blocked on code-3ad Calico Installation CR)
- W1.4, W1.5: code committed, blocked on code-e2dp (Kyverno provider crash)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Locked design for wave 1 of cluster security hardening. Plan only — implementation lives in beads
code-8ywc and follow-up commits. Captures:
- security.md: Kyverno policy table updated (Audit → Enforce planned for the four security policies
with the 31-namespace exclude list). New section "Audit Logging & Anomaly Detection" detailing the
K8s API audit policy, Vault audit device + X-Forwarded-For trust, source-IP anomaly rules (K9, V7,
S1), and the rejected-canary-tokens / rejected-K1 rationales. New section "NetworkPolicy
Default-Deny Egress" describing the observe-then-enforce (γ) approach for tier 3+4.
- monitoring.md: new "Security Alerts (Wave 1)" section listing the 16 rules (K2-K9, V1-V7, S1)
and the Loki ruler → Alertmanager → #security routing path.
- runbooks/security-incident.md (new): per-alert response playbook with LogQL queries, action
steps, false-positive triage, and SEV1 escalation.
- .claude/CLAUDE.md: new "Security Posture" section summarising the locked decisions: identity
allowlist is me@viktorbarzin.me ONLY, source-IP allowlist CIDRs, no public-IP access policy,
rationale for not adopting canary tokens.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The weekday-only schedule was a 2026-03-16-incident-era guardrail when
the rest of the safety net was thin. Today's gates — halt-on-alert,
sentinel-gate Check 4 (24h soak via node Ready transitions), the
K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the
sentinel-path fix from earlier today — make weekend reboots safe and
just clear the backlog faster.
Effect: 5 pending node reboots clear in 5 calendar days instead of
queueing up over weekends. The K8s version-upgrade detection at Sun
12:00 UTC self-defers if a Sunday-morning kured reboot fires (the
RecentNodeReboot alert is in the Upgrade Gates ignore-less list for
the version-upgrade preflight — same mechanism kured uses).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.
Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:
preflight (k8s-node1)
→ master (k8s-node1) drains k8s-master
→ worker × 4 (k8s-node1) drains k8s-node{4,3,2}
→ worker (k8s-master + control-plane toleration) drains k8s-node1
→ postflight (no pinning)
Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.
Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).
Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).
Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the legacy `protected = true` reference with the four-tier
`auth` enum that's been live for weeks. Document the anti-exposure
guard (`scripts/check-ingress-auth-comments.py` + `scripts/tg`)
that enforces the inline-comment convention. Fix two stale paths:
- `stacks/platform/modules/ingress_factory/` → `modules/kubernetes/ingress_factory/`
- `stacks/platform/modules/traefik/middleware.tf` → `stacks/traefik/modules/traefik/middleware.tf`
Replace the single `protected = true` example with three: a
default Authentik-gated admin UI, an app-managed backend, and an
intentionally-public webhook receiver. Each example shows the
required comment line above the auth assignment.
[ci skip]
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.
The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
-> etcd snapshot save
-> optional master containerd skew fix
-> apt repo URL rewrite (minor bumps only)
-> drain/upgrade/uncordon master via ssh < update_k8s.sh
-> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
-> post-flight verification
Two new Upgrade Gates alerts catch failure modes:
- K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
- EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)
update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.
Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.
Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
The OS-side counterpart to the service-upgrade pipeline. Covers
the unattended-upgrades + kured + sentinel-gate + Prometheus
halt-on-alert design landed in c0991f7f8.
Runbook: ops procedures (verify health, halt rollout, restore
config to a re-imaged node, roll back a bad upgrade, investigate
which alert is blocking).
Architecture doc: extends the existing service-upgrade flow with
a "K8s Node OS Upgrades" section (stack, sources of truth, day-2
mechanism, why-this-design rationale tied to the March 2026
post-mortem).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reason: GPU multi-tenancy (frigate + ytdlp-highlights + llama-swap +
immich-ml) was hitting 94% memory-request saturation on the old size.
The benchmark on 2026-05-10 surfaced this when llama-swap stayed
Pending despite GPU time-slicing being on (nvidia.com/gpu replicas=100)
- the actual constraint was node1 RAM, not GPU.
Procedure: drained node1, qm shutdown 201, qm set 201 --memory 49152,
qm start 201, kubelet picked up new capacity (47 GiB / 45.5 GiB
allocatable), uncordon, restored llama-swap + immich-ml.
Out-of-band qm set is the path here (not Terraform) because VMID 201
is intentionally not managed by TF yet - the telmate/proxmox provider
trips on iSCSI-disked VMs (see infra/stacks/infra/main.tf line 442).
Adopt this VM into TF once we migrate to bpg/proxmox.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 7 of the vision-LLM benchmark plan. Adds:
- docs/benchmarks/2026-05-10-vision-llm.md — curated report (TL;DR,
per-model analysis, top-N agreement, cost vs cloud APIs, sample
captions). Verdict: qwen3vl-4b for the request path (3.55 s p50,
100% parse, decisive top-N distro); qwen3vl-8b for caption polish.
- docs/benchmarks/benchmark-2026-05-10-1424.json — raw 300-row dump
for diff-checking against future runs.
- main.tf: -fa -> -fa on (b9085 llama.cpp removed the no-value form
of the flash-attention flag; without the value llama-server exits
before serving any request).
- llama-cpp.md architecture doc links the report so future operators
land on the deployed-and-evaluated model from one entry point.
300/300 calls, 0 parse errors, 33m32s wall on a single T4 with the
GPU exclusively allocated. immich-ml was scaled to 0 for the run
(node1 RAM constraint, not GPU - bumping node1 RAM is tracked as a
follow-up).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single Deployment of mostlygeek/llama-swap:cuda hot-swaps three
GGUF vision models (qwen3vl-8b, minicpm-v-4-5, qwen3vl-4b) at one
OpenAI-compat /v1 endpoint on Service llama-swap.llama-cpp.svc.
Idle TTL 10min so models unload between benchmark batches.
Storage: NFS-RWX from /srv/nfs-ssd/llamacpp (30Gi). One-shot
download Job pulls Q4_K_M GGUF + mmproj per model, creates stable
model.gguf / mmproj.gguf symlinks so the llama-swap config is
filename-agnostic, then warms the kernel page cache.
GPU: nvidia.com/gpu=1 = whole T4 — operator must scale immich-ml
to 0 during benchmark windows. wait_for_rollout=false so apply
doesn't block on GPU availability.
Initial use case: vision-LLM benchmark for instagram-poster
candidate scoring; future consumers (HA, agentic tooling) hit
the same endpoint via LiteLLM at the gateway.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Slack notifier now also exposes /metrics on :9101 with stdlib HTTP — counts
activations and dedup-skips by product, gauges last-activation timestamp.
Pod template gets the standard prometheus.io/scrape annotations so the
cluster-wide kubernetes-pods job picks it up via pod IP. Memory request
bumped to 48Mi to cover counter dicts + HTTPServer.
Plus docs: networking.md footnotes the windows-kms row noting public WAN
exposure with the rate-limited (max-src-conn 50, max-src-conn-rate 10/60,
overload <virusprot> flush) pfSense filter rule, and a new runbook covers
log locations, rate-limit tuning, and how to revoke the WAN forward.
The matching pfSense rule was tightened in place (TCP-only + rate limits)
via SSH; pfSense isn't Terraform-managed.
daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr
30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount,
which blocked the next run from completing — root cause of the WeeklyBackupStale
alert going silent (the metric never reached its end-of-script push).
Fixes:
- TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting
the wall during week 18 runs)
- Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as
belt-and-braces for any inherited stuck state from a prior crashed run
- TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of
the alert going blind on systemd kills
- pfsense metric pushed in BOTH success and failure paths (was only on success;
any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert
threshold expired)
Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node
OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup
that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to
/srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end:
3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql
image is stripped (no curl/wget/python) — switched to docker.io/library/postgres
matching the dbaas/postgresql-backup pattern with apt-installed curl.
Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed
backup_weekly_last_success_timestamp but the script pushes
daily_backup_last_run_timestamp). Updated to match what's actually emitted, and
added a "default-covered" footnote to the Service Protection Matrix so the
~40 services with PVCs not enumerated in the table are no longer ambiguous.
Manual PVE-host actions (out-of-band, not in TF):
- unmounted 6 stacked snapshots from /tmp/pvc-mount
- pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the
loop got SIGTERMed against repeatedly, so prune kept failing)
- created /srv/nfs/postiz-backup directory
- triggered a one-shot daily-backup run with the new TimeoutStartSec to
validate the fix end-to-end
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mysql-standalone OOMKilled May 8 18:05 (anon-rss 2 GB at the 2 Gi limit).
innodb_buffer_pool_size=1Gi plus connection buffers and InnoDB internals
don't fit in 2 Gi. Bumping limit to 4 Gi (request 3 Gi) leaves headroom
without changing the buffer pool config.
/srv/nfs was at 90% (1.7T / 2T); grew the underlying pve/nfs-data LV
1 TiB online and ran resize2fs (now 60% used). Triggered by surfacing
during the 2026-05-09 IO-pressure post-mortem; thinpool had ~4.6 TiB
free.
The post-mortem also covers the stale-NFS-client trigger (legacy
/usr/local/bin/weekly-backup pointing at the decommissioned TrueNAS IP)
and the resulting wedged kthread on the PVE host. Script removed and
node_exporter restarted out-of-band; kthread will clear at next PVE
reboot. See docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Existing NetworkPolicy only admitted port 3000 (Playwright WS) from
labelled client namespaces, blocking Traefik's traffic to the noVNC
sidecar on port 6080. The chrome.viktorbarzin.me ingress would hang
forever — page never loads, eventually times out.
Adds a second ingress rule allowing TCP/6080 from the traefik
namespace only. Authentik forward-auth still gates external access
at the Traefik layer.
Also reconciles the noVNC image to the new Forgejo registry path
(:v4 unchanged) — already declared in TF, just live-state drift from
the Phase 3 registry consolidation.
Updates the architecture doc; the previous text still described the
old nginx static health stub that noVNC replaced.
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.
Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
— image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
(viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
from Forgejo. build-ci-image.yml dual-pushes still until next
build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.
Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
hosts.toml entries for registry.viktorbarzin.me +
10.0.20.10:5050. (Existing nodes already had the file removed
manually by `setup-forgejo-containerd-mirror.sh` rollout — the
cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
service block removed; nginx 5050 port mapping dropped. Pull-
through caches for upstream registries (5000/5010/5020/5030/5040)
stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
`private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
integrity_probe + registry_probe_credentials resources stripped.
forgejo_integrity_probe is the only manifest probe now.
Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
diagram now reflect Forgejo. Pre-migration root-cause sentence
preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
alert annotation simplified now that only one registry is in
scope.
Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
match the new template AND `docker compose up -d --remove-orphans`
to actually stop the registry-private container. Memory id=1078
confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
registry.viktorbarzin.me:5050 from the `repo:` list — at that
point the post-push integrity check at line 33-107 also needs
to be repointed at Forgejo or removed (the per-build verify is
redundant with the every-15min Forgejo probe).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The f1-stream verifier's in-process headless Chromium kept tripping
hmembeds' disable-devtool.js Performance detector (CDP latency on
console.log vs console.table) and getting redirected to google.com.
This adds a single-replica chrome-service stack running Playwright
launch-server under Xvfb so callers can connect via WS+token to a
shared headed browser. f1-stream's _ensure_browser now prefers
chromium.connect(CHROME_WS_URL/CHROME_WS_TOKEN) and adds a vendored
stealth init script (webdriver/plugins/languages/Permissions/WebGL
spoofs + querySelector hijack to disarm disable-devtool-auto) on
every new context. Falls back to in-process headless if the env
vars aren't set.
Encrypted PVC for profile + npm cache, NetworkPolicy to TCP/3000
gated by client-namespace label, 6h tar.gz backup CronJob to NFS,
Authentik-gated nginx sidecar at chrome.viktorbarzin.me for human
liveness checks. Image pinned to playwright:v1.48.0-noble in
lockstep with the Python client's playwright==1.48.0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reduces 5-min disk-write spikes on PVE sdc. The cronjob was the
heaviest single contributor in our hourly fan-out investigation
(11.2 MB/s burst when it fired). Kea DDNS still handles real-time
DNS auto-registration; phpIPAM inventory just lags by up to 1h,
which we don't need fresher.
Docs (dns.md, networking.md, .claude/CLAUDE.md) updated to match.
All 3 vault voters now on proxmox-lvm-encrypted (vault-0 16:18, vault-1
+ vault-2 today). The NFS fsync incompatibility identified in the
2026-04-22 raft-leader-deadlock post-mortem is no longer reachable —
raft consensus log + audit log live on LUKS2 block storage with real
fsync semantics.
Cluster-wide consumers of the inline kubernetes_storage_class.nfs_proxmox
dropped to zero after the rolling, so the resource is removed from
infra/stacks/vault/main.tf. Released NFS PVs (6) remain in the cluster
and will be reclaimed in Phase 3 cleanup.
Lesson learned (recorded in plan): pvc-protection finalizer races the
StatefulSet controller — pod recreates on the OLD PVCs unless the
finalizer is patched out before pod delete. Force-finalize technique
applied to vault-1 + vault-2 successfully.
Closes: code-gy7h
Document scans (receipts, contracts, IDs) are unambiguously sensitive
PII. Storage decision rule defaults sensitive data to
`proxmox-lvm-encrypted`, but paperless-ngx had been left on plain
`proxmox-lvm` by an abandoned migration attempt that left a dormant,
non-Terraform-managed encrypted PVC sitting unbound for 11 days.
Cleaned up the orphan, added the encrypted PVC properly via Terraform,
rsynced data with deployment scaled to 0, swapped claim_name. Plain
`proxmox-lvm` PVC retained for a 7-day soak before removal.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cmd_prune_count's `log " Pruned: ..."` wrote to stdout, which the
caller captures via `pruned=$(cmd_prune_count)`. From 2026-04-16 onward
(7d retention kicked in), pruned snapshots polluted the captured value
with multi-line log text, breaking the Prometheus exposition format
on the metric push (`lvm_snapshot_pruned_total ${pruned}` → 400 from
Pushgateway). Snapshots themselves were always fine; only the metric
push silently failed for ~9 nights, eventually triggering
LVMSnapshotNeverRun (alert has 48h `for:`).
Fix: redirect the inner log call to stderr so cmd_prune_count's stdout
contains only the count. Also adopts `infra/scripts/lvm-pvc-snapshot.sh`
as the source-of-truth (was edited only on the PVE host) and updates
backup-dr.md to point at the .sh and document the scp deploy.
Deploy: scp infra/scripts/lvm-pvc-snapshot.sh root@192.168.1.127:/usr/local/bin/lvm-pvc-snapshot
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two independent root-cause fixes surfaced by the 2026-04-22 cluster
health check:
1. Pushgateway lost all in-memory metrics when node3 kubelet hiccuped
at 11:42 UTC, hiding backup_last_success_timestamp{job="offsite-
backup-sync"} until the next 06:01 UTC push — a ~18h false-negative
window. Enable persistence on a 2Gi proxmox-lvm-encrypted PVC with
--persistence.interval=1m. Chart note: values key is
`prometheus-pushgateway:` (subchart alias), not `pushgateway:`.
2. poison-fountain-fetcher CronJob runs curlimages/curl as UID 100
but the NFS mount /srv/nfs/poison-fountain is root:root 755 and
the main Deployment runs as root, so mkdir /data/cache fails
every 6h. Set run_as_user=0 on the CronJob container (no_root_squash
is set on the export).
Closes the backup_offsite_sync FAIL on the next 06:01 UTC offsite
sync; closes the recurring poison-fountain evicted-pod noise on the
next 00:00 UTC cron tick.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>