infra

Author	SHA1	Message	Date
Viktor Barzin	99f9bf8d89	fan-control: power-tune COOL curve to the 60% efficiency knee Power/temp sweep (2026-06-05) located the cooling-per-watt knee at ~60%: 60->70% buys only -2C for +21W, and 70->100% buys 0C for +54W (the CPU floors ~59C at cluster load, so more airflow does nothing). Re-tune the COOL curve to cap its normal band at 60% (~303W, ~61C); 80/100% become a high-load safety ramp (>=73/79C) before the 83C ceiling. QUIET unchanged (already at the 281W / 4800rpm floor). Saves up to ~75W (~650 kWh/yr) vs full-tilt for the last ~2C. Tests + design doc updated; verified live (63C, 60%, ~267W). [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	90ad6b9125	fan-control: presence-aware IPMI fan curve for the R730 PVE host The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even under load (optimises for quiet, not cool). Add a bash daemon + systemd unit that drives the chassis fans from CPU temp on two curves, picked by garage occupancy (the server is in the garage): COOL when empty (measured ~58-65°C under load), QUIET near the silent floor when the ha-sofia garage door shows someone is there (open, or <15min since last activity). Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI failures do the same. Pushgateway metrics (job=fan_control). 36 unit tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN + RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127 (CPU 70->58°C in cool mode, hysteresis stepping confirmed). Design: docs/plans/2026-06-04-pve-fan-control-design.md Runbook: docs/runbooks/fan-control.md [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	ae252f9116	cluster-health: ha_integrations — skip disabled + ignored config entries check_ha_integrations counted any config entry with state=not_loaded as a problem, but HA marks intentionally-off entries that way too: disabled_by set (user/integration disabled it) and source=="ignore" (a discovered integration the user chose to ignore — never meant to load). On ha-sofia 2026-06-04 this false-WARNed on 6 entries that are all intentional — wyoming faster-whisper/piper + ollama (disabled_by=user) and mass_queue/dlna_dms(EMO-LAPTOP2)/yalexs_ble (source=ignore). Skip disabled/ignored entries; only genuine setup_error/setup_retry/ not_loaded (without disabled/ignore) now flag. Verified: check #27 -> PASS "All 96 integrations loaded".	2026-06-05 09:19:11 +00:00
Viktor Barzin	31b8104b43	cluster-health: uptime_kuma check — only count status==0 as down check_uptime_kuma flagged a monitor as down whenever its last heartbeat status != 1, and treated "no beats" as down too. But uptime-kuma status 2 = PENDING (mid-retry) and 3 = MAINTENANCE are not outages, and no-beats = no data. So a monitor caught in a momentary pending/retry state at check time produced a false "internal/external down(N)" WARN — observed twice on 2026-06-04 (Novelapp, then ha-sofia) for monitors uptime-kuma itself logged ZERO downs against over 24h (0/2880 and 0/288 beats). Count a monitor as down ONLY on an explicit DOWN beat (status==0); pending, maintenance, and no-data are not-down. Real outages still flag (uptime-kuma persists status==0 beats for genuine downs).	2026-06-05 09:19:10 +00:00
Viktor Barzin	b64d8d6168	cluster-health: add #47 ghost-disk drift check; fix immich_search set -e crash Check #47 "Proxmox CSI — Ghost-Disk Drift": per node, compares the real virtio-scsi CSI disk count in `qm config <vmid>` (SSH PVE) against the attached proxmox-CSI VolumeAttachments k8s tracks. Catches orphaned "ghost" disks left by failed detaches (query-pci QMP timeouts) that the scheduler's 28-LUN guard can't see — exactly the drift that wedged the MAM grabber on node3 (13 tracked vs 23 real). PASS reconciled; WARN drift>0 or real 20-24; FAIL real ≥25 (near the LUN cap). Already flagging node6 at 21 disks. Single `qm list` + one `qm config` per VM keeps it ~3s (the naive once-per-VM version timed out the parallel runner). Also fixes a PRE-EXISTING set -e crash in #46 immich_search (introduced by 138894cd): `pct=$(kubectl exec … \| tr -d ' ')` and the dur_ms probe were unguarded, so with `set -o pipefail` a non-zero psql/exec propagated and tripped `set -e`, killing the check before json_add. It silently dropped from every parallel report and broke --serial entirely (whole run aborted). Guarded both substitutions with `\|\| true`; the existing `=~` numeric checks already handle the empty case. immich_search now reports PASS/WARN instead of vanishing.	2026-06-05 09:19:10 +00:00
Viktor Barzin	23d87d8885	cluster-health #20 : fix false NFS FAIL on Linux (nc -G is macOS-only) The NFS connectivity check fell through to `nc -z -G 3 192.168.1.127 2049` when `showmount` is absent (the DevVM ships no nfs-common). But `-G` is a macOS/Darwin-only connect-timeout flag — OpenBSD/GNU nc on Linux rejects it with "invalid option -- 'G'", so the elif failed and the check reported "NFS unreachable" on every Linux run even though port 2049 was wide open (confirmed via /dev/tcp). All deployment/PVC/statefulset checks were green throughout — a real PVE NFS outage would have taken down 30+ services. Fix: use the portable `-w` timeout flag, and add a final bash /dev/tcp fallback so the probe is correct even on hosts with neither showmount nor a usable nc.	2026-06-05 09:19:07 +00:00
Viktor Barzin	f201e4573e	immich: fix slow context search — prewarm clip_index + latency alert/healthcheck Context (smart) search latency was caused by the 665MB vchord clip_index decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs ~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual is warm ~15ms on GPU). The postStart prewarm runs once at pod start and pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job buffer-pressure over days. - clip-index-prewarm CronJob (immich, /5): pg_prewarm('clip_index') keeps the whole index resident -> searches stay ~4ms. - immich-search-probe CronJob (immich, /5): times a random-vector ANN query + reads clip_index residency, pushes gauges to the Pushgateway. - Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache / ImmichSearchProbeStale (+ inhibition when the probe is stale). - cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46). - Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	a2fa912b44	cluster-health: add check #45 — HA Sofia Status Dashboard Mirrors the verdict of emo's curated Барзини → Статус Lovelace view (dashboard-barzini / path 'status', 8 sections, ~43 mushroom-template cards). Pulls the dashboard config via the HA WebSocket API (one-shot, shared cache), batch-renders every card's secondary Jinja template against /api/template in a single POST, and classifies the rendered text per card: FAIL — contains "Offline" / "Disconnected" / "Разкачен" / "— No data" WARN — contains "⚠️" / "Abnormal" / "Trouble (" / "(ниска)" / "Пълен резервоар" / "Грешка" / "attention" / "Внимание" Roll-up is a single check with a per-section breakdown (Сигурност 0F/0W/4P; Мрежа 0F/1W/10P; …). On WARN/FAIL the non-quiet non-JSON path lists each offending card with its rendered status line. Verified live against ha-sofia: 2 offline devices (Пералня, Гардероб спалня) and 1 degraded (NAS_Barzini volume attention, 7% free) surfaced correctly in both human and JSON output.	2026-06-05 09:19:06 +00:00
Viktor Barzin	c7cf21a986	Revert mail LAN-redirect approach; pending VIP-based redesign The pfSense NAT rdr rules added in f7cf9f07 hardcoded 10.0.20.203 (Traefik LB IP) as the redirect source. That couples mail's LAN path to Traefik's IP choice — if Traefik moves again (it just moved .200 → .203 on 2026-05-30), the mail path silently breaks. Removing the script and the matching doc paragraph; keeping the networking.md .200 → .203 staleness fix (separate correction). Follow-up: give the mail HAProxy listener a dedicated pfSense Virtual IP (IP Alias on opt1), update Technitium internal zone + WAN port-forwards to target the VIP, so mail's LAN-side path is decoupled from any other service's LB IP.	2026-06-03 10:24:25 +00:00
Viktor Barzin	fd35c4f303	pfSense: LAN-side NAT redirect for mail ports landing on Traefik LB IP Technitium's split-horizon rewrites *.viktorbarzin.me to 10.0.20.203 (Traefik LB) for the 192.168.1.0/24 Barzini WiFi (TP-Link router has no hairpin NAT). The rule is name-agnostic so mail.viktorbarzin.me (and imap./smtp.) get sent to .203 too — where Traefik does not listen on 25/465/587/993. iOS Mail on Barzini WiFi silently hangs while Roundcube (port 443 via Traefik) keeps working. Adds pfSense NAT rdr rules so traffic to 10.0.20.203:{25,465,587,993} gets redirected to 10.0.20.1 (the mail HAProxy listener already serving the public path). Loaded on every incoming interface by pfSense rule generation, so any LAN/VPN client falling into the split-horizon answer lands on the right service unchanged. Includes idempotent reproducer script (mirrors the existing pfsense-haproxy-bootstrap.php pattern) and the networking.md mail carve-out paragraph plus the stale .200 → .203 reference.	2026-06-03 10:24:25 +00:00
Viktor Barzin	848cc7211f	t3code: track t3 nightly via health-checked auto-updater Move t3 from pinned stable (0.0.24, catalog capped at opus-4-7) to the nightly channel so new models (Opus 4.8) land as t3 ships them. t3-autoupdate (daily systemd timer) pulls t3@nightly, but applies the Keel-incident lesson: it health-checks the new binary on a throwaway serve and AUTO-ROLLS-BACK on failure, and restarts only IDLE per-user instances (defers any with an active agent child) so an in-flight session is never killed by an update. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	d27df1f321	t3code: dispatch — strip @domain from X-authentik-username (Authentik injects email) Authentik injects the full email (e.g. vbarzin@gmail.com), but /etc/ttyd-user-map and dispatch.json key on the local part (vbarzin), so every real login hit 403 'no instance provisioned'. Strip @domain before lookup, matching the terminal stack's tmux-attach.sh. Verified: vbarzin@gmail.com / emil.barzin@gmail.com -> 302 (own instance); unmapped/no-header -> 403. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	9f551e3c13	t3code: harden dispatch — dedicated user + validated t3-mint + scoped sudoers Run t3-dispatch as an unprivileged dedicated user instead of wizard (who has full sudo). Privileged minting goes through /usr/local/bin/t3-mint, which validates the target against /etc/ttyd-user-map before minting as that user; sudoers permits t3-dispatch to run only that wrapper. Compromise of the network-facing service can mint pairing tokens for mapped users at most. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	0472f67d49	t3code: devvm dispatch + auto-pair service (Go) Routes X-authentik-username -> per-user t3 instance; on no t3_session cookie, mints a pairing token (as the OS user) and exchanges it at /api/auth/bootstrap, injecting the session cookie. Listens :3780, reads /etc/t3-serve/dispatch.json. Constants from the Task-1 auth-contract spike. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	72aba7da32	t3code: reconcile per-user t3 instances from /etc/ttyd-user-map Sticky port allocation (3773+), enables t3-serve@<user>, emits /etc/t3-serve/dispatch.json for the dispatch service. systemd timer (OnBootSec+hourly) mirrors the apply-mbps-caps pattern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	f8a63fdacd	t3code: per-user t3-serve@ systemd template (User=%i file isolation) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	a382683c0e	infra: fix containerd forgejo-registry redirect .200->.203 (+skip_verify) Traefik moved off shared .200 to its dedicated .203 on 2026-05-30, but the containerd hosts.toml redirect for forgejo.viktorbarzin.me still pointed at the now-dead .200:443 -> every FRESH forgejo pull failed (cached images kept running, so it stayed hidden until a new image tag was pulled). Retarget to .203 and add skip_verify (node dials Traefik by IP; cert is for forgejo.viktorbarzin.me) in both the new-node cloud-init and existing-node deploy scripts. Already rolled to all 7 nodes (rewrite + restart containerd, no drain). Doc fix in .claude/CLAUDE.md.	2026-06-01 21:22:05 +00:00
Viktor Barzin	ddd582a28c	backup: stop offsite-copying regenerable data; shrink nextcloud backup; pin nextcloud image The offsite Synology hit 97% — the Backup share grew +670G in a week, traced to the 2026-05-26 change that began mirroring large regenerable services offsite, plus an unbounded nextcloud.log bloating its backups to 87G. - nfs-mirror: re-exclude ollama, prometheus-backup, audiblez, ebook2audiobook (regenerable; live-only on sdc). Keep *-backup DB dumps (real safety copies). - offsite-sync Step 2: nfs-ssd leg is now immich-only; ollama/llamacpp on the SSD no longer ship offsite (re-pullable models). - daily-backup: skip nextcloud/nextcloud-data-proxmox (orphaned pre-encryption PV, still backed up weekly). - nextcloud: cap+rotate the log (log_rotate_size=10MB); the dedicated backup now excludes html/ (app code, from image), logs, and preview cache and keeps only the latest copy (pvc-data holds version history) → <5G (was 87G). - nextcloud: pin image to 32.0.9 in chart_values. A 2026-05-26 Keel bump moved the live pod to 32.0.9 (data migrated to 32.0.9.2) but TF still defaulted to 32.0.3; reconciling that drift this session rolled a 32.0.3 pod that CrashLooped on the downgrade. Pinning eliminates the drift. Docs: backup-dr.md + infra CLAUDE.md updated (add nfs-mirror, new exclusions). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00
Viktor Barzin	f677794379	cluster_healthcheck.sh: run checks in parallel (~3x speedup) Each check function only reads cluster state and mutates in-memory counters; that makes it safe to isolate each one in a subshell, write stdout to a per-check temp file, and replay outputs in original order after all jobs finish. Counters/JSON_RESULTS replicated through marker lines (###HCK###PASS:N etc.) so the aggregate state matches the serial run exactly. Pre-fetch the HA Sofia cache once in the parent so the four HA checks share a single API round-trip instead of each subshell re-fetching. Auto-fix mode forces --serial so mutation order stays deterministic. New flags: --parallel N (default 12, env HEALTHCHECK_PARALLEL_JOBS), --serial. Diminishing returns past ~12 workers. Benchmark (--quiet, 44 checks): 53s serial -> 18s parallel-12.	2026-05-27 19:46:40 +00:00
Viktor Barzin	3f0c429d46	offsite-sync: add `\|\| true` to Step 2 HDD grep\|while pipeline Mirrors the SSD section's pattern. If the LAST iteration of the `while IFS= read -r f; do [ -f "$f" ] && echo "${f#/srv/nfs/}"; done` body sees a file that was deleted between inotify capture and now (e.g. an immich encoded-video temp file that got cleaned up), the while loop returns 1, pipefail propagates, set -e kills the script silently before reaching the rsync. No log line, just disappears. Pre-existing bug; only exposed today after pruning the bypass regex to immich-only — when the regex was broader, the last match in the sorted dedup'd inotify log happened to be a live file often enough that the bug stayed dormant. Validated by full e2e run: 1120 nfs/immich files + 2285 nfs-ssd files shipped successfully. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 19:55:33 +00:00
Viktor Barzin	37d88ce50e	nfs-mirror: weekly Mon 04:00 → daily 02:00 Steady-state delta runs in 10-20 min and the weekly cadence left a real RPO gap: app data under /srv/nfs/<svc>/ that isn't a PVC (captured by daily-backup) or a *-backup CronJob (captured daily by the CronJob writing to /srv/nfs/<svc>-backup/) was on a 7-day worst case for off-disk durability. Affected paths include nextcloud shared files, audiobookshelf library, mailserver Maildir, calibre, servarr metadata, real-estate-crawler scraped data, openclaw agent state. Daily cadence drops their RPO to ~24h at negligible cost. Slot: 02:00, 3h ahead of daily-backup (05:00) so the manifest is populated before offsite-sync reads it at 06:00. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 19:00:10 +00:00
Viktor Barzin	41fb7c4a76	backup pipeline: prune sda-bypass list to immich-only Previously /srv/nfs/{ollama,audiblez,ebook2audiobook,*-backup} took the sdc → Synology direct leg. They now ride sdc → sda → Synology pve-backup/ via nfs-mirror like every other NFS subtree, so sda becomes the single canonical mirror and Synology only has to ingest one feed for the bulk of cluster state. frigate + temp dropped from BOTH legs (no backup anywhere) per explicit user ask — frigate is a 14d camera ring, temp is scratch. prometheus/loki/alertmanager dropped as no-op (orphan dirs that no longer exist on /srv/nfs). Also: nfs-mirror's manifest collection switched from find -newer (mtime) to find -cnewer (ctime) — rsync -t preserves source mtime on dest, so freshly-written files looked "older than \$STAMP" and the 2026-05-26 full mirror run captured only 2 of 800k transferred files. Hit during this session, recovered via .force-full-sync. Operational result post-rollout: - sda 87% → 70% (anca-elements 423G deleted, +260G new dirs) - /Viki/nfs/ on Synology: was 24 stale dirs (~430G), now immich only - Synology free: ~300G → ~430G+ once btrfs reclaim catches up Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 18:22:01 +00:00
Viktor Barzin	8ed427a7e4	cloud-init: hands-off k8s worker provisioning + 5 bug fixes Goal: re-clone the worker template, boot, and have it appear as `kubectl get nodes …Ready` with no manual steps. Adds `scripts/provision-k8s-worker NAME VMID IP` and rebuilds the cloud-init pipeline that was failing five distinct ways on a clean boot. Bugs fixed (all hit during the k8s-node5 + k8s-node6 builds today): 1. `indent(6, containerd_config_update_command)` indented the bodies of `cat >> /etc/containerd/config.toml <<'CONTAINERD_GC'` heredocs, so [plugins.*] TOML sections landed in /etc/containerd/config.toml at col 6 — containerd refused to parse them. Source is now a normal .sh file (`modules/create-template-vm/k8s-node-containerd-setup.sh`) base64-embedded into `write_files`; YAML whitespace never touches the heredoc bodies. 2. The same script tried to `cat >> /etc/containerd/config.toml` `[plugins."io.containerd.gc.v1.scheduler"]` etc., which containerd v2.2.4's `config default` ALREADY emits. Result: `toml: table … already exists`. Patched with sed-in-place overrides instead. 3. Kubelet tuning (sed against /var/lib/kubelet/config.yaml) ran from the containerd setup script — BEFORE `kubeadm join` writes that file. Sed aborted with "No such file or directory", `set -e` killed the script, post-script cloud-init steps kept going (cloud-init doesn't stop on runcmd failure). Split into a dedicated `k8s-node-post-join-tune.sh` invoked AFTER kubeadm join. 4. cloud_init.yaml fallocate'd a 4G swapfile and `swapon`'d it BEFORE kubeadm join. kubelet defaults to failSwapOn=true → exited 1 immediately. Replaced the swap setup with `swapoff -a` (node4 already runs this way and the cluster is fine). 5. Without `hostname:` in the shared user-data snippet, Proxmox's auto-generated meta-data does NOT include local-hostname when `cicustom user=…` is set — so cloud-init falls back to the cloud image's default `ubuntu` and `kubeadm join` registers the wrong node name. `provision-k8s-worker` now writes a per-node `<NAME>-meta.yaml` snippet and passes both via `cicustom user=…,meta=…`. Other improvements rolled in while fixing the above: - `ssh_public_key` read from Vault (`secret/viktor.ssh_public_key`, added today) instead of `var.ssh_public_key`. The last `terragrunt apply` was run with that var empty, leaving the snippet's `ssh_authorized_keys` with a single blank entry; the wizard user was effectively locked out of every fresh node. - `cloud_init.yaml` adds `/etc/systemd/resolved.conf.d/global-dns.conf` with `DNS=8.8.8.8 1.1.1.1, FallbackDNS=10.0.20.201`. Without it, systemd-resolved only consulted Technitium (link-level), which returns NXDOMAIN for `forgejo.viktorbarzin.me` — kubelet pulls from the Forgejo registry then failed DNS until I patched it manually on node5. - k8s apt repo bumped v1.32 → v1.34 (matches cluster). - The containerd setup script now creates hosts.toml for forgejo, quay, registry.k8s.io in addition to docker.io + ghcr.io. node3/4 had these added by hand post-bootstrap; now they're baked in. - `config_path` sed matches both `""` (containerd v1) and `''` (containerd v2.x). Without the v2 match, the certs.d mirror dir was silently ignored. - `proxmox-csi` node map adds k8s-node5 + k8s-node6 entries so CSI topology labels (region/zone, max-volume-attachments=28) apply on next `tg apply`. - `stacks/infra/main.tf` shed the 160-line inline containerd setup heredoc — that whole thing now lives in the module as a .sh file. Known unsolved gaps (deferred): - iscsid restart hangs ~90s on first boot before SIGKILL releases it (systemd-resolved restart kicks iscsid via dependency). Adds wall- clock time but doesn't block the join. - `provision-k8s-worker` doesn't run `tg apply` on `proxmox-csi` afterward, so the CSI topology labels need a manual apply after the node joins. Solving cleanly needs the CSI map to derive from `kubectl get nodes` instead of a static local — separate work. - `var.containerd_config_update_command` is now ignored when is_k8s_template=true (replaced by the bundled .sh file). Variable kept with a deprecation note to avoid breaking other call sites. E2E proof: k8s-node6 (VMID 206) boots hands-off from `provision-k8s-worker k8s-node6 206 10.0.20.106` and appears as `kubectl get nodes …Ready` ~7 min later (most of which is the apt package_upgrade — separate optimization). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 11:52:00 +00:00
Viktor Barzin	56a338f80b	scripts: hook apply-mbps-caps into the PVE host as a systemd timer The qm-set I/O caps were previously only applied by manual one-shot runs of apply-mbps-caps.sh, so any config drift (manual `qm set`, config restored from /mnt/backup/pve-config like we did on 2026-05-26, fresh VM clone) would leave the affected VM uncapped until someone remembered to re-run the script. Adds apply-mbps-caps.service (Type=oneshot) + apply-mbps-caps.timer firing: - OnBootSec=5min — catches PVE host reboots & restored configs - OnCalendar=hourly — catches manual qm-set drift / fresh clones - Persistent=true — runs missed schedule after PVE downtime - RandomizedDelaySec=2min Same install pattern as the other PVE operational scripts (nfs-mirror, daily-backup, offsite-sync-backup, lvm-pvc-snapshot — memory id=609 + id=542). Source in this repo, deployed to /usr/local/bin + /etc/ systemd/system/ on the PVE host. Script hardening: kept `set -uo pipefail` but dropped `-e` so one missing VM doesn't abort the rest; each VM is gated on `qm status` existence; added a fast-path "already at target" no-op log line for quiet hourly runs. Installed on PVE (192.168.1.127) and smoke-tested: all 8 VMs caps re-applied successfully, next run 12:00 EEST. Journal: `journalctl -u apply-mbps-caps -f` on the PVE host.	2026-05-26 08:12:15 +00:00
Viktor Barzin	232409e798	scripts: per-VM I/O cap script — apply-mbps-caps.sh Idempotent qm-set script for the per-VM I/O caps on the PVE host's sdc thin pool (2026-05-26 session, beads code-9v2j). Caps protect each Linux VM's share of sdc so a runaway workload (e.g. the 2026-05-23/26 alloy IO storm — memory id=2726) cannot saturate the disk for everyone. Was sitting in /tmp on PVE — moving the source under version control and installing to /usr/local/bin/ alongside the other PVE operational scripts (nfs-mirror, daily-backup, offsite-sync-backup; pattern from memory id=609). Survives PVE host reboots; safe to re-run on any node rebuild to restore the caps. VMIDs covered (Linux only — pfSense 101 and Windows10 300 skipped): 102 devvm 60/60 103 home-assistant 40/40 200 k8s-master 100/60 201 k8s-node1 150/120 202 k8s-node2 150/120 203 k8s-node3 150/120 204 k8s-node4 150/120 220 docker-registry 40/40	2026-05-26 08:06:15 +00:00
Viktor Barzin	d5f73ce109	backup: exclude /anca-elements/ from nfs-mirror + offsite Step 1 Anca's photos are being ingested into Immich (started 2026-05-24 afternoon), so /srv/nfs/immich/library/ becomes the canonical copy for those photos. The separate /srv/nfs/anca-elements/ archive tree + its sda mirror at /mnt/backup/anca-elements/ are now redundant. Going forward: - nfs-mirror EXCLUDES /anca-elements/ so future weekly runs don't re-touch the 771G subtree (also no longer required since Immich has the data via its NFS library). - offsite-sync Step 1 also excludes /anca-elements/ — the historical 771G under /mnt/backup/anca-elements/ stays on sda for now but is NOT shipped to Synology pve-backup/ (Immich's library reaches Synology via Step 2 bypass leg anyway). The 771G on /mnt/backup/anca-elements/ will be cleaned up manually once Immich ingest completes and we verify all photos are in the Immich library. Same for /srv/nfs/anca-elements/ on sdc thin pool — freeing both would reclaim ~1.5 TB across sdc + sda. In-flight context: today's nfs-mirror first run was killed mid-flight at ~70% (was at /srv/nfs/postgresql/). The killed run wrote ~200G of service NFS subtrees to /mnt/backup/<svc>/, then sda hit 95% used, prompting this change. Next nfs-mirror run will not touch anca-elements and will fit comfortably (~250G total for the keep-list minus anca-elements).	2026-05-24 18:34:41 +00:00
Viktor Barzin	c948dc0dbe	backup pipeline: flock manifest + cap + drop LAN -z Three more audit fixes from the 2026-05-24 backup-pipeline review: #5 (S1 race) — manifest flock daily-backup and nfs-mirror both append to /mnt/backup/.changed-files. If they overlap (nfs-mirror Mon 04:11 running long, daily-backup starting Mon 05:00), concurrent appends from `find \| tee` and `find \| sed >>` could interleave mid-line — partial paths would slip past rsync's --files-from. Both scripts now share a manifest_append() helper using `flock -x` on /mnt/backup/.changed-files.lock. The 4 daily-backup call sites + the 1 nfs-mirror call site all pipe through it instead of redirecting directly. #7 (S2 unbounded manifest) daily-backup gains check_manifest_size() invoked after the PVE-config append (the last manifest writer of the run). Above MANIFEST_MAX_LINES (500k) it touches /mnt/backup/.force-full-sync — offsite-sync's Step 1 now treats that flag the same as day-of-month ≤ 7 (full sync with --delete) and clears it on success. Catches the "Synology unreachable for many days" edge case where the manifest would grow unbounded. #9 (wear — drop -z on LAN hops) offsite-sync rsync calls to Synology over the same 192.168.1.0/24 gigabit LAN had `-rltz`. Compression burns CPU on the PVE host (already IO-busy) and gives nothing on a saturated GigE link. Dropped to `-rlt` on all 5 offsite rsync invocations (Step 1 full + Step 1 incremental + Step 2 full nfs + Step 2 full nfs-ssd + Step 2 incremental). Other adjustments: - nfs-mirror's find-after-rsync now also excludes the new state files (.changed-files.lock, .force-full-sync) when populating the manifest. - offsite-sync Step 1 full-sync excludes the same .force-full-sync flag so it doesn't ship to Synology. Deployed to PVE host (/usr/local/bin/{daily-backup,nfs-mirror, offsite-sync-backup}). Currently in-flight nfs-mirror run is unaffected (bash loaded the old script into memory at start). Next runs use the new behaviour. Refs: 2026-05-24 audit Section 2 items #1 (manifest race), #4 (unbounded manifest), #6 (LAN -z wear).	2026-05-24 16:27:42 +00:00
Viktor Barzin	4798583db7	backup pipeline: S1 fixes from 2026-05-24 audit Three immediate fixes surfaced by the backup-pipeline audit: 1. S1 silent-loss race fix (daily-backup.sh:142): remove the `> "${MANIFEST}"` truncation at the start of daily-backup. Truncation already lives in offsite-sync-backup at line 159, gated on a successful sync. With both scripts truncating, an offsite-sync failure followed by the next morning's daily-backup would silently wipe yesterday's unconsumed manifest entries — those files would only reach Synology via the monthly full sync (1st-7th of month). Now only offsite-sync truncates, and only on success. 2. Missing alert OffsiteBackupSyncFailing: documented in backup-dr.md but was never added to prometheus_chart_values.tpl. Step 1 or Step 2 failure pushes offsite_sync_last_status=1 but nothing read it. Added. 3. wear: drop `-z` from local-only rsyncs (daily-backup.sh:218 PVC snapshot rsync + line 347 /etc/pve sync). Both are local-to-sda transfers — compression wastes CPU and yields nothing (gigabit local path, intermediate disk doesn't benefit). Bonus cleanups (zero functional impact): - "Weekly backup starting/complete" → "daily-backup starting/complete" (the timer is daily, not weekly — legacy from earlier monthly-rotation schedule). - "--- Step 2: PVC file copy ---" → "Step 1:" (was numbered from 2 with no Step 1 above). - wear: pfSense full filesystem tar now Sunday-only instead of daily. config.xml stays daily (it's the primary restore artifact and tiny). Full tar is forensic recovery only — re-tarring ~100MB+ daily writes ~3G/month to sda + Synology for unchanged content. Weekly is plenty. docs/architecture/backup-dr.md: rewritten Overview + 3-2-1 breakdown to reflect today's two-leg architecture; added a "2026-05-24 session" changelog summary at the top; added a "Synology snapshot management" subsection with the sudo + `synosharesnapshot` recipe (DSM API is gated by 2FA so this is the only programmatic path); updated Key Files table with nfs-mirror + the Synology SSH access notes. Open follow-ups from the audit (S2 — file as beads if pursued): - Factor two-leg invariant into /etc/backup-skip-list.conf sourced by both nfs-mirror.sh and offsite-sync-backup.sh. - Manifest write-collision flock between nfs-mirror Mon 04:11 and daily-backup Mon 05:00. - Unbounded manifest cap (force full sync if > 500k lines). - Synology free-space scraper + alert. - LVM thin pool meta-pool fill alert. - nfs-change-tracker.service heartbeat to Pushgateway. - Synology config drift TF surface (snap retention, share defs).	2026-05-24 16:18:44 +00:00
Viktor Barzin	9277d71d81	nfs-mirror: append transferred files to offsite-sync manifest Some checks failed ci/woodpecker/push/default Pipeline is running Details ci/woodpecker/push/build-cli Pipeline failed Details Step 1 of offsite-sync-backup is incremental on non-monthly days, driven by /mnt/backup/.changed-files which only daily-backup wrote to. nfs-mirror's writes were therefore invisible to Step 1 until the next monthly --delete pass — which would also wipe data pre-positioned on Synology pve-backup/ (e.g. the in-place btrfs rename we just did to relocate ~160G of NFS subtrees from /Backup/Viki/nfs/<svc>/ to /Backup/Viki/pve-backup/<svc>/). Fix: snapshot a timestamp before rsync, then after rsync use `find -newer $STAMP -type f -printf '%P\n'` to enumerate every file nfs-mirror created/modified and append to the manifest. Paths are relative to /mnt/backup/ (matches Step 1 --files-from expectation). State files are excluded. The current in-flight first run started before this patch was deployed, so its writes won't auto-populate the manifest — a one-off manual backfill will be done after it completes.	2026-05-24 15:32:22 +00:00
Viktor Barzin	15745eab2f	backup: retire anca-elements-mirror + anca-elements-sync.sh Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details Both subsumed by nfs-mirror (deployed earlier this session) — see commit `4d756be4`. anca-elements-sync.sh is now dead code because its upstream (Synology /volume1/Backup/Anca/Elements) was deleted today once the sda mirror was parity-verified (109,624 files / 827,480,937,976 bytes equal both sides). PVE NFS is the source of truth for the archive from here on. Final script inventory on the PVE host (down from 6 to 4): - /usr/local/bin/daily-backup (block PVCs + sqlite + pfsense) - /usr/local/bin/lvm-pvc-snapshot (snapshot management) - /usr/local/bin/nfs-mirror (NFS local mirror to sda) - /usr/local/bin/offsite-sync-backup (sda + bypass-list NFS to Synology)	2026-05-24 14:58:55 +00:00
Viktor Barzin	4d756be4f5	backup: consolidate to one local-mirror script + invert offsite filter Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline failed Details Before this commit, the in-flight design split anca-elements (its own mirror script + timer) from the rest of /srv/nfs (still going to Synology via inotify-tracked offsite-sync). It also meant Synology received some bytes via both paths (sda → Synology AND direct NFS → Synology), which doubled consumption. This commit collapses both into a clean 3-2-1: Copy 1 (sdc): live /srv/nfs/* + cluster block PVCs Copy 2 (sda): /mnt/backup/{pvc-data,sqlite-backup,pfsense, pve-config,<critical-nfs>/} ← daily-backup + nfs-mirror (one script each) Copy 3 (Synology): /Backup/Viki/{pve-backup,nfs,nfs-ssd} ← offsite-sync-backup Step 1 (sda → Synology) + Step 2 (sda-BYPASS paths only → Synology direct) scripts/nfs-mirror.{sh,service,timer}: New consolidated weekly mirror. Replaces anca-elements-mirror (to be removed in a follow-up after the current in-flight rsync completes, parity-verified, and Synology source-of-truth is deleted). Single rsync /srv/nfs/ → /mnt/backup/ with an explicit EXCLUDES list that drops paths not worth a local 2nd copy: immich (1.2T — too big), frigate (14d ring), prometheus/loki (rebuildable), ollama/llamacpp/ audiblez/ebook2audiobook (re-fetchable), -backup (already backups), temp/alertmanager (transient). Nice=10, IOSchedulingClass=idle. scripts/offsite-sync-backup.sh: Step 2 (NFS → Synology) filter inverted: instead of `--exclude= anca-elements/`, it now `--include`s only the sda-BYPASS paths (immich, frigate, prometheus, -backup, …). The bypass-include regex MUST stay in lockstep with nfs-mirror's EXCLUDES — they are complementary and any drift creates either gaps or duplication on Synology. Comment in the script flags this. monitoring alerts: renamed AncaElementsMirror{Stale,Failing} to NfsMirror{Stale,Failing} matching the new metric job name `nfs-mirror`. Thresholds unchanged. docs/architecture/backup-dr.md: rewritten Step 1/Step 2 sections and added the bypass-list rationale + cross-reference between scripts. NOT YET DEPLOYED — gated on the in-flight anca-elements-mirror rsync finishing + parity verification + Synology /volume1/Backup/Anca/ Elements deletion. The old scripts (anca-elements-{mirror,sync.sh}) remain on the PVE host until then, and will be removed in a cleanup commit.	2026-05-24 12:49:20 +00:00
Viktor Barzin	6db64fe060	anca-elements: weekly local mirror sdc → sda (replaces Synology as 2nd copy) Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details Synology is being removed as a host for the Anca/Elements archive (770G). /srv/nfs/anca-elements on PVE becomes the source of truth; sda /mnt/backup/anca-elements becomes the single-disk-failure mirror. No offsite for this archive — by design. - scripts/anca-elements-mirror.sh: rsync -rlt --delete -H, idempotent, pushes anca_elements_mirror_last_{run_timestamp,status,bytes} to Pushgateway, lockfile in /run, SIGTERM-safe (status=2 on abort). - .service: oneshot, Nice=10, IOSchedulingClass=idle, 5h timeout. - .timer: weekly Mon 04:00, Persistent=true, 15-min randomised delay. Deployed to PVE host; timer enabled; initial 770G sync running in background. Synology original to be deleted after first run completes and parity is verified. docs/architecture/backup-dr.md: documents Layer 3a + updated path exclusion rationale (PVE is now upstream, not downstream).	2026-05-24 11:51:52 +00:00
Viktor Barzin	34f8c0f537	docs+scripts: lock in nextcloud-as-PVE-NFS-browser surface Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details - docs/architecture/storage.md: new "Nextcloud as PVE-NFS browser" section documenting mount-per-archive + applicable_users model, why mount-level ACL beats Files Access Control on NC 30/31, the manifest shape (with current applicableUsers + enableSharing fields), and the trade-off - docs/runbooks/nextcloud-add-archive.md: 5-step runbook to surface a new directory under /srv/nfs/* to specific NC users via the bootstrap Job - scripts/anca-elements-sync.sh: deployed at /usr/local/bin/anca-elements-sync.sh on the PVE host; fpsync from Synology Anca/Elements to /srv/nfs/anca-elements (idempotent + resumable). The PVE replica is what the NC /anca-elements mount serves; the offsite-sync pipeline excludes this path (committed earlier this session) so we don't write it back to Synology NC usernames are admin/anca/emo (not display names — admin is Viktor). Stale "viktor" references in the manifest example dropped.	2026-05-24 11:45:01 +00:00
Viktor Barzin	05f047f290	offsite-sync-backup + nfs-change-tracker: exclude /srv/nfs/anca-elements Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details The 771G under /srv/nfs/anca-elements is a downstream replica synced FROM Synology (/volume1/Backup/Anca/Elements) by anca-elements-sync.sh. The offsite-sync pipeline was copying it back to Synology under /volume1/Backup/Viki/nfs/anca-elements, creating a self-duplicate (~122G already partially copied during the last monthly full sync). - nfs-change-tracker.service: drop anca-elements/ from inotify watch (incremental syncs no longer queue these paths) - offsite-sync-backup.sh: --exclude='anca-elements/' on the monthly full rsync; grep -v on the incremental files-from list Deployed to 192.168.1.127:/usr/local/bin/offsite-sync-backup + /etc/systemd/system/nfs-change-tracker.service; service reloaded.	2026-05-24 11:03:09 +00:00
Viktor Barzin	4713c3a6d9	k8s-version-upgrade: tigera quiesce + etcd-skip retry + IO-wait alert ignore Three changes unblocking the autonomous chain for k8s patch upgrades: 1. phase_master quiesces tigera-operator before drain, restores after. Tigera crashes immediately if apiserver is unreachable (no retry logic) and crashlooping it during master static-pod swaps generates ~500MB/s disk I/O that pushes kubeadm's 5-min static-pod-hash watch past its limit. Quiesce removes the storm contributor; calico data plane keeps running unchanged (data plane is the DaemonSet+Typha, operator is just the reconciler). 2. update_k8s.sh retries with --etcd-upgrade=false on the 2nd attempt. For patch upgrades (1.34.7→1.34.8), etcd's image doesn't change — kubeadm writes an identical manifest, hash doesn't update, watch times out and rolls back forever. The skip-etcd retry sidesteps it for the legitimate no-change case while still doing a full etcd upgrade on the first attempt (correct for minor-version bumps). 3. halt_on_alert_query also ignores IngressTTFBHigh + NodeHighIOWait. Both are symptoms-not-causes: ingress latency spikes briefly during any pod-restart wave; high IOwait is exactly what upgrade activity causes (chicken-and-egg). The inline quiet-baseline check (Ready transition <10min) is the real cluster-churn gate. RBAC: k8s-upgrade-job ClusterRole gains `patch` on deployments + scale subresource so the chain can do the scale-to-0/back-to-1 on tigera. These three together get the chain past the cascade that's been blocking 1.34.7→1.34.8 for a week. Long-term fix is still HA control plane (beads code-n0ow); these are the bridge. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 08:40:11 +00:00
Viktor Barzin	4830230984	cluster-health #43 : tighten PVE thermal threshold to 65 C Per Viktor: healthy baseline range is 55-65 C; anything above 65 C is a signal a VM/workload is using too much CPU and warrants investigation. Previous thresholds were calibrated to the hardware's TjMax (75/83 C) — that was too lax, since cluster-load-driven elevation arrives a long time before throttling. The 65 C cutoff matches the live Prometheus baseline (Apr 20-May 8 2026: peak 61-69 C, avg 51-55 C) and the session-observed correlation: above 65 C means the cluster is doing sustained work that should be looked at, even if hardware is still nowhere near its limit. Updated: PASS < 65 C (within 55-65 baseline) WARN 65-82 C (elevated; check top kvm processes for the culprit) FAIL >= 83 C (at/above TjMax — throttling imminent) Verified live: 67 C now WARN (was PASS under the 75 C threshold).	2026-05-22 14:09:08 +00:00
Viktor Barzin	8228171104	cluster-health: add checks 43 + 44 (PVE host thermals + load) Both new checks SSH read-only to the PVE host and emit PASS/WARN/FAIL via the standard healthcheck output + JSON. They run alongside the existing 42 checks and surface the same alerts the 2026-05-20/21 optimization session had to gather by hand. #43 PVE Host Thermals — Xeon E5-2699v4 package + per-core temps Reads every /sys/class/hwmon/hwmon0/temp*_input in one SSH round-trip. Thresholds tuned to the live TjMax=83 / Tcrit=93: PASS < 75 °C package WARN 75-82 °C (approaching max, action time) FAIL >= 83 °C (at/above TjMax, throttling imminent) Reports hottest core label too so a single hot core doesn't hide in the package average. #44 PVE Host Load — load avg vs 44-thread capacity Reads /proc/loadavg, compares 5-min to thread count (44): PASS load_5 < 30 (< 70% threads busy) WARN 30-37 (oversubscribed but not saturating) FAIL >= 38 (~85%+ threads busy — scheduler saturation) Uses 5-min so brief work spikes don't false-fail. Both gracefully WARN-degrade if SSH BatchMode fails, matching the existing check 36 (LVM PVC snapshots) pattern. TOTAL_CHECKS bumped 42 -> 44 and the dispatcher updated.	2026-05-22 09:55:11 +00:00
Viktor Barzin	2dc7e001bd	k8s-version-upgrade: retry kubeadm apply on static-pod-hash timeout kubeadm's `upgrade apply` waits 5min for each static-pod manifest swap to be picked up by the kubelet (it polls the pod's `kubernetes.io/config.hash` annotation via apiserver). On a freshly-rebooted master with apiserver-to-kubelet status sync lagging, that 5min isn't enough — kubeadm declares the upgrade failed and rolls back. The thing is: the etcd container HAS already been swapped to the new image by then (verified live — pod is on registry.k8s.io/etcd:3.6.5-0 when this fires). kubeadm's check is just slow to notice. The 2nd attempt sees etcd already on target, skips it, and proceeds cleanly. Wrap `kubeadm upgrade apply` in a 3-attempt loop with 30s between. Worker phase doesn't need this — `kubeadm upgrade node` has no static-pod-hash waits. Today's autonomous-pipeline session: master phase Failed at 5m on attempt #1 with this exact error, retried, hit same timeout, gave up (backoffLimit=1). The wrapper turns this from a fatal pipeline halt into a "wait a bit, try again" that usually completes on attempt #2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 09:32:29 +00:00
Viktor Barzin	7a1751a668	upgrade-state: filter transient registry digest-check errors Keel polls ~175 image manifests hourly against public registries. Transient i/o timeouts and registry 5xx responses are inherent at that scale and auto-recover on the next poll, but they were tripping the Apps row into ⚠ attn — pure noise. Extend benign_re to cover: - failed to check digest + (i/o timeout \| connection refused \| connection reset \| context deadline exceeded \| TLS handshake timeout \| no such host \| EOF) - failed to check digest + non-successful response (status=5xx) Real actionable digest-check failures (HTTP 401 auth, 404 removed tag) still surface. Persistent registry-side 5xx is owned by the registry's own monitoring (forgejo-integrity-probe + RegistryCatalogInaccessible), not by Keel logs. Tested locally: Apps row flips from ⚠ attn → ✓ healthy after the filter is in place; remaining errors-line drops to "(none in last 24h)". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-19 22:06:21 +00:00
Viktor Barzin	34c1d64a88	upgrade-state: suppress known-benign Keel slack-bot-not-configured noise Keel 1.2.0 registers a Slack socket-mode bot whenever SLACK_BOT_TOKEN is set, then fails because we don't supply an `xapp-` app-level token: bot.slack.Configure(): SLACK_APP_TOKEN must have the prefix "xapp-". bot.Run(): can not get configuration for bot [slack] We don't want the interactive bot — opt-out auto-update + no approval flow (see stacks/keel/main.tf comment). The Slack NOTIFICATION sender works independently and continues posting rollout messages to #general fine. But /upgrade-state's broad `grep level=error` was counting these as real errors → ⚠ on the Apps row every run. Add a small skip-pattern list so the two recurring benign lines drop out; any new genuine Keel error still shows. Reuses `bot.Run()` + `SLACK_APP_TOKEN must have the prev?if\|prefix` (typo in Keel's actual log message preserved as alternation). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 19:45:40 +00:00
Viktor Barzin	9a06a76883	k8s-version-upgrade: switch detection cron from weekly to daily Was `0 12 * * 0` (Sun 12:00 UTC) — patch releases waited up to 6 days before the chain picked them up. Now `0 12 * * *` (daily 12:00 UTC, still outside kured's 02:00-06:00 London window). Concurrency is bounded by Forbid + deterministic job-name idempotency (the detection job exits early if a preflight Job for the same target already exists), so back-to-back days can't pile up parallel runs. - stacks/k8s-version-upgrade/main.tf: var.schedule default + rationale comment - scripts/upgrade_state.sh: rename next_sunday_noon_utc -> next_daily_noon_utc (now returns "Tue 2026-05-19 12:00 UTC" form); change "(Sun cron)" label to "(daily cron)" - .claude/skills/upgrade-state/SKILL.md: cadence column + frontmatter Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 18:29:08 +00:00
Viktor Barzin	9e045e2c16	upgrade-state: skill + script + Keel scrape for periodic three-pipeline audit Three autonomous-upgrade pipelines run independently — Keel for apps (hourly registry polling), unattended-upgrades+kured for OS, and the k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there was no single place to see whether each was healthy, what's pending, or whether anything's stuck. The /upgrade-state skill collapses the state of all three into one table you can run before each Sunday's k8s-version-check fires. - stacks/keel/main.tf: add Prometheus pod-annotation scrape on container port 9300. Surfaces pending_approvals, poll_trigger_tracked_images, and registries_scanned_total{image} so the skill has a real timeseries (also opens the door to a future "pending_approvals > 0 for 24h" alert). - scripts/upgrade_state.sh: collector + renderer. Three-row table (Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2. SSH fan-out (parallel subshells) to all five nodes for apt state + reboot-required + uu log; Prometheus query for Keel; Pushgateway parse for k8s_upgrade_* gauges. Read-only. - .claude/skills/upgrade-state/SKILL.md: hardlinked to ~/.claude/skills/upgrade-state/SKILL.md so the skill is discoverable from both monorepo-rooted and global sessions. Verification: ran the script, stress-tested the ✗ stalled path by pushing in_flight=1 + started_timestamp=-100min to Pushgateway and resetting after — script correctly raised ✗ and exit 2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 10:50:43 +00:00
Viktor Barzin	545cd2b854	healthcheck: probe uptime-kuma via internal Service (port-forward), not public URL The Uptime Kuma check was hitting https://uptime.viktorbarzin.me, which sits behind Authentik forward-auth. Authentik 302-redirects the Socket.IO handshake the uptime-kuma-api library uses, and the library can't complete the OAuth flow, so every healthcheck reported "Connection failed" even though the pod was healthy and serving 225 monitors. Fix: open a transient `kubectl port-forward` to svc/uptime-kuma in the uptime-kuma namespace for the duration of the check, connect the library to http://127.0.0.1:<port> (no auth gate), then SIGKILL the port-forward on the way out. The disown is to suppress bash's "Killed" job notification on stderr, which corrupted stdout when stderr was merged for JSON parsing. Verified end-to-end: healthcheck now reports the real signal — "external down(3): www, xray-vless, hermes-agent" — the same 3 Cloudflare-facing endpoints flagging in the uptime-kuma logs.	2026-05-11 20:02:57 +00:00
Viktor Barzin	0712a1b659	infra/scripts/tg: enforce ingress_factory auth-comment convention Every `tg plan/apply/destroy/refresh` now runs `scripts/check-ingress-auth-comments.py` against the current stack before invoking terragrunt. The check fails closed if any `auth = "app"` or `auth = "none"` line in the stack's .tf files lacks an immediately-preceding `# auth = "<tier>": ...` comment documenting what gates the app (for "app") or why the endpoint is intentionally public (for "none"). Why tg-level (not git pre-commit): tg is the universal entry point for all infra changes. CI runs it, headless agents run it, humans run it. A pre-commit hook only catches the human path. Wiring the check into tg means the anti-exposure guard fires regardless of who or what is invoking terragrunt. Stack-scoped: each stack documents itself the next time it's edited. The 30+ existing `auth = "none"` stacks that predate this guard are not blocked from operating today; they'll need the comment added the next time someone runs `tg plan` on them — at which point the gate forces a conscious "yes, this is intentional" moment before any state change can land. Skipped on: init, fmt, validate, output, etc. — anything that doesn't read or write infra state.	2026-05-11 19:18:27 +00:00
Viktor Barzin	2f0e8c88a9	healthcheck: tune noise filters + nvidia-exporter auth=none Six tuning changes to cluster_healthcheck.sh so PASS sections actually reflect "nothing to act on": 1. prometheus_alerts: only count severity=warning\|critical. Info-level alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the alert rule itself sets severity; the script should respect it. 2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the Lets Encrypt wildcard renews weekly; <14d is the only window where human attention is genuinely useful. 3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/ event/image/update domains (transient by design), skip friendly names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and only count entities whose last_changed > 24h. Was 431/1470, most of which were "phone in standby" noise. 4. ha_automations: only flag DISABLED automations as abandoned if they've also been untouched (last_changed) for >180 days; raise stale threshold 30d → 180d. Was flagging seasonal/holiday-only automations as broken. 5. problematic_pods + evicted_pods: exclude pods owned by Jobs. CronJob retry leftovers (Error/Failed phase pods that K8s keeps around for log inspection) aren't problematic at the cluster level. 6. uptime_kuma: retry the WebSocket login 3x with backoff. Single- shot failures were a recurring false-positive even though the service was healthy. Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll /metrics and got 302'd to Authentik like the idrac/snmp ones did. Same fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 22:27:39 +00:00
Viktor Barzin	a58d777059	k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison on master for new patches + HEAD pkgs.k8s.io for next-minor availability, then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent. The agent (.claude/agents/k8s-version-upgrade.md) orchestrates: pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match) -> etcd snapshot save -> optional master containerd skew fix -> apt repo URL rewrite (minor bumps only) -> drain/upgrade/uncordon master via ssh < update_k8s.sh -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each -> post-flight verification Two new Upgrade Gates alerts catch failure modes: - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m) - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m) update_k8s.sh refactored to take --role / --release args; the agent shells it into each node via SSH pipe. update_node.sh annotated as OS-major path. Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section in docs/architecture/automated-upgrades.md. Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519 keypair distributed to all 5 nodes via authorized_keys; slack_webhook reuses kured webhook URL on initial deploy).	2026-05-10 19:07:42 +00:00
Viktor Barzin	c647791774	scripts: timeout rsync + sqlite calls in daily-backup Per-PVC rsync had no timeout, so any single hung PVC (e.g. on a corrupted snapshot or a sqlite held open by a writer) blocked the whole script until systemd's 4h TimeoutStartSec kicked in, leaving every later PVC silently unbacked. Today's run hung on mailserver/roundcubemail-enigma-encrypted at 05:09 and didn't recover — hence WeeklyBackupFailing alert. Now: - rsync per PVC: timeout 30 min, exit 124 logged separately - sqlite3 per database: timeout 5 min - /etc/pve rsync: timeout 5 min Each timed-out PVC bumps PVC_FAIL but the loop keeps moving. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 18:39:07 +00:00
Viktor Barzin	64c71615e8	scripts: cluster_healthcheck defaults to ~/.kube/config The previous default of $(pwd)/config required running the script from the infra/ directory or always passing --kubeconfig. From a parent shell or any other working directory, the lookup hit a non-existent file and kubectl returned a stale-token error, masking real check results. Now: use $KUBECONFIG if set, then ~/.kube/config, then fall back to $(pwd)/config for backwards compatibility. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 18:12:40 +00:00
Viktor Barzin	cfe969fe43	backup: fix daily-backup silent failures, postiz pg_dump CronJob, doc reconcile daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr 30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount, which blocked the next run from completing — root cause of the WeeklyBackupStale alert going silent (the metric never reached its end-of-script push). Fixes: - TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting the wall during week 18 runs) - Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as belt-and-braces for any inherited stuck state from a prior crashed run - TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of the alert going blind on systemd kills - pfsense metric pushed in BOTH success and failure paths (was only on success; any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert threshold expired) Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to /srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end: 3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql image is stripped (no curl/wget/python) — switched to docker.io/library/postgres matching the dbaas/postgresql-backup pattern with apt-installed curl. Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed backup_weekly_last_success_timestamp but the script pushes daily_backup_last_run_timestamp). Updated to match what's actually emitted, and added a "default-covered" footnote to the Service Protection Matrix so the ~40 services with PVCs not enumerated in the table are no longer ambiguous. Manual PVE-host actions (out-of-band, not in TF): - unmounted 6 stacked snapshots from /tmp/pvc-mount - pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the loop got SIGTERMed against repeatedly, so prune kept failing) - created /srv/nfs/postiz-backup directory - triggered a one-shot daily-backup run with the new TimeoutStartSec to validate the fix end-to-end Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 17:41:04 +00:00
Viktor Barzin	02a6a955f5	[woodpecker] Programmatic Forgejo repo registration Earlier I claimed the OAuth Web UI flow was the only way to onboard new Forgejo repos in Woodpecker. That's wrong. Two parts to the actual workaround: 1. Woodpecker session JWTs are HS256 signed with the user's per-user `hash` column from the PG `users` table (NOT the global agent secret). Mint a session JWT for the Forgejo viktor user (id=2, forge_id=2), and you're authenticated as that user. 2. POST /api/repos?forge_remote_id=N as viktor → Woodpecker calls Forgejo with viktor's stored OAuth access_token to create the webhook + per-repo signing key. Works. The 500 I saw earlier was from POST'ing as ViktorBarzin (GitHub admin), whose user row has no Forgejo OAuth token — Woodpecker's forge-API call fails for that user, surfacing as a 500. scripts/woodpecker-register-forgejo-repo.sh wraps the whole flow: extract hash from PG → mint JWT → activate repo. Verified against viktor/{broker-sync,claude-agent-service,freedify,hmrc-sync} in this session — all activated cleanly. Also updated the runbook with the actual mechanism + the WOODPECKER_FORGE_TIMEOUT=30s tip (the real root cause of the 'context deadline exceeded' failures, NOT the v3.14 upgrade).	2026-05-07 23:33:26 +00:00

1 2 3

115 commits