infra

Author	SHA1	Message	Date
Viktor Barzin	c7ffbaa204	aiostreams: harden stream-probe + repair sources (RD-451 "few films" fix) Root cause of "barely serving films": Real-Debrid's May-2026 infringing_file/HTTP-451 filter blocks WEB-DL releases (which dominate new content), while degraded sources starved candidates. RD account + popular-title availability were healthy throughout (library 32/36 unrestrict OK; Matrix 897 / Dune2 694 / Oppenheimer 672 streams). Runtime config (AIOStreams PG, applied via API — not in this diff): - Comet timeout 5s -> 10s. Comet is the workhorse (~450+ streams/title) and was silently dropping the bulk of its results at the 5s cutoff; Interstellar 430 -> 987 streams after the bump. - Removed MediaFusion preset: broken upstream ("Invalid configuration" -> 500 Internal Server Error), contributed 0 usable streams, only a dead [X] entry in every list. This diff (Terraform): - Harden aiostreams-stream-probe: test series AND movie paths, per-source breakdown (comet/torrentio/stremthru_torz/knaben), error-stream count, success gated on Comet being alive. The old probe counted only Breaking Bad streams and stayed green while new-content playback was broken. - service-catalog: reflect source set + probe behaviour. [ci skip] — probe already applied via targeted `tg apply` + verified (series=378 movie=898 comet=206 errors=0 success=1); skipping the full servarr reconcile to avoid touching unrelated pre-existing drift (qbittorrent MetalLB annotation, tls_secret cert revert). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 07:21:42 +00:00
Viktor Barzin	4cdb9e1886	novelapp: switch Keel to semver (policy=major) now upstream tags are valid All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Gheorghe fixed his tag format 2026-06-06 (v.1.1.1 -> valid v1.1.1 / v1.1.3), so drop the :latest+force+match-tag digest workaround and track semver properly: policy=major (all upgrades, cumulative), match-tag removed (so Keel is free to climb to higher semver tags), image floor pinned to v1.1.3. Pull policy -> IfNotPresent (correct for a pinned Keel-managed tag; Always was only needed for the mutable :latest). Running v1.1.3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 22:56:46 +00:00
Viktor Barzin	551412488b	apiserver: enable audit logging (low-write Metadata) + ship to Loki Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Resource changes/deletions are now attributable (the novelapp deletion this week was untraceable because apiserver audit was off). Low-write policy: drops reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into the kube-apiserver static-pod manifest + kubeadm-config (v1beta4 extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails /var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}. Root cause that had silently blocked this AND OIDC for weeks: a stray kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate static-pod manifest kubelet ran instead of the real one, dropping every flag added to the real manifest. Removed it. Runbook added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	3696ff5922	novelapp: track :latest by digest (Keel force+match-tag), adopt into TF state Keel was stuck on v1.0.3 because upstream mghee/novelapp tags newer releases as `v.1.1.1` (dot after v), which isn't valid semver, so policy=all couldn't see past the highest parseable tag. :latest correctly points at the newest release, so switch to force + match-tag digest-tracking of :latest (Kyverno does not manage match-tag, contrary to the stale code comment). Imports the live Deployment (recreated out-of-band 2026-06-06) back into TF state; running image flipped to :latest -> now on v.1.1.1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	4d8b782df1	feat(trip-planner): app stack (Tier-1, CNPG, Slack-signed webhook ingress) Namespace trip-planner (tier=4-aux, keel enrolled), ExternalSecret pulling secret/trip-planner from vault-kv, DB-creds ExternalSecret from vault-database (static-creds/pg-trip-planner → asyncpg DSN), Deployment with migrate init container + main container (readiness+liveness /healthz, 256Mi req=limit, 100m cpu request), ClusterIP service port 8080, and ingress_factory with auth=none (Slack v0 HMAC signature verification in-app). Terraform fmt clean. NOT applied; requires Vault secret/trip-planner + CNPG trip_planner DB + Slack app config. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	7c12fbba95	monitoring/alloy: drop cosmetic calico-typha 'Endpoints deprecated' warning calico-typha (~342 lines/hr across 3 pods) still WATCHes the core v1 Endpoints API, so the apiserver returns the 'v1 Endpoints is deprecated in v1.33+' client-go warning, which typha logs. Per KEP-4974 the v1 Endpoints API will essentially never be removed (clients keep working indefinitely), and even the latest Calico still watches Endpoints (projectcalico/calico#11540) so a CNI upgrade would not fix it. Pure cosmetic noise. Targeted Alloy stage.drop (calico-system ns, exact deprecation message), mirroring the mailserver drop. Real calico warnings/errors kept; reversible. Validated with alloy fmt (exit 0). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	4b13be6d48	dawarich: upgrade 1.6.1 -> 1.7.11 (removes RailsPulse, drops orphan tables) dawarich 1.6.1 shipped the RailsPulse perf-monitoring gem, which scheduled an hourly Sidekiq SummaryJob INDEPENDENT of its disabled flag; the job hit rails_pulse_routes (no primary key) and retry-looped, logging ~125 UnknownPrimaryKey lines/hr (found via Loki triage 2026-06-06). Upstream removed RailsPulse entirely in 1.7.x (commit a5172cc) with a DropRailsPulseTables migration; 1.7.11 is latest stable. Keel only auto-applies patch bumps within 1.6.x, so the minor jump is manual. Pre-upgrade pg_dump of dawarich (79.9MB) + dawarich_queue taken to devvm. The 5 rails_pulse_* tables are empty (feature never collected data), so cleanup is zero-data-risk; location data (tracks/points/visits/places) untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	8a3bbde38c	mailserver: silence mixed-TLS-directive warning + drop SMTP scanner noise from Loki Two mailserver-namespace log-noise cleanups (cluster's #1 Loki error source, from the 2026-06-06 log triage): 1. TLS warning: docker-mailserver SSL_TYPE=manual writes the authoritative smtpd_tls_chain_files at boot, so the legacy smtpd_tls_cert_file/key_file in our postfix-main.cf override were IGNORED and triggered postfix's 'Both smtpd_tls_chain_files and ... legacy ...' warning. Dropped the two legacy lines (functional no-op; chain_files already wins). Verified via live postconf. 2. Scanner noise (~9k lines/hr): narrow Alloy stage.drop for the benign public-SMTP probe patterns (unknown[unknown] SSL_accept resets, postscreen half-open drops, rate-limit-exceeded from unknown). Real delivery logs + real-IP SASL failures KEPT; CrowdSec bans these IPs independently, so security posture is unchanged. Validated with 'alloy fmt' (exit 0). Reversible. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
root	de181a9afc	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
Viktor Barzin	27211acda1	rybbit: recreate missing Postgres database via idempotent init Job rybbit's 'rybbit' PG database was missing from CNPG (the role survived a past cluster rebuild but the database did not), so the app's node-cron logged 'database "rybbit" does not exist' every minute (found via Loki 2026-06-06). Created the DB manually to restore service (app auto-migrated 11 tables); this adds a self-contained init Job so the DB is recreated on any future rebuild -- connects as the rybbit role (has CREATEDB) using the existing rybbit-secrets password, idempotent CREATE DATABASE if absent. Deployment now depends_on the job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	9529eedfe0	docs(security): bot-block-proxy is a no-op while poison-fountain is at 0 [ci skip] Reflect commit b6dd23b1: bot-block-proxy short-circuits /auth to return 200 instead of proxying to the scaled-to-0 poison-fountain. - security.md Layer 1 + tarpit description + troubleshooting (fix stale stacks/platform path -> traefik stack; drop misleading restart-poison-fountain step). - .claude/CLAUDE.md: add matrix to PG rotation list; document that startup-read secret consumers need a Reloader annotation (matrix root cause, found via Loki 2026-06-05). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	9ad7756a94	traefik: make bot-block-proxy a clean no-op while poison-fountain is at 0 bot-block-proxy is the forward-auth target for the ai-bot-block middleware (applied to every anti-AI ingress). It proxied /auth to the poison-fountain bot trap with error_page 5xx=200 fail-open. But poison-fountain is intentionally scaled to 0, so proxy_pass only ever failed and fell open to '200 allowed' -- while logging ~51k errors/hr (the #1 Loki source once pod logs began shipping 2026-06-05) and paying up to 100ms connect-timeout per authed request. Short-circuit /auth to 'return 200 "allowed"' directly (drop the upstream + proxy_pass + fallback). Identical effective behaviour (allow-all), no upstream attempt, no noise, no latency. Reversible: restore the upstream + proxy_pass and scale poison-fountain up. Also add the missing configmap.reloader.stakater.com/reload annotation so openresty picks up ConfigMap changes (it does not hot-reload on its own -- the root reason stale config ran for days). replicas stays 2: critical-path forward-auth target (anti-AI ingresses fail closed if it is down), so HA is retained though each request is now trivial. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
root	d70a99dc48	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
Viktor Barzin	d661d074ef	matrix: auto-reload Synapse on DB credential rotation (Reloader) Synapse injects the Postgres password into homeserver.yaml only at startup (inject-db-password initContainer). matrix-db-creds is rotated by Vault via ESO (15m refresh), so each rotation left the running pod with a stale password and Synapse DB auth failed silently until a manual rollout restart. Found today via Loki: ~12.9k/hr 'password authentication failed for user matrix' lines; secret password verified working against the DB while the 10-day-old pod held the pre-rotation value. Add the explicit secret.reloader.stakater.com/reload annotation so Reloader rolls the deployment whenever the secret changes (explicit form, not auto/search, because the secret is referenced only in an initContainer env var). Live pod already restarted to restore service; this prevents recurrence on the next rotation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
root	e7ece3eaf9	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
root	02366103ef	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
Viktor Barzin	d808694af4	docs(storage): record harden-half shipped (orphan cleanup + ghost-reconcile) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details 2a orphan cleanup (67 Released PVs + 475 LVs removed, VG pve 997->~410) + 2b csi-ghost-reconcile CronJob done — ghost-disk doom loop closed by construction, beads code-dfjn retireable. Cap kept at 28 (lowering would reverse the 2026-05-25 eviction-cascade post-mortem fix). Phase-1: insta2spotify migrated (noted its 3.26GB image re-pull blip on node reschedule). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 21:39:36 +00:00
Viktor Barzin	1b9d4f1233	storage: migrate insta2spotify off proxmox-lvm to NFS (LUN relief, Phase 1) Some checks failed ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was canceled Details Config-only PVC (no embedded DB), preflighted. Frees one proxmox-csi slot. NB: pod reschedule re-pulled the 3.26GB backend image (~6min stall) — large-image services incur a pull-delay blip when migration moves them to a fresh node. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 21:38:01 +00:00
Viktor Barzin	355ca3ee91	proxmox-csi: auto-reconcile CronJob to detach ghost disks (code-dfjn prevention) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Closes the ghost-disk doom loop by construction (failed detach -> orphan scsiN with no VolumeAttachment -> invisible oversubscription -> query-pci wedge). Every 15min csi-ghost-reconcile compares each worker VM's real scsi disks (Proxmox API) vs k8s VolumeAttachments and safely detaches ghosts (PUT .../config delete=scsiN -> frees the LUN slot, retains the LV). - detection mirrors cluster-health check #47 - SAFETY: only vm-9999-pvc scsi with no matching VA; 60s re-confirm; per-run cap 5 - scoped CSI API token (VM.Config.Disk), not root SSH; k8s API via injected ClusterIP - verified live: read 66 VAs, 0 ghosts, no false positives - pushes csi_ghosts_detected/detached to Pushgateway Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 21:25:36 +00:00
Viktor Barzin	e311cbe103	chore(modules): remove vestigial audiblez-web copy + fix glossary note [ci skip] modules/kubernetes/ebook2audiobook/ held a tracked copy of the audiblez-web app source (24 files), sourced by no stack and built by no CI — audiblez-web is GHA-built from its own repo. Bulk-swept in 2026-04-15; removed. Also corrected CONTEXT.md: the "vestigial per-app dirs (immich/, ollama/, ...)" note was wrong — those were untracked local macOS cruft (._main.tf AppleDouble turds), never in the repo; cleaned from the working tree. modules/kubernetes/ now holds exactly the four factory modules (ingress_factory, nfs_volume, anubis_instance, setup_tls_secret). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:38:13 +00:00
Viktor Barzin	a42f4f7b26	trek: trial-deploy TREK group-trip planner behind Authentik (solo eval) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Stand up upstream TREK (mauriceboe/trek:3.0.22, AGPL) as a low-commitment trial to evaluate the self-hosted group-trip use case before building a custom app. Solo, single shared instance, Authentik forward-auth. - stacks/trek: namespace, deployment (pinned, TF-managed, no CI/Keel), service 80->3000, ingress_factory auth=required + proxied DNS at trek.viktorbarzin.me, TLS. Two proxmox-lvm-encrypted PVCs (SQLite data + uploads) -- encrypted per the sensitive-data rule and to avoid the SQLite-over-NFS locking hazard. - Trial secrets posture: ENCRYPTION_KEY auto-generated on the data PVC, bootstrap admin in pod logs -- no Vault/ESO. Graduation TODOs documented in main.tf + service-catalog (Vault key, app-level SQLite backup, OIDC SSO). - kyverno: add mauriceboe/* to require-trusted-registries allowlist (the policy is Enforce since 2026-05-19 -- also fixed the stale "stays in Audit" header comment that said otherwise and misled the deploy). - Runs free on OpenStreetMap (no paid maps key). Rallly availability-poll companion deferred per solo-trial scope. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:30:07 +00:00
Viktor Barzin	63182730f9	docs(storage): record Wave-2 NFS migration + harden-proxmox-csi decision (option 1) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Document the 2026-06-05 decision to keep proxmox-csi and harden it (keep PVC mobility, no hardware) over TopoLVM (pins to node) / Longhorn (2x writes on single shared HDD). Wave-2 moved 5 non-DB workloads off block to NFS (tandoor, speedtest, hackmd, changedetection, send), freeing 5 LUN slots. - storage.md: live PVC counts, Retain-policy/orphan-LV note, Wave-2 history, updated cap-relief levers - topolvm-evaluation.md: stamped NOT ADOPTED with rationale + pointer to the decision doc Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:15:21 +00:00
Viktor Barzin	a0b34750ee	storage: migrate hackmd uploads off proxmox-lvm-encrypted to NFS (LUN-cap relief) Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details codimd is MySQL-backed; this PVC holds only pasted image uploads (subPath hackmd, 4.5M) — no embedded DB, NFS-safe. Drops LUKS-at-rest for these low-sensitivity images (accepted). Frees one proxmox-csi SCSI-LUN slot on node6. - swap hackmd-data-encrypted -> nfs_volume module (subPath preserved) - uploads copied + verified (20 files, HTTP 200, codimd listening) - block PVC removed; LV retained per SC policy (code-dfjn cleanup) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:11:31 +00:00
Viktor Barzin	e35d693972	storage: migrate send off proxmox-lvm to NFS (LUN-cap relief) Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details Send (timvisee/send) stores encrypted upload blobs on disk with metadata in Redis — no embedded DB, NFS-safe. Frees one proxmox-csi SCSI-LUN slot on node2. - swap send-data-proxmox -> nfs_volume module - blobs copied + verified (273M, 22 entries, HTTP 200 on NFS) - block PVC removed; LV retained per SC policy (code-dfjn cleanup) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:04:37 +00:00
Viktor Barzin	c24b4a21d8	docs(architecture): fix stale 5-node claim -> 7 nodes (k8s-node1..6) [ci skip] Cluster grew to 7 nodes (k8s-master + node1..6; node5/6 added ~10d ago) but several docs still said "5 nodes". Corrected with live specs: - overview.md: 7-node enumeration; node1 is 16c/48GB (doc wrongly said 32GB), node2-6 are 8c/32GB general workers - compute.md: "5-node" -> "7-node" cluster description - dns.md: NodeLocal DNSCache DaemonSet "5 nodes" -> "7 nodes" - mailserver.md: HAProxy backend diagram "node1..4" -> "node1..6" Illustrative "0/5 nodes available" scheduler-error examples left as-is. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:03:58 +00:00
Viktor Barzin	bf3608052b	tripit: GEOCODER_PROVIDER=openmeteo for per-city itinerary weather Enables Open-Meteo geocoding of lodging addresses (results cached in the new geocode_cache table) so the itinerary can show per-city weather. Applied manually via scripts/tg apply. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:01:31 +00:00
Viktor Barzin	6eb683b6e0	storage: migrate speedtest off proxmox-lvm to NFS (LUN-cap relief) Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details speedtest-tracker is MySQL-backed (config dir = Laravel config + logs, no embedded DB), NFS-safe. Frees one proxmox-csi SCSI-LUN slot. - swap speedtest-config-proxmox -> nfs_volume module - config copied + verified (HTTP 302->login,200); excluded 383MB laravel.log - block PVC removed; LV retained per SC policy (code-dfjn cleanup) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:59:56 +00:00
Viktor Barzin	060aefbd0b	storage: migrate changedetection off proxmox-lvm to NFS (LUN-cap relief) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details changedetection uses a file-based JSON datastore (url-watches.json + per-watch dirs + brotli snapshots) — no embedded DB, NFS-safe. Frees one proxmox-csi SCSI-LUN slot. Part of harden-proxmox-csi+NFS plan. - swap changedetection-data-proxmox -> nfs_volume module - data copied + verified (HTTP 200, 4 watches loaded); excluded 200MB test cruft - block PVC removed; block LV retained per SC policy (code-dfjn cleanup) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:55:03 +00:00
Viktor Barzin	52f5de905d	docs(context): freshen infra glossary (modules, tiers, new concepts) [ci skip] Refresh CONTEXT.md against current repo + cluster reality (grill-with-docs): - Module taxonomy rewrite: drop fictional k8s_app/helm_app/postgres_app factory modules (never existed); name the real four (ingress_factory, nfs_volume, anubis_instance, setup_tls_secret) + the shared / Stack-local / flat distinction; flag vestigial modules/kubernetes/<app> dirs. - Rename "Ingress auth tier" -> "Ingress auth" (discrete modes, not tiers); reserve "tier" for State tier + Namespace tier only. - Add local-path entry (cluster default SC; node-local footgun warning). - Add concepts: Keel, Diun, CNPG/pg-cluster, MetalLB LB-IP split, Calico. - Add "policy" ambiguity flag (Kyverno vs Calico NetworkPolicy vs Vault/RBAC). - Fix node count 5 -> 7 (k8s-master + k8s-node1..6). Doc-sync (same commit per repo rules): - overview.md: replace fictional factory modules with the real shared modules + the flat/stack-local pattern. - .claude/CLAUDE.md: drop dead nfs-proxmox column from the storage decision table + stale cross-reference (vault migrated off it 2026-04-25). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:34:49 +00:00
Viktor Barzin	aa948be581	storage: migrate tandoor off proxmox-lvm to NFS (LUN-cap relief) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details tandoor is PostgreSQL-backed with no embedded DB, so its media/static PVC is NFS-safe. Frees one proxmox-csi SCSI-LUN slot. Part of the 'harden proxmox-csi + NFS' plan (keeps PVC mobility, no new hardware) — see docs/plans/2026-06-05-block-storage-harden-nfs-design.md. - swap tandoor-data-proxmox -> nfs_volume module (nfs-truenas SC) - data copied + verified (HTTP 200 on NFS volume); block PVC removed - block LV retained per SC policy (orphan cleanup tracked in code-dfjn) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:34:47 +00:00
Viktor Barzin	febf12bddd	mail(tripit): send From: plans@viktorbarzin.me instead of spam@ Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details tripit outbound (linked-email verification + trip-share invites) was sent From: spam@viktorbarzin.me. Switch the From to plans@viktorbarzin.me while keeping SMTP auth as spam@ (its password, unchanged). docker-mailserver SPOOF_PROTECTION (reject_sender_login_mismatch) requires the authed login to "own" the From; the @viktorbarzin.me catch-all does NOT grant that per-address, so add an explicit `plans@ -> spam@` virtual alias to authorize it (also keeps inbound plans@ routing to spam@ for the mail-ingest poller). tripit SMTP_FROM flips to plans@. Verified: sender-login probe (auth spam@, MAIL FROM plans@) now 250 (was 553); a real send from the tripit pod logs from=<plans@viktorbarzin.me> accepted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 18:41:08 +00:00
Viktor Barzin	bc33cd5ac4	monitoring: NodeFilesystemFull 90%->95% + Synology storage runbook The Synology offsite backup target (/mnt/synology-backup, surfaced via the PVE host NFS mount) sits at ~94% by design and was firing NodeFilesystemFull continuously. Per user request, raise the threshold to 95% (<5% free). NOTE: NodeFilesystemFull is a global node-filesystem rule, so this also loosens the warning on k8s node/system disks; BackupDiskFull (sda /mnt/backup) stays at 85%. Also adds docs/runbooks/synology-storage.md: how to assess Synology usage WITHOUT du (Storage Analyzer weekly CSVs, df/btrfs/qgroup), btrfs async/snapshot-pinned reclaim, the 2026-06-05 capacity assessment (94% full; Backup share 4.42TiB), and ~500GiB of homelab cleanup candidates (redundant gphotos Takeout, old laptop VM images, archives). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 18:18:31 +00:00
Viktor Barzin	f526af694d	monitoring: snmp-idrac scrape 1m->30s — faster HA dashboard iDRAC refresh All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The ha-sofia R730 REST sensors (via prometheus-query.lan) + Grafana iDRAC panels were bound to the 1m snmp-idrac scrape. Halved to 30s so the dashboard-it Server view refreshes uniformly at 30s, matching the fan-control daemon's Pushgateway metrics. SNMP scrape ~3-4s; timeout 15s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 17:52:07 +00:00
Viktor Barzin	5b5b855528	monitoring(alloy): drop goflow2 + vpa logs from Loki to cut sdc write wear goflow2 emits ~8 GB/day of per-flow NetFlow JSON to stdout (~64% of all cluster log volume) but only its Prometheus aggregate metrics are used; vpa is ~1.3 GB/day of Goldilocks/VPA recommender chatter. Both are low-value and were landing in Loki (PVC on the contended sdc HDD). Drop them at the Alloy relabel. Reversible (remove the drop rule). Loki ingestion drops ~73%. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 17:44:47 +00:00
Viktor Barzin	dbe115910f	monitoring: add local-only prometheus-query.lan ingress for ha-sofia SNMP sensors ha-sofia's 7 R730 REST sensors (CPU/exhaust/inlet temp, power, 2x PSU voltage, fan) read the iDRAC via the slow on-demand Redfish exporter (scan_interval 120, ~16-22s/fetch, intermittent `unavailable` blips). Migrated them to a FAST Prometheus query of the SNMP values (instant, ~1m-fresh from the snmp-idrac scrape), scan_interval 30. This adds the enabling ingress: `prometheus-query.viktorbarzin.lan` → `prometheus-server:80`, auth=none, allow_local_access_only, path-scoped to `/api/v1/query` (read-only instant-query only — not the UI/admin/federation). ha-sofia can't use `prometheus.viktorbarzin.me` (Authentik-gated, no OIDC from a REST sensor), so this mirrors the existing local-only `.lan` exporter ingresses HA already queries. The ha-sofia REST file (`/config/rest_resources/idrac_redfish_exporter.yaml`) was edited in place (auto-version-controlled by the HA version-control add-on; pre-migration copy at `/config/idrac_redfish_exporter.bak-pre-snmp`). The Technitium CNAME `prometheus-query.viktorbarzin.lan -> ingress.viktorbarzin.lan` was added manually via the API — like the other `.lan` exporter hosts it is NOT auto-synced (the technitium-ingress-dns-sync CronJob only creates `.me` records). Follow-up (already noted for the Loki sensor): extend that sync to manage `.lan` CNAMEs too. The Redfish remnant's `sensors` collector is now vestigial (HA no longer reads it). Verified: all 7 HA sensors report correct fresh values from Prometheus (fan 10800 rpm, CPU 62.0C, power 280W, PSU 230/240V). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 17:25:06 +00:00
Viktor Barzin	b7cb74f1b5	docs(monitoring): cluster log aggregation (Alloy fix) + Cluster Logs dashboard + HA sensors [ci skip] Document the 2026-06-05 cluster-wide log observability work: the Alloy local.file_match fix (loki.source.file doesn't expand globs) + stage.cri, the new "Cluster Logs" Grafana dashboard, the ha-sofia cluster-log-health REST sensors, and the loki.viktorbarzin.lan Technitium-record follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 17:15:57 +00:00
Viktor Barzin	7501c2be5d	monitoring(grafana): add professional "Cluster Logs" dashboard (Logs folder) Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Cluster-wide Loki log observability now that pod logs flow (Alloy fix). New dashboards/cluster-logs.json (Loki DS P8E80F9AEF21F6940): namespace/app/pod dropdowns + free-text regex search; stats (lines/errors/warns/active-ns), log-volume-by-namespace, error/warn rate, top-namespaces-by-errors, top-pods-by-errors, a filterable live-logs panel, and a second row for the node + rpi-sofia systemd journals (volume-by-level + error/warn journal panel). Error/warn use case-insensitive regex line-filters so they work regardless of level-label availability. New "Logs" Grafana folder. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 17:03:45 +00:00
Viktor Barzin	bb0099b747	monitoring(alloy): fix broken pod-log shipping (missing local.file_match) + parse CRI Cluster pod logs were NOT reaching Loki — only node/Pi journals were. Root cause: loki.source.file was fed the /var/log/pods/<uid>/<container>/.log glob directly from discovery.relabel, but loki.source.file does NOT expand globs, so it stat()'d the literal `*` path and shipped zero pod logs ("stat failed: no such file" for every pod). Per Grafana Alloy docs, a local.file_match component must expand the glob into concrete file targets first. Add it. Also add stage.cri {} so Loki stores clean messages + real timestamps instead of raw containerd CRI-prefixed lines. Fixes cluster-wide log observability (regression vs the working 2026-05-26 state). Ship-all-then-measure per the agreed plan; Alloy mem limits stay as the IO-storm safeguard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 16:57:44 +00:00
Viktor Barzin	6b1d23abbd	monitoring: migrate R730 iDRAC scraping to SNMP (fast primary) + thin Redfish remnant The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac module: ~3.7s/scrape at 1m. - snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM (coolingDeviceReading + location lookup) and an amperageProbeLocationName lookup so the "System Board Pwr Consumption" watts probe is label-selectable. - snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*). - Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes. - Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names; temps ÷10 (tenths-degC); DellStatus value-mappings updated. - Demote the Redfish exporter to a slow remnant: trim collectors to system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change. SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan + as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md, docs/architecture/monitoring.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 16:33:20 +00:00
Viktor Barzin	6442978f07	fan-control: merge Fan %/RPM dashboard cards + RPM estimate fallback [ci skip] The Fan % and Fan RPM sensor-graph cards had identical trend shapes (RPM ∝ %), so merge them into one "Fan speed" card: % trend (stable Pushgateway sensor) + RPM beneath. RPM reads sensor.r730_fan_speed (Redfish) but falls back to the calibrated estimate (rpm≈160·%+1520, shown with a "~" prefix) when that sensor is unavailable — it blips out intermittently, so the readout never goes blank. The Override readout likewise shows both "% · rpm". HA-side only; daemon unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 14:31:32 +00:00
Viktor Barzin	722a1c9b42	docs(monitoring): document rpi-sofia off-box monitoring + log shipping [ci skip] Add an "External host: rpi-sofia" section to docs/architecture/monitoring.md covering the 2026-06-05 setup: node_exporter + vcgencmd textfile metrics; the full-journal promtail->Loki shipping (job=rpi-sofia-journal — kernel/dmesg via the (none) unit + all systemd units, labeled by unit/level); the RPi Sofia alert group; the dashboard; and the systemd watchdog. Notes the SD-card root cause and that the Pi-side config is hand-managed + backed up off-box. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 14:25:20 +00:00
Viktor Barzin	405ca79531	fan-control: Override slider now tracks live fan speed while unlocked [ci skip] The dashboard Override slider used to show a stale stored % (e.g. 5%) while the fans were actually at ~53%, which was confusing. Add automation.r730_fan_override_track_live_speed_while_unlocked: while unlocked it mirrors the live commanded % (sensor.r730_fan_control_target) into the Override, so it always shows the actual absolute fan speed and updates as the fan moves. While locked it stops tracking and is the user's editable setpoint. The readout under the slider now shows the live "% · rpm" (actual, not an estimate). HA-side only; daemon unchanged. Verified live: slider forced to 10 → synced to 58 target. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 14:20:38 +00:00
Viktor Barzin	ddc8bfa8cf	tripit: remove Gmail-scrape ingest-mail CronJob; plans@ becomes sole channel All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The Gmail All-Mail scrape (tripit-ingest-mail) is retired — Viktor only wants mail ingested when forwarded to plans@viktorbarzin.me, and only from actual users. Dropped the ingest-mail CronJob and removed MAIL_DEFAULT_OWNER_EMAIL from ingest-plans (the app now ignores mail from non-users instead of filing it under the default owner). ingest-plans already carries EMAIL_PROVIDER/SMTP_* for the new sender notifications. Service-catalog updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 13:50:53 +00:00
Viktor Barzin	5381beb3b7	monitoring: fix ingress auth-comment guard for loki-write-ingress All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details scripts/tg's check-ingress-auth-comments.py requires the `# auth = "none":` rationale comment DIRECTLY above the `auth = "none"` line; mine was in the module's top block comment, so the guard aborted the whole monitoring apply (this is why the rpi-sofia scrape/alerts/ingress/dashboard never landed on the first push). Move the rationale to the required position. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 13:36:43 +00:00
Viktor Barzin	c059405632	fan-control: simplify HA dashboard + Lock = freeze-current/algo-off [ci skip] The dashboard-it Server → Fans view is now minimal: fan speed (% + RPM), an Override % slider, and a Lock toggle. Lock now means "freeze the current speed, algorithm off" — a new automation (r730_fan_lock_freeze_current_speed_resume_algo) snapshots the live target % into Override and sets mode=manual on lock-ON, and mode=auto on lock-OFF. The host daemon is unchanged (the toggle just drives the mode it already reads). cool/quiet stay reachable via the entity but are off the simplified view; the 60-min auto-revert is kept as a dormant safety net. Verified live: lock ON → mode=manual + Override captured the live 60%; lock OFF → auto. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 13:27:46 +00:00
Viktor Barzin	f9376a36ff	monitoring: wire rpi-sofia (Sofia Pi) into Prometheus/Loki/alerts Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details The Sofia Raspberry Pi hung this morning (network wedged ~10:13, HA sensors dead, and its local journal had been silent since Apr 27 — a 2017 SD card intermittently flipping the rootfs read-only). Nothing was captured because logging lived only on the failing card. Ship telemetry off-box so the next failure is diagnosable centrally: - Prometheus scrape job `rpi-sofia` (rpi-sofia.viktorbarzin.lan:9100) — node_exporter + a vcgencmd textfile collector on the Pi exporting under-voltage/throttle/SoC-temp as rpi_* metrics. - Alert group "RPi Sofia": node_exporter Down, rootfs ReadOnly (the exact SD-failure signature), Under-voltage since boot, High SoC temp. - LAN-gated Loki write ingress (loki.viktorbarzin.lan) so the Pi's promtail can push its journal — Loki was ClusterIP-only. - Grafana dashboard "RPi Sofia" (Hardware): status, undervoltage/ throttle, temp, load, memory, disk, network. The Pi separately got a systemd hardware watchdog (auto-reboot on a hard hang; today it stayed down ~5h until a manual power-cycle). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 13:11:40 +00:00
Viktor Barzin	5b96b841fc	f1-stream: right-size memory 1Gi -> 256Mi (CDP-only, no bundled Chromium) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Actual usage ~116Mi, Goldilocks/VPA upperBound ~185Mi (incl. live races over 99d). The 1Gi reservation was sized for the old bundled-Chromium image; the app now drives the remote chrome-service over CDP. 256Mi (upperBound x~1.3, bursty) requests=limits per convention; cpu request 100m -> 50m (VPA upperBound 49m). Frees ~768Mi of reserved cluster memory.	2026-06-05 12:57:22 +00:00
Viktor Barzin	d17b25cdcc	fan-control: document the HA Fan Lock (opt out of 60-min auto-revert) [ci skip] A manual/cool/quiet override in HA auto-reverts to `auto` after 60 min. Add a Fan Lock (`input_boolean.r730_fan_lock`) that gates that automation so a deliberate override persists, with a visible "🔒 FAN CONTROL LOCKED" banner on the dashboard-it Server view so it isn't forgotten. The automation re-checks the lock after the hour (locking mid-countdown cancels the revert) and the 83 °C ceiling still wins. HA-side only (helper + automation + dashboard live on ha-sofia, auto-git-tracked there); these docs are the infra-repo record. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 12:22:00 +00:00
Viktor Barzin	51456a96f6	fan-control: estimate + expose fan power (fan_watts_est) The iDRAC reports only total DCMI watts + RPM (no per-fan power), so add a cube-law fan-power estimate: fan_W ~= 0.0205*(RPM/1000)^3, calibrated to the 2026-06-05 sweep (fits within ~3W; ~2W floor -> ~99W full). The daemon reads live RPM each loop and pushes pve_fan_control_fan_rpm + _fan_watts_est. Surfaced in HA as sensor.r730_fan_power_est + a "Fan Power (est)" card on the dashboard-it Server view, next to total power. 46 bash tests green; verified live (9120rpm -> ~15W est). [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 11:10:27 +00:00
Viktor Barzin	324f2dc3bf	fan-control: continuous linear curve (replaces discrete step-bands) Replace the step-band fan curve with a continuous linear ramp — the bands flapped at edges (e.g. 45<->65%). Web-researched: linear + 2-3C hysteresis is the homelab standard; PID is overkill for this slow thermal loop. fan% now interpolates between env-tunable anchors: COOL 50C/30% -> 83C/100% (~2.1%/C; ~51% at the ~60C equilibrium) QUIET 68C/20% -> 83C/100% (near-silent until ~70C) Both reach 100% at the 83C ceiling. Anti-oscillation: asymmetric hysteresis (fc_decide) + a MIN_STEP (3%) min-change threshold. 41 bash tests green; deployed + verified live (59C -> 49%, smooth). [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 10:29:35 +00:00

1 2 3 4 5 ...

4053 commits