infra

Author	SHA1	Message	Date
Viktor Barzin	4cdb9e1886	novelapp: switch Keel to semver (policy=major) now upstream tags are valid All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Gheorghe fixed his tag format 2026-06-06 (v.1.1.1 -> valid v1.1.1 / v1.1.3), so drop the :latest+force+match-tag digest workaround and track semver properly: policy=major (all upgrades, cumulative), match-tag removed (so Keel is free to climb to higher semver tags), image floor pinned to v1.1.3. Pull policy -> IfNotPresent (correct for a pinned Keel-managed tag; Always was only needed for the mutable :latest). Running v1.1.3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 22:56:46 +00:00
Viktor Barzin	551412488b	apiserver: enable audit logging (low-write Metadata) + ship to Loki Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Resource changes/deletions are now attributable (the novelapp deletion this week was untraceable because apiserver audit was off). Low-write policy: drops reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into the kube-apiserver static-pod manifest + kubeadm-config (v1beta4 extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails /var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}. Root cause that had silently blocked this AND OIDC for weeks: a stray kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate static-pod manifest kubelet ran instead of the real one, dropping every flag added to the real manifest. Removed it. Runbook added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	3696ff5922	novelapp: track :latest by digest (Keel force+match-tag), adopt into TF state Keel was stuck on v1.0.3 because upstream mghee/novelapp tags newer releases as `v.1.1.1` (dot after v), which isn't valid semver, so policy=all couldn't see past the highest parseable tag. :latest correctly points at the newest release, so switch to force + match-tag digest-tracking of :latest (Kyverno does not manage match-tag, contrary to the stale code comment). Imports the live Deployment (recreated out-of-band 2026-06-06) back into TF state; running image flipped to :latest -> now on v.1.1.1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	4d8b782df1	feat(trip-planner): app stack (Tier-1, CNPG, Slack-signed webhook ingress) Namespace trip-planner (tier=4-aux, keel enrolled), ExternalSecret pulling secret/trip-planner from vault-kv, DB-creds ExternalSecret from vault-database (static-creds/pg-trip-planner → asyncpg DSN), Deployment with migrate init container + main container (readiness+liveness /healthz, 256Mi req=limit, 100m cpu request), ClusterIP service port 8080, and ingress_factory with auth=none (Slack v0 HMAC signature verification in-app). Terraform fmt clean. NOT applied; requires Vault secret/trip-planner + CNPG trip_planner DB + Slack app config. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	7c12fbba95	monitoring/alloy: drop cosmetic calico-typha 'Endpoints deprecated' warning calico-typha (~342 lines/hr across 3 pods) still WATCHes the core v1 Endpoints API, so the apiserver returns the 'v1 Endpoints is deprecated in v1.33+' client-go warning, which typha logs. Per KEP-4974 the v1 Endpoints API will essentially never be removed (clients keep working indefinitely), and even the latest Calico still watches Endpoints (projectcalico/calico#11540) so a CNI upgrade would not fix it. Pure cosmetic noise. Targeted Alloy stage.drop (calico-system ns, exact deprecation message), mirroring the mailserver drop. Real calico warnings/errors kept; reversible. Validated with alloy fmt (exit 0). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	4b13be6d48	dawarich: upgrade 1.6.1 -> 1.7.11 (removes RailsPulse, drops orphan tables) dawarich 1.6.1 shipped the RailsPulse perf-monitoring gem, which scheduled an hourly Sidekiq SummaryJob INDEPENDENT of its disabled flag; the job hit rails_pulse_routes (no primary key) and retry-looped, logging ~125 UnknownPrimaryKey lines/hr (found via Loki triage 2026-06-06). Upstream removed RailsPulse entirely in 1.7.x (commit a5172cc) with a DropRailsPulseTables migration; 1.7.11 is latest stable. Keel only auto-applies patch bumps within 1.6.x, so the minor jump is manual. Pre-upgrade pg_dump of dawarich (79.9MB) + dawarich_queue taken to devvm. The 5 rails_pulse_* tables are empty (feature never collected data), so cleanup is zero-data-risk; location data (tracks/points/visits/places) untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	8a3bbde38c	mailserver: silence mixed-TLS-directive warning + drop SMTP scanner noise from Loki Two mailserver-namespace log-noise cleanups (cluster's #1 Loki error source, from the 2026-06-06 log triage): 1. TLS warning: docker-mailserver SSL_TYPE=manual writes the authoritative smtpd_tls_chain_files at boot, so the legacy smtpd_tls_cert_file/key_file in our postfix-main.cf override were IGNORED and triggered postfix's 'Both smtpd_tls_chain_files and ... legacy ...' warning. Dropped the two legacy lines (functional no-op; chain_files already wins). Verified via live postconf. 2. Scanner noise (~9k lines/hr): narrow Alloy stage.drop for the benign public-SMTP probe patterns (unknown[unknown] SSL_accept resets, postscreen half-open drops, rate-limit-exceeded from unknown). Real delivery logs + real-IP SASL failures KEPT; CrowdSec bans these IPs independently, so security posture is unchanged. Validated with 'alloy fmt' (exit 0). Reversible. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
root	de181a9afc	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
Viktor Barzin	27211acda1	rybbit: recreate missing Postgres database via idempotent init Job rybbit's 'rybbit' PG database was missing from CNPG (the role survived a past cluster rebuild but the database did not), so the app's node-cron logged 'database "rybbit" does not exist' every minute (found via Loki 2026-06-06). Created the DB manually to restore service (app auto-migrated 11 tables); this adds a self-contained init Job so the DB is recreated on any future rebuild -- connects as the rybbit role (has CREATEDB) using the existing rybbit-secrets password, idempotent CREATE DATABASE if absent. Deployment now depends_on the job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	9ad7756a94	traefik: make bot-block-proxy a clean no-op while poison-fountain is at 0 bot-block-proxy is the forward-auth target for the ai-bot-block middleware (applied to every anti-AI ingress). It proxied /auth to the poison-fountain bot trap with error_page 5xx=200 fail-open. But poison-fountain is intentionally scaled to 0, so proxy_pass only ever failed and fell open to '200 allowed' -- while logging ~51k errors/hr (the #1 Loki source once pod logs began shipping 2026-06-05) and paying up to 100ms connect-timeout per authed request. Short-circuit /auth to 'return 200 "allowed"' directly (drop the upstream + proxy_pass + fallback). Identical effective behaviour (allow-all), no upstream attempt, no noise, no latency. Reversible: restore the upstream + proxy_pass and scale poison-fountain up. Also add the missing configmap.reloader.stakater.com/reload annotation so openresty picks up ConfigMap changes (it does not hot-reload on its own -- the root reason stale config ran for days). replicas stays 2: critical-path forward-auth target (anti-AI ingresses fail closed if it is down), so HA is retained though each request is now trivial. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
root	d70a99dc48	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
Viktor Barzin	d661d074ef	matrix: auto-reload Synapse on DB credential rotation (Reloader) Synapse injects the Postgres password into homeserver.yaml only at startup (inject-db-password initContainer). matrix-db-creds is rotated by Vault via ESO (15m refresh), so each rotation left the running pod with a stale password and Synapse DB auth failed silently until a manual rollout restart. Found today via Loki: ~12.9k/hr 'password authentication failed for user matrix' lines; secret password verified working against the DB while the 10-day-old pod held the pre-rotation value. Add the explicit secret.reloader.stakater.com/reload annotation so Reloader rolls the deployment whenever the secret changes (explicit form, not auto/search, because the secret is referenced only in an initContainer env var). Live pod already restarted to restore service; this prevents recurrence on the next rotation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
root	e7ece3eaf9	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
root	02366103ef	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
Viktor Barzin	1b9d4f1233	storage: migrate insta2spotify off proxmox-lvm to NFS (LUN relief, Phase 1) Some checks failed ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was canceled Details Config-only PVC (no embedded DB), preflighted. Frees one proxmox-csi slot. NB: pod reschedule re-pulled the 3.26GB backend image (~6min stall) — large-image services incur a pull-delay blip when migration moves them to a fresh node. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 21:38:01 +00:00
Viktor Barzin	355ca3ee91	proxmox-csi: auto-reconcile CronJob to detach ghost disks (code-dfjn prevention) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Closes the ghost-disk doom loop by construction (failed detach -> orphan scsiN with no VolumeAttachment -> invisible oversubscription -> query-pci wedge). Every 15min csi-ghost-reconcile compares each worker VM's real scsi disks (Proxmox API) vs k8s VolumeAttachments and safely detaches ghosts (PUT .../config delete=scsiN -> frees the LUN slot, retains the LV). - detection mirrors cluster-health check #47 - SAFETY: only vm-9999-pvc scsi with no matching VA; 60s re-confirm; per-run cap 5 - scoped CSI API token (VM.Config.Disk), not root SSH; k8s API via injected ClusterIP - verified live: read 66 VAs, 0 ghosts, no false positives - pushes csi_ghosts_detected/detached to Pushgateway Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 21:25:36 +00:00
Viktor Barzin	a42f4f7b26	trek: trial-deploy TREK group-trip planner behind Authentik (solo eval) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Stand up upstream TREK (mauriceboe/trek:3.0.22, AGPL) as a low-commitment trial to evaluate the self-hosted group-trip use case before building a custom app. Solo, single shared instance, Authentik forward-auth. - stacks/trek: namespace, deployment (pinned, TF-managed, no CI/Keel), service 80->3000, ingress_factory auth=required + proxied DNS at trek.viktorbarzin.me, TLS. Two proxmox-lvm-encrypted PVCs (SQLite data + uploads) -- encrypted per the sensitive-data rule and to avoid the SQLite-over-NFS locking hazard. - Trial secrets posture: ENCRYPTION_KEY auto-generated on the data PVC, bootstrap admin in pod logs -- no Vault/ESO. Graduation TODOs documented in main.tf + service-catalog (Vault key, app-level SQLite backup, OIDC SSO). - kyverno: add mauriceboe/* to require-trusted-registries allowlist (the policy is Enforce since 2026-05-19 -- also fixed the stale "stays in Audit" header comment that said otherwise and misled the deploy). - Runs free on OpenStreetMap (no paid maps key). Rallly availability-poll companion deferred per solo-trial scope. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:30:07 +00:00
Viktor Barzin	a0b34750ee	storage: migrate hackmd uploads off proxmox-lvm-encrypted to NFS (LUN-cap relief) Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details codimd is MySQL-backed; this PVC holds only pasted image uploads (subPath hackmd, 4.5M) — no embedded DB, NFS-safe. Drops LUKS-at-rest for these low-sensitivity images (accepted). Frees one proxmox-csi SCSI-LUN slot on node6. - swap hackmd-data-encrypted -> nfs_volume module (subPath preserved) - uploads copied + verified (20 files, HTTP 200, codimd listening) - block PVC removed; LV retained per SC policy (code-dfjn cleanup) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:11:31 +00:00
Viktor Barzin	e35d693972	storage: migrate send off proxmox-lvm to NFS (LUN-cap relief) Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details Send (timvisee/send) stores encrypted upload blobs on disk with metadata in Redis — no embedded DB, NFS-safe. Frees one proxmox-csi SCSI-LUN slot on node2. - swap send-data-proxmox -> nfs_volume module - blobs copied + verified (273M, 22 entries, HTTP 200 on NFS) - block PVC removed; LV retained per SC policy (code-dfjn cleanup) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:04:37 +00:00
Viktor Barzin	bf3608052b	tripit: GEOCODER_PROVIDER=openmeteo for per-city itinerary weather Enables Open-Meteo geocoding of lodging addresses (results cached in the new geocode_cache table) so the itinerary can show per-city weather. Applied manually via scripts/tg apply. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:01:31 +00:00
Viktor Barzin	6eb683b6e0	storage: migrate speedtest off proxmox-lvm to NFS (LUN-cap relief) Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details speedtest-tracker is MySQL-backed (config dir = Laravel config + logs, no embedded DB), NFS-safe. Frees one proxmox-csi SCSI-LUN slot. - swap speedtest-config-proxmox -> nfs_volume module - config copied + verified (HTTP 302->login,200); excluded 383MB laravel.log - block PVC removed; LV retained per SC policy (code-dfjn cleanup) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:59:56 +00:00
Viktor Barzin	060aefbd0b	storage: migrate changedetection off proxmox-lvm to NFS (LUN-cap relief) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details changedetection uses a file-based JSON datastore (url-watches.json + per-watch dirs + brotli snapshots) — no embedded DB, NFS-safe. Frees one proxmox-csi SCSI-LUN slot. Part of harden-proxmox-csi+NFS plan. - swap changedetection-data-proxmox -> nfs_volume module - data copied + verified (HTTP 200, 4 watches loaded); excluded 200MB test cruft - block PVC removed; block LV retained per SC policy (code-dfjn cleanup) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:55:03 +00:00
Viktor Barzin	aa948be581	storage: migrate tandoor off proxmox-lvm to NFS (LUN-cap relief) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details tandoor is PostgreSQL-backed with no embedded DB, so its media/static PVC is NFS-safe. Frees one proxmox-csi SCSI-LUN slot. Part of the 'harden proxmox-csi + NFS' plan (keeps PVC mobility, no new hardware) — see docs/plans/2026-06-05-block-storage-harden-nfs-design.md. - swap tandoor-data-proxmox -> nfs_volume module (nfs-truenas SC) - data copied + verified (HTTP 200 on NFS volume); block PVC removed - block LV retained per SC policy (orphan cleanup tracked in code-dfjn) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:34:47 +00:00
Viktor Barzin	febf12bddd	mail(tripit): send From: plans@viktorbarzin.me instead of spam@ Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details tripit outbound (linked-email verification + trip-share invites) was sent From: spam@viktorbarzin.me. Switch the From to plans@viktorbarzin.me while keeping SMTP auth as spam@ (its password, unchanged). docker-mailserver SPOOF_PROTECTION (reject_sender_login_mismatch) requires the authed login to "own" the From; the @viktorbarzin.me catch-all does NOT grant that per-address, so add an explicit `plans@ -> spam@` virtual alias to authorize it (also keeps inbound plans@ routing to spam@ for the mail-ingest poller). tripit SMTP_FROM flips to plans@. Verified: sender-login probe (auth spam@, MAIL FROM plans@) now 250 (was 553); a real send from the tripit pod logs from=<plans@viktorbarzin.me> accepted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 18:41:08 +00:00
Viktor Barzin	bc33cd5ac4	monitoring: NodeFilesystemFull 90%->95% + Synology storage runbook The Synology offsite backup target (/mnt/synology-backup, surfaced via the PVE host NFS mount) sits at ~94% by design and was firing NodeFilesystemFull continuously. Per user request, raise the threshold to 95% (<5% free). NOTE: NodeFilesystemFull is a global node-filesystem rule, so this also loosens the warning on k8s node/system disks; BackupDiskFull (sda /mnt/backup) stays at 85%. Also adds docs/runbooks/synology-storage.md: how to assess Synology usage WITHOUT du (Storage Analyzer weekly CSVs, df/btrfs/qgroup), btrfs async/snapshot-pinned reclaim, the 2026-06-05 capacity assessment (94% full; Backup share 4.42TiB), and ~500GiB of homelab cleanup candidates (redundant gphotos Takeout, old laptop VM images, archives). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 18:18:31 +00:00
Viktor Barzin	f526af694d	monitoring: snmp-idrac scrape 1m->30s — faster HA dashboard iDRAC refresh All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The ha-sofia R730 REST sensors (via prometheus-query.lan) + Grafana iDRAC panels were bound to the 1m snmp-idrac scrape. Halved to 30s so the dashboard-it Server view refreshes uniformly at 30s, matching the fan-control daemon's Pushgateway metrics. SNMP scrape ~3-4s; timeout 15s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 17:52:07 +00:00
Viktor Barzin	5b5b855528	monitoring(alloy): drop goflow2 + vpa logs from Loki to cut sdc write wear goflow2 emits ~8 GB/day of per-flow NetFlow JSON to stdout (~64% of all cluster log volume) but only its Prometheus aggregate metrics are used; vpa is ~1.3 GB/day of Goldilocks/VPA recommender chatter. Both are low-value and were landing in Loki (PVC on the contended sdc HDD). Drop them at the Alloy relabel. Reversible (remove the drop rule). Loki ingestion drops ~73%. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 17:44:47 +00:00
Viktor Barzin	dbe115910f	monitoring: add local-only prometheus-query.lan ingress for ha-sofia SNMP sensors ha-sofia's 7 R730 REST sensors (CPU/exhaust/inlet temp, power, 2x PSU voltage, fan) read the iDRAC via the slow on-demand Redfish exporter (scan_interval 120, ~16-22s/fetch, intermittent `unavailable` blips). Migrated them to a FAST Prometheus query of the SNMP values (instant, ~1m-fresh from the snmp-idrac scrape), scan_interval 30. This adds the enabling ingress: `prometheus-query.viktorbarzin.lan` → `prometheus-server:80`, auth=none, allow_local_access_only, path-scoped to `/api/v1/query` (read-only instant-query only — not the UI/admin/federation). ha-sofia can't use `prometheus.viktorbarzin.me` (Authentik-gated, no OIDC from a REST sensor), so this mirrors the existing local-only `.lan` exporter ingresses HA already queries. The ha-sofia REST file (`/config/rest_resources/idrac_redfish_exporter.yaml`) was edited in place (auto-version-controlled by the HA version-control add-on; pre-migration copy at `/config/idrac_redfish_exporter.bak-pre-snmp`). The Technitium CNAME `prometheus-query.viktorbarzin.lan -> ingress.viktorbarzin.lan` was added manually via the API — like the other `.lan` exporter hosts it is NOT auto-synced (the technitium-ingress-dns-sync CronJob only creates `.me` records). Follow-up (already noted for the Loki sensor): extend that sync to manage `.lan` CNAMEs too. The Redfish remnant's `sensors` collector is now vestigial (HA no longer reads it). Verified: all 7 HA sensors report correct fresh values from Prometheus (fan 10800 rpm, CPU 62.0C, power 280W, PSU 230/240V). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 17:25:06 +00:00
Viktor Barzin	7501c2be5d	monitoring(grafana): add professional "Cluster Logs" dashboard (Logs folder) Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Cluster-wide Loki log observability now that pod logs flow (Alloy fix). New dashboards/cluster-logs.json (Loki DS P8E80F9AEF21F6940): namespace/app/pod dropdowns + free-text regex search; stats (lines/errors/warns/active-ns), log-volume-by-namespace, error/warn rate, top-namespaces-by-errors, top-pods-by-errors, a filterable live-logs panel, and a second row for the node + rpi-sofia systemd journals (volume-by-level + error/warn journal panel). Error/warn use case-insensitive regex line-filters so they work regardless of level-label availability. New "Logs" Grafana folder. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 17:03:45 +00:00
Viktor Barzin	bb0099b747	monitoring(alloy): fix broken pod-log shipping (missing local.file_match) + parse CRI Cluster pod logs were NOT reaching Loki — only node/Pi journals were. Root cause: loki.source.file was fed the /var/log/pods/<uid>/<container>/.log glob directly from discovery.relabel, but loki.source.file does NOT expand globs, so it stat()'d the literal `*` path and shipped zero pod logs ("stat failed: no such file" for every pod). Per Grafana Alloy docs, a local.file_match component must expand the glob into concrete file targets first. Add it. Also add stage.cri {} so Loki stores clean messages + real timestamps instead of raw containerd CRI-prefixed lines. Fixes cluster-wide log observability (regression vs the working 2026-05-26 state). Ship-all-then-measure per the agreed plan; Alloy mem limits stay as the IO-storm safeguard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 16:57:44 +00:00
Viktor Barzin	6b1d23abbd	monitoring: migrate R730 iDRAC scraping to SNMP (fast primary) + thin Redfish remnant The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac module: ~3.7s/scrape at 1m. - snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM (coolingDeviceReading + location lookup) and an amperageProbeLocationName lookup so the "System Board Pwr Consumption" watts probe is label-selectable. - snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*). - Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes. - Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names; temps ÷10 (tenths-degC); DellStatus value-mappings updated. - Demote the Redfish exporter to a slow remnant: trim collectors to system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change. SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan + as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md, docs/architecture/monitoring.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 16:33:20 +00:00
Viktor Barzin	ddc8bfa8cf	tripit: remove Gmail-scrape ingest-mail CronJob; plans@ becomes sole channel All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The Gmail All-Mail scrape (tripit-ingest-mail) is retired — Viktor only wants mail ingested when forwarded to plans@viktorbarzin.me, and only from actual users. Dropped the ingest-mail CronJob and removed MAIL_DEFAULT_OWNER_EMAIL from ingest-plans (the app now ignores mail from non-users instead of filing it under the default owner). ingest-plans already carries EMAIL_PROVIDER/SMTP_* for the new sender notifications. Service-catalog updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 13:50:53 +00:00
Viktor Barzin	5381beb3b7	monitoring: fix ingress auth-comment guard for loki-write-ingress All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details scripts/tg's check-ingress-auth-comments.py requires the `# auth = "none":` rationale comment DIRECTLY above the `auth = "none"` line; mine was in the module's top block comment, so the guard aborted the whole monitoring apply (this is why the rpi-sofia scrape/alerts/ingress/dashboard never landed on the first push). Move the rationale to the required position. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 13:36:43 +00:00
Viktor Barzin	f9376a36ff	monitoring: wire rpi-sofia (Sofia Pi) into Prometheus/Loki/alerts Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details The Sofia Raspberry Pi hung this morning (network wedged ~10:13, HA sensors dead, and its local journal had been silent since Apr 27 — a 2017 SD card intermittently flipping the rootfs read-only). Nothing was captured because logging lived only on the failing card. Ship telemetry off-box so the next failure is diagnosable centrally: - Prometheus scrape job `rpi-sofia` (rpi-sofia.viktorbarzin.lan:9100) — node_exporter + a vcgencmd textfile collector on the Pi exporting under-voltage/throttle/SoC-temp as rpi_* metrics. - Alert group "RPi Sofia": node_exporter Down, rootfs ReadOnly (the exact SD-failure signature), Under-voltage since boot, High SoC temp. - LAN-gated Loki write ingress (loki.viktorbarzin.lan) so the Pi's promtail can push its journal — Loki was ClusterIP-only. - Grafana dashboard "RPi Sofia" (Hardware): status, undervoltage/ throttle, temp, load, memory, disk, network. The Pi separately got a systemd hardware watchdog (auto-reboot on a hard hang; today it stayed down ~5h until a manual power-cycle). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 13:11:40 +00:00
Viktor Barzin	5b96b841fc	f1-stream: right-size memory 1Gi -> 256Mi (CDP-only, no bundled Chromium) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Actual usage ~116Mi, Goldilocks/VPA upperBound ~185Mi (incl. live races over 99d). The 1Gi reservation was sized for the old bundled-Chromium image; the app now drives the remote chrome-service over CDP. 256Mi (upperBound x~1.3, bursty) requests=limits per convention; cpu request 100m -> 50m (VPA upperBound 49m). Frees ~768Mi of reserved cluster memory.	2026-06-05 12:57:22 +00:00
Viktor Barzin	b958935ee0	woodpecker: reload server on Vault PG password rotation [ci skip] woodpecker-server sets reloader.stakater.com/search="true" but the woodpecker-db-creds ExternalSecret never carried the matching reloader.stakater.com/match="true", so Stakater Reloader never restarted the server when Vault rotated the pg-woodpecker password (7-day static role). The DB DSN is injected via envFrom, which does not hot-reload a running pod — so after each rotation the server kept using the revoked password until some unrelated restart (Keel bump, drain, manual) recreated it inside the window. A latent weekly DB-outage masked by incidental restarts. Add the match annotation to the ESO target template and correct the stale "rotated every 24h" comment (actual rotation_period is 604800s = 7 days). Verified end-to-end: forced 'vault write -f database/rotate-role/pg-woodpecker', ESO updated the secret in ~3s, Reloader auto-restarted woodpecker-server in ~36s, new pod reconnected with zero DB errors. [ci skip] because the change was already applied via scripts/tg apply. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
root	aa25dd488c	Woodpecker CI deploy [CI SKIP]	2026-06-05 09:19:12 +00:00
Viktor Barzin	e8bfb4d06b	f1-stream: consume Forgejo-registry image; drop in-monorepo source The actively-developed f1-stream (infra files/ copy: 12 active extractors + Playwright/chrome-service verifier) is now its own repo viktor/f1-stream and is the deployed app (replacing the stale March github build). - main.tf: image -> forgejo.viktorbarzin.me/viktor/f1-stream:${var.image_tag} + image_pull_secrets registry-credentials. Image stays in KEEL_IGNORE_IMAGE. - Remove stacks/f1-stream/files/ (source now in viktor/f1-stream). - docs/plans: extraction design + plan pair. Applied via tg + kubectl set image to forgejo:24857a82; live /health green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	17da37cea3	fire-planner: reset bulk ingest toggle after successful run Job completed: 1,060 examples inserted across 10 FIRE subreddits (1,080 total), 20/24 sub-runs succeeded. Toggle reset to false. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	deb031cc2c	feat(tripit): encrypted personal-document vault PVC + DOCUMENT_ENCRYPTION_KEY Add a proxmox-lvm-encrypted RWO PVC (tripit-personal-documents) mounted at /data/personal-documents on the app container, PERSONAL_STORAGE_DIR env, and the DOCUMENT_ENCRYPTION_KEY ExternalSecret entry (seeded in Vault secret/tripit). A root chown init-container makes the block volume writable by the non-root app without touching the NFS doc vault. Backs the new owner-only encrypted personal document vault in the tripit app. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	27989cd9f1	fire-planner: bulk Reddit FIRE examples ingest + qwen3-8b model upgrade - Enable bulk ingest job (run_examples_bulk_ingest=true) to populate fire_example table from top/all + top/year across 12 FIRE subreddits. Job fire-planner-examples-bulk-202606042150 is currently running. - Upgrade examples_llm_model from qwen3vl-4b to qwen3-8b; GPU has 10.7GB free (immich-ml using ~4GB of 15GB total), so higher-quality model fits. - Add LLM_CONCURRENCY=3 to bulk job container — claude-agent-service is now bounded-concurrency (MAX_CONCURRENCY=10), no longer single-flight. Strictly serial extraction (default 1) is no longer necessary. TODO: flip run_examples_bulk_ingest=false after job completes and re-apply to push the weekly CronJob model upgrade (qwen3vl-4b→qwen3-8b) which didn't land in this apply (TF timed out waiting for Job completion). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	147a8cff40	Restore f1-stream stack — undo accidental bundling into 63fe7d2b Commit 63fe7d2b (fan-control) was made with a bare `git commit` in the shared infra working tree and inadvertently swept in a parallel session's staged f1-stream-extraction work (main.tf repoint, ~48 files/ removals, ci-cd.md + .claude docs, two extraction plan docs). This returns every f1-stream-related path to its pre-63fe7d2b state (3493c347) so that extraction can be committed cleanly by its own session. The fan-control files added in 63fe7d2b are untouched. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	90ad6b9125	fan-control: presence-aware IPMI fan curve for the R730 PVE host The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even under load (optimises for quiet, not cool). Add a bash daemon + systemd unit that drives the chassis fans from CPU temp on two curves, picked by garage occupancy (the server is in the garage): COOL when empty (measured ~58-65°C under load), QUIET near the silent floor when the ha-sofia garage door shows someone is there (open, or <15min since last activity). Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI failures do the same. Pushgateway metrics (job=fan_control). 36 unit tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN + RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127 (CPU 70->58°C in cool mode, hysteresis stepping confirmed). Design: docs/plans/2026-06-04-pve-fan-control-design.md Runbook: docs/runbooks/fan-control.md [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	c6f27fa172	wealth dashboard: enlarge returns numbers (drop stat name labels) [ci skip] At h=4 the two stacked values per window panel were too small because each also rendered its field-name label. Switch textMode value_and_name -> value on 9211-9215 so the numbers get the full cell height; the % suffix / £ prefix keep them self-identifying and the window stays in the panel title. Applied via targeted tg apply of the configmap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	dbe10a708c	wealth dashboard: shrink returns stat panels to h=4 [ci skip] The 5 per-window returns widgets (9211-9215) were too tall at h=8. Halve to h=4 (matching the overview stat cards directly above) and pull every panel below up by 4 so the layout stays gap-free. Layout-only change — no panel content/query touched. Applied via targeted tg apply of the configmap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	fc1486c3dd	wealth dashboard: replace returns table with per-window stat panels [ci skip] Swap the single "Returns over time windows" table (panel 9201) for 5 stat panels (1d/7d/30d/90d/12mo), each showing Return % (Modified-Dietz) as the headline value + Δ market (£, net of contributions) as a second value, colored red/green by sign. Same per-window Modified-Dietz math as the old table, just scoped to one interval per panel — verified against live wealthfolio_sync PG and reproduced through Grafana's datasource API (e.g. 30d = 8.15% / £86,875, 12mo = 38.68% / £297,846, matching the prior table exactly). Kept the same 24×8 grid footprint so nothing else on the dashboard reflows. Already applied via targeted `tg apply` of the wealth.json configmap; [ci skip] because a full monitoring-stack CI apply would pull in unrelated pre-existing drift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	6cec27f8dc	novelapp: bump Keel policy patch -> all (track any upstream version) Explicitly own the keel.sh/policy annotation in TF (was relying on the Kyverno-stamped `patch` default). Set policy=all + trigger=poll + pollSchedule, expand ignore_changes per KEEL_LIFECYCLE_V1 to cover Keel-written runtime annotations (change-cause, update-time, revision, match-tag).	2026-06-05 09:19:11 +00:00
Viktor Barzin	9cb609f21a	nextcloud-todos: register only the Created webhook (drop Updated) The agent acts only on newly-created todos; the Updated listener re-fired on every edit (incl. the agent's own note-append). Live Updated webhook (id=2) already deleted via OCS API.	2026-06-05 09:19:11 +00:00
Viktor Barzin	3d0cba9dcb	openclaw: pin 2026.2.26, resilient startup, SHA-pinned plugin init (recover from agentRuntime + configSchema crashloop) Surfaced while installing the nextcloud-todos-api plugin (a pod roll): - 2026.5.4 gateway rejects an openai-codex `agentRuntime` key it writes itself (commit `4b39cb72`) -> crashloop on any restart. Pinned image back to 2026.2.26. - startup steps (plugins enable / mcp set / memory index) backgrounded + timeout-guarded so a hung npm-install can never block the gateway. - install-nextcloud-todos-plugin init SHA-pinned (:f85c6de1) + Always pull: IfNotPresent served a stale cached :latest, so the plugin manifest (configSchema) fix never landed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
root	c01a28e23c	Woodpecker CI deploy [CI SKIP]	2026-06-05 09:19:11 +00:00

1 2 3 4 5 ...

1265 commits