Commit graph

4053 commits

Author SHA1 Message Date
Viktor Barzin
c7ffbaa204 aiostreams: harden stream-probe + repair sources (RD-451 "few films" fix)
Root cause of "barely serving films": Real-Debrid's May-2026
infringing_file/HTTP-451 filter blocks WEB-DL releases (which dominate
new content), while degraded sources starved candidates. RD account +
popular-title availability were healthy throughout (library 32/36
unrestrict OK; Matrix 897 / Dune2 694 / Oppenheimer 672 streams).

Runtime config (AIOStreams PG, applied via API — not in this diff):
- Comet timeout 5s -> 10s. Comet is the workhorse (~450+ streams/title)
  and was silently dropping the bulk of its results at the 5s cutoff;
  Interstellar 430 -> 987 streams after the bump.
- Removed MediaFusion preset: broken upstream ("Invalid configuration"
  -> 500 Internal Server Error), contributed 0 usable streams, only a
  dead [X] entry in every list.

This diff (Terraform):
- Harden aiostreams-stream-probe: test series AND movie paths, per-source
  breakdown (comet/torrentio/stremthru_torz/knaben), error-stream count,
  success gated on Comet being alive. The old probe counted only Breaking
  Bad streams and stayed green while new-content playback was broken.
- service-catalog: reflect source set + probe behaviour.

[ci skip] — probe already applied via targeted `tg apply` + verified
(series=378 movie=898 comet=206 errors=0 success=1); skipping the full
servarr reconcile to avoid touching unrelated pre-existing drift
(qbittorrent MetalLB annotation, tls_secret cert revert).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 07:21:42 +00:00
Viktor Barzin
4cdb9e1886 novelapp: switch Keel to semver (policy=major) now upstream tags are valid
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Gheorghe fixed his tag format 2026-06-06 (v.1.1.1 -> valid v1.1.1 / v1.1.3), so
drop the :latest+force+match-tag digest workaround and track semver properly:
policy=major (all upgrades, cumulative), match-tag removed (so Keel is free to
climb to higher semver tags), image floor pinned to v1.1.3. Pull policy ->
IfNotPresent (correct for a pinned Keel-managed tag; Always was only needed for
the mutable :latest). Running v1.1.3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 22:56:46 +00:00
Viktor Barzin
551412488b apiserver: enable audit logging (low-write Metadata) + ship to Loki
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Resource changes/deletions are now attributable (the novelapp deletion this week
was untraceable because apiserver audit was off). Low-write policy: drops
reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into
the kube-apiserver static-pod manifest + kubeadm-config (v1beta4
extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails
/var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}.

Root cause that had silently blocked this AND OIDC for weeks: a stray
kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate
static-pod manifest kubelet ran instead of the real one, dropping every flag
added to the real manifest. Removed it. Runbook added.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
3696ff5922 novelapp: track :latest by digest (Keel force+match-tag), adopt into TF state
Keel was stuck on v1.0.3 because upstream mghee/novelapp tags newer releases as
`v.1.1.1` (dot after v), which isn't valid semver, so policy=all couldn't see
past the highest parseable tag. :latest correctly points at the newest release,
so switch to force + match-tag digest-tracking of :latest (Kyverno does not
manage match-tag, contrary to the stale code comment). Imports the live
Deployment (recreated out-of-band 2026-06-06) back into TF state; running image
flipped to :latest -> now on v.1.1.1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
4d8b782df1 feat(trip-planner): app stack (Tier-1, CNPG, Slack-signed webhook ingress)
Namespace trip-planner (tier=4-aux, keel enrolled), ExternalSecret pulling
secret/trip-planner from vault-kv, DB-creds ExternalSecret from vault-database
(static-creds/pg-trip-planner → asyncpg DSN), Deployment with migrate init
container + main container (readiness+liveness /healthz, 256Mi req=limit, 100m
cpu request), ClusterIP service port 8080, and ingress_factory with auth=none
(Slack v0 HMAC signature verification in-app). Terraform fmt clean. NOT applied;
requires Vault secret/trip-planner + CNPG trip_planner DB + Slack app config.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
7c12fbba95 monitoring/alloy: drop cosmetic calico-typha 'Endpoints deprecated' warning
calico-typha (~342 lines/hr across 3 pods) still WATCHes the core v1
Endpoints API, so the apiserver returns the 'v1 Endpoints is deprecated
in v1.33+' client-go warning, which typha logs. Per KEP-4974 the v1
Endpoints API will essentially never be removed (clients keep working
indefinitely), and even the latest Calico still watches Endpoints
(projectcalico/calico#11540) so a CNI upgrade would not fix it. Pure
cosmetic noise. Targeted Alloy stage.drop (calico-system ns, exact
deprecation message), mirroring the mailserver drop. Real calico
warnings/errors kept; reversible. Validated with alloy fmt (exit 0).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
4b13be6d48 dawarich: upgrade 1.6.1 -> 1.7.11 (removes RailsPulse, drops orphan tables)
dawarich 1.6.1 shipped the RailsPulse perf-monitoring gem, which scheduled
an hourly Sidekiq SummaryJob INDEPENDENT of its disabled flag; the job hit
rails_pulse_routes (no primary key) and retry-looped, logging ~125
UnknownPrimaryKey lines/hr (found via Loki triage 2026-06-06). Upstream
removed RailsPulse entirely in 1.7.x (commit a5172cc) with a
DropRailsPulseTables migration; 1.7.11 is latest stable. Keel only
auto-applies patch bumps within 1.6.x, so the minor jump is manual.

Pre-upgrade pg_dump of dawarich (79.9MB) + dawarich_queue taken to devvm.
The 5 rails_pulse_* tables are empty (feature never collected data), so
cleanup is zero-data-risk; location data (tracks/points/visits/places)
untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
8a3bbde38c mailserver: silence mixed-TLS-directive warning + drop SMTP scanner noise from Loki
Two mailserver-namespace log-noise cleanups (cluster's #1 Loki error
source, from the 2026-06-06 log triage):

1. TLS warning: docker-mailserver SSL_TYPE=manual writes the authoritative
   smtpd_tls_chain_files at boot, so the legacy smtpd_tls_cert_file/key_file
   in our postfix-main.cf override were IGNORED and triggered postfix's
   'Both smtpd_tls_chain_files and ... legacy ...' warning. Dropped the two
   legacy lines (functional no-op; chain_files already wins). Verified via
   live postconf.

2. Scanner noise (~9k lines/hr): narrow Alloy stage.drop for the benign
   public-SMTP probe patterns (unknown[unknown] SSL_accept resets, postscreen
   half-open drops, rate-limit-exceeded from unknown). Real delivery logs +
   real-IP SASL failures KEPT; CrowdSec bans these IPs independently, so
   security posture is unchanged. Validated with 'alloy fmt' (exit 0).
   Reversible.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
root
de181a9afc Woodpecker CI deploy [CI SKIP] 2026-06-06 16:51:26 +00:00
Viktor Barzin
27211acda1 rybbit: recreate missing Postgres database via idempotent init Job
rybbit's 'rybbit' PG database was missing from CNPG (the role survived a
past cluster rebuild but the database did not), so the app's node-cron
logged 'database "rybbit" does not exist' every minute (found via Loki
2026-06-06). Created the DB manually to restore service (app auto-migrated
11 tables); this adds a self-contained init Job so the DB is recreated on
any future rebuild -- connects as the rybbit role (has CREATEDB) using the
existing rybbit-secrets password, idempotent CREATE DATABASE if absent.
Deployment now depends_on the job.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
9529eedfe0 docs(security): bot-block-proxy is a no-op while poison-fountain is at 0 [ci skip]
Reflect commit b6dd23b1: bot-block-proxy short-circuits /auth to
return 200 instead of proxying to the scaled-to-0 poison-fountain.
- security.md Layer 1 + tarpit description + troubleshooting (fix stale
  stacks/platform path -> traefik stack; drop misleading
  restart-poison-fountain step).
- .claude/CLAUDE.md: add matrix to PG rotation list; document that
  startup-read secret consumers need a Reloader annotation (matrix root
  cause, found via Loki 2026-06-05).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
9ad7756a94 traefik: make bot-block-proxy a clean no-op while poison-fountain is at 0
bot-block-proxy is the forward-auth target for the ai-bot-block
middleware (applied to every anti-AI ingress). It proxied /auth to the
poison-fountain bot trap with error_page 5xx=200 fail-open. But
poison-fountain is intentionally scaled to 0, so proxy_pass only ever
failed and fell open to '200 allowed' -- while logging ~51k errors/hr
(the #1 Loki source once pod logs began shipping 2026-06-05) and paying
up to 100ms connect-timeout per authed request.

Short-circuit /auth to 'return 200 "allowed"' directly (drop the
upstream + proxy_pass + fallback). Identical effective behaviour
(allow-all), no upstream attempt, no noise, no latency. Reversible:
restore the upstream + proxy_pass and scale poison-fountain up.

Also add the missing configmap.reloader.stakater.com/reload annotation
so openresty picks up ConfigMap changes (it does not hot-reload on its
own -- the root reason stale config ran for days). replicas stays 2:
critical-path forward-auth target (anti-AI ingresses fail closed if it
is down), so HA is retained though each request is now trivial.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
root
d70a99dc48 Woodpecker CI deploy [CI SKIP] 2026-06-06 16:51:26 +00:00
Viktor Barzin
d661d074ef matrix: auto-reload Synapse on DB credential rotation (Reloader)
Synapse injects the Postgres password into homeserver.yaml only at
startup (inject-db-password initContainer). matrix-db-creds is rotated
by Vault via ESO (15m refresh), so each rotation left the running pod
with a stale password and Synapse DB auth failed silently until a
manual rollout restart. Found today via Loki: ~12.9k/hr 'password
authentication failed for user matrix' lines; secret password verified
working against the DB while the 10-day-old pod held the pre-rotation
value.

Add the explicit secret.reloader.stakater.com/reload annotation so
Reloader rolls the deployment whenever the secret changes (explicit
form, not auto/search, because the secret is referenced only in an
initContainer env var). Live pod already restarted to restore service;
this prevents recurrence on the next rotation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
root
e7ece3eaf9 Woodpecker CI deploy [CI SKIP] 2026-06-06 16:51:26 +00:00
root
02366103ef Woodpecker CI deploy [CI SKIP] 2026-06-06 16:51:26 +00:00
Viktor Barzin
d808694af4 docs(storage): record harden-half shipped (orphan cleanup + ghost-reconcile)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2a orphan cleanup (67 Released PVs + 475 LVs removed, VG pve 997->~410) + 2b
csi-ghost-reconcile CronJob done — ghost-disk doom loop closed by construction,
beads code-dfjn retireable. Cap kept at 28 (lowering would reverse the
2026-05-25 eviction-cascade post-mortem fix). Phase-1: insta2spotify migrated
(noted its 3.26GB image re-pull blip on node reschedule).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 21:39:36 +00:00
Viktor Barzin
1b9d4f1233 storage: migrate insta2spotify off proxmox-lvm to NFS (LUN relief, Phase 1)
Some checks failed
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was canceled
Config-only PVC (no embedded DB), preflighted. Frees one proxmox-csi slot.
NB: pod reschedule re-pulled the 3.26GB backend image (~6min stall) — large-image
services incur a pull-delay blip when migration moves them to a fresh node.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 21:38:01 +00:00
Viktor Barzin
355ca3ee91 proxmox-csi: auto-reconcile CronJob to detach ghost disks (code-dfjn prevention)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Closes the ghost-disk doom loop by construction (failed detach -> orphan scsiN
with no VolumeAttachment -> invisible oversubscription -> query-pci wedge).
Every 15min csi-ghost-reconcile compares each worker VM's real scsi disks
(Proxmox API) vs k8s VolumeAttachments and safely detaches ghosts (PUT
.../config delete=scsiN -> frees the LUN slot, retains the LV).

- detection mirrors cluster-health check #47
- SAFETY: only vm-9999-pvc scsi with no matching VA; 60s re-confirm; per-run cap 5
- scoped CSI API token (VM.Config.Disk), not root SSH; k8s API via injected ClusterIP
- verified live: read 66 VAs, 0 ghosts, no false positives
- pushes csi_ghosts_detected/detached to Pushgateway

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 21:25:36 +00:00
Viktor Barzin
e311cbe103 chore(modules): remove vestigial audiblez-web copy + fix glossary note [ci skip]
modules/kubernetes/ebook2audiobook/ held a tracked copy of the audiblez-web
app source (24 files), sourced by no stack and built by no CI — audiblez-web
is GHA-built from its own repo. Bulk-swept in 2026-04-15; removed.

Also corrected CONTEXT.md: the "vestigial per-app dirs (immich/, ollama/,
...)" note was wrong — those were untracked local macOS cruft (._main.tf
AppleDouble turds), never in the repo; cleaned from the working tree.
modules/kubernetes/ now holds exactly the four factory modules
(ingress_factory, nfs_volume, anubis_instance, setup_tls_secret).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:38:13 +00:00
Viktor Barzin
a42f4f7b26 trek: trial-deploy TREK group-trip planner behind Authentik (solo eval)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Stand up upstream TREK (mauriceboe/trek:3.0.22, AGPL) as a low-commitment
trial to evaluate the self-hosted group-trip use case before building a
custom app. Solo, single shared instance, Authentik forward-auth.

- stacks/trek: namespace, deployment (pinned, TF-managed, no CI/Keel),
  service 80->3000, ingress_factory auth=required + proxied DNS at
  trek.viktorbarzin.me, TLS. Two proxmox-lvm-encrypted PVCs (SQLite data +
  uploads) -- encrypted per the sensitive-data rule and to avoid the
  SQLite-over-NFS locking hazard.
- Trial secrets posture: ENCRYPTION_KEY auto-generated on the data PVC,
  bootstrap admin in pod logs -- no Vault/ESO. Graduation TODOs documented
  in main.tf + service-catalog (Vault key, app-level SQLite backup, OIDC SSO).
- kyverno: add mauriceboe/* to require-trusted-registries allowlist (the
  policy is Enforce since 2026-05-19 -- also fixed the stale "stays in
  Audit" header comment that said otherwise and misled the deploy).
- Runs free on OpenStreetMap (no paid maps key). Rallly availability-poll
  companion deferred per solo-trial scope.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:30:07 +00:00
Viktor Barzin
63182730f9 docs(storage): record Wave-2 NFS migration + harden-proxmox-csi decision (option 1)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Document the 2026-06-05 decision to keep proxmox-csi and harden it (keep PVC
mobility, no hardware) over TopoLVM (pins to node) / Longhorn (2x writes on
single shared HDD). Wave-2 moved 5 non-DB workloads off block to NFS
(tandoor, speedtest, hackmd, changedetection, send), freeing 5 LUN slots.

- storage.md: live PVC counts, Retain-policy/orphan-LV note, Wave-2 history,
  updated cap-relief levers
- topolvm-evaluation.md: stamped NOT ADOPTED with rationale + pointer to the
  decision doc

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:15:21 +00:00
Viktor Barzin
a0b34750ee storage: migrate hackmd uploads off proxmox-lvm-encrypted to NFS (LUN-cap relief)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
codimd is MySQL-backed; this PVC holds only pasted image uploads (subPath
hackmd, 4.5M) — no embedded DB, NFS-safe. Drops LUKS-at-rest for these
low-sensitivity images (accepted). Frees one proxmox-csi SCSI-LUN slot on node6.

- swap hackmd-data-encrypted -> nfs_volume module (subPath preserved)
- uploads copied + verified (20 files, HTTP 200, codimd listening)
- block PVC removed; LV retained per SC policy (code-dfjn cleanup)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:11:31 +00:00
Viktor Barzin
e35d693972 storage: migrate send off proxmox-lvm to NFS (LUN-cap relief)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
Send (timvisee/send) stores encrypted upload blobs on disk with metadata in
Redis — no embedded DB, NFS-safe. Frees one proxmox-csi SCSI-LUN slot on node2.

- swap send-data-proxmox -> nfs_volume module
- blobs copied + verified (273M, 22 entries, HTTP 200 on NFS)
- block PVC removed; LV retained per SC policy (code-dfjn cleanup)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:04:37 +00:00
Viktor Barzin
c24b4a21d8 docs(architecture): fix stale 5-node claim -> 7 nodes (k8s-node1..6) [ci skip]
Cluster grew to 7 nodes (k8s-master + node1..6; node5/6 added ~10d ago)
but several docs still said "5 nodes". Corrected with live specs:

- overview.md: 7-node enumeration; node1 is 16c/48GB (doc wrongly said
  32GB), node2-6 are 8c/32GB general workers
- compute.md: "5-node" -> "7-node" cluster description
- dns.md: NodeLocal DNSCache DaemonSet "5 nodes" -> "7 nodes"
- mailserver.md: HAProxy backend diagram "node1..4" -> "node1..6"

Illustrative "0/5 nodes available" scheduler-error examples left as-is.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:03:58 +00:00
Viktor Barzin
bf3608052b tripit: GEOCODER_PROVIDER=openmeteo for per-city itinerary weather
Enables Open-Meteo geocoding of lodging addresses (results cached in the
new geocode_cache table) so the itinerary can show per-city weather.
Applied manually via scripts/tg apply.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:01:31 +00:00
Viktor Barzin
6eb683b6e0 storage: migrate speedtest off proxmox-lvm to NFS (LUN-cap relief)
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
speedtest-tracker is MySQL-backed (config dir = Laravel config + logs, no
embedded DB), NFS-safe. Frees one proxmox-csi SCSI-LUN slot.

- swap speedtest-config-proxmox -> nfs_volume module
- config copied + verified (HTTP 302->login,200); excluded 383MB laravel.log
- block PVC removed; LV retained per SC policy (code-dfjn cleanup)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 19:59:56 +00:00
Viktor Barzin
060aefbd0b storage: migrate changedetection off proxmox-lvm to NFS (LUN-cap relief)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
changedetection uses a file-based JSON datastore (url-watches.json + per-watch
dirs + brotli snapshots) — no embedded DB, NFS-safe. Frees one proxmox-csi
SCSI-LUN slot. Part of harden-proxmox-csi+NFS plan.

- swap changedetection-data-proxmox -> nfs_volume module
- data copied + verified (HTTP 200, 4 watches loaded); excluded 200MB test cruft
- block PVC removed; block LV retained per SC policy (code-dfjn cleanup)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 19:55:03 +00:00
Viktor Barzin
52f5de905d docs(context): freshen infra glossary (modules, tiers, new concepts) [ci skip]
Refresh CONTEXT.md against current repo + cluster reality (grill-with-docs):

- Module taxonomy rewrite: drop fictional k8s_app/helm_app/postgres_app
  factory modules (never existed); name the real four (ingress_factory,
  nfs_volume, anubis_instance, setup_tls_secret) + the shared / Stack-local
  / flat distinction; flag vestigial modules/kubernetes/<app> dirs.
- Rename "Ingress auth tier" -> "Ingress auth" (discrete modes, not tiers);
  reserve "tier" for State tier + Namespace tier only.
- Add local-path entry (cluster default SC; node-local footgun warning).
- Add concepts: Keel, Diun, CNPG/pg-cluster, MetalLB LB-IP split, Calico.
- Add "policy" ambiguity flag (Kyverno vs Calico NetworkPolicy vs Vault/RBAC).
- Fix node count 5 -> 7 (k8s-master + k8s-node1..6).

Doc-sync (same commit per repo rules):
- overview.md: replace fictional factory modules with the real shared
  modules + the flat/stack-local pattern.
- .claude/CLAUDE.md: drop dead nfs-proxmox column from the storage decision
  table + stale cross-reference (vault migrated off it 2026-04-25).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 19:34:49 +00:00
Viktor Barzin
aa948be581 storage: migrate tandoor off proxmox-lvm to NFS (LUN-cap relief)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
tandoor is PostgreSQL-backed with no embedded DB, so its media/static PVC
is NFS-safe. Frees one proxmox-csi SCSI-LUN slot. Part of the 'harden
proxmox-csi + NFS' plan (keeps PVC mobility, no new hardware) — see
docs/plans/2026-06-05-block-storage-harden-nfs-design.md.

- swap tandoor-data-proxmox -> nfs_volume module (nfs-truenas SC)
- data copied + verified (HTTP 200 on NFS volume); block PVC removed
- block LV retained per SC policy (orphan cleanup tracked in code-dfjn)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 19:34:47 +00:00
Viktor Barzin
febf12bddd mail(tripit): send From: plans@viktorbarzin.me instead of spam@
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
tripit outbound (linked-email verification + trip-share invites) was sent
From: spam@viktorbarzin.me. Switch the From to plans@viktorbarzin.me while
keeping SMTP auth as spam@ (its password, unchanged).

docker-mailserver SPOOF_PROTECTION (reject_sender_login_mismatch) requires
the authed login to "own" the From; the @viktorbarzin.me catch-all does NOT
grant that per-address, so add an explicit `plans@ -> spam@` virtual alias to
authorize it (also keeps inbound plans@ routing to spam@ for the mail-ingest
poller). tripit SMTP_FROM flips to plans@.

Verified: sender-login probe (auth spam@, MAIL FROM plans@) now 250 (was 553);
a real send from the tripit pod logs from=<plans@viktorbarzin.me> accepted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 18:41:08 +00:00
Viktor Barzin
bc33cd5ac4 monitoring: NodeFilesystemFull 90%->95% + Synology storage runbook
The Synology offsite backup target (/mnt/synology-backup, surfaced via
the PVE host NFS mount) sits at ~94% by design and was firing
NodeFilesystemFull continuously. Per user request, raise the threshold
to 95% (<5% free). NOTE: NodeFilesystemFull is a global node-filesystem
rule, so this also loosens the warning on k8s node/system disks;
BackupDiskFull (sda /mnt/backup) stays at 85%.

Also adds docs/runbooks/synology-storage.md: how to assess Synology
usage WITHOUT du (Storage Analyzer weekly CSVs, df/btrfs/qgroup),
btrfs async/snapshot-pinned reclaim, the 2026-06-05 capacity assessment
(94% full; Backup share 4.42TiB), and ~500GiB of homelab cleanup
candidates (redundant gphotos Takeout, old laptop VM images, archives).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 18:18:31 +00:00
Viktor Barzin
f526af694d monitoring: snmp-idrac scrape 1m->30s — faster HA dashboard iDRAC refresh
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The ha-sofia R730 REST sensors (via prometheus-query.lan) + Grafana iDRAC
panels were bound to the 1m snmp-idrac scrape. Halved to 30s so the
dashboard-it Server view refreshes uniformly at 30s, matching the
fan-control daemon's Pushgateway metrics. SNMP scrape ~3-4s; timeout 15s.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 17:52:07 +00:00
Viktor Barzin
5b5b855528 monitoring(alloy): drop goflow2 + vpa logs from Loki to cut sdc write wear
goflow2 emits ~8 GB/day of per-flow NetFlow JSON to stdout (~64% of all cluster
log volume) but only its Prometheus aggregate metrics are used; vpa is ~1.3
GB/day of Goldilocks/VPA recommender chatter. Both are low-value and were
landing in Loki (PVC on the contended sdc HDD). Drop them at the Alloy relabel.
Reversible (remove the drop rule). Loki ingestion drops ~73%.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 17:44:47 +00:00
Viktor Barzin
dbe115910f monitoring: add local-only prometheus-query.lan ingress for ha-sofia SNMP sensors
ha-sofia's 7 R730 REST sensors (CPU/exhaust/inlet temp, power, 2x PSU voltage,
fan) read the iDRAC via the slow on-demand Redfish exporter (scan_interval 120,
~16-22s/fetch, intermittent `unavailable` blips). Migrated them to a FAST
Prometheus query of the SNMP values (instant, ~1m-fresh from the snmp-idrac
scrape), scan_interval 30.

This adds the enabling ingress: `prometheus-query.viktorbarzin.lan` →
`prometheus-server:80`, auth=none, allow_local_access_only, path-scoped to
`/api/v1/query` (read-only instant-query only — not the UI/admin/federation).
ha-sofia can't use `prometheus.viktorbarzin.me` (Authentik-gated, no OIDC from
a REST sensor), so this mirrors the existing local-only `.lan` exporter
ingresses HA already queries.

The ha-sofia REST file (`/config/rest_resources/idrac_redfish_exporter.yaml`)
was edited in place (auto-version-controlled by the HA version-control add-on;
pre-migration copy at `/config/idrac_redfish_exporter.bak-pre-snmp`). The
Technitium CNAME `prometheus-query.viktorbarzin.lan -> ingress.viktorbarzin.lan`
was added manually via the API — like the other `.lan` exporter hosts it is NOT
auto-synced (the technitium-ingress-dns-sync CronJob only creates `.me`
records). Follow-up (already noted for the Loki sensor): extend that sync to
manage `.lan` CNAMEs too. The Redfish remnant's `sensors` collector is now
vestigial (HA no longer reads it).

Verified: all 7 HA sensors report correct fresh values from Prometheus (fan
10800 rpm, CPU 62.0C, power 280W, PSU 230/240V).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 17:25:06 +00:00
Viktor Barzin
b7cb74f1b5 docs(monitoring): cluster log aggregation (Alloy fix) + Cluster Logs dashboard + HA sensors [ci skip]
Document the 2026-06-05 cluster-wide log observability work: the Alloy
local.file_match fix (loki.source.file doesn't expand globs) + stage.cri, the
new "Cluster Logs" Grafana dashboard, the ha-sofia cluster-log-health REST
sensors, and the loki.viktorbarzin.lan Technitium-record follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 17:15:57 +00:00
Viktor Barzin
7501c2be5d monitoring(grafana): add professional "Cluster Logs" dashboard (Logs folder)
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Cluster-wide Loki log observability now that pod logs flow (Alloy fix). New
dashboards/cluster-logs.json (Loki DS P8E80F9AEF21F6940): namespace/app/pod
dropdowns + free-text regex search; stats (lines/errors/warns/active-ns),
log-volume-by-namespace, error/warn rate, top-namespaces-by-errors,
top-pods-by-errors, a filterable live-logs panel, and a second row for the
node + rpi-sofia systemd journals (volume-by-level + error/warn journal panel).
Error/warn use case-insensitive regex line-filters so they work regardless of
level-label availability. New "Logs" Grafana folder.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 17:03:45 +00:00
Viktor Barzin
bb0099b747 monitoring(alloy): fix broken pod-log shipping (missing local.file_match) + parse CRI
Cluster pod logs were NOT reaching Loki — only node/Pi journals were. Root cause:
loki.source.file was fed the /var/log/pods/*<uid>/<container>/*.log glob directly
from discovery.relabel, but loki.source.file does NOT expand globs, so it stat()'d
the literal `*` path and shipped zero pod logs ("stat failed: no such file" for
every pod). Per Grafana Alloy docs, a local.file_match component must expand the
glob into concrete file targets first. Add it. Also add stage.cri {} so Loki
stores clean messages + real timestamps instead of raw containerd CRI-prefixed
lines. Fixes cluster-wide log observability (regression vs the working 2026-05-26
state). Ship-all-then-measure per the agreed plan; Alloy mem limits stay as the
IO-storm safeguard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 16:57:44 +00:00
Viktor Barzin
6b1d23abbd monitoring: migrate R730 iDRAC scraping to SNMP (fast primary) + thin Redfish remnant
The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on
each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m
interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac
module: ~3.7s/scrape at 1m.

- snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM
  (coolingDeviceReading + location lookup) and an amperageProbeLocationName
  lookup so the "System Board Pwr Consumption" watts probe is label-selectable.
- snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the
  primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*).
- Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and
  fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes.
- Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names;
  temps ÷10 (tenths-degC); DellStatus value-mappings updated.
- Demote the Redfish exporter to a slow remnant: trim collectors to
  system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for
  metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive
  table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads
  idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change.

SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified
live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without
firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan +
as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md,
docs/architecture/monitoring.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 16:33:20 +00:00
Viktor Barzin
6442978f07 fan-control: merge Fan %/RPM dashboard cards + RPM estimate fallback [ci skip]
The Fan % and Fan RPM sensor-graph cards had identical trend shapes (RPM ∝ %),
so merge them into one "Fan speed" card: % trend (stable Pushgateway sensor) +
RPM beneath. RPM reads sensor.r730_fan_speed (Redfish) but falls back to the
calibrated estimate (rpm≈160·%+1520, shown with a "~" prefix) when that sensor
is unavailable — it blips out intermittently, so the readout never goes blank.
The Override readout likewise shows both "% · rpm". HA-side only; daemon
unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 14:31:32 +00:00
Viktor Barzin
722a1c9b42 docs(monitoring): document rpi-sofia off-box monitoring + log shipping [ci skip]
Add an "External host: rpi-sofia" section to docs/architecture/monitoring.md
covering the 2026-06-05 setup: node_exporter + vcgencmd textfile metrics; the
full-journal promtail->Loki shipping (job=rpi-sofia-journal — kernel/dmesg via
the (none) unit + all systemd units, labeled by unit/level); the RPi Sofia
alert group; the dashboard; and the systemd watchdog. Notes the SD-card root
cause and that the Pi-side config is hand-managed + backed up off-box.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 14:25:20 +00:00
Viktor Barzin
405ca79531 fan-control: Override slider now tracks live fan speed while unlocked [ci skip]
The dashboard Override slider used to show a stale stored % (e.g. 5%) while the
fans were actually at ~53%, which was confusing. Add
automation.r730_fan_override_track_live_speed_while_unlocked: while unlocked it
mirrors the live commanded % (sensor.r730_fan_control_target) into the Override,
so it always shows the actual absolute fan speed and updates as the fan moves.
While locked it stops tracking and is the user's editable setpoint. The readout
under the slider now shows the live "% · rpm" (actual, not an estimate). HA-side
only; daemon unchanged. Verified live: slider forced to 10 → synced to 58 target.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 14:20:38 +00:00
Viktor Barzin
ddc8bfa8cf tripit: remove Gmail-scrape ingest-mail CronJob; plans@ becomes sole channel
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The Gmail All-Mail scrape (tripit-ingest-mail) is retired — Viktor only wants
mail ingested when forwarded to plans@viktorbarzin.me, and only from actual
users. Dropped the ingest-mail CronJob and removed MAIL_DEFAULT_OWNER_EMAIL
from ingest-plans (the app now ignores mail from non-users instead of filing it
under the default owner). ingest-plans already carries EMAIL_PROVIDER/SMTP_* for
the new sender notifications. Service-catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 13:50:53 +00:00
Viktor Barzin
5381beb3b7 monitoring: fix ingress auth-comment guard for loki-write-ingress
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
scripts/tg's check-ingress-auth-comments.py requires the `# auth = "none":`
rationale comment DIRECTLY above the `auth = "none"` line; mine was in the
module's top block comment, so the guard aborted the whole monitoring apply
(this is why the rpi-sofia scrape/alerts/ingress/dashboard never landed on the
first push). Move the rationale to the required position.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 13:36:43 +00:00
Viktor Barzin
c059405632 fan-control: simplify HA dashboard + Lock = freeze-current/algo-off [ci skip]
The dashboard-it Server → Fans view is now minimal: fan speed (% + RPM), an
Override % slider, and a Lock toggle. Lock now means "freeze the current speed,
algorithm off" — a new automation (r730_fan_lock_freeze_current_speed_resume_algo)
snapshots the live target % into Override and sets mode=manual on lock-ON, and
mode=auto on lock-OFF. The host daemon is unchanged (the toggle just drives the
mode it already reads). cool/quiet stay reachable via the entity but are off the
simplified view; the 60-min auto-revert is kept as a dormant safety net. Verified
live: lock ON → mode=manual + Override captured the live 60%; lock OFF → auto.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 13:27:46 +00:00
Viktor Barzin
f9376a36ff monitoring: wire rpi-sofia (Sofia Pi) into Prometheus/Loki/alerts
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
The Sofia Raspberry Pi hung this morning (network wedged ~10:13, HA
sensors dead, and its local journal had been silent since Apr 27 — a
2017 SD card intermittently flipping the rootfs read-only). Nothing was
captured because logging lived only on the failing card. Ship telemetry
off-box so the next failure is diagnosable centrally:

- Prometheus scrape job `rpi-sofia` (rpi-sofia.viktorbarzin.lan:9100) —
  node_exporter + a vcgencmd textfile collector on the Pi exporting
  under-voltage/throttle/SoC-temp as rpi_* metrics.
- Alert group "RPi Sofia": node_exporter Down, rootfs ReadOnly (the
  exact SD-failure signature), Under-voltage since boot, High SoC temp.
- LAN-gated Loki write ingress (loki.viktorbarzin.lan) so the Pi's
  promtail can push its journal — Loki was ClusterIP-only.
- Grafana dashboard "RPi Sofia" (Hardware): status, undervoltage/
  throttle, temp, load, memory, disk, network.

The Pi separately got a systemd hardware watchdog (auto-reboot on a hard
hang; today it stayed down ~5h until a manual power-cycle).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 13:11:40 +00:00
Viktor Barzin
5b96b841fc f1-stream: right-size memory 1Gi -> 256Mi (CDP-only, no bundled Chromium)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Actual usage ~116Mi, Goldilocks/VPA upperBound ~185Mi (incl. live races over
99d). The 1Gi reservation was sized for the old bundled-Chromium image; the app
now drives the remote chrome-service over CDP. 256Mi (upperBound x~1.3, bursty)
requests=limits per convention; cpu request 100m -> 50m (VPA upperBound 49m).
Frees ~768Mi of reserved cluster memory.
2026-06-05 12:57:22 +00:00
Viktor Barzin
d17b25cdcc fan-control: document the HA Fan Lock (opt out of 60-min auto-revert) [ci skip]
A manual/cool/quiet override in HA auto-reverts to `auto` after 60 min. Add a
Fan Lock (`input_boolean.r730_fan_lock`) that gates that automation so a
deliberate override persists, with a visible "🔒 FAN CONTROL LOCKED" banner on
the dashboard-it Server view so it isn't forgotten. The automation re-checks the
lock after the hour (locking mid-countdown cancels the revert) and the 83 °C
ceiling still wins. HA-side only (helper + automation + dashboard live on
ha-sofia, auto-git-tracked there); these docs are the infra-repo record.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 12:22:00 +00:00
Viktor Barzin
51456a96f6 fan-control: estimate + expose fan power (fan_watts_est)
The iDRAC reports only total DCMI watts + RPM (no per-fan power), so add a
cube-law fan-power estimate: fan_W ~= 0.0205*(RPM/1000)^3, calibrated to the
2026-06-05 sweep (fits within ~3W; ~2W floor -> ~99W full). The daemon reads
live RPM each loop and pushes pve_fan_control_fan_rpm + _fan_watts_est.
Surfaced in HA as sensor.r730_fan_power_est + a "Fan Power (est)" card on the
dashboard-it Server view, next to total power. 46 bash tests green; verified
live (9120rpm -> ~15W est).

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 11:10:27 +00:00
Viktor Barzin
324f2dc3bf fan-control: continuous linear curve (replaces discrete step-bands)
Replace the step-band fan curve with a continuous linear ramp — the bands
flapped at edges (e.g. 45<->65%). Web-researched: linear + 2-3C hysteresis
is the homelab standard; PID is overkill for this slow thermal loop.
fan% now interpolates between env-tunable anchors:
  COOL  50C/30% -> 83C/100% (~2.1%/C; ~51% at the ~60C equilibrium)
  QUIET 68C/20% -> 83C/100% (near-silent until ~70C)
Both reach 100% at the 83C ceiling. Anti-oscillation: asymmetric
hysteresis (fc_decide) + a MIN_STEP (3%) min-change threshold.
41 bash tests green; deployed + verified live (59C -> 49%, smooth).

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 10:29:35 +00:00