Commit graph

4062 commits

Author SHA1 Message Date
Viktor Barzin
3d6c5b8bc7 matrix/authentik: remove orphaned Matrix OAuth2 app + provider (post-tuwunel)
The migration left a UI-managed (not TF) Authentik OIDC app orphaned — tuwunel
uses native password auth, so nothing consumed it. Deleted application `matrix`
+ OAuth2 provider pk=6 via the Authentik API (user-confirmed). Drop the stale
Matrix rows from the SSO reference tables and update the plan's residual list.

Doc-only [ci skip].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:32:49 +00:00
Viktor Barzin
23602f393e matrix: migrate Synapse -> tuwunel (Rust homeserver, fresh start, federated)
Replace the cramped Synapse deployment with tuwunel v1.7.1: embedded RocksDB
drops the CNPG dependency (both init-containers, the db ESO, the Reloader
annotation all gone), env-var config, fsGroup-owned encrypted PVC, federation
on, tuwunel-served well-known delegation to :443. server_name unchanged
(matrix.viktorbarzin.me); fresh start (no Synapse->RocksDB migration path).
Registered @viktor admin then disabled registration (403).

Cleanup: removed the orphaned pg-matrix Vault static role and dropped the
matrix Postgres DB/role; updated service-catalog, upgrade-config, CLAUDE.md
PG-rotation list, and the Matrix OIDC->orphaned auth notes. Design+plan in
docs/plans/2026-06-08-matrix-synapse-to-tuwunel-*.

Already applied via scripts/tg (matrix tier-1 + targeted vault tier-0), so
[ci skip] to avoid CI reconciling an unrelated pre-existing vault OIDC
tune-TTL drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 11:58:17 +00:00
Viktor Barzin
09514a234b state(vault): update encrypted state 2026-06-08 11:51:06 +00:00
Viktor Barzin
7501ea286b tripit: wire planner subsystem (merged trip-planner) secrets + Slack webhook ingress
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
- ExternalSecret gains SLACK_SIGNING_SECRET / TREK_USER / TREK_PASSWORD /
  CLAUDE_AGENT_TOKEN (SLACK_BOT_TOKEN reused from nudges).
- New auth=none ingress carve-out /api/planner/slack (Slack v0 signature-gated,
  same pattern as the calendar + emails-confirm carve-outs).
- Remove the superseded standalone stacks/trip-planner (merged into tripit per
  the "future travel logic goes in tripit" policy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 09:26:21 +00:00
Viktor Barzin
838343184b stem95su: document on-demand Drive→NFS deploy (no scheduled job)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
CI/CD for the stem95su site is intentionally ON-DEMAND, not a CronJob:
the content is short-term and a scheduled job + Vault secret + ESO +
GCP "publish to Production" would be rotting artifacts. Instead, mirror
the source Google Drive folder "claude" → /srv/nfs/stem-site via a
throwaway rclone container using the existing google_workspace OAuth
creds (secret/viktor), rsync to NFS with an empty-source guard, then
shred the temp config. Verified end-to-end. Recipe in claude-memory.

Doc-only: corrects the service-catalog update-mechanism note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 22:10:06 +00:00
Viktor Barzin
d4ec5768b2 vault-token-renew: version the devvm renewer + user units in the repo
The devvm periodic Vault admin token (token-devvm-wizard, period=768h, policies default+sops-admin+vault-admin) is kept alive by a systemd user timer, but the renewer script + units lived only under ~/.local/bin and ~/.config/systemd/user — lost on a devvm rebuild. Move them into the repo as the source of truth so a rebuild can restore them. (version-only scope: behavior unchanged; no canonical-file/self-heal added.)

- scripts/vault-token-renew.{sh,service,timer}: renewer + user units, refactored into pure drift-guard functions + a guarded main (behavior identical; deployed live and verified still renewing with full write access).

- scripts/test-vault-token-renew.sh: unit-tests the drift guard + lookup-JSON parsing, incl. the 2026-06-05 woodpecker-clobber case (17 assertions).

- docs/runbooks/vault-token-renew-devvm.md: deploy, mint/re-mint, health-check, drift recovery.

- docs/architecture/secrets.md: correct the stale '~/.vault-token = OIDC token' description for devvm.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 22:10:06 +00:00
Viktor Barzin
f9d5cd6243 feat(tripit): wire real flight (AeroDataBox) + rail (RealtimeTrains) status
Prod ran FLIGHT_PROVIDER=fake, so every flight gate/terminal/time/position was
fabricated from a hash and never matched reality. Switch to real providers:
- FLIGHT_PROVIDER=aerodatabox (RapidAPI free BASIC; AERODATABOX_API_KEY via the
  tripit-secrets ExternalSecret)
- RAIL_PROVIDER=realtimetrains (RTT_API_TOKEN, already in Vault)
- poll-flights cron */30 -> hourly to respect the free 600 req/month cap
  (provider also self-throttles to <=1 req/sec)

Verified live: /api/segments/<LS1468>/status returns source=aerodatabox with
real schedule/terminal/aircraft.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 22:10:06 +00:00
root
b1ccbd12e8 Woodpecker CI Update TLS Certificates Commit 2026-06-07 22:10:06 +00:00
Viktor Barzin
0d445d948c stem95su: host STEM platform for 95. СУ (public NFS-backed static site)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
New public static site at stem95su.viktorbarzin.me serving the school's
Bulgarian STEM platform (dashboard + lessons/games, externally authored
HTML/media exported from Gemini).

- Stock nginx:1.28-alpine serving /srv/nfs/stem-site read-only (nfs_volume),
  NOT image-baked — content updated out-of-band (Nextcloud "PVE NFS Pool"
  or rsync), no rebuild; auto-backed-up offsite by nfs-mirror.
- ingress_factory auth="none" (open; CrowdSec + ai-bot-block at the edge),
  dns_type="proxied" (Cloudflare CNAME auto-created).
- nginx ConfigMap sets index stem_board.html (the dashboard) for "/".
- Docs: service-catalog entry + new "Static Site Hosting" pattern
  (NFS-backed vs image-baked) in patterns.md.

Applied via scripts/tg apply; verified live end-to-end (dashboard, 20MB
page, video byte-range, no Authentik redirect) through the public
Cloudflare path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 15:21:21 +00:00
Viktor Barzin
c7ffbaa204 aiostreams: harden stream-probe + repair sources (RD-451 "few films" fix)
Root cause of "barely serving films": Real-Debrid's May-2026
infringing_file/HTTP-451 filter blocks WEB-DL releases (which dominate
new content), while degraded sources starved candidates. RD account +
popular-title availability were healthy throughout (library 32/36
unrestrict OK; Matrix 897 / Dune2 694 / Oppenheimer 672 streams).

Runtime config (AIOStreams PG, applied via API — not in this diff):
- Comet timeout 5s -> 10s. Comet is the workhorse (~450+ streams/title)
  and was silently dropping the bulk of its results at the 5s cutoff;
  Interstellar 430 -> 987 streams after the bump.
- Removed MediaFusion preset: broken upstream ("Invalid configuration"
  -> 500 Internal Server Error), contributed 0 usable streams, only a
  dead [X] entry in every list.

This diff (Terraform):
- Harden aiostreams-stream-probe: test series AND movie paths, per-source
  breakdown (comet/torrentio/stremthru_torz/knaben), error-stream count,
  success gated on Comet being alive. The old probe counted only Breaking
  Bad streams and stayed green while new-content playback was broken.
- service-catalog: reflect source set + probe behaviour.

[ci skip] — probe already applied via targeted `tg apply` + verified
(series=378 movie=898 comet=206 errors=0 success=1); skipping the full
servarr reconcile to avoid touching unrelated pre-existing drift
(qbittorrent MetalLB annotation, tls_secret cert revert).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 07:21:42 +00:00
Viktor Barzin
4cdb9e1886 novelapp: switch Keel to semver (policy=major) now upstream tags are valid
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Gheorghe fixed his tag format 2026-06-06 (v.1.1.1 -> valid v1.1.1 / v1.1.3), so
drop the :latest+force+match-tag digest workaround and track semver properly:
policy=major (all upgrades, cumulative), match-tag removed (so Keel is free to
climb to higher semver tags), image floor pinned to v1.1.3. Pull policy ->
IfNotPresent (correct for a pinned Keel-managed tag; Always was only needed for
the mutable :latest). Running v1.1.3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 22:56:46 +00:00
Viktor Barzin
551412488b apiserver: enable audit logging (low-write Metadata) + ship to Loki
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Resource changes/deletions are now attributable (the novelapp deletion this week
was untraceable because apiserver audit was off). Low-write policy: drops
reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into
the kube-apiserver static-pod manifest + kubeadm-config (v1beta4
extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails
/var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}.

Root cause that had silently blocked this AND OIDC for weeks: a stray
kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate
static-pod manifest kubelet ran instead of the real one, dropping every flag
added to the real manifest. Removed it. Runbook added.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
3696ff5922 novelapp: track :latest by digest (Keel force+match-tag), adopt into TF state
Keel was stuck on v1.0.3 because upstream mghee/novelapp tags newer releases as
`v.1.1.1` (dot after v), which isn't valid semver, so policy=all couldn't see
past the highest parseable tag. :latest correctly points at the newest release,
so switch to force + match-tag digest-tracking of :latest (Kyverno does not
manage match-tag, contrary to the stale code comment). Imports the live
Deployment (recreated out-of-band 2026-06-06) back into TF state; running image
flipped to :latest -> now on v.1.1.1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
4d8b782df1 feat(trip-planner): app stack (Tier-1, CNPG, Slack-signed webhook ingress)
Namespace trip-planner (tier=4-aux, keel enrolled), ExternalSecret pulling
secret/trip-planner from vault-kv, DB-creds ExternalSecret from vault-database
(static-creds/pg-trip-planner → asyncpg DSN), Deployment with migrate init
container + main container (readiness+liveness /healthz, 256Mi req=limit, 100m
cpu request), ClusterIP service port 8080, and ingress_factory with auth=none
(Slack v0 HMAC signature verification in-app). Terraform fmt clean. NOT applied;
requires Vault secret/trip-planner + CNPG trip_planner DB + Slack app config.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
7c12fbba95 monitoring/alloy: drop cosmetic calico-typha 'Endpoints deprecated' warning
calico-typha (~342 lines/hr across 3 pods) still WATCHes the core v1
Endpoints API, so the apiserver returns the 'v1 Endpoints is deprecated
in v1.33+' client-go warning, which typha logs. Per KEP-4974 the v1
Endpoints API will essentially never be removed (clients keep working
indefinitely), and even the latest Calico still watches Endpoints
(projectcalico/calico#11540) so a CNI upgrade would not fix it. Pure
cosmetic noise. Targeted Alloy stage.drop (calico-system ns, exact
deprecation message), mirroring the mailserver drop. Real calico
warnings/errors kept; reversible. Validated with alloy fmt (exit 0).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
4b13be6d48 dawarich: upgrade 1.6.1 -> 1.7.11 (removes RailsPulse, drops orphan tables)
dawarich 1.6.1 shipped the RailsPulse perf-monitoring gem, which scheduled
an hourly Sidekiq SummaryJob INDEPENDENT of its disabled flag; the job hit
rails_pulse_routes (no primary key) and retry-looped, logging ~125
UnknownPrimaryKey lines/hr (found via Loki triage 2026-06-06). Upstream
removed RailsPulse entirely in 1.7.x (commit a5172cc) with a
DropRailsPulseTables migration; 1.7.11 is latest stable. Keel only
auto-applies patch bumps within 1.6.x, so the minor jump is manual.

Pre-upgrade pg_dump of dawarich (79.9MB) + dawarich_queue taken to devvm.
The 5 rails_pulse_* tables are empty (feature never collected data), so
cleanup is zero-data-risk; location data (tracks/points/visits/places)
untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
8a3bbde38c mailserver: silence mixed-TLS-directive warning + drop SMTP scanner noise from Loki
Two mailserver-namespace log-noise cleanups (cluster's #1 Loki error
source, from the 2026-06-06 log triage):

1. TLS warning: docker-mailserver SSL_TYPE=manual writes the authoritative
   smtpd_tls_chain_files at boot, so the legacy smtpd_tls_cert_file/key_file
   in our postfix-main.cf override were IGNORED and triggered postfix's
   'Both smtpd_tls_chain_files and ... legacy ...' warning. Dropped the two
   legacy lines (functional no-op; chain_files already wins). Verified via
   live postconf.

2. Scanner noise (~9k lines/hr): narrow Alloy stage.drop for the benign
   public-SMTP probe patterns (unknown[unknown] SSL_accept resets, postscreen
   half-open drops, rate-limit-exceeded from unknown). Real delivery logs +
   real-IP SASL failures KEPT; CrowdSec bans these IPs independently, so
   security posture is unchanged. Validated with 'alloy fmt' (exit 0).
   Reversible.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
root
de181a9afc Woodpecker CI deploy [CI SKIP] 2026-06-06 16:51:26 +00:00
Viktor Barzin
27211acda1 rybbit: recreate missing Postgres database via idempotent init Job
rybbit's 'rybbit' PG database was missing from CNPG (the role survived a
past cluster rebuild but the database did not), so the app's node-cron
logged 'database "rybbit" does not exist' every minute (found via Loki
2026-06-06). Created the DB manually to restore service (app auto-migrated
11 tables); this adds a self-contained init Job so the DB is recreated on
any future rebuild -- connects as the rybbit role (has CREATEDB) using the
existing rybbit-secrets password, idempotent CREATE DATABASE if absent.
Deployment now depends_on the job.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
9529eedfe0 docs(security): bot-block-proxy is a no-op while poison-fountain is at 0 [ci skip]
Reflect commit b6dd23b1: bot-block-proxy short-circuits /auth to
return 200 instead of proxying to the scaled-to-0 poison-fountain.
- security.md Layer 1 + tarpit description + troubleshooting (fix stale
  stacks/platform path -> traefik stack; drop misleading
  restart-poison-fountain step).
- .claude/CLAUDE.md: add matrix to PG rotation list; document that
  startup-read secret consumers need a Reloader annotation (matrix root
  cause, found via Loki 2026-06-05).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
Viktor Barzin
9ad7756a94 traefik: make bot-block-proxy a clean no-op while poison-fountain is at 0
bot-block-proxy is the forward-auth target for the ai-bot-block
middleware (applied to every anti-AI ingress). It proxied /auth to the
poison-fountain bot trap with error_page 5xx=200 fail-open. But
poison-fountain is intentionally scaled to 0, so proxy_pass only ever
failed and fell open to '200 allowed' -- while logging ~51k errors/hr
(the #1 Loki source once pod logs began shipping 2026-06-05) and paying
up to 100ms connect-timeout per authed request.

Short-circuit /auth to 'return 200 "allowed"' directly (drop the
upstream + proxy_pass + fallback). Identical effective behaviour
(allow-all), no upstream attempt, no noise, no latency. Reversible:
restore the upstream + proxy_pass and scale poison-fountain up.

Also add the missing configmap.reloader.stakater.com/reload annotation
so openresty picks up ConfigMap changes (it does not hot-reload on its
own -- the root reason stale config ran for days). replicas stays 2:
critical-path forward-auth target (anti-AI ingresses fail closed if it
is down), so HA is retained though each request is now trivial.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
root
d70a99dc48 Woodpecker CI deploy [CI SKIP] 2026-06-06 16:51:26 +00:00
Viktor Barzin
d661d074ef matrix: auto-reload Synapse on DB credential rotation (Reloader)
Synapse injects the Postgres password into homeserver.yaml only at
startup (inject-db-password initContainer). matrix-db-creds is rotated
by Vault via ESO (15m refresh), so each rotation left the running pod
with a stale password and Synapse DB auth failed silently until a
manual rollout restart. Found today via Loki: ~12.9k/hr 'password
authentication failed for user matrix' lines; secret password verified
working against the DB while the 10-day-old pod held the pre-rotation
value.

Add the explicit secret.reloader.stakater.com/reload annotation so
Reloader rolls the deployment whenever the secret changes (explicit
form, not auto/search, because the secret is referenced only in an
initContainer env var). Live pod already restarted to restore service;
this prevents recurrence on the next rotation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00
root
e7ece3eaf9 Woodpecker CI deploy [CI SKIP] 2026-06-06 16:51:26 +00:00
root
02366103ef Woodpecker CI deploy [CI SKIP] 2026-06-06 16:51:26 +00:00
Viktor Barzin
d808694af4 docs(storage): record harden-half shipped (orphan cleanup + ghost-reconcile)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2a orphan cleanup (67 Released PVs + 475 LVs removed, VG pve 997->~410) + 2b
csi-ghost-reconcile CronJob done — ghost-disk doom loop closed by construction,
beads code-dfjn retireable. Cap kept at 28 (lowering would reverse the
2026-05-25 eviction-cascade post-mortem fix). Phase-1: insta2spotify migrated
(noted its 3.26GB image re-pull blip on node reschedule).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 21:39:36 +00:00
Viktor Barzin
1b9d4f1233 storage: migrate insta2spotify off proxmox-lvm to NFS (LUN relief, Phase 1)
Some checks failed
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was canceled
Config-only PVC (no embedded DB), preflighted. Frees one proxmox-csi slot.
NB: pod reschedule re-pulled the 3.26GB backend image (~6min stall) — large-image
services incur a pull-delay blip when migration moves them to a fresh node.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 21:38:01 +00:00
Viktor Barzin
355ca3ee91 proxmox-csi: auto-reconcile CronJob to detach ghost disks (code-dfjn prevention)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Closes the ghost-disk doom loop by construction (failed detach -> orphan scsiN
with no VolumeAttachment -> invisible oversubscription -> query-pci wedge).
Every 15min csi-ghost-reconcile compares each worker VM's real scsi disks
(Proxmox API) vs k8s VolumeAttachments and safely detaches ghosts (PUT
.../config delete=scsiN -> frees the LUN slot, retains the LV).

- detection mirrors cluster-health check #47
- SAFETY: only vm-9999-pvc scsi with no matching VA; 60s re-confirm; per-run cap 5
- scoped CSI API token (VM.Config.Disk), not root SSH; k8s API via injected ClusterIP
- verified live: read 66 VAs, 0 ghosts, no false positives
- pushes csi_ghosts_detected/detached to Pushgateway

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 21:25:36 +00:00
Viktor Barzin
e311cbe103 chore(modules): remove vestigial audiblez-web copy + fix glossary note [ci skip]
modules/kubernetes/ebook2audiobook/ held a tracked copy of the audiblez-web
app source (24 files), sourced by no stack and built by no CI — audiblez-web
is GHA-built from its own repo. Bulk-swept in 2026-04-15; removed.

Also corrected CONTEXT.md: the "vestigial per-app dirs (immich/, ollama/,
...)" note was wrong — those were untracked local macOS cruft (._main.tf
AppleDouble turds), never in the repo; cleaned from the working tree.
modules/kubernetes/ now holds exactly the four factory modules
(ingress_factory, nfs_volume, anubis_instance, setup_tls_secret).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:38:13 +00:00
Viktor Barzin
a42f4f7b26 trek: trial-deploy TREK group-trip planner behind Authentik (solo eval)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Stand up upstream TREK (mauriceboe/trek:3.0.22, AGPL) as a low-commitment
trial to evaluate the self-hosted group-trip use case before building a
custom app. Solo, single shared instance, Authentik forward-auth.

- stacks/trek: namespace, deployment (pinned, TF-managed, no CI/Keel),
  service 80->3000, ingress_factory auth=required + proxied DNS at
  trek.viktorbarzin.me, TLS. Two proxmox-lvm-encrypted PVCs (SQLite data +
  uploads) -- encrypted per the sensitive-data rule and to avoid the
  SQLite-over-NFS locking hazard.
- Trial secrets posture: ENCRYPTION_KEY auto-generated on the data PVC,
  bootstrap admin in pod logs -- no Vault/ESO. Graduation TODOs documented
  in main.tf + service-catalog (Vault key, app-level SQLite backup, OIDC SSO).
- kyverno: add mauriceboe/* to require-trusted-registries allowlist (the
  policy is Enforce since 2026-05-19 -- also fixed the stale "stays in
  Audit" header comment that said otherwise and misled the deploy).
- Runs free on OpenStreetMap (no paid maps key). Rallly availability-poll
  companion deferred per solo-trial scope.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:30:07 +00:00
Viktor Barzin
63182730f9 docs(storage): record Wave-2 NFS migration + harden-proxmox-csi decision (option 1)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Document the 2026-06-05 decision to keep proxmox-csi and harden it (keep PVC
mobility, no hardware) over TopoLVM (pins to node) / Longhorn (2x writes on
single shared HDD). Wave-2 moved 5 non-DB workloads off block to NFS
(tandoor, speedtest, hackmd, changedetection, send), freeing 5 LUN slots.

- storage.md: live PVC counts, Retain-policy/orphan-LV note, Wave-2 history,
  updated cap-relief levers
- topolvm-evaluation.md: stamped NOT ADOPTED with rationale + pointer to the
  decision doc

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:15:21 +00:00
Viktor Barzin
a0b34750ee storage: migrate hackmd uploads off proxmox-lvm-encrypted to NFS (LUN-cap relief)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
codimd is MySQL-backed; this PVC holds only pasted image uploads (subPath
hackmd, 4.5M) — no embedded DB, NFS-safe. Drops LUKS-at-rest for these
low-sensitivity images (accepted). Frees one proxmox-csi SCSI-LUN slot on node6.

- swap hackmd-data-encrypted -> nfs_volume module (subPath preserved)
- uploads copied + verified (20 files, HTTP 200, codimd listening)
- block PVC removed; LV retained per SC policy (code-dfjn cleanup)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:11:31 +00:00
Viktor Barzin
e35d693972 storage: migrate send off proxmox-lvm to NFS (LUN-cap relief)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
Send (timvisee/send) stores encrypted upload blobs on disk with metadata in
Redis — no embedded DB, NFS-safe. Frees one proxmox-csi SCSI-LUN slot on node2.

- swap send-data-proxmox -> nfs_volume module
- blobs copied + verified (273M, 22 entries, HTTP 200 on NFS)
- block PVC removed; LV retained per SC policy (code-dfjn cleanup)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:04:37 +00:00
Viktor Barzin
c24b4a21d8 docs(architecture): fix stale 5-node claim -> 7 nodes (k8s-node1..6) [ci skip]
Cluster grew to 7 nodes (k8s-master + node1..6; node5/6 added ~10d ago)
but several docs still said "5 nodes". Corrected with live specs:

- overview.md: 7-node enumeration; node1 is 16c/48GB (doc wrongly said
  32GB), node2-6 are 8c/32GB general workers
- compute.md: "5-node" -> "7-node" cluster description
- dns.md: NodeLocal DNSCache DaemonSet "5 nodes" -> "7 nodes"
- mailserver.md: HAProxy backend diagram "node1..4" -> "node1..6"

Illustrative "0/5 nodes available" scheduler-error examples left as-is.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:03:58 +00:00
Viktor Barzin
bf3608052b tripit: GEOCODER_PROVIDER=openmeteo for per-city itinerary weather
Enables Open-Meteo geocoding of lodging addresses (results cached in the
new geocode_cache table) so the itinerary can show per-city weather.
Applied manually via scripts/tg apply.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 20:01:31 +00:00
Viktor Barzin
6eb683b6e0 storage: migrate speedtest off proxmox-lvm to NFS (LUN-cap relief)
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
speedtest-tracker is MySQL-backed (config dir = Laravel config + logs, no
embedded DB), NFS-safe. Frees one proxmox-csi SCSI-LUN slot.

- swap speedtest-config-proxmox -> nfs_volume module
- config copied + verified (HTTP 302->login,200); excluded 383MB laravel.log
- block PVC removed; LV retained per SC policy (code-dfjn cleanup)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 19:59:56 +00:00
Viktor Barzin
060aefbd0b storage: migrate changedetection off proxmox-lvm to NFS (LUN-cap relief)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
changedetection uses a file-based JSON datastore (url-watches.json + per-watch
dirs + brotli snapshots) — no embedded DB, NFS-safe. Frees one proxmox-csi
SCSI-LUN slot. Part of harden-proxmox-csi+NFS plan.

- swap changedetection-data-proxmox -> nfs_volume module
- data copied + verified (HTTP 200, 4 watches loaded); excluded 200MB test cruft
- block PVC removed; block LV retained per SC policy (code-dfjn cleanup)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 19:55:03 +00:00
Viktor Barzin
52f5de905d docs(context): freshen infra glossary (modules, tiers, new concepts) [ci skip]
Refresh CONTEXT.md against current repo + cluster reality (grill-with-docs):

- Module taxonomy rewrite: drop fictional k8s_app/helm_app/postgres_app
  factory modules (never existed); name the real four (ingress_factory,
  nfs_volume, anubis_instance, setup_tls_secret) + the shared / Stack-local
  / flat distinction; flag vestigial modules/kubernetes/<app> dirs.
- Rename "Ingress auth tier" -> "Ingress auth" (discrete modes, not tiers);
  reserve "tier" for State tier + Namespace tier only.
- Add local-path entry (cluster default SC; node-local footgun warning).
- Add concepts: Keel, Diun, CNPG/pg-cluster, MetalLB LB-IP split, Calico.
- Add "policy" ambiguity flag (Kyverno vs Calico NetworkPolicy vs Vault/RBAC).
- Fix node count 5 -> 7 (k8s-master + k8s-node1..6).

Doc-sync (same commit per repo rules):
- overview.md: replace fictional factory modules with the real shared
  modules + the flat/stack-local pattern.
- .claude/CLAUDE.md: drop dead nfs-proxmox column from the storage decision
  table + stale cross-reference (vault migrated off it 2026-04-25).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 19:34:49 +00:00
Viktor Barzin
aa948be581 storage: migrate tandoor off proxmox-lvm to NFS (LUN-cap relief)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
tandoor is PostgreSQL-backed with no embedded DB, so its media/static PVC
is NFS-safe. Frees one proxmox-csi SCSI-LUN slot. Part of the 'harden
proxmox-csi + NFS' plan (keeps PVC mobility, no new hardware) — see
docs/plans/2026-06-05-block-storage-harden-nfs-design.md.

- swap tandoor-data-proxmox -> nfs_volume module (nfs-truenas SC)
- data copied + verified (HTTP 200 on NFS volume); block PVC removed
- block LV retained per SC policy (orphan cleanup tracked in code-dfjn)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 19:34:47 +00:00
Viktor Barzin
febf12bddd mail(tripit): send From: plans@viktorbarzin.me instead of spam@
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
tripit outbound (linked-email verification + trip-share invites) was sent
From: spam@viktorbarzin.me. Switch the From to plans@viktorbarzin.me while
keeping SMTP auth as spam@ (its password, unchanged).

docker-mailserver SPOOF_PROTECTION (reject_sender_login_mismatch) requires
the authed login to "own" the From; the @viktorbarzin.me catch-all does NOT
grant that per-address, so add an explicit `plans@ -> spam@` virtual alias to
authorize it (also keeps inbound plans@ routing to spam@ for the mail-ingest
poller). tripit SMTP_FROM flips to plans@.

Verified: sender-login probe (auth spam@, MAIL FROM plans@) now 250 (was 553);
a real send from the tripit pod logs from=<plans@viktorbarzin.me> accepted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 18:41:08 +00:00
Viktor Barzin
bc33cd5ac4 monitoring: NodeFilesystemFull 90%->95% + Synology storage runbook
The Synology offsite backup target (/mnt/synology-backup, surfaced via
the PVE host NFS mount) sits at ~94% by design and was firing
NodeFilesystemFull continuously. Per user request, raise the threshold
to 95% (<5% free). NOTE: NodeFilesystemFull is a global node-filesystem
rule, so this also loosens the warning on k8s node/system disks;
BackupDiskFull (sda /mnt/backup) stays at 85%.

Also adds docs/runbooks/synology-storage.md: how to assess Synology
usage WITHOUT du (Storage Analyzer weekly CSVs, df/btrfs/qgroup),
btrfs async/snapshot-pinned reclaim, the 2026-06-05 capacity assessment
(94% full; Backup share 4.42TiB), and ~500GiB of homelab cleanup
candidates (redundant gphotos Takeout, old laptop VM images, archives).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 18:18:31 +00:00
Viktor Barzin
f526af694d monitoring: snmp-idrac scrape 1m->30s — faster HA dashboard iDRAC refresh
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The ha-sofia R730 REST sensors (via prometheus-query.lan) + Grafana iDRAC
panels were bound to the 1m snmp-idrac scrape. Halved to 30s so the
dashboard-it Server view refreshes uniformly at 30s, matching the
fan-control daemon's Pushgateway metrics. SNMP scrape ~3-4s; timeout 15s.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 17:52:07 +00:00
Viktor Barzin
5b5b855528 monitoring(alloy): drop goflow2 + vpa logs from Loki to cut sdc write wear
goflow2 emits ~8 GB/day of per-flow NetFlow JSON to stdout (~64% of all cluster
log volume) but only its Prometheus aggregate metrics are used; vpa is ~1.3
GB/day of Goldilocks/VPA recommender chatter. Both are low-value and were
landing in Loki (PVC on the contended sdc HDD). Drop them at the Alloy relabel.
Reversible (remove the drop rule). Loki ingestion drops ~73%.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 17:44:47 +00:00
Viktor Barzin
dbe115910f monitoring: add local-only prometheus-query.lan ingress for ha-sofia SNMP sensors
ha-sofia's 7 R730 REST sensors (CPU/exhaust/inlet temp, power, 2x PSU voltage,
fan) read the iDRAC via the slow on-demand Redfish exporter (scan_interval 120,
~16-22s/fetch, intermittent `unavailable` blips). Migrated them to a FAST
Prometheus query of the SNMP values (instant, ~1m-fresh from the snmp-idrac
scrape), scan_interval 30.

This adds the enabling ingress: `prometheus-query.viktorbarzin.lan` →
`prometheus-server:80`, auth=none, allow_local_access_only, path-scoped to
`/api/v1/query` (read-only instant-query only — not the UI/admin/federation).
ha-sofia can't use `prometheus.viktorbarzin.me` (Authentik-gated, no OIDC from
a REST sensor), so this mirrors the existing local-only `.lan` exporter
ingresses HA already queries.

The ha-sofia REST file (`/config/rest_resources/idrac_redfish_exporter.yaml`)
was edited in place (auto-version-controlled by the HA version-control add-on;
pre-migration copy at `/config/idrac_redfish_exporter.bak-pre-snmp`). The
Technitium CNAME `prometheus-query.viktorbarzin.lan -> ingress.viktorbarzin.lan`
was added manually via the API — like the other `.lan` exporter hosts it is NOT
auto-synced (the technitium-ingress-dns-sync CronJob only creates `.me`
records). Follow-up (already noted for the Loki sensor): extend that sync to
manage `.lan` CNAMEs too. The Redfish remnant's `sensors` collector is now
vestigial (HA no longer reads it).

Verified: all 7 HA sensors report correct fresh values from Prometheus (fan
10800 rpm, CPU 62.0C, power 280W, PSU 230/240V).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 17:25:06 +00:00
Viktor Barzin
b7cb74f1b5 docs(monitoring): cluster log aggregation (Alloy fix) + Cluster Logs dashboard + HA sensors [ci skip]
Document the 2026-06-05 cluster-wide log observability work: the Alloy
local.file_match fix (loki.source.file doesn't expand globs) + stage.cri, the
new "Cluster Logs" Grafana dashboard, the ha-sofia cluster-log-health REST
sensors, and the loki.viktorbarzin.lan Technitium-record follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 17:15:57 +00:00
Viktor Barzin
7501c2be5d monitoring(grafana): add professional "Cluster Logs" dashboard (Logs folder)
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Cluster-wide Loki log observability now that pod logs flow (Alloy fix). New
dashboards/cluster-logs.json (Loki DS P8E80F9AEF21F6940): namespace/app/pod
dropdowns + free-text regex search; stats (lines/errors/warns/active-ns),
log-volume-by-namespace, error/warn rate, top-namespaces-by-errors,
top-pods-by-errors, a filterable live-logs panel, and a second row for the
node + rpi-sofia systemd journals (volume-by-level + error/warn journal panel).
Error/warn use case-insensitive regex line-filters so they work regardless of
level-label availability. New "Logs" Grafana folder.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 17:03:45 +00:00
Viktor Barzin
bb0099b747 monitoring(alloy): fix broken pod-log shipping (missing local.file_match) + parse CRI
Cluster pod logs were NOT reaching Loki — only node/Pi journals were. Root cause:
loki.source.file was fed the /var/log/pods/*<uid>/<container>/*.log glob directly
from discovery.relabel, but loki.source.file does NOT expand globs, so it stat()'d
the literal `*` path and shipped zero pod logs ("stat failed: no such file" for
every pod). Per Grafana Alloy docs, a local.file_match component must expand the
glob into concrete file targets first. Add it. Also add stage.cri {} so Loki
stores clean messages + real timestamps instead of raw containerd CRI-prefixed
lines. Fixes cluster-wide log observability (regression vs the working 2026-05-26
state). Ship-all-then-measure per the agreed plan; Alloy mem limits stay as the
IO-storm safeguard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 16:57:44 +00:00
Viktor Barzin
6b1d23abbd monitoring: migrate R730 iDRAC scraping to SNMP (fast primary) + thin Redfish remnant
The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on
each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m
interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac
module: ~3.7s/scrape at 1m.

- snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM
  (coolingDeviceReading + location lookup) and an amperageProbeLocationName
  lookup so the "System Board Pwr Consumption" watts probe is label-selectable.
- snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the
  primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*).
- Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and
  fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes.
- Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names;
  temps ÷10 (tenths-degC); DellStatus value-mappings updated.
- Demote the Redfish exporter to a slow remnant: trim collectors to
  system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for
  metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive
  table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads
  idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change.

SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified
live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without
firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan +
as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md,
docs/architecture/monitoring.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 16:33:20 +00:00
Viktor Barzin
6442978f07 fan-control: merge Fan %/RPM dashboard cards + RPM estimate fallback [ci skip]
The Fan % and Fan RPM sensor-graph cards had identical trend shapes (RPM ∝ %),
so merge them into one "Fan speed" card: % trend (stable Pushgateway sensor) +
RPM beneath. RPM reads sensor.r730_fan_speed (Redfish) but falls back to the
calibrated estimate (rpm≈160·%+1520, shown with a "~" prefix) when that sensor
is unavailable — it blips out intermittently, so the readout never goes blank.
The Override readout likewise shows both "% · rpm". HA-side only; daemon
unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 14:31:32 +00:00
Viktor Barzin
722a1c9b42 docs(monitoring): document rpi-sofia off-box monitoring + log shipping [ci skip]
Add an "External host: rpi-sofia" section to docs/architecture/monitoring.md
covering the 2026-06-05 setup: node_exporter + vcgencmd textfile metrics; the
full-journal promtail->Loki shipping (job=rpi-sofia-journal — kernel/dmesg via
the (none) unit + all systemd units, labeled by unit/level); the RPi Sofia
alert group; the dashboard; and the systemd watchdog. Notes the SD-card root
cause and that the Pi-side config is hand-managed + backed up off-box.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 14:25:20 +00:00