Viktor asked to root-cause the frequent t3 code disconnects and rule
infra in or out. The tunnel pods ran bare 'cloudflared tunnel run':
every Cloudflare release made the binary self-update and exit (code 11),
restarting all 3 pods and severing every WebSocket riding the tunnel —
one of the confirmed infra-side drop causes (pods cycled 2026-06-09
20:55/21:00 and 2026-06-10 02:31). Updates belong to pod image rollouts,
not in-place binary swaps.
The Actual web app boots with ~70 near-parallel requests (55
/data/migrations/*.sql + statics, all served cache-control max-age=0 so
every page load re-validates them). The shared rate-limit middleware
(average 10, burst 50) 429s the tail of that storm, so every cold boot
shows 'Server returned an error while checking its status' and every
load stalls in retry backoff — measured up to 5min stalls when two
loads from one IP overlap. Viktor asked to relax the limit after the
anca slow-load investigation (beads code-7zv).
Same pattern as immich: dedicated actualbudget-rate-limit middleware in
the traefik stack, budget-* ingresses opt out of the default via
skip_default_rate_limit + extra_middlewares.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Completes the internal port table of the mail front door (10.0.20.1):
443 was squatted by the pfSense webGUI (self-signed cert expired 2022),
so internal webmail and the kuma [External] mail probe hit the firewall
login instead of Roundcube — the last leg of the mail split-brain name.
Design (Viktor): route by what the client asked for. New HAProxy
frontend internal_https_443 (binds 10.0.20.1+10.0.10.1 :443, mode tcp):
SNI present -> Traefik .203 with send-proxy-v2 (trusted, IPv6-bridge
pattern, no health check per the PROXY-probe gotcha); SNI of
pfsense.viktorbarzin.{lan,me} or NO SNI (bare-IP admin access) -> webGUI,
which moved to :8443 (invisible to habits — https://10.0.20.1 still
lands on the login page; :8443 doubles as direct fallback). The
reverse-proxy pfsense ingress now targets :8443 directly.
Declared idempotently in pfsense-haproxy-bootstrap.php; config.xml
backed up on-box (config.xml.bak-2026-06-10-pre-sni443). Verified:
bare IP -> GUI login; pfsense.viktorbarzin.lan -> GUI;
pfsense.viktorbarzin.me -> 302 via ingress; mail.viktorbarzin.me ->
Roundcube with STRICT cert validation; :993 IMAPS untouched.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to verify the book-plotting push->build->deploy chain.
The chain itself is healthy, but the Terraform baseline image said
ancamilea/book-plotter:latest while CI (GHA on
PassionProjectsAnca/Plotting-Your-Dream-Book) builds and deploys
viktorbarzin/book-plotter:<sha8> + :latest — a from-scratch apply
would have resurrected a stale March image. Baseline now
viktorbarzin/book-plotter:latest. No live change: the running tag is
CI-owned via ignore_changes, plan confirms the image attr is ignored.
[ci skip] deliberately: plan shows UNRELATED pre-existing drift on
this stack (live ns labels managed-by=vault-user-onboarding +
resource-governance/custom-quota=true would be stripped; deployment
keel.sh/policy=patch annotations removed) — auto-applying that needs
its own reviewed pass.
The roundtrip probe kept failing after the SPF/MX fix: rspamd's actual
junk-score driver was R_DKIM_PERMFAIL(+4.5) on selector brevo2 — Brevo
signs with brevo1/brevo2._domainkey, which are CNAMEs to
b{1,2}.viktorbarzin-me.dkim.brevo.com in public DNS and were absent
from the internal zone (the earlier existence check used ANY queries,
which Cloudflare refuses per RFC 8482 — false negative). The DKIM
permfail also cascaded into DMARC_POLICY_SOFTFAIL(+1.5), totalling the
6.09/6.0 junk threshold; sieve filed probes into \Junk where the INBOX
poll never finds them.
ingress-dns-sync now maintains both selector CNAMEs. Ops notes: rspamd
caches DNS (restart to flush after zone fixes); CoreDNS denial cache
holds NXDOMAINs up to 300s. Verified: roundtrip SUCCESS in 20.5s.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two fixes from the post-DNS-internalization health sweep:
1. The internal viktorbarzin.me zone served only ingress A/CNAME records.
Since the mailserver pods now resolve the domain through it (CoreDNS
viktorbarzin.me:53 -> Technitium, 59a531b8), rspamd's SPF checks on
inbound @viktorbarzin.me mail saw SPF=none and quarantined it — the
Brevo email-roundtrip probe failed from the 16:20 run onward
(EmailRoundtripFailing/Stale). The ingress-dns-sync CronJob now also
maintains the static mail-auth records (SPF, brevo-code TXT, MX;
DMARC + DKIM were already present), idempotently. Principle: the
internal zone must be a SUPERSET of the public zone for every record
type internal clients consume. Verified in-pod: all four types
resolve; roundtrip re-probe green.
2. cluster_healthcheck #30 queried instant `up`, which goes stale for
~5 of every 10 minutes on the deliberate 10m redfish-idrac remnant
job -> intermittent false "redfish-idrac=missing". Now uses
last_over_time(up[15m]) — same answers for fast jobs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Forward the viktorbarzin.me:53 pod block to the Technitium ClusterIP
(10.96.0.53, same as the .lan block) instead of 8.8.8.8/1.1.1.1. Pods
become ordinary internal clients (CNAME -> apex -> live Traefik LB;
mail -> 10.0.20.1), fixing the 27 non-proxied [External] uptime-kuma
monitors that rode the TP-Link NAT loopback (hard-down since 06-09;
loopback refuses flows whose source equals the reflection target, which
all pfSense-SNAT'd cluster traffic does).
Enabled by re-testing a stale premise: on k8s 1.34 pods DO reach the
ETP=Local Traefik LB IP (kube-proxy short-circuits in-cluster traffic
to LB IPs; verified from pods on three non-Traefik nodes) — re-verify
after major k8s upgrades; canary = [External] fleet going red. The
NAT-layer alternatives (pfSense rdr, SNAT-drop) were rejected: both
fight return-path asymmetry and deepen TP-Link dependency.
Verified in-pod: immich -> .203 + HTTPS 200, mail -> 10.0.20.1,
forgejo -> Traefik ClusterIP (pin kept for Technitium-outage
resilience). Proxied [External] monitors now test the internal path —
true edge fidelity moves to the external vantage (ha-london, next fix).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The keep-set (newest 10 versions + latest + *cache* tags) treats
multi-arch/attestation index CHILDREN — separate untagged sha256
versions — as deletable: for images not rebuilt recently they sort
outside the newest-10 window and were pruned while their kept parent
index survived. kms-website :latest and :dfc83fb children 404'd
(RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe
within hours; deployed tag a794d1a unaffected).
Healed: :latest re-pointed at the intact a794d1a index (also the
newest commit), corrupt :dfc83fb version deleted, probe re-run clean
(0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied
live. Re-enable only with a container-aware keep-set — options in the
post-mortem.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to
Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM
backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I
re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl,
mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success.
NOT [ci]-applied: this is a Terraform stack change — arms on the next
`scripts/tg apply` of the monitoring stack (metrics already flow, so it arms
immediately once applied). Admin-gated apply per org policy.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node
customization — split-brain lives in the DNS infra):
- pfSense Unbound domain override viktorbarzin.me -> Technitium
10.0.20.201 (applied via php write_config, backup on-box). Every
Unbound client on every VLAN now gets the internal split-horizon
answers (live Traefik IP via apex CNAME) with zero per-host config.
- CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block —
forgejo pinned to Traefik ClusterIP via data source (pods cannot reach
the ETP=Local LB IP pfSense now returns), all other .me names kept on
public resolvers (pods' pre-existing behavior). Replaces the .:53
forgejo rewrite.
- Removed the same-day resolved routing-domain drop-ins from all 7 nodes;
node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206)
for fleet parity; cloud-init no longer writes any DNS drop-ins.
- Docs: dns.md, pfsense-unbound runbook (override + rollback), registry
bullet, post-mortem final-architecture addendum.
Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK,
pods resolve forgejo -> ClusterIP / others -> public, mail record works,
.lan zone unaffected.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.
Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).
Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).
tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.
Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via
two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt
deferred):
- Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under
registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it
kept OOMing against. Size for the push spike.
- Activate registry retention (DRY_RUN false). Verified the delete list
against all running viktor/* images first: 0 running images affected.
Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling.
- FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo
scopes container packages per-user, so DELETE on viktor/* returned 403 (the
dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to
viktor's write:package PAT. Retention had never actually worked.
- Protect buildkit *cache* tags from retention (cleanup.sh keep-set) so the
gentler-builds layer cache survives daily pruning.
[ci skip] — already applied via scripts/tg.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Boarding-pass images, no embedded DB. Drops LUKS-at-rest (low-sensitivity, accepted).
21.8M copied + verified on NFS; pod 2/2 on NFS; frees one proxmox-csi slot.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New authentik_group 'T3 Users' (members wizard/emo/ancamilea via data lookups — usernames ARE their emails in this Authentik instance) + a branch in the admin-services-restriction expression policy gating t3.viktorbarzin.me to that group, placed BEFORE the ADMIN_ONLY_HOSTS early-return. Surgical two-step targeted apply (group-with-members first, then the gate) → zero lock-out window. Verified: group has all 3 members, the live policy contains the t3 branch, t3 still 302s to Authentik. Membership is HCL for now (FUTURE: roster-reconciled via the Authentik API).
Note: the authentik stack had 3 unrelated pending drift changes (pgbouncer deployment + 2 tls_secrets) — deliberately NOT applied (targeted apply isolated this change; left for the stack owner).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New oidc-power-user-readonly ClusterRole (cluster-wide get/list/watch, NO secrets/exec/write); the power-user binding re-pointed to it (the existing read+write+secrets oidc-power-user role is retained but UNBOUND per ADR-0005). Applied to the rbac stack (2 add, 1 change, 0 destroy). emo added to Vault k8s_users (secret/platform) as power-user, email emil.barzin@gmail.com — the OIDC email IS the Authentik username (verified live). Verified via impersonation: emo gets cluster-wide read, NO secrets/write/exec/delete; anca unchanged.
Provisioner: install_user_kubeconfig writes a per-user OIDC kubeconfig (kubelogin/PKCE — the kubernetes Authentik client is public, no secret; server+CA copied from the admin kubeconfig) if-absent. Written for emo + ancamilea (0600). End-to-end login is interactive (browser OIDC); verified config validity + RBAC, not the live browser flow.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
User-chosen fully-open registration on tuwunel (no CAPTCHA support; browser
challenges break native clients). Bot defense is layered instead:
- Traefik rate-limit Middleware on a path-scoped /register ingress carve-out,
keyed on request Host (GLOBAL /register cap) not source IP — the host is
reachable via both Cloudflare-IPv4 (CF-Connecting-IP) and IPv6-direct (HE
tunnel, no CF header), so a per-source key let IPv6 bots bypass. 10/min,
burst 20, per replica; CrowdSec is the hard backstop on both paths.
- Loki ruler rule MatrixNewUserRegistered -> lane=security -> existing
#security Slack receiver (matches "registered on this server", never the
rejection line). tuwunel's admin bot also posts signups to the admin room.
Dropped the REGISTRATION_TOKEN env (secret/matrix + ESO kept for revert).
Applied via scripts/tg (matrix tier-1 + targeted monitoring configmap), so
[ci skip] to avoid CI full-applying monitoring (unrelated grafana-acl drift).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the cramped Synapse deployment with tuwunel v1.7.1: embedded RocksDB
drops the CNPG dependency (both init-containers, the db ESO, the Reloader
annotation all gone), env-var config, fsGroup-owned encrypted PVC, federation
on, tuwunel-served well-known delegation to :443. server_name unchanged
(matrix.viktorbarzin.me); fresh start (no Synapse->RocksDB migration path).
Registered @viktor admin then disabled registration (403).
Cleanup: removed the orphaned pg-matrix Vault static role and dropped the
matrix Postgres DB/role; updated service-catalog, upgrade-config, CLAUDE.md
PG-rotation list, and the Matrix OIDC->orphaned auth notes. Design+plan in
docs/plans/2026-06-08-matrix-synapse-to-tuwunel-*.
Already applied via scripts/tg (matrix tier-1 + targeted vault tier-0), so
[ci skip] to avoid CI reconciling an unrelated pre-existing vault OIDC
tune-TTL drift.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- ExternalSecret gains SLACK_SIGNING_SECRET / TREK_USER / TREK_PASSWORD /
CLAUDE_AGENT_TOKEN (SLACK_BOT_TOKEN reused from nudges).
- New auth=none ingress carve-out /api/planner/slack (Slack v0 signature-gated,
same pattern as the calendar + emails-confirm carve-outs).
- Remove the superseded standalone stacks/trip-planner (merged into tripit per
the "future travel logic goes in tripit" policy).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Prod ran FLIGHT_PROVIDER=fake, so every flight gate/terminal/time/position was
fabricated from a hash and never matched reality. Switch to real providers:
- FLIGHT_PROVIDER=aerodatabox (RapidAPI free BASIC; AERODATABOX_API_KEY via the
tripit-secrets ExternalSecret)
- RAIL_PROVIDER=realtimetrains (RTT_API_TOKEN, already in Vault)
- poll-flights cron */30 -> hourly to respect the free 600 req/month cap
(provider also self-throttles to <=1 req/sec)
Verified live: /api/segments/<LS1468>/status returns source=aerodatabox with
real schedule/terminal/aircraft.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New public static site at stem95su.viktorbarzin.me serving the school's
Bulgarian STEM platform (dashboard + lessons/games, externally authored
HTML/media exported from Gemini).
- Stock nginx:1.28-alpine serving /srv/nfs/stem-site read-only (nfs_volume),
NOT image-baked — content updated out-of-band (Nextcloud "PVE NFS Pool"
or rsync), no rebuild; auto-backed-up offsite by nfs-mirror.
- ingress_factory auth="none" (open; CrowdSec + ai-bot-block at the edge),
dns_type="proxied" (Cloudflare CNAME auto-created).
- nginx ConfigMap sets index stem_board.html (the dashboard) for "/".
- Docs: service-catalog entry + new "Static Site Hosting" pattern
(NFS-backed vs image-baked) in patterns.md.
Applied via scripts/tg apply; verified live end-to-end (dashboard, 20MB
page, video byte-range, no Authentik redirect) through the public
Cloudflare path.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Root cause of "barely serving films": Real-Debrid's May-2026
infringing_file/HTTP-451 filter blocks WEB-DL releases (which dominate
new content), while degraded sources starved candidates. RD account +
popular-title availability were healthy throughout (library 32/36
unrestrict OK; Matrix 897 / Dune2 694 / Oppenheimer 672 streams).
Runtime config (AIOStreams PG, applied via API — not in this diff):
- Comet timeout 5s -> 10s. Comet is the workhorse (~450+ streams/title)
and was silently dropping the bulk of its results at the 5s cutoff;
Interstellar 430 -> 987 streams after the bump.
- Removed MediaFusion preset: broken upstream ("Invalid configuration"
-> 500 Internal Server Error), contributed 0 usable streams, only a
dead [X] entry in every list.
This diff (Terraform):
- Harden aiostreams-stream-probe: test series AND movie paths, per-source
breakdown (comet/torrentio/stremthru_torz/knaben), error-stream count,
success gated on Comet being alive. The old probe counted only Breaking
Bad streams and stayed green while new-content playback was broken.
- service-catalog: reflect source set + probe behaviour.
[ci skip] — probe already applied via targeted `tg apply` + verified
(series=378 movie=898 comet=206 errors=0 success=1); skipping the full
servarr reconcile to avoid touching unrelated pre-existing drift
(qbittorrent MetalLB annotation, tls_secret cert revert).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Gheorghe fixed his tag format 2026-06-06 (v.1.1.1 -> valid v1.1.1 / v1.1.3), so
drop the :latest+force+match-tag digest workaround and track semver properly:
policy=major (all upgrades, cumulative), match-tag removed (so Keel is free to
climb to higher semver tags), image floor pinned to v1.1.3. Pull policy ->
IfNotPresent (correct for a pinned Keel-managed tag; Always was only needed for
the mutable :latest). Running v1.1.3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resource changes/deletions are now attributable (the novelapp deletion this week
was untraceable because apiserver audit was off). Low-write policy: drops
reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into
the kube-apiserver static-pod manifest + kubeadm-config (v1beta4
extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails
/var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}.
Root cause that had silently blocked this AND OIDC for weeks: a stray
kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate
static-pod manifest kubelet ran instead of the real one, dropping every flag
added to the real manifest. Removed it. Runbook added.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Keel was stuck on v1.0.3 because upstream mghee/novelapp tags newer releases as
`v.1.1.1` (dot after v), which isn't valid semver, so policy=all couldn't see
past the highest parseable tag. :latest correctly points at the newest release,
so switch to force + match-tag digest-tracking of :latest (Kyverno does not
manage match-tag, contrary to the stale code comment). Imports the live
Deployment (recreated out-of-band 2026-06-06) back into TF state; running image
flipped to :latest -> now on v.1.1.1.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Namespace trip-planner (tier=4-aux, keel enrolled), ExternalSecret pulling
secret/trip-planner from vault-kv, DB-creds ExternalSecret from vault-database
(static-creds/pg-trip-planner → asyncpg DSN), Deployment with migrate init
container + main container (readiness+liveness /healthz, 256Mi req=limit, 100m
cpu request), ClusterIP service port 8080, and ingress_factory with auth=none
(Slack v0 HMAC signature verification in-app). Terraform fmt clean. NOT applied;
requires Vault secret/trip-planner + CNPG trip_planner DB + Slack app config.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
calico-typha (~342 lines/hr across 3 pods) still WATCHes the core v1
Endpoints API, so the apiserver returns the 'v1 Endpoints is deprecated
in v1.33+' client-go warning, which typha logs. Per KEP-4974 the v1
Endpoints API will essentially never be removed (clients keep working
indefinitely), and even the latest Calico still watches Endpoints
(projectcalico/calico#11540) so a CNI upgrade would not fix it. Pure
cosmetic noise. Targeted Alloy stage.drop (calico-system ns, exact
deprecation message), mirroring the mailserver drop. Real calico
warnings/errors kept; reversible. Validated with alloy fmt (exit 0).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
dawarich 1.6.1 shipped the RailsPulse perf-monitoring gem, which scheduled
an hourly Sidekiq SummaryJob INDEPENDENT of its disabled flag; the job hit
rails_pulse_routes (no primary key) and retry-looped, logging ~125
UnknownPrimaryKey lines/hr (found via Loki triage 2026-06-06). Upstream
removed RailsPulse entirely in 1.7.x (commit a5172cc) with a
DropRailsPulseTables migration; 1.7.11 is latest stable. Keel only
auto-applies patch bumps within 1.6.x, so the minor jump is manual.
Pre-upgrade pg_dump of dawarich (79.9MB) + dawarich_queue taken to devvm.
The 5 rails_pulse_* tables are empty (feature never collected data), so
cleanup is zero-data-risk; location data (tracks/points/visits/places)
untouched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two mailserver-namespace log-noise cleanups (cluster's #1 Loki error
source, from the 2026-06-06 log triage):
1. TLS warning: docker-mailserver SSL_TYPE=manual writes the authoritative
smtpd_tls_chain_files at boot, so the legacy smtpd_tls_cert_file/key_file
in our postfix-main.cf override were IGNORED and triggered postfix's
'Both smtpd_tls_chain_files and ... legacy ...' warning. Dropped the two
legacy lines (functional no-op; chain_files already wins). Verified via
live postconf.
2. Scanner noise (~9k lines/hr): narrow Alloy stage.drop for the benign
public-SMTP probe patterns (unknown[unknown] SSL_accept resets, postscreen
half-open drops, rate-limit-exceeded from unknown). Real delivery logs +
real-IP SASL failures KEPT; CrowdSec bans these IPs independently, so
security posture is unchanged. Validated with 'alloy fmt' (exit 0).
Reversible.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
rybbit's 'rybbit' PG database was missing from CNPG (the role survived a
past cluster rebuild but the database did not), so the app's node-cron
logged 'database "rybbit" does not exist' every minute (found via Loki
2026-06-06). Created the DB manually to restore service (app auto-migrated
11 tables); this adds a self-contained init Job so the DB is recreated on
any future rebuild -- connects as the rybbit role (has CREATEDB) using the
existing rybbit-secrets password, idempotent CREATE DATABASE if absent.
Deployment now depends_on the job.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bot-block-proxy is the forward-auth target for the ai-bot-block
middleware (applied to every anti-AI ingress). It proxied /auth to the
poison-fountain bot trap with error_page 5xx=200 fail-open. But
poison-fountain is intentionally scaled to 0, so proxy_pass only ever
failed and fell open to '200 allowed' -- while logging ~51k errors/hr
(the #1 Loki source once pod logs began shipping 2026-06-05) and paying
up to 100ms connect-timeout per authed request.
Short-circuit /auth to 'return 200 "allowed"' directly (drop the
upstream + proxy_pass + fallback). Identical effective behaviour
(allow-all), no upstream attempt, no noise, no latency. Reversible:
restore the upstream + proxy_pass and scale poison-fountain up.
Also add the missing configmap.reloader.stakater.com/reload annotation
so openresty picks up ConfigMap changes (it does not hot-reload on its
own -- the root reason stale config ran for days). replicas stays 2:
critical-path forward-auth target (anti-AI ingresses fail closed if it
is down), so HA is retained though each request is now trivial.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Synapse injects the Postgres password into homeserver.yaml only at
startup (inject-db-password initContainer). matrix-db-creds is rotated
by Vault via ESO (15m refresh), so each rotation left the running pod
with a stale password and Synapse DB auth failed silently until a
manual rollout restart. Found today via Loki: ~12.9k/hr 'password
authentication failed for user matrix' lines; secret password verified
working against the DB while the 10-day-old pod held the pre-rotation
value.
Add the explicit secret.reloader.stakater.com/reload annotation so
Reloader rolls the deployment whenever the secret changes (explicit
form, not auto/search, because the secret is referenced only in an
initContainer env var). Live pod already restarted to restore service;
this prevents recurrence on the next rotation.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Config-only PVC (no embedded DB), preflighted. Frees one proxmox-csi slot.
NB: pod reschedule re-pulled the 3.26GB backend image (~6min stall) — large-image
services incur a pull-delay blip when migration moves them to a fresh node.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Closes the ghost-disk doom loop by construction (failed detach -> orphan scsiN
with no VolumeAttachment -> invisible oversubscription -> query-pci wedge).
Every 15min csi-ghost-reconcile compares each worker VM's real scsi disks
(Proxmox API) vs k8s VolumeAttachments and safely detaches ghosts (PUT
.../config delete=scsiN -> frees the LUN slot, retains the LV).
- detection mirrors cluster-health check #47
- SAFETY: only vm-9999-pvc scsi with no matching VA; 60s re-confirm; per-run cap 5
- scoped CSI API token (VM.Config.Disk), not root SSH; k8s API via injected ClusterIP
- verified live: read 66 VAs, 0 ghosts, no false positives
- pushes csi_ghosts_detected/detached to Pushgateway
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stand up upstream TREK (mauriceboe/trek:3.0.22, AGPL) as a low-commitment
trial to evaluate the self-hosted group-trip use case before building a
custom app. Solo, single shared instance, Authentik forward-auth.
- stacks/trek: namespace, deployment (pinned, TF-managed, no CI/Keel),
service 80->3000, ingress_factory auth=required + proxied DNS at
trek.viktorbarzin.me, TLS. Two proxmox-lvm-encrypted PVCs (SQLite data +
uploads) -- encrypted per the sensitive-data rule and to avoid the
SQLite-over-NFS locking hazard.
- Trial secrets posture: ENCRYPTION_KEY auto-generated on the data PVC,
bootstrap admin in pod logs -- no Vault/ESO. Graduation TODOs documented
in main.tf + service-catalog (Vault key, app-level SQLite backup, OIDC SSO).
- kyverno: add mauriceboe/* to require-trusted-registries allowlist (the
policy is Enforce since 2026-05-19 -- also fixed the stale "stays in
Audit" header comment that said otherwise and misled the deploy).
- Runs free on OpenStreetMap (no paid maps key). Rallly availability-poll
companion deferred per solo-trial scope.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Enables Open-Meteo geocoding of lodging addresses (results cached in the
new geocode_cache table) so the itinerary can show per-city weather.
Applied manually via scripts/tg apply.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tandoor is PostgreSQL-backed with no embedded DB, so its media/static PVC
is NFS-safe. Frees one proxmox-csi SCSI-LUN slot. Part of the 'harden
proxmox-csi + NFS' plan (keeps PVC mobility, no new hardware) — see
docs/plans/2026-06-05-block-storage-harden-nfs-design.md.
- swap tandoor-data-proxmox -> nfs_volume module (nfs-truenas SC)
- data copied + verified (HTTP 200 on NFS volume); block PVC removed
- block LV retained per SC policy (orphan cleanup tracked in code-dfjn)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tripit outbound (linked-email verification + trip-share invites) was sent
From: spam@viktorbarzin.me. Switch the From to plans@viktorbarzin.me while
keeping SMTP auth as spam@ (its password, unchanged).
docker-mailserver SPOOF_PROTECTION (reject_sender_login_mismatch) requires
the authed login to "own" the From; the @viktorbarzin.me catch-all does NOT
grant that per-address, so add an explicit `plans@ -> spam@` virtual alias to
authorize it (also keeps inbound plans@ routing to spam@ for the mail-ingest
poller). tripit SMTP_FROM flips to plans@.
Verified: sender-login probe (auth spam@, MAIL FROM plans@) now 250 (was 553);
a real send from the tripit pod logs from=<plans@viktorbarzin.me> accepted.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Synology offsite backup target (/mnt/synology-backup, surfaced via
the PVE host NFS mount) sits at ~94% by design and was firing
NodeFilesystemFull continuously. Per user request, raise the threshold
to 95% (<5% free). NOTE: NodeFilesystemFull is a global node-filesystem
rule, so this also loosens the warning on k8s node/system disks;
BackupDiskFull (sda /mnt/backup) stays at 85%.
Also adds docs/runbooks/synology-storage.md: how to assess Synology
usage WITHOUT du (Storage Analyzer weekly CSVs, df/btrfs/qgroup),
btrfs async/snapshot-pinned reclaim, the 2026-06-05 capacity assessment
(94% full; Backup share 4.42TiB), and ~500GiB of homelab cleanup
candidates (redundant gphotos Takeout, old laptop VM images, archives).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>