The instagram_poster.benchmark CLI was writing scores to a sqlite file
on the pod's data PVC. Moving it to the shared CNPG cluster so the
benchmark scoring path is stateless on the pod, scores survive pod
recreation, and the rotation/backup pipeline applies automatically.
- dbaas: null_resource.pg_instagram_poster_db creates role + DB
(idempotent CREATE IF NOT EXISTS, password placeholder) — same
shape as pg_postiz_dbs / pg_wealthfolio_sync_db.
- vault: vault_database_secret_backend_static_role.pg_instagram_poster
+ add to allowed_roles. 7d rotation_period.
- instagram-poster: second ExternalSecret (vault-database store) →
K8s Secret instagram-poster-benchmark-db with BENCHMARK_PG_HOST/
PORT/USER/PASSWORD/DATABASE. env_from on the deployment.
reloader.stakater.com/match=true bounces the pod on rotation.
Code-side: instagram_poster/benchmark.py now resolves the DB URL from
BENCHMARK_DB_URL or BENCHMARK_PG_* env vars; falls back to sqlite for
local DevVM scratch runs. Schema bootstraps via Base.metadata.create_all,
no alembic step needed for the benchmark-only side-DB.
Verified end-to-end via DevVM port-forward: ESO synced, K8s Secret has
all 5 fields, pod env shows BENCHMARK_PG_*, smoke-test scoring 3 photos
landed in the new PG table with subject_category populated.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Update `.claude/reference/authentik-state.md`:
- Add `ProxyProvider.access_token_validity = "weeks=4"` to the Session
Duration table with the gotcha that the gorilla session store binds
the value once at outpost startup (rollout restart needed).
- Replace the "session storage moved to Postgres in 2025.10" note that
falsely implied the migration was automatic — explain that the
`Outpost.managed` field gates the postgres path and our outpost
silently stayed on `FilesystemStore` until 2026-05-10.
- Document the goauthentik 2026.2.2 service-selector bug
(service.py:52) and the JSON-patch workaround.
- Document that the standalone embedded-outpost deployment needs
`AUTHENTIK_POSTGRESQL__*` env vars injected via JSON patch, plus the
`app.kubernetes.io/component=server` pod label.
- Note the "Terraform doesn't expose `Outpost.managed`" assumption
that holds the `managed=embedded` value in place across applies.
Close out post-mortem `2026-04-18-authentik-outpost-shm-full.md`:
- P2 codify-in-Terraform: DONE.
- P3 access_token_validity reduce: DONE-alt (we did the opposite —
bumped to 4 weeks — because postgres backend mooted the storage
concern).
- P3 move-off-embedded-outpost: DONE-alt (postgres backend addresses
the loss-of-state class on the embedded outpost itself).
The HCL declared `IfNotPresent` since module creation but the live
deployment reconciled to `Always` somewhere along the way (likely a
Helm/operator default). Since the image is `:latest`, `Always` is the
correct value — `IfNotPresent` would skip pulling updated images on
pod restart, defeating the point of the floating tag.
Drops the lone remaining drift in the authentik stack so plan-to-zero
holds across the whole stack, not just the resources I just adopted.
Bump access_token_validity to weeks=4 (was hours=168, UI-managed in
ignore_changes). Drives the cookie Max-Age and the proxysession.expires
TTL — keeps users logged in for 28d instead of 7d.
Adopt the embedded outpost into Terraform so the postgres-session-backend
fix from earlier today (2026-05-10) is described as code:
- kubernetes_json_patches.deployment carries dshm 2Gi tmpfs, resource
requests/limits, the app.kubernetes.io/component=server pod label
(workaround for goauthentik 2026.2.2 service.py:52 selector mismatch
on standalone embedded outposts), and AUTHENTIK_POSTGRESQL__* envFrom
the shared `goauthentik` Secret so the postgres session backend can
connect to the dbaas cluster.
- kubernetes_json_patches.service replaces the controller-set selector
(which targets app.kubernetes.io/name=authentik / the goauthentik-server
pods) with the outpost's own labels — without this, endpoints are
empty and auth-proxy falls back to Basic-Auth realm "Emergency Access".
The `managed` field ("goauthentik.io/outposts/embedded") is server-set
and not in the Terraform provider's schema, so TF preserves it across
applies (writes only fields it knows about). Plan-to-zero verified.
Reason: GPU multi-tenancy (frigate + ytdlp-highlights + llama-swap +
immich-ml) was hitting 94% memory-request saturation on the old size.
The benchmark on 2026-05-10 surfaced this when llama-swap stayed
Pending despite GPU time-slicing being on (nvidia.com/gpu replicas=100)
- the actual constraint was node1 RAM, not GPU.
Procedure: drained node1, qm shutdown 201, qm set 201 --memory 49152,
qm start 201, kubelet picked up new capacity (47 GiB / 45.5 GiB
allocatable), uncordon, restored llama-swap + immich-ml.
Out-of-band qm set is the path here (not Terraform) because VMID 201
is intentionally not managed by TF yet - the telmate/proxmox provider
trips on iSCSI-disked VMs (see infra/stacks/infra/main.tf line 442).
Adopt this VM into TF once we migrate to bpg/proxmox.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 7 of the vision-LLM benchmark plan. Adds:
- docs/benchmarks/2026-05-10-vision-llm.md — curated report (TL;DR,
per-model analysis, top-N agreement, cost vs cloud APIs, sample
captions). Verdict: qwen3vl-4b for the request path (3.55 s p50,
100% parse, decisive top-N distro); qwen3vl-8b for caption polish.
- docs/benchmarks/benchmark-2026-05-10-1424.json — raw 300-row dump
for diff-checking against future runs.
- main.tf: -fa -> -fa on (b9085 llama.cpp removed the no-value form
of the flash-attention flag; without the value llama-server exits
before serving any request).
- llama-cpp.md architecture doc links the report so future operators
land on the deployed-and-evaluated model from one entry point.
300/300 calls, 0 parse errors, 33m32s wall on a single T4 with the
GPU exclusively allocated. immich-ml was scaled to 0 for the run
(node1 RAM constraint, not GPU - bumping node1 RAM is tracked as a
follow-up).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PrivateBin's XHR `POST /` (paste creation) was the trigger — Anubis's
catch-all CHALLENGE rule served an HTML challenge page where the JS
expected JSON, breaking paste creation entirely. Same shape will hit
any SPA XHR or CORS preflight on the other 8 Anubis-fronted sites
(homepage actions, kms upload-then-poll, wrongmove search refresh,
jsoncrack share, etc.) the moment it gets exercised.
Add an `ALLOW` rule keyed on `method != "GET"` between the AI/UA-block
imports and the catch-all CHALLENGE. Rationale:
* AI scrapers consume GET response bodies — they don't POST.
* State-mutating XHRs and OPTIONS preflight need to bypass the
challenge or the app breaks.
* CrowdSec + per-route rate-limit + app-level auth already cover
abuse on mutating methods, so this gives up nothing.
* Hard-deny rules for known-bad bots run first, so a declared bad
bot can't sneak through by sending a POST.
Also added a `checksum/policy` annotation on the Anubis pod template
sourced from `sha256(coalesce(var.policy_yaml, default_policy_yaml))`
so future policy changes auto-roll the deployment instead of needing
a manual `kubectl rollout restart`.
f1-stream had its own policy override (path carve-outs for SvelteKit
asset hashes and JSON data routes); mirrored the new rule there too.
Applied to all 8 Anubis-fronted stacks: blog, kms, f1-stream,
travel_blog, real-estate-crawler, homepage, cyberchef, jsoncrack.
Verified per stack: GET / returns the Anubis challenge page; POST,
PUT, DELETE, OPTIONS pass through to the backend (HTTP 301/405/502
from the upstream app, never the Anubis "not a bot" HTML).
PrivateBin's UI POSTs the encrypted blob to `/` via XHR. With Anubis in
front, the catch-all CHALLENGE rule returned an HTML challenge page
where the JS expected JSON, so paste creation failed silently for every
user. The challenge cookie didn't bypass it — Anubis appears to issue a
fresh challenge on POST regardless of cookie state.
Pastes are client-side encrypted; AI scrapers gain nothing from
indexing them, so the default `anti_ai_scraping` middleware is enough
protection. Restoring the ingress to point straight at the privatebin
service. CSP `wasm-unsafe-eval` retained — PrivateBin's zlib.wasm
needs it independent of Anubis.
This matches the rule already documented in infra/.claude/CLAUDE.md:
"DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints — clients
without JS can't solve PoW." A SPA's XHR is the same shape.
Verified: GET / returns PrivateBin HTML (not the Anubis challenge),
POST / returns PrivateBin's own JSON error envelope.
Single Deployment of mostlygeek/llama-swap:cuda hot-swaps three
GGUF vision models (qwen3vl-8b, minicpm-v-4-5, qwen3vl-4b) at one
OpenAI-compat /v1 endpoint on Service llama-swap.llama-cpp.svc.
Idle TTL 10min so models unload between benchmark batches.
Storage: NFS-RWX from /srv/nfs-ssd/llamacpp (30Gi). One-shot
download Job pulls Q4_K_M GGUF + mmproj per model, creates stable
model.gguf / mmproj.gguf symlinks so the llama-swap config is
filename-agnostic, then warms the kernel page cache.
GPU: nvidia.com/gpu=1 = whole T4 — operator must scale immich-ml
to 0 during benchmark windows. wait_for_rollout=false so apply
doesn't block on GPU availability.
Initial use case: vision-LLM benchmark for instagram-poster
candidate scoring; future consumers (HA, agentic tooling) hit
the same endpoint via LiteLLM at the gateway.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
LAN clients with DNS suffix viktorbarzin.lan now activate with zero
configuration — Windows queries _vlmcs._tcp.viktorbarzin.lan SRV by
default and the chain resolves through vlmcs.viktorbarzin.lan to the
new 10.0.20.202 KMS IP.
DNS state (Technitium primary, replicated to secondary+tertiary by the
existing technitium-zone-sync CronJob every 30 min):
- _vlmcs._tcp.viktorbarzin.lan SRV 0 0 1688 vlmcs.viktorbarzin.lan
(was: target=kms.viktorbarzin.lan)
- vlmcs.viktorbarzin.lan A 10.0.20.202 (added)
- kms.viktorbarzin.lan A 10.0.20.200 (unchanged — still the
Traefik LB for the user-facing website at kms.viktorbarzin.lan/)
vlmcs.viktorbarzin.lan was added as a dedicated KMS-server hostname
rather than retargeting kms.viktorbarzin.lan so the LAN-direct website
keeps working without depending on hairpin NAT through pfSense.
Verified end-to-end on WIN10Pro-DS32 (192.168.1.230):
slmgr /ckms → slmgr /ato → "Product activated successfully" with
"KMS machine name from DNS: vlmcs.viktorbarzin.lan:1688" and
"KMS machine IP address: 10.0.20.202". Real client IP 192.168.1.230
appears in vlmcsd log and in the slack-notifier sent line; second
activation within the dedup window correctly increments
kms_activations_dedup_skipped_total.
Bug found via E2E test against the Windows VM (VMID 300). The single
shared `state` dict in slack-notifier.py worked when vlmcsd processed
one connection at a time, but real Windows KMS activations hold the
connection open ~30 seconds (handshake + keep-alive). During that
window vlmcsd accepts other concurrent connections — most relevantly
the new kubelet TCP readiness probe every 5s — and each new OPEN line
reset the shared state, wiping the in-flight activation's
app/product/host before its CLOSE arrived. Result: real activations
were misclassified as probes (no Slack post, no metric increment).
Fix: state is now a dict keyed by `ip:port` with one sub-dict per
in-flight connection. A `__current` pointer tracks the most recent
OPEN so unkeyed detail lines (Application ID, Workstation name, etc.)
can be attributed correctly — vlmcsd writes detail lines immediately
after the OPEN and before any subsequent OPEN, so the heuristic holds.
Orphan CLOSEs (notifier started mid-conn) are now silently dropped
instead of emitting an empty probe event.
Two new regression tests:
- test_kubelet_probe_during_long_activation: 5s probe interleaved into
a 31s activation block — exact production failure mode.
- test_orphan_close_no_event: bare CLOSE without prior OPEN.
Verified live: triggered slmgr /upk + /ipk + /skms 10.0.20.202 + /ato
on WIN10Pro-DS32. vlmcsd logged the full activation block, notifier
posted to Slack with ip=192.168.1.230 source=external
product='Windows 10 Professional' host='WIN10Pro-DS32.viktorbarzin.lan'
and kms_activations_total{product=Windows 10 Professional,
status=Licensed} 1 — real WAN client IP preserved through the
ETP=Local + dedicated MetalLB IP chain end to end.
After a rollout-restart, the main container (default Always for :latest)
pulled the new image with alembic 0003, but the init container
defaulted to IfNotPresent and reused a cached old image lacking 0003 →
"Can't locate revision identified by '0003'" → CrashLoopBackOff.
Setting Always on the init container so both containers stay in lockstep
across rollouts. Longer term we should switch the deployment to 8-char
git-SHA tags per the cluster policy in .claude/CLAUDE.md, but this
unblocks the Wave 1 deploy in the meantime.
Two coupled fixes for the hourly Slack noise + missing client IPs:
1. Move windows-kms off shared 10.0.20.200 to a dedicated MetalLB IP
10.0.20.202 with externalTrafficPolicy=Local, so vlmcsd sees real
WAN client IPs (pfSense WAN forwards do DNAT-only; ETP=Local skips
kube-proxy SNAT). Same pattern mailserver used pre-2026-04-19.
Sharing 10.0.20.200 is blocked because all 10 services there are
ETP=Cluster and MetalLB requires consistent ETP per shared IP.
2. Slack notifier now suppresses Slack posts for bare TCP open/close
pairs (no Application/Activation block) — these are Uptime Kuma's
port monitor and the new kubelet readiness/liveness probes. Probe
counts go to a new metric kms_connection_probes_total{source} where
source classifies the IP as internal_pod / cluster_node / external.
Real activations are unaffected.
Pod fluidity: added TCP readiness/liveness probes on 1688 to gate Pod
Ready on the listener actually being up — required for ETP=Local so
MetalLB only advertises 10.0.20.202 from a node where vlmcsd is serving.
pfSense side (applied separately, not codified):
- New alias k8s_kms_lb = 10.0.20.202 (KMS-only)
- WAN:1688 NAT + filter rule retargeted from k8s_shared_lb to k8s_kms_lb
- All other forwards on k8s_shared_lb (WireGuard, HTTPS, shadowsocks,
smtps, etc.) untouched
Runbook updated. Tests added for classify_source / is_probe / process_line.
Adds 3 new keys (ACTUALBUDGET_API_URL/KEY/SYNC_ID) sourced from
Vault secret/fire-planner so the FastAPI backend can read viktor's
spending from the in-cluster actualbudget HTTP API and prefill the
Annual spending field on the WhatIf form. Vault keys seeded manually
ahead of this commit; ESO has already synced the K8s Secret.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The per-site `x402_instance` module created one Deployment + Service +
PDB per protected host (9 in total, 9×64Mi). Every pod was running the
exact same logic with the same config — the only thing that varied
was the upstream URL, which we don't even need since the gateway can
return 200 to "allow" and Traefik handles the upstream itself.
Refactor to the same pattern as `ai-bot-block`:
* single deployment + service in `traefik` namespace, 2 replicas, HA
* Traefik `Middleware` CRD `x402` (forwardAuth → x402-gateway:8080/auth)
* each consumer ingress just appends `traefik-x402@kubernetescrd` to
its middleware chain via `extra_middlewares`
x402-gateway gains a `MODE=forwardauth` env var that returns 200 (allow)
or 402 (with x402 PaymentRequiredResponse body) instead of reverse-
proxying. Image: ghcr ... f4804d62.
Pod count: 9 → 2 (78% memory saved). All 9 sites verified still
serving the Anubis challenge to plain curl with identical TTFB.
DRY_RUN until `var.x402_wallet_address` is set on the traefik stack.
Removes `modules/kubernetes/x402_instance/` (dead code now).
Adds modules/kubernetes/x402_instance/ — a small Go reverse proxy
(forgejo.viktorbarzin.me/viktor/x402-gateway:ce333419) that selectively
issues HTTP 402 Payment Required to declared AI-bot User-Agents and
validates X-PAYMENT headers against a Coinbase x402 facilitator.
Browsers are forwarded transparently to Anubis (which then handles the
JS PoW gate as before).
Wired into all nine Anubis-fronted sites:
ingress -> x402-X -> anubis-X -> backend
While `wallet_address` is empty the gateway runs in DRY_RUN — every
request is transparent-proxied, no 402s issued. This lets the pod sit
in the request path with zero behavioural impact today; flipping the
wallet variable in the per-stack module call activates payment-required
mode for AI-bot UAs.
Default config: Base mainnet USDC, $0.01/req, x402.org/facilitator,
catch-all UA list (ClaudeBot|GPTBot|Bytespider|meta-externalagent|
PerplexityBot|GoogleOther|cohere-ai|Diffbot|Amazonbot|
Applebot-Extended|FacebookBot|ImagesiftBot|YouBot|anthropic-ai|
Claude-Web|petalbot|spawning-ai|scrapy|python-requests).
Verified post-apply: 9/9 pods Running, all 9 sites still serve the
Anubis challenge to plain curl with identical TTFB, x402 logs confirm
"dry_run":true on every instance.
Earlier f1 revert left the host fully unprotected (no Anubis,
exclude_crowdsec=true on the ingress already). Re-add Anubis with
a custom policy_yaml that:
- ALLOWs /_app/* (SvelteKit immutable JS/CSS chunks loaded before
any cookie exists), /openapi.json, /docs, /api/* (FastAPI meta).
- ALLOWs the 9 known JSON/proxy routes (schedule, streams,
embed, embed-asset, extract, extractors, health, proxy, relay)
so the SvelteKit SPA's XHRs return JSON instead of the challenge
HTML.
- Catch-all CHALLENGE for everything else — the SPA HTML pages
(which fall through to FastAPI's `/{path}` catch-all) get the
PoW gate.
The ALLOWed JSON routes are technically scrapeable by a determined
bot, but the user's stated goal is "avoid accidental scrapes" — the
HTML/SPA is the AI-training target, and that stays gated.
Verified: / → Anubis challenge HTML; /schedule, /streams → JSON;
/_app/.../app.js → text/javascript; ClaudeBot UA → Anubis deny page.
f1.viktorbarzin.me is a SPA whose JS fetches /schedule, /embed,
/embed-asset, … on the same path tree. With Anubis fronting `/`,
those XHRs land on the challenge HTML even when the cookie *should*
be valid, breaking the page with `Unexpected token '<', "<!doctype "
... is not valid JSON`. Removed Anubis from f1 — would need a path
carve-out (the way wrongmove does for /api) to re-enable. Added a
top-of-block comment so future me remembers why.
Plus four new Prometheus alerts in `Slow Ingress Latency` group
(stacks/monitoring/.../prometheus_chart_values.tpl):
- IngressTTFBHigh (warn, 10m, avg latency >1s)
- IngressTTFBCritical (crit, 5m, avg latency >3s)
- IngressErrorRate5xxHigh (crit, 5m, 5xx >5%)
- AnubisChallengeStoreErrors (crit, 5m, any 5xx on *anubis* services
via Traefik — proxies for the in-pod challenge-store error since
Anubis itself only exposes Go-runtime metrics)
Notes from the alert author: avg-not-p95 because the existing
Prometheus scrape config drops traefik bucket series; once those
are restored, swap to histogram_quantile(0.95). TraefikDown inhibit
rule extended to suppress these four during a Traefik outage.
Browser visits to viktorbarzin.me started returning HTTP 500 with
`store: key not found: "challenge:..."` in pod logs. Root cause:
each Anubis pod stores in-flight challenges in process memory; with
2 replicas behind a ClusterIP, the PoW-solved request can be
routed to a different pod than the one that issued the challenge.
Anubis upstream documents the same caveat ("when running multiple
instances on the same base domain, the key must be the same across
all instances" — true for the ed25519 signing key, but the
challenge store is still pod-local without a shared backend).
Drop module default replicas: 2 → 1. Worst-case: ~1s cold-start on
pod restart. Real fix (Redis-backed challenge store) noted as a
follow-up in CLAUDE.md.
Roll Anubis out to: f1-stream, cyberchef (cc), jsoncrack (json),
privatebin (pb), homepage (home), real-estate-crawler (wrongmove
UI only — `/api` ingress stays direct via path-based ingress carve-
out so XHRs from the SPA bypass the challenge).
End-state: 9 public hosts now Anubis-fronted (blog, www, kms,
travel, f1, cc, json, pb, home, wrongmove). All return the
challenge HTML to bare curl/browser; verified-IP search engines and
/robots.txt + /.well-known still skip via the strict-policy
allowlist.
The default upstream policy only WEIGHs Mozilla|Opera UAs and lets
everything else (curl, wget, python-requests, scrapy, headless CLI
scrapers) fall through to the implicit ALLOW. On non-CDN-fronted
hosts (kms, anything dns_type=non-proxied) this meant a plain
`curl https://kms.viktorbarzin.me/` returned the real backend
content with no challenge — defeating the whole point of the
"avoid casual scrapers" intent.
Now the module ships a custom POLICY_FNAME mounted via ConfigMap:
- Imports the upstream deny-pathological / ai-block-aggressive /
allow-good-crawlers / keep-internet-working snippets unchanged
- Adds a final `path_regex: .*` → action: CHALLENGE catch-all
Result: only IP-verified search engines (Googlebot from Google IPs,
Bingbot, etc.) and well-known paths (robots.txt, .well-known,
favicon, sitemap) skip the challenge. Everything else — including
spoofed-Googlebot-UA-from-random-IP — solves PoW or gets nothing.
Verified post-apply: curl default UA on viktorbarzin.me + kms +
travel returns the Anubis challenge HTML; /robots.txt still 200s
straight through.
The SPA can't carry an Authentik session on its own fetch() XHRs in
all cases (cross-origin redirect to authentik.viktorbarzin.me on a
stale cookie returns HTML, fetch().json() parse fails). Splitting
the ingress so /api/ paths skip forward-auth lets the React app talk
to its API end-to-end. The browser still has to log in via
Authentik to load the SPA at /.
Verified end-to-end via chrome-service Playwright: dashboard load,
scenario list, what-if run with real Monte Carlo, save-as-scenario
round-trip, run-now on detail, delete — all pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds modules/kubernetes/anubis_instance/ — a per-site reverse proxy
instance pinned to ghcr.io/techarohq/anubis:v1.25.0. Each instance
issues a 30-day JWT cookie scoped to viktorbarzin.me after a tiny
proof-of-work (difficulty 2 ≈ 250 ms desktop / 700 ms mobile). The
shared ed25519 signing key (Vault: secret/viktor → anubis_ed25519_key)
makes a single solve good across every Anubis-fronted subdomain.
Wired into blog (viktorbarzin.me + www), kms.viktorbarzin.me, and
travel.viktorbarzin.me — each with anti_ai_scraping=false on the
ingress so the redundant ai-bot-block forwardAuth is dropped from the
chain. Skipped forgejo (Git/API clients can't solve PoW) and resume
(replicas=0).
Also tightens bot-block-proxy nginx timeouts (3s/5s → 100ms/200ms) so
any ingress still using the ai-bot-block forwardAuth pays at most
~150 ms when poison-fountain is scaled down, instead of 3 s.
End-to-end TTFB on viktorbarzin.me dropped from ~3.2 s to ~150-200 ms.
Docs: .claude/reference/patterns.md "Anti-AI Scraping" updated to
4 layers; .claude/CLAUDE.md adds the Anubis usage paragraph and
Forgejo/API caveat.
ingress_factory's port var defaults to 80, but fire-planner publishes
on 8080. Traefik logged 'Cannot create service error="service port
not found"' and 404'd every request. Cloudflare's standard
origin-error decoy page (with the noindex meta + cdn-cgi/content
honeypot link) made it look like a bot-block, but it was just the
upstream coming back 404.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Slack notifier now also exposes /metrics on :9101 with stdlib HTTP — counts
activations and dedup-skips by product, gauges last-activation timestamp.
Pod template gets the standard prometheus.io/scrape annotations so the
cluster-wide kubernetes-pods job picks it up via pod IP. Memory request
bumped to 48Mi to cover counter dicts + HTTPServer.
Plus docs: networking.md footnotes the windows-kms row noting public WAN
exposure with the rate-limited (max-src-conn 50, max-src-conn-rate 10/60,
overload <virusprot> flush) pfSense filter rule, and a new runbook covers
log locations, rate-limit tuning, and how to revoke the WAN forward.
The matching pfSense rule was tightened in place (TCP-only + rate limits)
via SSH; pfSense isn't Terraform-managed.
daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr
30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount,
which blocked the next run from completing — root cause of the WeeklyBackupStale
alert going silent (the metric never reached its end-of-script push).
Fixes:
- TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting
the wall during week 18 runs)
- Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as
belt-and-braces for any inherited stuck state from a prior crashed run
- TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of
the alert going blind on systemd kills
- pfsense metric pushed in BOTH success and failure paths (was only on success;
any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert
threshold expired)
Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node
OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup
that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to
/srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end:
3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql
image is stripped (no curl/wget/python) — switched to docker.io/library/postgres
matching the dbaas/postgresql-backup pattern with apt-installed curl.
Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed
backup_weekly_last_success_timestamp but the script pushes
daily_backup_last_run_timestamp). Updated to match what's actually emitted, and
added a "default-covered" footnote to the Service Protection Matrix so the
~40 services with PVCs not enumerated in the table are no longer ambiguous.
Manual PVE-host actions (out-of-band, not in TF):
- unmounted 6 stacked snapshots from /tmp/pvc-mount
- pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the
loop got SIGTERMed against repeatedly, so prune kept failing)
- created /srv/nfs/postiz-backup directory
- triggered a one-shot daily-backup run with the new TimeoutStartSec to
validate the fix end-to-end
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wealth, Payslips, and Job-Hunter Grafana datasources all baked the
rotating PG password into their ConfigMap at TF-apply time, so every
7-day Vault static-role rotation silently broke the panels until a
manual `terragrunt apply`. Same family as the recurring grafana-mysql
backend bug — Grafana caches creds at startup and never picks up the
new ESO-synced password without a restart.
Fix:
- Each source stack now creates an ExternalSecret in `monitoring`
exposing the rotating password as `<NAME>_PG_PASSWORD` env-var.
- Grafana mounts those via `envFromSecrets` (optional=true so a
missing source stack doesn't block boot) and the datasource
ConfigMaps reference `$__env{<NAME>_PG_PASSWORD}` instead of a
literal password.
- `reloader.stakater.com/auto: "true"` on the Grafana pod restarts
it whenever any of the four DB-cred Secrets is updated.
Tested end-to-end: forced `vault write -force database/rotate-role/
pg-wealthfolio-sync` → ESO synced (~30s) → reloader fired →
Grafana booted with new env in ~50s total → all three /api/datasources
/uid/*/health endpoints return "Database Connection OK".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mysql-standalone OOMKilled May 8 18:05 (anon-rss 2 GB at the 2 Gi limit).
innodb_buffer_pool_size=1Gi plus connection buffers and InnoDB internals
don't fit in 2 Gi. Bumping limit to 4 Gi (request 3 Gi) leaves headroom
without changing the buffer pool config.
/srv/nfs was at 90% (1.7T / 2T); grew the underlying pve/nfs-data LV
1 TiB online and ran resize2fs (now 60% used). Triggered by surfacing
during the 2026-05-09 IO-pressure post-mortem; thinpool had ~4.6 TiB
free.
The post-mortem also covers the stale-NFS-client trigger (legacy
/usr/local/bin/weekly-backup pointing at the decommissioned TrueNAS IP)
and the resulting wedged kthread on the PVE host. Script removed and
node_exporter restarted out-of-band; kthread will clear at next PVE
reboot. See docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
instagram-approval: after every tap, immediately fetch /candidates?limit=1
and send the next photo as a fresh inline-keyboard message — the user's
tap chains back into this same workflow, so the loop is user-paced.
When the pool is exhausted, send an 'all caught up' summary with the
backlog count + cumulative training stats.
instagram-discover: cron throttled from every-30-min to daily 09:00.
The chain handles ongoing training; the daily run only kickstarts a
session if the user hasn't been tapping. Limit reduced from 3 → 1 so
each kickstart sends a single photo (chain takes over).
Stories+feed posts via Postiz failed with state=ERROR and Postiz
mistranslated the cause as 'Invalid Instagram image resolution
max: 1920x1080px'. Real cause: Postiz hands Meta an upload URL
under https://postiz.viktorbarzin.me/uploads/... and Meta gets a
302 to the Authentik login page instead of bytes. Meta returns
error 36001 (image not fetchable) which Postiz maps to that
misleading resolution string.
Split the ingress: /uploads/* on a public ingress (matches the
instagram-poster /image+/original pattern), everything else
remains behind Authentik forward-auth. /uploads contents are
random UUIDs, low blast radius if scraped.
The canonical proxmox-lvm and proxmox-lvm-encrypted PVC templates were
missing `lifecycle { ignore_changes = [spec[0].resources[0].requests] }`.
Without it, every PVC created from these templates becomes a drift bomb
the moment pvc-autoresizer expands it: the next `tg apply` on that stack
will try to shrink the PVC back to the TF-declared size, K8s rejects the
shrink, and apply fails.
This was latent because pvc-autoresizer was silently broken cluster-wide
(commit 9d5da4d8 fixed it by allow-listing kubelet_volume_stats_available_bytes
in Prometheus). Now that the autoresizer actually works, every existing
proxmox-lvm/encrypted PVC without ignore_changes is at risk.
Sweep needed (separate task): grep for kubernetes_persistent_volume_claim
across stacks/ and add ignore_changes to any with resize.topolvm.io
annotations.
Standalone provider (instagram-standalone OAuth flow) is what the user
is trying after the FB-Login path was blocked by their Business Account
ad-policy flag. Uses modern scope names (instagram_business_*), so no
JS patch needed unlike the FB-Login provider.
Same fix as default.yml — drift-detection cron also runs terragrunt
plan on every stack, which requires the kubeconfig at <repo>/config
that terragrunt.hcl injects via -var kube_config_path. Pipeline #547
(latest scheduled drift-detection run) failed with the same
'config_path refers to an invalid path' error.
terragrunt.hcl injects -var kube_config_path=${repo_root}/config for
every terraform invocation, but the pipeline never created that file.
Every commit that touched a TF stack since #545 (2026-05-08) failed
with 'config_path refers to an invalid path: \"../../config\": no such
file or directory' followed by the kubernetes provider falling back
to localhost:80.
Add a step that writes a kubeconfig at <repo>/config using the
projected SA token + cluster CA. The woodpecker namespace's default
SA is already cluster-admin (woodpecker-default ClusterRoleBinding),
so the projected token is sufficient for any stack apply. Using
tokenFile (not an inline token) lets the provider re-read it if
kubelet rotates the projected token mid-pipeline.
#545 was the last green run because that commit only changed the
build-cli pipeline — 0 stacks applied so the missing kubeconfig
never mattered.
The old port-5050 R/W private registry was decommissioned 2026-05-07
(forgejo-registry-consolidation Phase 4). The reverse-proxy ingress
+ ExternalName service + Cloudflare DNS record kept pointing at the
dead backend, returning 502 to anyone hitting registry.viktorbarzin.me.
This was driving 3 monitoring artifacts that auto-cleared on cleanup:
- Uptime Kuma external monitor #586 (deleted)
- Pushgateway stale registry-integrity-probe metrics (deleted)
- ExternalAccessDivergence + RegistryIntegrityProbeStale alerts
The Prometheus scrape config for the kubernetes-nodes job kept
capacity_bytes + used_bytes but dropped available_bytes. pvc-autoresizer
computes utilization from available/capacity, so without that metric it
was silent for every PVC in the cluster — including mailserver, which
filled to 89% (1.7G/2.0G) and started rejecting all inbound mail with
'452 4.3.1 Insufficient system storage' (15+ hours, all real senders:
Brevo, Gmail, Facebook).
Also bumps the floors of mailserver (2Gi -> 5Gi, limit 10Gi) and forgejo
(15Gi -> 30Gi) PVCs to recover from the immediate outage, and adds
ignore_changes on requests.storage so future autoresizer expansions
don't cause TF drift.
User dropped Postiz/Instagram OAuth (Meta Business Account flagged
+ Postiz scope drift). New pipeline ends at Telegram — full-quality
JPEG delivered to the bot chat, manually uploaded to IG by the user.
- Image bumped to 25e46efd: adds /deliver/{asset_id} endpoint that
multipart-uploads to Telegram (URL-fetch fails through Cloudflare
for >5MB), then tags 'posted' in Immich.
- ESO now syncs telegram_bot_token + telegram_chat_id from Vault.
- Public ingress paths grow to ['/image', '/original'] (Authentik
bypass on /original is harmless — files are user-tagged, low blast
radius — and useful for ad-hoc browser downloads).
- Memory limit 512Mi -> 1500Mi: full-resolution Pillow HEIC decode
was OOMing on 12MP+ phone photos.
- discover.json simplified to scan -> deliver per item; approval and
post workflows already deactivated. Telegram bot webhook removed.