infra

Author	SHA1	Message	Date
Viktor Barzin	7b6eee49c4	infra: drop Authentik forward-auth from 7 self-authed apps (auth = "none") Apps with their own user auth + bearer-token APIs were being broken by Traefik → Authentik forward-auth: every iOS/Android/native client got a 302 to authentik.viktorbarzin.me instead of the JSON they expected. Authentik's 302+cookie dance can only be followed by a real browser. Changed: - immich (Immich mobile app + bearer-token /api) - linkwarden (NextAuth + Linkwarden mobile clients) - tandoor (Django auth + Tandoor mobile clients) - freshrss (Fever/GReader API used by Reeder/FeedMe/etc.) - affine (workspace auth + AFFiNE desktop/mobile sync) - actualbudget (server password + Actual mobile/sync clients) - ebooks/abs (Audiobookshelf iOS/Android app) Each app's own auth is the gate now. CrowdSec + rate-limit + anti-AI UA filter still front the ingresses. Same pattern as the novelapp change earlier this session. [ci skip]	2026-05-22 14:16:44 +00:00
Viktor Barzin	f98c3f2049	infra/novelapp: drop Authentik forward-auth (auth = "none") novelapp handles its own user auth via NextAuth + Google OAuth, so the ingress-level Authentik forward-auth was double-gating. Mobile webviews (iOS/Android) can't follow the Authentik 302/cookie dance — they saw HTML challenges where they expected JSON. CrowdSec + rate-limit + anti-AI UA filter remain in front; novelapp's own login handles users. [ci skip]	2026-05-22 14:16:44 +00:00
root	77492b3131	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:44 +00:00
Viktor Barzin	9be0672aa3	claude-memory / resume: unblock terragrunt apply (var defaults + psql -d postgres) Two pre-existing apply failures uncovered during the Phase 4 mass apply, unrelated to the auth refactor but blocking 100% rollout. claude-memory: - `var.claude_memory_db_password` had no default and wasn't passed by terragrunt → fall back to Vault `secret/claude-memory.db_password` via `coalesce(var.x, data.vault.data["db_password"])`. - db-init Job was failing with `database "root" does not exist` because psql defaults the database name to the user when -d is omitted. Added `-d postgres` to all five psql invocations. resume: - `var.resume_database_url` had no default and wasn't passed → default to empty string. Vault carries the real value at `secret/resume.database_url` consumed at the deployment env-var level; the variable here just needs a value to satisfy the apply. Also: priority-pass had lost most of its TF state (only 3 of 8 resources tracked); imported namespace/service/pvc/deployment/ingress/tls-secret to re-bind state with live K8s resources. No code change needed there. Verified after re-apply: - claude-memory.viktorbarzin.me → 200 (auth=none, native MCP responses) - priority-pass.viktorbarzin.me → 302 → authentik (auth=required) - resume.viktorbarzin.me → 302 → authentik public outpost (auth=public) - 6 of 7 previously-failing applies now green; only vault remains, blocked by an unrelated helm chart immutable-StatefulSet-field issue. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:44 +00:00
Viktor Barzin	a168277213	healthcheck: tune noise filters + nvidia-exporter auth=none Six tuning changes to cluster_healthcheck.sh so PASS sections actually reflect "nothing to act on": 1. prometheus_alerts: only count severity=warning\|critical. Info-level alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the alert rule itself sets severity; the script should respect it. 2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the Lets Encrypt wildcard renews weekly; <14d is the only window where human attention is genuinely useful. 3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/ event/image/update domains (transient by design), skip friendly names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and only count entities whose last_changed > 24h. Was 431/1470, most of which were "phone in standby" noise. 4. ha_automations: only flag DISABLED automations as abandoned if they've also been untouched (last_changed) for >180 days; raise stale threshold 30d → 180d. Was flagging seasonal/holiday-only automations as broken. 5. problematic_pods + evicted_pods: exclude pods owned by Jobs. CronJob retry leftovers (Error/Failed phase pods that K8s keeps around for log inspection) aren't problematic at the cluster level. 6. uptime_kuma: retry the WebSocket login 3x with backoff. Single- shot failures were a recurring false-positive even though the service was healthy. Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll /metrics and got 302'd to Authentik like the idrac/snmp ones did. Same fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
root	8483ca59ba	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:43 +00:00
Viktor Barzin	dc7c19d88e	frigate: lan ingress auth=none for HA Sofia integration The frigate-lan.viktorbarzin.lan ingress had Authentik forward-auth in front. HA Sofia's frigate integration polls /api/config and only knows how to use Frigate's own API key (not browser SSO), so every poll got a 302 to authentik.viktorbarzin.me and the integration entered the errors-state. Same pattern as idrac-redfish-exporter (5c594291). allow_local_access_only IP allowlist + Frigate's API key are enough. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	dc134011eb	fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes After fixing the threshold=80% misconfig and seeing two PVCs (prometheus + technitium primary) get stuck Terminating, a 3rd round showed four more PVCs (frigate, hackmd, immich-postgresql, paperless-ngx) in the same state. Same root cause: TF spec'd a smaller storage size than the autoresizer-grown live value, K8s rejected the shrink, TF force-replaced the PVC, and the pvc-protection finalizer held it in Terminating while the pod kept using the underlying volume. Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests] on every kubernetes_persistent_volume_claim block that has resize.topolvm.io/threshold annotations. The pattern was already documented in .claude/CLAUDE.md but ~63 stacks were missing it. Live PVCs are unaffected; this only prevents future TF applies from attempting the destroy+recreate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	dd2b7de291	fix: HA Sofia REST sensors + PVC drift safety Two real issues found while triaging HomeAssistantCriticalSensorUnavailable alerts and the prometheus + technitium PVC Terminating-but-in-use state from the earlier session. 1. idrac-redfish-exporter + snmp-exporter ingresses: auth=required → auth=none. HA Sofia REST sensors scrape these endpoints programmatically; with Authentik forward-auth in front, every request got a 302 to authentik.viktorbarzin.me and the REST sensors parsed the HTML login page instead of metrics — leaving the R730, UPS, and ~20 other sensors permanently unavailable. The allow_local_access_only IP allowlist (192.168.0.0/16 + 10.0.0.0/8) already gates external access, so authentik on top was breaking machine-to-machine traffic for no security gain. 2. prometheus_server_pvc + technitium primary_config_encrypted: add lifecycle.ignore_changes = [spec[0].resources[0].requests]. The autoresizer expands these PVCs; PVCs can't shrink. Without the ignore, every TF apply tried to revert the live size back to the TF spec value, hit K8s's shrink-forbidden rule, and force-replaced the PVC. Because the pod still mounted it, the PVC went into Terminating-but-protected limbo — fine until a pod restart would have orphaned the volume. Root cause of the 2026-05-10 PVC Terminating incident. Bonus: prometheus_server_pvc threshold was the inverted "90%" (the same bug the bulk fecfa211 sweep fixed elsewhere; my regex only matched "80%" so this one slipped through). Now "10%". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	7e69951cb9	state(dbaas): update encrypted state	2026-05-22 14:16:43 +00:00
Viktor Barzin	ee47197f3b	vault: enroll audit-vault-0 in pvc-autoresizer (10Gi limit) audit-vault-0 fills steadily with raft audit logs; without autoresizer annotations it hits the 2Gi ceiling and Vault stalls on writes (PVAutoExpanding alert was firing at 81% used). The Vault Helm chart copies server.auditStorage.annotations onto the PVC at create time. Live PVC already has the annotations applied via kubectl annotate; this just keeps TF in sync. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	0fdadcc3dd	dbaas: pg-cluster threshold 80%→10% in CNPG inheritedMetadata Same misconfig as the bulk fecfa211 sweep, but the pg-cluster YAML is buried inside a null_resource local-exec heredoc so the regex didn't catch it. CNPG operator inherits these annotations onto each member PVC (pg-cluster-1, pg-cluster-2), and reapplies them on every reconcile — patching the live PVCs alone bounces back within seconds. Live state already patched via kubectl patch cluster, this just keeps TF in sync. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	3f2b2f9d32	fix: pvc-autoresizer threshold should be 10%, not 80% topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE percentage below which expansion fires (per upstream README). Setting it to "80%" means "expand when free-space drops below 80%", i.e. as soon as the PVC crosses 20% utilization — which caused prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi in 70 minutes (six 10% bumps, all when the volume was only ~14% used). Once the SC opt-in fix landed (1e4eac53) and the inode metrics fix landed (02a12f1a), the autoresizer started actively misfiring across 75+ PVCs cluster-wide. Flip the value to "10%" everywhere — that's "expand when free-space drops below 10%", i.e. at 90% utilization, which is the conventional semantic and matches the alert thresholds in prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp at 95%). The CLAUDE.md PVC template was the source of the misconfig, so update it too. Live PVC annotations were patched in parallel via kubectl annotate; TF apply on each affected stack will be a no-op against those live values. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	dc4ce46411	k8s-version-upgrade: detection script refresh apt before madison + DRY_RUN_OVERRIDE Test 2 dry-run revealed kubeadm plan reports v1.34.7 as latest while apt-cache madison (without prior apt-get update) was reporting v1.34.5 — so the CronJob would have dispatched the agent against a stale target. Now do `sudo apt-get update -qq` for just the kubernetes repo before querying madison. Also add a DRY_RUN_OVERRIDE env precedence so future test invocations can override DRY_RUN without an apply cycle — but Job spec env is immutable post-create, so this is only useful for CronJob spec edits (suspend, then add env, then resume). Documented in the runbook.	2026-05-22 14:16:43 +00:00
Viktor Barzin	ae6dde45c2	k8s-version-upgrade: trigger etcd snapshot via existing backup-etcd Job; broaden agent RBAC Stage 2 now reuses the existing default/backup-etcd CronJob (NFS-backed PV pointing at 192.168.1.127:/srv/nfs/etcd-backup) instead of trying to ssh into master and run etcdctl against a non-existent /mnt/main mount. The agent triggers a one-shot Job from cronjob/backup-etcd, waits up to 10 min, then parses the backup-manage container log for "Backup done" line + byte count. Test 2 (dry-run) surfaced 5 real cluster blockers — agent loop works end-to-end at the planning level. Expanded the claude-agent ServiceAccount's privileges via a sibling ClusterRole (claude-agent-upgrade-ops): - patch namespaces/k8s-upgrade (in-flight annotation) - create batch/jobs (trigger etcd snapshot Job) - patch nodes (cordon/uncordon) - create pods/eviction (drain) - delete pods (drain fallback)	2026-05-22 14:16:43 +00:00
Viktor Barzin	e75bcaf394	k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison on master for new patches + HEAD pkgs.k8s.io for next-minor availability, then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent. The agent (.claude/agents/k8s-version-upgrade.md) orchestrates: pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match) -> etcd snapshot save -> optional master containerd skew fix -> apt repo URL rewrite (minor bumps only) -> drain/upgrade/uncordon master via ssh < update_k8s.sh -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each -> post-flight verification Two new Upgrade Gates alerts catch failure modes: - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m) - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m) update_k8s.sh refactored to take --role / --release args; the agent shells it into each node via SSH pipe. update_node.sh annotated as OS-major path. Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section in docs/architecture/automated-upgrades.md. Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519 keypair distributed to all 5 nodes via authorized_keys; slack_webhook reuses kured webhook URL on initial deploy).	2026-05-22 14:16:42 +00:00
Viktor Barzin	09f83b4e83	fire-planner / k8s-portal / insta2spotify: revert auth=public to auth=none The Phase 4 audit promoted three "smoke-test candidates" from `protected = false` to `auth = "public"`, but all three are XHR / curl-driven endpoints (fetch() calls, automation scripts) that don't survive the 302+cookie redirect dance that the public-auto-login flow requires on first visit. fire-planner's SPA broke immediately — every fetch() to /api/* hit a cross-origin redirect and CORS preflight rejected it. Important learning for the `auth = "public"` design: `auth = "public"` is functionally equivalent to a normal Authentik forward-auth for the FIRST request — it issues a 302 to authentik to set a guest session cookie, then 302s back. This is invisible for top-level browser navigation but BREAKS: - XHR/fetch() under CORS preflight (preflight rejects redirects) - curl/automation scripts that don't preserve cookies across requests - Mobile / native clients that can't follow OAuth-style redirects Use `auth = "public"` only for top-level HTML pages where the user navigates via the browser address bar (or links). For XHR APIs, native-client surfaces, webhooks, OAuth callbacks — use `auth = "none"`. The plan's "smoke test 3 candidates" were misjudged on this front. Reverting all three to `auth = "none"` (their previous behaviour). The end-to-end public flow IS verified working via curl + flow API — the design is sound, just the test targets were wrong. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
root	faad99cff3	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:42 +00:00
Viktor Barzin	143413dc0b	owntracks: explicit auth = "none" — Phase 5 audit completion The Phase 4 audit pass missed this site because the previous agent scoped out owntracks (it overrides the factory's middleware list via extra_annotations to use its own basic-auth middleware). Adding the explicit auth = "none" satisfies Phase 5's "every ingress has an explicit decision" goal and makes the intent visible — mobile OwnTracks clients post location data via HTTP basic-auth and can't follow Authentik forward-auth 302s. Closes the loop on Phase 5: 122/122 active ingress_factory call sites now carry an explicit auth = "..." decision (zero callers rely on the default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor Barzin	ff5538a667	ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default false → unprotected) variable in `modules/kubernetes/ingress_factory` with `auth = string` enum (default "required" → fail-closed). Touches every ingress_factory caller so the audit decision is recorded explicitly in code. ingress_factory (Phase 3): - `auth = "required"`: standard Authentik forward-auth (the legacy `protected = true` semantic). - `auth = "public"`: forward-auth via the new `authentik-forward-auth-public` middleware → dedicated public outpost → guest auto-bind. Logged-in users keep their real identity. - `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost itself. - `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated ingresses don't need anti-AI noise; the auth flow already discourages bots). Audit pass (Phase 4) across 96 ingress_factory call sites: - 49 explicit `protected = true` → `auth = "required"` - 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3) - 64 previously-default (no protected line) → `auth = "required"` ADDED, then reviewed individually: * 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack, homepage, wrongmove UI, privatebin) → `auth = "none"` * 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC, xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich location ingestion, immich frame kiosk, headscale CP, send anonymous drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) → `auth = "none"` * Remaining ~33 → `auth = "required"` confirmed (admin tools, internal UIs, services without app-level auth) - Smoke-test promotions to `auth = "public"`: fire-planner public UI, k8s-portal API, insta2spotify callback. Three call sites in wrapper modules (`stacks/freedify/factory/`, `stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected` bool — they translate to `auth` internally, out of scope for this rename. Behavior change: previously-default ingresses now fail closed (require Authentik login) unless explicitly flipped to `auth = "none"` or `auth = "public"`. This is the audit goal — no more accidentally-unprotected surfaces. Sites that were intentionally public (Anubis content, native APIs, webhooks) are now explicitly recorded as `auth = "none"`. Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via `terraform fmt -recursive` during the audit. Behavior-neutral. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor Barzin	88e57fdddb	instagram-poster: disable ig-ingest-stories CronJob until /ig-ingest ships The endpoint exists in the working copy of instagram_poster/app.py but isn't committed/built/deployed, so every cron fire returned 404 and triggered JobFailed alerts every 30 min. Set count = 0 to leave the resource declaration in place — re-enable by removing that line once the endpoint is in a built image. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor Barzin	d2be0921e8	scripts: timeout rsync + sqlite calls in daily-backup Per-PVC rsync had no timeout, so any single hung PVC (e.g. on a corrupted snapshot or a sqlite held open by a writer) blocked the whole script until systemd's 4h TimeoutStartSec kicked in, leaving every later PVC silently unbacked. Today's run hung on mailserver/roundcubemail-enigma-encrypted at 05:09 and didn't recover — hence WeeklyBackupFailing alert. Now: - rsync per PVC: timeout 30 min, exit 124 logged separately - sqlite3 per database: timeout 5 min - /etc/pve rsync: timeout 5 min Each timed-out PVC bumps PVC_FAIL but the loop keeps moving. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor	fddf168ecb	cloudflare: disable AI bot edge-block so x402 can issue payment offers CF zone was returning 403 to declared AI-bot UAs at the edge (`ai_bots_protection: "block"`). That meant the in-cluster x402 gateway never saw the request and could never issue an HTTP 402 with the wallet payment requirements — the bot just bounced. Adopt `cloudflare_bot_management.zone` via root-module import block, flip ai_bots_protection to "disabled". Bot Fight Mode (`fight_mode`), crawler challenge (`crawler_protection`), and managed robots.txt are unaffected — generic automated traffic still gets the bot fight gate. End-to-end verified: `User-Agent: Mozilla/5.0 (compatible; ClaudeBot/ 1.0;...)` on viktorbarzin.me now returns HTTP 402 (was 403 CF block) with `payTo=0xCc33...659f`, `amount=10000` micro-USDC, `network=base`. Trade-off: bots that don't pay still hit origin (instead of CF blackholing them), so a small bandwidth uptick. Negligible at our traffic level.	2026-05-22 14:16:42 +00:00
Viktor Barzin	4103ea2ba0	monitoring(prometheus): keep all 4 kubelet_volume_stats_inodes metrics pvc-autoresizer's GetMetrics() returns volume stats for a PVC only if all four kubelet_volume_stats metrics (available_bytes, capacity_bytes, inodes_free, inodes) are retrieved. The keep-list in the kubernetes-nodes scrape job had available_bytes and capacity_bytes (post 9d5da4d8) but was missing the two inode metrics, so the autoresizer's reconcile logged "failed to get volume stats" for every PVC and never resized anything. Add kubelet_volume_stats_inodes and kubelet_volume_stats_inodes_free to the regex. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor Barzin	6d3308c848	authentik: add public guest auto-login flow + dedicated outpost + traefik public middleware Phase 1+2 of default-deny ingress plan. Adds the infrastructure for an `auth = "public"` ingress tier that auto-binds anonymous requests to a `guest` Authentik user (no UI prompt), so public sites are still recorded as authenticated by Authentik for audit purposes — but as `guest`, not by leaking the standard catchall flow. - guest user in `Public Guests` group (NOT `Allow Login Users`). - `public-auto-login` flow: stage_binding policy sets `pending_user = guest`, `evaluate_on_plan = false` + `re_evaluate_policies = true` so flow_plan is populated when the policy mutates it; `authentication = none` lets anonymous requests enter. - `Provider for Public` proxy provider (forward_domain, cookie_domain viktorbarzin.me) with `authentication_flow = public-auto-login`. - Dedicated `public` outpost: only the public provider bound, deployed as `ak-outpost-public` Deployment+Service in the `authentik` namespace by Authentik's K8s controller. - `public-auth.viktorbarzin.me` ingress exposes the public outpost's `/outpost.goauthentik.io/*` so OAuth callbacks land on it (the embedded outpost doesn't know about the public provider, so `authentik.viktorbarzin.me` callbacks would fail). - `authentik-forward-auth-public` traefik middleware points at the public outpost service (not via the auth-proxy nginx fallback). The plan's `?app=public` dispatch idea was tested and rejected — the embedded outpost dispatches purely by Host header, so a dedicated outpost was the only way to isolate the public flow without conflicts. No ingresses use the new middleware yet — Phase 3+4 (the ingress_factory `auth` variable refactor + audit pass) wires it up. This commit is additive and behaviour-neutral. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor Barzin	ff5416ff40	proxmox-csi: opt SCs into pvc-autoresizer (resize.topolvm.io/enabled=true) Without this annotation on the StorageClass, pvc-autoresizer's controller filters the SC out at the index lookup stage and never patches any of its PVCs, regardless of utilization or per-PVC threshold/increase/storage_limit annotations. Internal metric pvcautoresizer_loop_seconds_total ticked but no PVCs were ever evaluated — visible cluster-wide as PVAutoExpanding alerts firing for forgejo-data-encrypted (82%) and audit-vault-0 (81%) without any ResizeStarted events ever following. The Prometheus scrape-config fix in 9d5da4d8 was a prerequisite (autoresizer reads kubelet_volume_stats_available_bytes) but not sufficient on its own. Also pinning chart version to 0.5.6 so the next apply doesn't incidentally bump to 0.5.7. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00
Viktor	ea9b5542d1	x402: flip gateway live with Viktor's wallet + Slack payment notifications Wires the traefik stack to read two new fields from secret/viktor: * x402_wallet_address -> 0xCc33BD250d39752e0ceaB616f8a05F72274a659f * alertmanager_slack_api_url (existing) -> reused as the per-payment notification webhook so payment events arrive in the same Slack channel as other infra alerts. Gateway now runs `wallet_set:true, dry_run:false`. Verified end-to-end: - Browser UA on all 9 sites -> 200 (passes through to Anubis) - python-requests/2.31 + scrapy + ClaudeBot UA -> 402 with PaymentRequiredResponse, payTo == Viktor's wallet, amount=10000 micro-USDC, network=base, asset=Base USDC contract - Direct Slack-webhook test from inside cluster -> HTTP 200 Image bumped to forgejo.../x402-gateway:d9b83125 with Slack-format notification payload (text=..., username=x402-gateway, icon_emoji=💰; auxiliary fields preserved for richer receivers). Notifications fire on every successful X-PAYMENT validation; failures on Slack webhook are logged at WARN, never block the request, never double-charge the bot.	2026-05-22 14:16:41 +00:00
Viktor Barzin	58789cde8b	kured(sentinel-gate): fix auth + write-perm so safety checks actually run Test 3 validation surfaced two latent bugs in the sentinel-gate DaemonSet that have been masked since 2026-04-18 (when uu was off, nothing wrote /var/run/reboot-required, so the gate never had to fire): 1. automount_service_account_token=false on both the SA and the pod spec → kubectl in the script falls back to localhost:8080 on every call. Each check (`kubectl get nodes`, `kubectl get pods -n calico-system`, transition-time read) errors to stderr and emits empty stdout. `wc -l` reports 0 → checks "pass" with no real data. 2. bitnami/kubectl:latest runs as uid=1001 by default. The hostPath /var/run is root:root 0755 → final `touch /host/var-run/gated-reboot-required` failed with EACCES. Fail-safe by accident — but if anything had ever loosened those perms, the broken checks above would have green-lit the gate with no real validation. Fix: enable token mount on the SA + pod, set securityContext.run_as_user=0 on the container. Verified post-fix: kubectl returns all 5 nodes, touch succeeds, sentinel-gate now reports the correct `BLOCKED: A node transitioned Ready within the last 24 hours (soak window)` when triggered with k8s-node1's recent reboot within the cool-down period. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00
Viktor Barzin	a2377a38df	scripts: cluster_healthcheck defaults to ~/.kube/config The previous default of $(pwd)/config required running the script from the infra/ directory or always passing --kubeconfig. From a parent shell or any other working directory, the lookup hit a non-existent file and kubectl returned a stale-token error, masking real check results. Now: use $KUBECONFIG if set, then ~/.kube/config, then fall back to $(pwd)/config for backwards compatibility. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00
Viktor Barzin	f5b1fb179a	docs: add k8s node auto-upgrade runbook + architecture section The OS-side counterpart to the service-upgrade pipeline. Covers the unattended-upgrades + kured + sentinel-gate + Prometheus halt-on-alert design landed in c0991f7f8. Runbook: ops procedures (verify health, halt rollout, restore config to a re-imaged node, roll back a bad upgrade, investigate which alert is blocking). Architecture doc: extends the existing service-upgrade flow with a "K8s Node OS Upgrades" section (stack, sources of truth, day-2 mechanism, why-this-design rationale tied to the March 2026 post-mortem). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00
Viktor Barzin	278ef5f19b	monitoring(grafana): swap python3 for jq in folder-ACL local-exec CI image (ci/Dockerfile) is alpine + jq, no python3. The grafana_admin_only_folder_acl null_resource was parsing /api/folders with a python3 oneliner, which crashed every CI apply with "python3: command not found" and made every monitoring stack apply fail in CI (worked locally because the dev VM has python3). jq is already in the CI image and produces the same output.	2026-05-22 14:16:41 +00:00
Viktor Barzin	b99e30e798	docs/plans: 2026-04-20 infra audit design (post-research, post-challenge) Adds the infra audit plan: 5 parallel research agents (Reliability, Declarative, Maintenance, Scalability, Security) → 91 raw findings → 2 independent challengers → filtered/corrected/ranked backlog. Already incorporates the challenger corrections (drops bad metric pulls, reframes intentional-by-design items). Source for several follow-ups already shipped this week (kured-prometheus gating, NFS fsid post-mortem fixes, Authentik outpost postgres-backend).	2026-05-22 14:16:41 +00:00
Viktor Barzin	5c0ea96a91	infra: re-enable unattended-upgrades with kured prometheus-gating Reverses the March 2026 outage mitigation that disabled unattended- upgrades cluster-wide. Now re-enables it on the k8s template VM with: - Allowed-Origins limited to security/updates pockets - Package-Blacklist for k8s/containerd/runc/calico-node (apt-mark hold on the cluster-critical components) - Automatic-Reboot disabled — kured drives the actual reboots - Compatible with the existing kured + sentinel-gate flow kured side: - rebootDelay 30s, concurrency 1 - Sentinel cool-down stretched 30m → 24h (aligns with the 24h soak window from the post-mortem) - prometheusUrl + alertFilterRegexp wired so any firing non-ignored alert halts the rollout. Ignore-list excludes self-referential alerts (Watchdog/RebootRequired/KuredNodeWasNotDrained/ InfoInhibitor) that would otherwise deadlock kured. Prometheus side (already partly landed in 6c4e0966 — the "Upgrade Gates" rule group): - Refine `KubeQuotaAlmostFull` to include the resourcequota label in both the on-clause and the summary, so multi-quota namespaces (authentik, beads-server, frigate) report the quota name correctly. grafana.tf: terraform fmt whitespace only. Together with the post-mortem 2026-03-22 (memory id=390) the loop is closed: unattended-upgrades runs again, kernel-class updates can land, but only when cluster health is green and the reboot window is open.	2026-05-22 14:16:41 +00:00
Viktor Barzin	fe75fad467	monitoring: protect grafana ingress with authentik + disable anonymous - add traefik-authentik-forward-auth to grafana ingress middleware list - disable auth.anonymous (was Viewer-by-default for the public) - enable auth.proxy with X-authentik-username so Authentik users get signed in seamlessly (no double-login UX) Prometheus and Alertmanager already had forward-auth — no change.	2026-05-22 14:16:41 +00:00
Viktor Barzin	6c294d4bb0	authentik: zero-endpoints alert + upgrade-validation checklist Add `AuthentikForwardAuthFallbackActive` Prometheus alert: fires on sustained 401/s spike on the websecure entrypoint (>5/s for 5m), which is the symptom of the auth-proxy Emergency-Access fallback firing — in turn caused by zero ready endpoints on the outpost service. Why this rule and not `kube_endpoint_address_available == 0`: kube-state-metrics endpoint metrics exist as series names but never have current values in this Prometheus pipeline (something is dropping them silently). Detecting the failure at the edge via Traefik is more reliable than instrumenting the broken middle. Also fix the pre-existing `AuthentikOutpostForwardAuth400Spike` regex — the service label is `authentik-ak-outpost-...`, not `authentik-authentik-outpost-...`, so the alert never matched any series and never could have fired. Verified in Prometheus before/after the fix. Add an "Upgrade Validation Checklist" section to `.claude/reference/authentik-state.md` with the seven-step smoke test to run after Authentik chart bumps, provider bumps, or outpost pod recreation. Covers the brittle surfaces (Service selector, JSON patches, postgres backend wiring, access_token_validity TTL, edge auth flow, plan-to-zero).	2026-05-22 14:16:41 +00:00
Viktor Barzin	dc87a9bffe	infra/instagram-poster: shared CNPG-backed benchmark DB, no PVC for scores The instagram_poster.benchmark CLI was writing scores to a sqlite file on the pod's data PVC. Moving it to the shared CNPG cluster so the benchmark scoring path is stateless on the pod, scores survive pod recreation, and the rotation/backup pipeline applies automatically. - dbaas: null_resource.pg_instagram_poster_db creates role + DB (idempotent CREATE IF NOT EXISTS, password placeholder) — same shape as pg_postiz_dbs / pg_wealthfolio_sync_db. - vault: vault_database_secret_backend_static_role.pg_instagram_poster + add to allowed_roles. 7d rotation_period. - instagram-poster: second ExternalSecret (vault-database store) → K8s Secret instagram-poster-benchmark-db with BENCHMARK_PG_HOST/ PORT/USER/PASSWORD/DATABASE. env_from on the deployment. reloader.stakater.com/match=true bounces the pod on rotation. Code-side: instagram_poster/benchmark.py now resolves the DB URL from BENCHMARK_DB_URL or BENCHMARK_PG_* env vars; falls back to sqlite for local DevVM scratch runs. Schema bootstraps via Base.metadata.create_all, no alembic step needed for the benchmark-only side-DB. Verified end-to-end via DevVM port-forward: ESO synced, K8s Secret has all 5 fields, pod env shows BENCHMARK_PG_*, smoke-test scoring 3 photos landed in the new PG table with subject_category populated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00
Viktor Barzin	93ee45bd25	docs/authentik: document postgres session backend + close out 2026-04-18 post-mortem items Update `.claude/reference/authentik-state.md`: - Add `ProxyProvider.access_token_validity = "weeks=4"` to the Session Duration table with the gotcha that the gorilla session store binds the value once at outpost startup (rollout restart needed). - Replace the "session storage moved to Postgres in 2025.10" note that falsely implied the migration was automatic — explain that the `Outpost.managed` field gates the postgres path and our outpost silently stayed on `FilesystemStore` until 2026-05-10. - Document the goauthentik 2026.2.2 service-selector bug (service.py:52) and the JSON-patch workaround. - Document that the standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__*` env vars injected via JSON patch, plus the `app.kubernetes.io/component=server` pod label. - Note the "Terraform doesn't expose `Outpost.managed`" assumption that holds the `managed=embedded` value in place across applies. Close out post-mortem `2026-04-18-authentik-outpost-shm-full.md`: - P2 codify-in-Terraform: DONE. - P3 access_token_validity reduce: DONE-alt (we did the opposite — bumped to 4 weeks — because postgres backend mooted the storage concern). - P3 move-off-embedded-outpost: DONE-alt (postgres backend addresses the loss-of-state class on the embedded outpost itself).	2026-05-22 14:16:41 +00:00
Viktor Barzin	94dfbb9a9c	state(vault): update encrypted state	2026-05-22 14:16:41 +00:00
Viktor Barzin	fbf97dfc5c	state(dbaas): update encrypted state	2026-05-22 14:16:41 +00:00
Viktor Barzin	1fcf911269	authentik/pgbouncer: image_pull_policy IfNotPresent -> Always (match live) The HCL declared `IfNotPresent` since module creation but the live deployment reconciled to `Always` somewhere along the way (likely a Helm/operator default). Since the image is `:latest`, `Always` is the correct value — `IfNotPresent` would skip pulling updated images on pod restart, defeating the point of the floating tag. Drops the lone remaining drift in the authentik stack so plan-to-zero holds across the whole stack, not just the resources I just adopted.	2026-05-22 14:16:41 +00:00
Viktor Barzin	24795ec203	authentik: codify proxy provider TTL + adopt embedded outpost Bump access_token_validity to weeks=4 (was hours=168, UI-managed in ignore_changes). Drives the cookie Max-Age and the proxysession.expires TTL — keeps users logged in for 28d instead of 7d. Adopt the embedded outpost into Terraform so the postgres-session-backend fix from earlier today (2026-05-10) is described as code: - kubernetes_json_patches.deployment carries dshm 2Gi tmpfs, resource requests/limits, the app.kubernetes.io/component=server pod label (workaround for goauthentik 2026.2.2 service.py:52 selector mismatch on standalone embedded outposts), and AUTHENTIK_POSTGRESQL__* envFrom the shared `goauthentik` Secret so the postgres session backend can connect to the dbaas cluster. - kubernetes_json_patches.service replaces the controller-set selector (which targets app.kubernetes.io/name=authentik / the goauthentik-server pods) with the outpost's own labels — without this, endpoints are empty and auth-proxy falls back to Basic-Auth realm "Emergency Access". The `managed` field ("goauthentik.io/outposts/embedded") is server-set and not in the Terraform provider's schema, so TF preserves it across applies (writes only fields it knows about). Plan-to-zero verified.	2026-05-22 14:16:41 +00:00
Viktor Barzin	63fc1e00de	infra/compute: bump k8s-node1 RAM 32 -> 48 GiB Reason: GPU multi-tenancy (frigate + ytdlp-highlights + llama-swap + immich-ml) was hitting 94% memory-request saturation on the old size. The benchmark on 2026-05-10 surfaced this when llama-swap stayed Pending despite GPU time-slicing being on (nvidia.com/gpu replicas=100) - the actual constraint was node1 RAM, not GPU. Procedure: drained node1, qm shutdown 201, qm set 201 --memory 49152, qm start 201, kubelet picked up new capacity (47 GiB / 45.5 GiB allocatable), uncordon, restored llama-swap + immich-ml. Out-of-band qm set is the path here (not Terraform) because VMID 201 is intentionally not managed by TF yet - the telmate/proxmox provider trips on iSCSI-disked VMs (see infra/stacks/infra/main.tf line 442). Adopt this VM into TF once we migrate to bpg/proxmox. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00
Viktor Barzin	6e7fe96a40	infra/llama-cpp: benchmark report + -fa flag fix Phase 7 of the vision-LLM benchmark plan. Adds: - docs/benchmarks/2026-05-10-vision-llm.md — curated report (TL;DR, per-model analysis, top-N agreement, cost vs cloud APIs, sample captions). Verdict: qwen3vl-4b for the request path (3.55 s p50, 100% parse, decisive top-N distro); qwen3vl-8b for caption polish. - docs/benchmarks/benchmark-2026-05-10-1424.json — raw 300-row dump for diff-checking against future runs. - main.tf: -fa -> -fa on (b9085 llama.cpp removed the no-value form of the flash-attention flag; without the value llama-server exits before serving any request). - llama-cpp.md architecture doc links the report so future operators land on the deployed-and-evaluated model from one entry point. 300/300 calls, 0 parse errors, 33m32s wall on a single T4 with the GPU exclusively allocated. immich-ml was scaled to 0 for the run (node1 RAM constraint, not GPU - bumping node1 RAM is tracked as a follow-up). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00
Viktor Barzin	3da01e6e1e	anubis: only challenge GET requests; allow everything else PrivateBin's XHR `POST /` (paste creation) was the trigger — Anubis's catch-all CHALLENGE rule served an HTML challenge page where the JS expected JSON, breaking paste creation entirely. Same shape will hit any SPA XHR or CORS preflight on the other 8 Anubis-fronted sites (homepage actions, kms upload-then-poll, wrongmove search refresh, jsoncrack share, etc.) the moment it gets exercised. Add an `ALLOW` rule keyed on `method != "GET"` between the AI/UA-block imports and the catch-all CHALLENGE. Rationale: * AI scrapers consume GET response bodies — they don't POST. * State-mutating XHRs and OPTIONS preflight need to bypass the challenge or the app breaks. * CrowdSec + per-route rate-limit + app-level auth already cover abuse on mutating methods, so this gives up nothing. * Hard-deny rules for known-bad bots run first, so a declared bad bot can't sneak through by sending a POST. Also added a `checksum/policy` annotation on the Anubis pod template sourced from `sha256(coalesce(var.policy_yaml, default_policy_yaml))` so future policy changes auto-roll the deployment instead of needing a manual `kubectl rollout restart`. f1-stream had its own policy override (path carve-outs for SvelteKit asset hashes and JSON data routes); mirrored the new rule there too. Applied to all 8 Anubis-fronted stacks: blog, kms, f1-stream, travel_blog, real-estate-crawler, homepage, cyberchef, jsoncrack. Verified per stack: GET / returns the Anubis challenge page; POST, PUT, DELETE, OPTIONS pass through to the backend (HTTP 301/405/502 from the upstream app, never the Anubis "not a bot" HTML).	2026-05-22 14:16:40 +00:00
root	ff3d64159a	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:40 +00:00
Viktor Barzin	1f0bd11d3f	privatebin: drop Anubis — broke XHR paste creation PrivateBin's UI POSTs the encrypted blob to `/` via XHR. With Anubis in front, the catch-all CHALLENGE rule returned an HTML challenge page where the JS expected JSON, so paste creation failed silently for every user. The challenge cookie didn't bypass it — Anubis appears to issue a fresh challenge on POST regardless of cookie state. Pastes are client-side encrypted; AI scrapers gain nothing from indexing them, so the default `anti_ai_scraping` middleware is enough protection. Restoring the ingress to point straight at the privatebin service. CSP `wasm-unsafe-eval` retained — PrivateBin's zlib.wasm needs it independent of Anubis. This matches the rule already documented in infra/.claude/CLAUDE.md: "DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints — clients without JS can't solve PoW." A SPA's XHR is the same shape. Verified: GET / returns PrivateBin HTML (not the Anubis challenge), POST / returns PrivateBin's own JSON error envelope.	2026-05-22 14:16:40 +00:00
Viktor Barzin	9c617e6d38	infra/llama-cpp: add stack — llama-swap fronting Qwen3-VL + MiniCPM-V Single Deployment of mostlygeek/llama-swap:cuda hot-swaps three GGUF vision models (qwen3vl-8b, minicpm-v-4-5, qwen3vl-4b) at one OpenAI-compat /v1 endpoint on Service llama-swap.llama-cpp.svc. Idle TTL 10min so models unload between benchmark batches. Storage: NFS-RWX from /srv/nfs-ssd/llamacpp (30Gi). One-shot download Job pulls Q4_K_M GGUF + mmproj per model, creates stable model.gguf / mmproj.gguf symlinks so the llama-swap config is filename-agnostic, then warms the kernel page cache. GPU: nvidia.com/gpu=1 = whole T4 — operator must scale immich-ml to 0 during benchmark windows. wait_for_rollout=false so apply doesn't block on GPU availability. Initial use case: vision-LLM benchmark for instagram-poster candidate scoring; future consumers (HA, agentic tooling) hit the same endpoint via LiteLLM at the gateway. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:40 +00:00
Viktor Barzin	0752bd49c8	kms: document native DNS auto-discovery (no client config needed) LAN clients with DNS suffix viktorbarzin.lan now activate with zero configuration — Windows queries _vlmcs._tcp.viktorbarzin.lan SRV by default and the chain resolves through vlmcs.viktorbarzin.lan to the new 10.0.20.202 KMS IP. DNS state (Technitium primary, replicated to secondary+tertiary by the existing technitium-zone-sync CronJob every 30 min): - _vlmcs._tcp.viktorbarzin.lan SRV 0 0 1688 vlmcs.viktorbarzin.lan (was: target=kms.viktorbarzin.lan) - vlmcs.viktorbarzin.lan A 10.0.20.202 (added) - kms.viktorbarzin.lan A 10.0.20.200 (unchanged — still the Traefik LB for the user-facing website at kms.viktorbarzin.lan/) vlmcs.viktorbarzin.lan was added as a dedicated KMS-server hostname rather than retargeting kms.viktorbarzin.lan so the LAN-direct website keeps working without depending on hairpin NAT through pfSense. Verified end-to-end on WIN10Pro-DS32 (192.168.1.230): slmgr /ckms → slmgr /ato → "Product activated successfully" with "KMS machine name from DNS: vlmcs.viktorbarzin.lan:1688" and "KMS machine IP address: 10.0.20.202". Real client IP 192.168.1.230 appears in vlmcsd log and in the slack-notifier sent line; second activation within the dedup window correctly increments kms_activations_dedup_skipped_total.	2026-05-22 14:16:40 +00:00
Viktor Barzin	d85b54d89d	kms: per-connection state in notifier (vlmcsd is multi-threaded) Bug found via E2E test against the Windows VM (VMID 300). The single shared `state` dict in slack-notifier.py worked when vlmcsd processed one connection at a time, but real Windows KMS activations hold the connection open ~30 seconds (handshake + keep-alive). During that window vlmcsd accepts other concurrent connections — most relevantly the new kubelet TCP readiness probe every 5s — and each new OPEN line reset the shared state, wiping the in-flight activation's app/product/host before its CLOSE arrived. Result: real activations were misclassified as probes (no Slack post, no metric increment). Fix: state is now a dict keyed by `ip:port` with one sub-dict per in-flight connection. A `__current` pointer tracks the most recent OPEN so unkeyed detail lines (Application ID, Workstation name, etc.) can be attributed correctly — vlmcsd writes detail lines immediately after the OPEN and before any subsequent OPEN, so the heuristic holds. Orphan CLOSEs (notifier started mid-conn) are now silently dropped instead of emitting an empty probe event. Two new regression tests: - test_kubelet_probe_during_long_activation: 5s probe interleaved into a 31s activation block — exact production failure mode. - test_orphan_close_no_event: bare CLOSE without prior OPEN. Verified live: triggered slmgr /upk + /ipk + /skms 10.0.20.202 + /ato on WIN10Pro-DS32. vlmcsd logged the full activation block, notifier posted to Slack with ip=192.168.1.230 source=external product='Windows 10 Professional' host='WIN10Pro-DS32.viktorbarzin.lan' and kms_activations_total{product=Windows 10 Professional, status=Licensed} 1 — real WAN client IP preserved through the ETP=Local + dedicated MetalLB IP chain end to end.	2026-05-22 14:16:40 +00:00
Viktor Barzin	4a3ca572e8	fire-planner: imagePullPolicy=Always on alembic-migrate init container After a rollout-restart, the main container (default Always for :latest) pulled the new image with alembic 0003, but the init container defaulted to IfNotPresent and reused a cached old image lacking 0003 → "Can't locate revision identified by '0003'" → CrashLoopBackOff. Setting Always on the init container so both containers stay in lockstep across rollouts. Longer term we should switch the deployment to 8-char git-SHA tags per the cluster policy in .claude/CLAUDE.md, but this unblocks the Wave 1 deploy in the meantime.	2026-05-22 14:16:40 +00:00

1 2 3 4 5 ...

3205 commits