infra

Author	SHA1	Message	Date
Viktor Barzin	4103ea2ba0	monitoring(prometheus): keep all 4 kubelet_volume_stats_inodes metrics pvc-autoresizer's GetMetrics() returns volume stats for a PVC only if all four kubelet_volume_stats metrics (available_bytes, capacity_bytes, inodes_free, inodes) are retrieved. The keep-list in the kubernetes-nodes scrape job had available_bytes and capacity_bytes (post 9d5da4d8) but was missing the two inode metrics, so the autoresizer's reconcile logged "failed to get volume stats" for every PVC and never resized anything. Add kubelet_volume_stats_inodes and kubelet_volume_stats_inodes_free to the regex. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor Barzin	6d3308c848	authentik: add public guest auto-login flow + dedicated outpost + traefik public middleware Phase 1+2 of default-deny ingress plan. Adds the infrastructure for an `auth = "public"` ingress tier that auto-binds anonymous requests to a `guest` Authentik user (no UI prompt), so public sites are still recorded as authenticated by Authentik for audit purposes — but as `guest`, not by leaking the standard catchall flow. - guest user in `Public Guests` group (NOT `Allow Login Users`). - `public-auto-login` flow: stage_binding policy sets `pending_user = guest`, `evaluate_on_plan = false` + `re_evaluate_policies = true` so flow_plan is populated when the policy mutates it; `authentication = none` lets anonymous requests enter. - `Provider for Public` proxy provider (forward_domain, cookie_domain viktorbarzin.me) with `authentication_flow = public-auto-login`. - Dedicated `public` outpost: only the public provider bound, deployed as `ak-outpost-public` Deployment+Service in the `authentik` namespace by Authentik's K8s controller. - `public-auth.viktorbarzin.me` ingress exposes the public outpost's `/outpost.goauthentik.io/*` so OAuth callbacks land on it (the embedded outpost doesn't know about the public provider, so `authentik.viktorbarzin.me` callbacks would fail). - `authentik-forward-auth-public` traefik middleware points at the public outpost service (not via the auth-proxy nginx fallback). The plan's `?app=public` dispatch idea was tested and rejected — the embedded outpost dispatches purely by Host header, so a dedicated outpost was the only way to isolate the public flow without conflicts. No ingresses use the new middleware yet — Phase 3+4 (the ingress_factory `auth` variable refactor + audit pass) wires it up. This commit is additive and behaviour-neutral. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor Barzin	ff5416ff40	proxmox-csi: opt SCs into pvc-autoresizer (resize.topolvm.io/enabled=true) Without this annotation on the StorageClass, pvc-autoresizer's controller filters the SC out at the index lookup stage and never patches any of its PVCs, regardless of utilization or per-PVC threshold/increase/storage_limit annotations. Internal metric pvcautoresizer_loop_seconds_total ticked but no PVCs were ever evaluated — visible cluster-wide as PVAutoExpanding alerts firing for forgejo-data-encrypted (82%) and audit-vault-0 (81%) without any ResizeStarted events ever following. The Prometheus scrape-config fix in 9d5da4d8 was a prerequisite (autoresizer reads kubelet_volume_stats_available_bytes) but not sufficient on its own. Also pinning chart version to 0.5.6 so the next apply doesn't incidentally bump to 0.5.7. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00
Viktor	ea9b5542d1	x402: flip gateway live with Viktor's wallet + Slack payment notifications Wires the traefik stack to read two new fields from secret/viktor: * x402_wallet_address -> 0xCc33BD250d39752e0ceaB616f8a05F72274a659f * alertmanager_slack_api_url (existing) -> reused as the per-payment notification webhook so payment events arrive in the same Slack channel as other infra alerts. Gateway now runs `wallet_set:true, dry_run:false`. Verified end-to-end: - Browser UA on all 9 sites -> 200 (passes through to Anubis) - python-requests/2.31 + scrapy + ClaudeBot UA -> 402 with PaymentRequiredResponse, payTo == Viktor's wallet, amount=10000 micro-USDC, network=base, asset=Base USDC contract - Direct Slack-webhook test from inside cluster -> HTTP 200 Image bumped to forgejo.../x402-gateway:d9b83125 with Slack-format notification payload (text=..., username=x402-gateway, icon_emoji=💰; auxiliary fields preserved for richer receivers). Notifications fire on every successful X-PAYMENT validation; failures on Slack webhook are logged at WARN, never block the request, never double-charge the bot.	2026-05-22 14:16:41 +00:00
Viktor Barzin	58789cde8b	kured(sentinel-gate): fix auth + write-perm so safety checks actually run Test 3 validation surfaced two latent bugs in the sentinel-gate DaemonSet that have been masked since 2026-04-18 (when uu was off, nothing wrote /var/run/reboot-required, so the gate never had to fire): 1. automount_service_account_token=false on both the SA and the pod spec → kubectl in the script falls back to localhost:8080 on every call. Each check (`kubectl get nodes`, `kubectl get pods -n calico-system`, transition-time read) errors to stderr and emits empty stdout. `wc -l` reports 0 → checks "pass" with no real data. 2. bitnami/kubectl:latest runs as uid=1001 by default. The hostPath /var/run is root:root 0755 → final `touch /host/var-run/gated-reboot-required` failed with EACCES. Fail-safe by accident — but if anything had ever loosened those perms, the broken checks above would have green-lit the gate with no real validation. Fix: enable token mount on the SA + pod, set securityContext.run_as_user=0 on the container. Verified post-fix: kubectl returns all 5 nodes, touch succeeds, sentinel-gate now reports the correct `BLOCKED: A node transitioned Ready within the last 24 hours (soak window)` when triggered with k8s-node1's recent reboot within the cool-down period. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00
Viktor Barzin	278ef5f19b	monitoring(grafana): swap python3 for jq in folder-ACL local-exec CI image (ci/Dockerfile) is alpine + jq, no python3. The grafana_admin_only_folder_acl null_resource was parsing /api/folders with a python3 oneliner, which crashed every CI apply with "python3: command not found" and made every monitoring stack apply fail in CI (worked locally because the dev VM has python3). jq is already in the CI image and produces the same output.	2026-05-22 14:16:41 +00:00
Viktor Barzin	5c0ea96a91	infra: re-enable unattended-upgrades with kured prometheus-gating Reverses the March 2026 outage mitigation that disabled unattended- upgrades cluster-wide. Now re-enables it on the k8s template VM with: - Allowed-Origins limited to security/updates pockets - Package-Blacklist for k8s/containerd/runc/calico-node (apt-mark hold on the cluster-critical components) - Automatic-Reboot disabled — kured drives the actual reboots - Compatible with the existing kured + sentinel-gate flow kured side: - rebootDelay 30s, concurrency 1 - Sentinel cool-down stretched 30m → 24h (aligns with the 24h soak window from the post-mortem) - prometheusUrl + alertFilterRegexp wired so any firing non-ignored alert halts the rollout. Ignore-list excludes self-referential alerts (Watchdog/RebootRequired/KuredNodeWasNotDrained/ InfoInhibitor) that would otherwise deadlock kured. Prometheus side (already partly landed in 6c4e0966 — the "Upgrade Gates" rule group): - Refine `KubeQuotaAlmostFull` to include the resourcequota label in both the on-clause and the summary, so multi-quota namespaces (authentik, beads-server, frigate) report the quota name correctly. grafana.tf: terraform fmt whitespace only. Together with the post-mortem 2026-03-22 (memory id=390) the loop is closed: unattended-upgrades runs again, kernel-class updates can land, but only when cluster health is green and the reboot window is open.	2026-05-22 14:16:41 +00:00
Viktor Barzin	fe75fad467	monitoring: protect grafana ingress with authentik + disable anonymous - add traefik-authentik-forward-auth to grafana ingress middleware list - disable auth.anonymous (was Viewer-by-default for the public) - enable auth.proxy with X-authentik-username so Authentik users get signed in seamlessly (no double-login UX) Prometheus and Alertmanager already had forward-auth — no change.	2026-05-22 14:16:41 +00:00
Viktor Barzin	6c294d4bb0	authentik: zero-endpoints alert + upgrade-validation checklist Add `AuthentikForwardAuthFallbackActive` Prometheus alert: fires on sustained 401/s spike on the websecure entrypoint (>5/s for 5m), which is the symptom of the auth-proxy Emergency-Access fallback firing — in turn caused by zero ready endpoints on the outpost service. Why this rule and not `kube_endpoint_address_available == 0`: kube-state-metrics endpoint metrics exist as series names but never have current values in this Prometheus pipeline (something is dropping them silently). Detecting the failure at the edge via Traefik is more reliable than instrumenting the broken middle. Also fix the pre-existing `AuthentikOutpostForwardAuth400Spike` regex — the service label is `authentik-ak-outpost-...`, not `authentik-authentik-outpost-...`, so the alert never matched any series and never could have fired. Verified in Prometheus before/after the fix. Add an "Upgrade Validation Checklist" section to `.claude/reference/authentik-state.md` with the seven-step smoke test to run after Authentik chart bumps, provider bumps, or outpost pod recreation. Covers the brittle surfaces (Service selector, JSON patches, postgres backend wiring, access_token_validity TTL, edge auth flow, plan-to-zero).	2026-05-22 14:16:41 +00:00
Viktor Barzin	dc87a9bffe	infra/instagram-poster: shared CNPG-backed benchmark DB, no PVC for scores The instagram_poster.benchmark CLI was writing scores to a sqlite file on the pod's data PVC. Moving it to the shared CNPG cluster so the benchmark scoring path is stateless on the pod, scores survive pod recreation, and the rotation/backup pipeline applies automatically. - dbaas: null_resource.pg_instagram_poster_db creates role + DB (idempotent CREATE IF NOT EXISTS, password placeholder) — same shape as pg_postiz_dbs / pg_wealthfolio_sync_db. - vault: vault_database_secret_backend_static_role.pg_instagram_poster + add to allowed_roles. 7d rotation_period. - instagram-poster: second ExternalSecret (vault-database store) → K8s Secret instagram-poster-benchmark-db with BENCHMARK_PG_HOST/ PORT/USER/PASSWORD/DATABASE. env_from on the deployment. reloader.stakater.com/match=true bounces the pod on rotation. Code-side: instagram_poster/benchmark.py now resolves the DB URL from BENCHMARK_DB_URL or BENCHMARK_PG_* env vars; falls back to sqlite for local DevVM scratch runs. Schema bootstraps via Base.metadata.create_all, no alembic step needed for the benchmark-only side-DB. Verified end-to-end via DevVM port-forward: ESO synced, K8s Secret has all 5 fields, pod env shows BENCHMARK_PG_*, smoke-test scoring 3 photos landed in the new PG table with subject_category populated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00
Viktor Barzin	1fcf911269	authentik/pgbouncer: image_pull_policy IfNotPresent -> Always (match live) The HCL declared `IfNotPresent` since module creation but the live deployment reconciled to `Always` somewhere along the way (likely a Helm/operator default). Since the image is `:latest`, `Always` is the correct value — `IfNotPresent` would skip pulling updated images on pod restart, defeating the point of the floating tag. Drops the lone remaining drift in the authentik stack so plan-to-zero holds across the whole stack, not just the resources I just adopted.	2026-05-22 14:16:41 +00:00
Viktor Barzin	24795ec203	authentik: codify proxy provider TTL + adopt embedded outpost Bump access_token_validity to weeks=4 (was hours=168, UI-managed in ignore_changes). Drives the cookie Max-Age and the proxysession.expires TTL — keeps users logged in for 28d instead of 7d. Adopt the embedded outpost into Terraform so the postgres-session-backend fix from earlier today (2026-05-10) is described as code: - kubernetes_json_patches.deployment carries dshm 2Gi tmpfs, resource requests/limits, the app.kubernetes.io/component=server pod label (workaround for goauthentik 2026.2.2 service.py:52 selector mismatch on standalone embedded outposts), and AUTHENTIK_POSTGRESQL__* envFrom the shared `goauthentik` Secret so the postgres session backend can connect to the dbaas cluster. - kubernetes_json_patches.service replaces the controller-set selector (which targets app.kubernetes.io/name=authentik / the goauthentik-server pods) with the outpost's own labels — without this, endpoints are empty and auth-proxy falls back to Basic-Auth realm "Emergency Access". The `managed` field ("goauthentik.io/outposts/embedded") is server-set and not in the Terraform provider's schema, so TF preserves it across applies (writes only fields it knows about). Plan-to-zero verified.	2026-05-22 14:16:41 +00:00
Viktor Barzin	6e7fe96a40	infra/llama-cpp: benchmark report + -fa flag fix Phase 7 of the vision-LLM benchmark plan. Adds: - docs/benchmarks/2026-05-10-vision-llm.md — curated report (TL;DR, per-model analysis, top-N agreement, cost vs cloud APIs, sample captions). Verdict: qwen3vl-4b for the request path (3.55 s p50, 100% parse, decisive top-N distro); qwen3vl-8b for caption polish. - docs/benchmarks/benchmark-2026-05-10-1424.json — raw 300-row dump for diff-checking against future runs. - main.tf: -fa -> -fa on (b9085 llama.cpp removed the no-value form of the flash-attention flag; without the value llama-server exits before serving any request). - llama-cpp.md architecture doc links the report so future operators land on the deployed-and-evaluated model from one entry point. 300/300 calls, 0 parse errors, 33m32s wall on a single T4 with the GPU exclusively allocated. immich-ml was scaled to 0 for the run (node1 RAM constraint, not GPU - bumping node1 RAM is tracked as a follow-up). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00
Viktor Barzin	3da01e6e1e	anubis: only challenge GET requests; allow everything else PrivateBin's XHR `POST /` (paste creation) was the trigger — Anubis's catch-all CHALLENGE rule served an HTML challenge page where the JS expected JSON, breaking paste creation entirely. Same shape will hit any SPA XHR or CORS preflight on the other 8 Anubis-fronted sites (homepage actions, kms upload-then-poll, wrongmove search refresh, jsoncrack share, etc.) the moment it gets exercised. Add an `ALLOW` rule keyed on `method != "GET"` between the AI/UA-block imports and the catch-all CHALLENGE. Rationale: * AI scrapers consume GET response bodies — they don't POST. * State-mutating XHRs and OPTIONS preflight need to bypass the challenge or the app breaks. * CrowdSec + per-route rate-limit + app-level auth already cover abuse on mutating methods, so this gives up nothing. * Hard-deny rules for known-bad bots run first, so a declared bad bot can't sneak through by sending a POST. Also added a `checksum/policy` annotation on the Anubis pod template sourced from `sha256(coalesce(var.policy_yaml, default_policy_yaml))` so future policy changes auto-roll the deployment instead of needing a manual `kubectl rollout restart`. f1-stream had its own policy override (path carve-outs for SvelteKit asset hashes and JSON data routes); mirrored the new rule there too. Applied to all 8 Anubis-fronted stacks: blog, kms, f1-stream, travel_blog, real-estate-crawler, homepage, cyberchef, jsoncrack. Verified per stack: GET / returns the Anubis challenge page; POST, PUT, DELETE, OPTIONS pass through to the backend (HTTP 301/405/502 from the upstream app, never the Anubis "not a bot" HTML).	2026-05-22 14:16:40 +00:00
root	ff3d64159a	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:40 +00:00
Viktor Barzin	1f0bd11d3f	privatebin: drop Anubis — broke XHR paste creation PrivateBin's UI POSTs the encrypted blob to `/` via XHR. With Anubis in front, the catch-all CHALLENGE rule returned an HTML challenge page where the JS expected JSON, so paste creation failed silently for every user. The challenge cookie didn't bypass it — Anubis appears to issue a fresh challenge on POST regardless of cookie state. Pastes are client-side encrypted; AI scrapers gain nothing from indexing them, so the default `anti_ai_scraping` middleware is enough protection. Restoring the ingress to point straight at the privatebin service. CSP `wasm-unsafe-eval` retained — PrivateBin's zlib.wasm needs it independent of Anubis. This matches the rule already documented in infra/.claude/CLAUDE.md: "DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints — clients without JS can't solve PoW." A SPA's XHR is the same shape. Verified: GET / returns PrivateBin HTML (not the Anubis challenge), POST / returns PrivateBin's own JSON error envelope.	2026-05-22 14:16:40 +00:00
Viktor Barzin	9c617e6d38	infra/llama-cpp: add stack — llama-swap fronting Qwen3-VL + MiniCPM-V Single Deployment of mostlygeek/llama-swap:cuda hot-swaps three GGUF vision models (qwen3vl-8b, minicpm-v-4-5, qwen3vl-4b) at one OpenAI-compat /v1 endpoint on Service llama-swap.llama-cpp.svc. Idle TTL 10min so models unload between benchmark batches. Storage: NFS-RWX from /srv/nfs-ssd/llamacpp (30Gi). One-shot download Job pulls Q4_K_M GGUF + mmproj per model, creates stable model.gguf / mmproj.gguf symlinks so the llama-swap config is filename-agnostic, then warms the kernel page cache. GPU: nvidia.com/gpu=1 = whole T4 — operator must scale immich-ml to 0 during benchmark windows. wait_for_rollout=false so apply doesn't block on GPU availability. Initial use case: vision-LLM benchmark for instagram-poster candidate scoring; future consumers (HA, agentic tooling) hit the same endpoint via LiteLLM at the gateway. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:40 +00:00
Viktor Barzin	d85b54d89d	kms: per-connection state in notifier (vlmcsd is multi-threaded) Bug found via E2E test against the Windows VM (VMID 300). The single shared `state` dict in slack-notifier.py worked when vlmcsd processed one connection at a time, but real Windows KMS activations hold the connection open ~30 seconds (handshake + keep-alive). During that window vlmcsd accepts other concurrent connections — most relevantly the new kubelet TCP readiness probe every 5s — and each new OPEN line reset the shared state, wiping the in-flight activation's app/product/host before its CLOSE arrived. Result: real activations were misclassified as probes (no Slack post, no metric increment). Fix: state is now a dict keyed by `ip:port` with one sub-dict per in-flight connection. A `__current` pointer tracks the most recent OPEN so unkeyed detail lines (Application ID, Workstation name, etc.) can be attributed correctly — vlmcsd writes detail lines immediately after the OPEN and before any subsequent OPEN, so the heuristic holds. Orphan CLOSEs (notifier started mid-conn) are now silently dropped instead of emitting an empty probe event. Two new regression tests: - test_kubelet_probe_during_long_activation: 5s probe interleaved into a 31s activation block — exact production failure mode. - test_orphan_close_no_event: bare CLOSE without prior OPEN. Verified live: triggered slmgr /upk + /ipk + /skms 10.0.20.202 + /ato on WIN10Pro-DS32. vlmcsd logged the full activation block, notifier posted to Slack with ip=192.168.1.230 source=external product='Windows 10 Professional' host='WIN10Pro-DS32.viktorbarzin.lan' and kms_activations_total{product=Windows 10 Professional, status=Licensed} 1 — real WAN client IP preserved through the ETP=Local + dedicated MetalLB IP chain end to end.	2026-05-22 14:16:40 +00:00
Viktor Barzin	4a3ca572e8	fire-planner: imagePullPolicy=Always on alembic-migrate init container After a rollout-restart, the main container (default Always for :latest) pulled the new image with alembic 0003, but the init container defaulted to IfNotPresent and reused a cached old image lacking 0003 → "Can't locate revision identified by '0003'" → CrashLoopBackOff. Setting Always on the init container so both containers stay in lockstep across rollouts. Longer term we should switch the deployment to 8-char git-SHA tags per the cluster policy in .claude/CLAUDE.md, but this unblocks the Wave 1 deploy in the meantime.	2026-05-22 14:16:40 +00:00
Viktor Barzin	67b11a964a	kms: dedicate MetalLB IP 10.0.20.202 + filter probe noise Two coupled fixes for the hourly Slack noise + missing client IPs: 1. Move windows-kms off shared 10.0.20.200 to a dedicated MetalLB IP 10.0.20.202 with externalTrafficPolicy=Local, so vlmcsd sees real WAN client IPs (pfSense WAN forwards do DNAT-only; ETP=Local skips kube-proxy SNAT). Same pattern mailserver used pre-2026-04-19. Sharing 10.0.20.200 is blocked because all 10 services there are ETP=Cluster and MetalLB requires consistent ETP per shared IP. 2. Slack notifier now suppresses Slack posts for bare TCP open/close pairs (no Application/Activation block) — these are Uptime Kuma's port monitor and the new kubelet readiness/liveness probes. Probe counts go to a new metric kms_connection_probes_total{source} where source classifies the IP as internal_pod / cluster_node / external. Real activations are unaffected. Pod fluidity: added TCP readiness/liveness probes on 1688 to gate Pod Ready on the listener actually being up — required for ETP=Local so MetalLB only advertises 10.0.20.202 from a node where vlmcsd is serving. pfSense side (applied separately, not codified): - New alias k8s_kms_lb = 10.0.20.202 (KMS-only) - WAN:1688 NAT + filter rule retargeted from k8s_shared_lb to k8s_kms_lb - All other forwards on k8s_shared_lb (WireGuard, HTTPS, shadowsocks, smtps, etc.) untouched Runbook updated. Tests added for classify_source / is_probe / process_line.	2026-05-22 14:16:40 +00:00
Viktor Barzin	a5e9fd8c71	fire-planner: expose actualbudget creds via ExternalSecret All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Adds 3 new keys (ACTUALBUDGET_API_URL/KEY/SYNC_ID) sourced from Vault secret/fire-planner so the FastAPI backend can read viktor's spending from the in-cluster actualbudget HTTP API and prefill the Annual spending field on the WhatIf form. Vault keys seeded manually ahead of this commit; ESO has already synced the K8s Secret. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 11:12:40 +00:00
Viktor Barzin	753e9bb971	x402: consolidate to a single shared forwardAuth gateway The per-site `x402_instance` module created one Deployment + Service + PDB per protected host (9 in total, 9×64Mi). Every pod was running the exact same logic with the same config — the only thing that varied was the upstream URL, which we don't even need since the gateway can return 200 to "allow" and Traefik handles the upstream itself. Refactor to the same pattern as `ai-bot-block`: * single deployment + service in `traefik` namespace, 2 replicas, HA * Traefik `Middleware` CRD `x402` (forwardAuth → x402-gateway:8080/auth) * each consumer ingress just appends `traefik-x402@kubernetescrd` to its middleware chain via `extra_middlewares` x402-gateway gains a `MODE=forwardauth` env var that returns 200 (allow) or 402 (with x402 PaymentRequiredResponse body) instead of reverse- proxying. Image: ghcr ... f4804d62. Pod count: 9 → 2 (78% memory saved). All 9 sites verified still serving the Anubis challenge to plain curl with identical TTFB. DRY_RUN until `var.x402_wallet_address` is set on the traefik stack. Removes `modules/kubernetes/x402_instance/` (dead code now).	2026-05-10 11:12:40 +00:00
Viktor Barzin	ce4a75d79a	x402: deploy payment gateway in front of Anubis on all 9 public sites Adds modules/kubernetes/x402_instance/ — a small Go reverse proxy (forgejo.viktorbarzin.me/viktor/x402-gateway:ce333419) that selectively issues HTTP 402 Payment Required to declared AI-bot User-Agents and validates X-PAYMENT headers against a Coinbase x402 facilitator. Browsers are forwarded transparently to Anubis (which then handles the JS PoW gate as before). Wired into all nine Anubis-fronted sites: ingress -> x402-X -> anubis-X -> backend While `wallet_address` is empty the gateway runs in DRY_RUN — every request is transparent-proxied, no 402s issued. This lets the pod sit in the request path with zero behavioural impact today; flipping the wallet variable in the per-stack module call activates payment-required mode for AI-bot UAs. Default config: Base mainnet USDC, $0.01/req, x402.org/facilitator, catch-all UA list (ClaudeBot\|GPTBot\|Bytespider\|meta-externalagent\| PerplexityBot\|GoogleOther\|cohere-ai\|Diffbot\|Amazonbot\| Applebot-Extended\|FacebookBot\|ImagesiftBot\|YouBot\|anthropic-ai\| Claude-Web\|petalbot\|spawning-ai\|scrapy\|python-requests). Verified post-apply: 9/9 pods Running, all 9 sites still serve the Anubis challenge to plain curl with identical TTFB, x402 logs confirm "dry_run":true on every instance.	2026-05-10 11:12:40 +00:00
root	a1b659de2a	Woodpecker CI deploy [CI SKIP]	2026-05-10 11:12:40 +00:00
Viktor Barzin	04cb22fd3b	anubis: re-protect f1 with a per-host policy that allows JSON routes Earlier f1 revert left the host fully unprotected (no Anubis, exclude_crowdsec=true on the ingress already). Re-add Anubis with a custom policy_yaml that: - ALLOWs /_app/* (SvelteKit immutable JS/CSS chunks loaded before any cookie exists), /openapi.json, /docs, /api/* (FastAPI meta). - ALLOWs the 9 known JSON/proxy routes (schedule, streams, embed, embed-asset, extract, extractors, health, proxy, relay) so the SvelteKit SPA's XHRs return JSON instead of the challenge HTML. - Catch-all CHALLENGE for everything else — the SPA HTML pages (which fall through to FastAPI's `/{path}` catch-all) get the PoW gate. The ALLOWed JSON routes are technically scrapeable by a determined bot, but the user's stated goal is "avoid accidental scrapes" — the HTML/SPA is the AI-training target, and that stays gated. Verified: / → Anubis challenge HTML; /schedule, /streams → JSON; /_app/.../app.js → text/javascript; ClaudeBot UA → Anubis deny page.	2026-05-10 11:12:40 +00:00
Viktor Barzin	a89d4a7d2a	anubis: pull f1 off Anubis (XHR-vs-challenge collision) + add latency alerts f1.viktorbarzin.me is a SPA whose JS fetches /schedule, /embed, /embed-asset, … on the same path tree. With Anubis fronting `/`, those XHRs land on the challenge HTML even when the cookie should be valid, breaking the page with `Unexpected token '<', "<!doctype " ... is not valid JSON`. Removed Anubis from f1 — would need a path carve-out (the way wrongmove does for /api) to re-enable. Added a top-of-block comment so future me remembers why. Plus four new Prometheus alerts in `Slow Ingress Latency` group (stacks/monitoring/.../prometheus_chart_values.tpl): - IngressTTFBHigh (warn, 10m, avg latency >1s) - IngressTTFBCritical (crit, 5m, avg latency >3s) - IngressErrorRate5xxHigh (crit, 5m, 5xx >5%) - AnubisChallengeStoreErrors (crit, 5m, any 5xx on anubis services via Traefik — proxies for the in-pod challenge-store error since Anubis itself only exposes Go-runtime metrics) Notes from the alert author: avg-not-p95 because the existing Prometheus scrape config drops traefik bucket series; once those are restored, swap to histogram_quantile(0.95). TraefikDown inhibit rule extended to suppress these four during a Traefik outage.	2026-05-10 11:12:40 +00:00
Viktor Barzin	8197842646	anubis: fix 500 on multi-replica + roll out to 6 more public sites Browser visits to viktorbarzin.me started returning HTTP 500 with `store: key not found: "challenge:..."` in pod logs. Root cause: each Anubis pod stores in-flight challenges in process memory; with 2 replicas behind a ClusterIP, the PoW-solved request can be routed to a different pod than the one that issued the challenge. Anubis upstream documents the same caveat ("when running multiple instances on the same base domain, the key must be the same across all instances" — true for the ed25519 signing key, but the challenge store is still pod-local without a shared backend). Drop module default replicas: 2 → 1. Worst-case: ~1s cold-start on pod restart. Real fix (Redis-backed challenge store) noted as a follow-up in CLAUDE.md. Roll Anubis out to: f1-stream, cyberchef (cc), jsoncrack (json), privatebin (pb), homepage (home), real-estate-crawler (wrongmove UI only — `/api` ingress stays direct via path-based ingress carve- out so XHRs from the SPA bypass the challenge). End-state: 9 public hosts now Anubis-fronted (blog, www, kms, travel, f1, cc, json, pb, home, wrongmove). All return the challenge HTML to bare curl/browser; verified-IP search engines and /robots.txt + /.well-known still skip via the strict-policy allowlist.	2026-05-10 11:12:40 +00:00
Viktor Barzin	2d6812f951	fire-planner: dual ingress — /api/* unprotected, / behind Authentik The SPA can't carry an Authentik session on its own fetch() XHRs in all cases (cross-origin redirect to authentik.viktorbarzin.me on a stale cookie returns HTML, fetch().json() parse fails). Splitting the ingress so /api/ paths skip forward-auth lets the React app talk to its API end-to-end. The browser still has to log in via Authentik to load the SPA at /. Verified end-to-end via chrome-service Playwright: dashboard load, scenario list, what-if run with real Monte Carlo, save-as-scenario round-trip, run-now on detail, delete — all pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 11:12:40 +00:00
Viktor Barzin	58fd4025f8	anubis: per-site PoW reverse proxy on blog + kms + travel-blog Adds modules/kubernetes/anubis_instance/ — a per-site reverse proxy instance pinned to ghcr.io/techarohq/anubis:v1.25.0. Each instance issues a 30-day JWT cookie scoped to viktorbarzin.me after a tiny proof-of-work (difficulty 2 ≈ 250 ms desktop / 700 ms mobile). The shared ed25519 signing key (Vault: secret/viktor → anubis_ed25519_key) makes a single solve good across every Anubis-fronted subdomain. Wired into blog (viktorbarzin.me + www), kms.viktorbarzin.me, and travel.viktorbarzin.me — each with anti_ai_scraping=false on the ingress so the redundant ai-bot-block forwardAuth is dropped from the chain. Skipped forgejo (Git/API clients can't solve PoW) and resume (replicas=0). Also tightens bot-block-proxy nginx timeouts (3s/5s → 100ms/200ms) so any ingress still using the ai-bot-block forwardAuth pays at most ~150 ms when poison-fountain is scaled down, instead of 3 s. End-to-end TTFB on viktorbarzin.me dropped from ~3.2 s to ~150-200 ms. Docs: .claude/reference/patterns.md "Anti-AI Scraping" updated to 4 layers; .claude/CLAUDE.md adds the Anubis usage paragraph and Forgejo/API caveat.	2026-05-10 11:12:40 +00:00
Viktor Barzin	248279605b	postiz: disable signups (DISABLE_REGISTRATION=true) Admin account already exists; we don't want random users registering on the public-facing instance. Sign-in only from now on.	2026-05-10 11:12:40 +00:00
Viktor Barzin	9904561c26	fire-planner: ingress port 8080 (was defaulting to 80) ingress_factory's port var defaults to 80, but fire-planner publishes on 8080. Traefik logged 'Cannot create service error="service port not found"' and 404'd every request. Cloudflare's standard origin-error decoy page (with the noindex meta + cdn-cgi/content honeypot link) made it look like a bot-block, but it was just the upstream coming back 404. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 11:12:39 +00:00
root	017e139b80	Woodpecker CI deploy [CI SKIP]	2026-05-10 11:12:39 +00:00
Viktor Barzin	08edd92b22	kms: deploy slack-notifier sidecar with Prometheus metrics + document public exposure Slack notifier now also exposes /metrics on :9101 with stdlib HTTP — counts activations and dedup-skips by product, gauges last-activation timestamp. Pod template gets the standard prometheus.io/scrape annotations so the cluster-wide kubernetes-pods job picks it up via pod IP. Memory request bumped to 48Mi to cover counter dicts + HTTPServer. Plus docs: networking.md footnotes the windows-kms row noting public WAN exposure with the rate-limited (max-src-conn 50, max-src-conn-rate 10/60, overload <virusprot> flush) pfSense filter rule, and a new runbook covers log locations, rate-limit tuning, and how to revoke the WAN forward. The matching pfSense rule was tightened in place (TCP-only + rate limits) via SSH; pfSense isn't Terraform-managed.	2026-05-10 11:12:39 +00:00
Viktor Barzin	0d8e0ca6fc	backup: fix daily-backup silent failures, postiz pg_dump CronJob, doc reconcile daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr 30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount, which blocked the next run from completing — root cause of the WeeklyBackupStale alert going silent (the metric never reached its end-of-script push). Fixes: - TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting the wall during week 18 runs) - Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as belt-and-braces for any inherited stuck state from a prior crashed run - TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of the alert going blind on systemd kills - pfsense metric pushed in BOTH success and failure paths (was only on success; any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert threshold expired) Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to /srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end: 3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql image is stripped (no curl/wget/python) — switched to docker.io/library/postgres matching the dbaas/postgresql-backup pattern with apt-installed curl. Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed backup_weekly_last_success_timestamp but the script pushes daily_backup_last_run_timestamp). Updated to match what's actually emitted, and added a "default-covered" footnote to the Service Protection Matrix so the ~40 services with PVCs not enumerated in the table are no longer ambiguous. Manual PVE-host actions (out-of-band, not in TF): - unmounted 6 stacked snapshots from /tmp/pvc-mount - pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the loop got SIGTERMed against repeatedly, so prune kept failing) - created /srv/nfs/postiz-backup directory - triggered a one-shot daily-backup run with the new TimeoutStartSec to validate the fix end-to-end Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 11:12:39 +00:00
Viktor Barzin	8c619278d3	grafana: env-var datasources + reloader so Vault rotations stop breaking dashboards Wealth, Payslips, and Job-Hunter Grafana datasources all baked the rotating PG password into their ConfigMap at TF-apply time, so every 7-day Vault static-role rotation silently broke the panels until a manual `terragrunt apply`. Same family as the recurring grafana-mysql backend bug — Grafana caches creds at startup and never picks up the new ESO-synced password without a restart. Fix: - Each source stack now creates an ExternalSecret in `monitoring` exposing the rotating password as `<NAME>_PG_PASSWORD` env-var. - Grafana mounts those via `envFromSecrets` (optional=true so a missing source stack doesn't block boot) and the datasource ConfigMaps reference `$__env{<NAME>_PG_PASSWORD}` instead of a literal password. - `reloader.stakater.com/auto: "true"` on the Grafana pod restarts it whenever any of the four DB-cred Secrets is updated. Tested end-to-end: forced `vault write -force database/rotate-role/ pg-wealthfolio-sync` → ESO synced (~30s) → reloader fired → Grafana booted with new env in ~50s total → all three /api/datasources /uid/*/health endpoints return "Database Connection OK". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 11:12:39 +00:00
Viktor Barzin	57250cfda2	mysql: bump to 4Gi limit / 3Gi request; grow /srv/nfs LV to 3 TiB mysql-standalone OOMKilled May 8 18:05 (anon-rss 2 GB at the 2 Gi limit). innodb_buffer_pool_size=1Gi plus connection buffers and InnoDB internals don't fit in 2 Gi. Bumping limit to 4 Gi (request 3 Gi) leaves headroom without changing the buffer pool config. /srv/nfs was at 90% (1.7T / 2T); grew the underlying pve/nfs-data LV 1 TiB online and ran resize2fs (now 60% used). Triggered by surfacing during the 2026-05-09 IO-pressure post-mortem; thinpool had ~4.6 TiB free. The post-mortem also covers the stale-NFS-client trigger (legacy /usr/local/bin/weekly-backup pointing at the decommissioned TrueNAS IP) and the resulting wedged kthread on the PVE host. Script removed and node_exporter restarted out-of-band; kthread will clear at next PVE reboot. See docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 11:12:38 +00:00
Viktor Barzin	b254c536f9	ig-poster: bump to da5b4191 (auto-curate from recent favorites)	2026-05-10 11:12:38 +00:00
root	a5a54aebe3	Woodpecker CI deploy [CI SKIP]	2026-05-10 11:12:38 +00:00
Viktor Barzin	72013a0890	n8n: real-time training loop + decoupled posting instagram-approval: after every tap, immediately fetch /candidates?limit=1 and send the next photo as a fresh inline-keyboard message — the user's tap chains back into this same workflow, so the loop is user-paced. When the pool is exhausted, send an 'all caught up' summary with the backlog count + cumulative training stats. instagram-discover: cron throttled from every-30-min to daily 09:00. The chain handles ongoing training; the daily run only kickstarts a session if the user hasn't been tapping. Limit reduced from 3 → 1 so each kickstart sends a single photo (chain takes over).	2026-05-10 11:12:38 +00:00
Viktor Barzin	ff2f32a33e	ig-poster b17a9737 + n8n discover rewritten to use /candidates with CLIP scoring	2026-05-10 11:12:38 +00:00
Viktor Barzin	94e2f34e2a	ig-poster: bump to 3b862fe4 (EXIF orientation + auto-pending /candidates)	2026-05-10 11:12:38 +00:00
Viktor Barzin	29bb434e1e	ig-poster: 69e395f2 + sync IMMICH_PG_* via ESO for CLIP scoring; postiz publish-notify n8n workflow	2026-05-10 11:12:38 +00:00
Viktor Barzin	cb83972b79	ig-poster: bump to cac6fa97 + sync POSTIZ_INTEGRATION_ID via ESO	2026-05-10 11:12:37 +00:00
Viktor Barzin	40ca011bd6	postiz: expose /uploads publicly so Meta IG fetcher can pull JPEGs Stories+feed posts via Postiz failed with state=ERROR and Postiz mistranslated the cause as 'Invalid Instagram image resolution max: 1920x1080px'. Real cause: Postiz hands Meta an upload URL under https://postiz.viktorbarzin.me/uploads/... and Meta gets a 302 to the Authentik login page instead of bytes. Meta returns error 36001 (image not fetchable) which Postiz maps to that misleading resolution string. Split the ingress: /uploads/* on a public ingress (matches the instagram-poster /image+/original pattern), everything else remains behind Authentik forward-auth. /uploads contents are random UUIDs, low blast radius if scraped.	2026-05-10 11:12:37 +00:00
Viktor Barzin	ce9bf5b676	postiz: wire INSTAGRAM_APP_ID/SECRET via ESO for IG-standalone provider Standalone provider (instagram-standalone OAuth flow) is what the user is trying after the FB-Login path was blocked by their Business Account ad-policy flag. Uses modern scope names (instagram_business_*), so no JS patch needed unlike the FB-Login provider.	2026-05-10 11:12:37 +00:00
Viktor Barzin	9c1df3ad96	chore: remove decommissioned registry.viktorbarzin.me ingress The old port-5050 R/W private registry was decommissioned 2026-05-07 (forgejo-registry-consolidation Phase 4). The reverse-proxy ingress + ExternalName service + Cloudflare DNS record kept pointing at the dead backend, returning 502 to anyone hitting registry.viktorbarzin.me. This was driving 3 monitoring artifacts that auto-cleared on cleanup: - Uptime Kuma external monitor #586 (deleted) - Pushgateway stale registry-integrity-probe metrics (deleted) - ExternalAccessDivergence + RegistryIntegrityProbeStale alerts	2026-05-10 11:12:37 +00:00
Viktor Barzin	8c09543391	fix: restore pvc-autoresizer by allow-listing kubelet_volume_stats_available_bytes The Prometheus scrape config for the kubernetes-nodes job kept capacity_bytes + used_bytes but dropped available_bytes. pvc-autoresizer computes utilization from available/capacity, so without that metric it was silent for every PVC in the cluster — including mailserver, which filled to 89% (1.7G/2.0G) and started rejecting all inbound mail with '452 4.3.1 Insufficient system storage' (15+ hours, all real senders: Brevo, Gmail, Facebook). Also bumps the floors of mailserver (2Gi -> 5Gi, limit 10Gi) and forgejo (15Gi -> 30Gi) PVCs to recover from the immediate outage, and adds ignore_changes on requests.storage so future autoresizer expansions don't cause TF drift.	2026-05-10 11:12:37 +00:00
Viktor Barzin	c44d855960	ig-poster: pivot to Telegram-only delivery (manual IG upload) User dropped Postiz/Instagram OAuth (Meta Business Account flagged + Postiz scope drift). New pipeline ends at Telegram — full-quality JPEG delivered to the bot chat, manually uploaded to IG by the user. - Image bumped to 25e46efd: adds /deliver/{asset_id} endpoint that multipart-uploads to Telegram (URL-fetch fails through Cloudflare for >5MB), then tags 'posted' in Immich. - ESO now syncs telegram_bot_token + telegram_chat_id from Vault. - Public ingress paths grow to ['/image', '/original'] (Authentik bypass on /original is harmless — files are user-tagged, low blast radius — and useful for ad-hoc browser downloads). - Memory limit 512Mi -> 1500Mi: full-resolution Pillow HEIC decode was OOMing on 12MP+ phone photos. - discover.json simplified to scan -> deliver per item; approval and post workflows already deactivated. Telegram bot webhook removed.	2026-05-10 11:12:37 +00:00
Viktor Barzin	bd8dbbc76f	postiz: wire FACEBOOK_APP_ID/SECRET via ESO for IG-Business integration	2026-05-10 11:12:37 +00:00
Viktor Barzin	02e28294e9	postiz: idempotent Job to drop default Text search attributes (Temporal SQL visibility caps at 3 Text attrs; auto-setup ships with 2, Postiz adds 2 more — gitroomhq/postiz-app#1504 )	2026-05-10 11:12:37 +00:00

1 2 3 4 5 ...

850 commits