infra

Author	SHA1	Message	Date
Viktor Barzin	04fd241679	terminal: inline session switching via sidebar + iframe Replace full-page navigation with a two-pane lobby. Sidebar holds the session list as clickable cards; an iframe in the content pane swaps its src on click so switching sessions takes one click instead of two navigations. - #lobby-shell grid (260px sidebar + iframe pane) - Cards become role=button, kill button stops propagation - activateSession/deactivateSession with hash routing (location.hash <-> active session, replaceState so back stack stays clean) - Killed active session deactivates the iframe before re-render - 5s session poll preserves currentActive; deactivates if gone - Mobile media query collapses to one column CSP frame-ancestors already permits same-origin embedding (*.viktorbarzin.me), no infra changes needed. Direct-link ?arg=<name> path is unchanged.	2026-05-22 14:16:45 +00:00
root	7663b5c36e	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:45 +00:00
Viktor Barzin	43affc3cdc	actualbudget: add `enabled` flag to factory, disable emo Emo isn't using the instance and the daily bank-sync CronJob has been failing because the budget has zero accounts (deleted from the UI), triggering BankSyncStale. Adds an `enabled` toggle that gates the core Deployment + Service + Ingress + http-api + CronJob behind a single plan-time bool while preserving the PVC, so we can flip back to true later to restore the instance as-was. Also fixes a latent bug where the http-api Service was always created even when `enable_http_api=false`. Apply: 7 resources destroyed (emo deploy/svc/ingress/cf dns/http-api deploy+svc/cronjob), 0 changes for viktor/anca (moved blocks migrated their state cleanly to the new [0] addresses). Pushgateway job bank-sync-emo cleared manually; orphaned external-monitor synced out by external-monitor-sync.	2026-05-22 14:16:45 +00:00
Viktor Barzin	9fce3c7b09	terminal: per-Authentik-user OS-user isolation; deny unmapped users Restores the kernel-level isolation the pre-cutover ttyd-session.sh had, but keeps the multi-session lobby UX: - ttyd.service gets `-H X-authentik-username` back. `tmux-attach.sh` reads $TTYD_USER, looks up the local part in /etc/ttyd-user-map, denies the connection (no fallback to wizard) if there's no mapping, otherwise `sudo -n -H -u <os_user> tmux …`. Each Authentik identity → its own Unix user → its own `/tmp/tmux-<uid>/default` socket. - tmux-api scopes every request to the same OS user via the same header. Adds /whoami so the lobby HTML can preflight access and render "logged in as <os_user> (<authentik>)" instead of leaving the user to discover the deny via a reconnect loop. - Commits /etc/ttyd-user-map and the matching /etc/sudoers.d/ttyd-users fragment under files/devvm/ so future operators see one canonical source of truth. Current mappings: vbarzin → wizard, emil.barzin → emo. Adding a user is now: append a line to ttyd-user-map + a NOPASSWD sudoers line + `useradd -m`. README walks through it. No Terraform changes — this is all DevVM-side + lobby JS.	2026-05-22 14:16:45 +00:00
Viktor Barzin	aff4f67671	terminal: cut over to multi-session lobby on terminal.viktorbarzin.me Promotes the staged multi-session UX from term.viktorbarzin.me to the primary terminal.viktorbarzin.me hostname. `ttyd.service` on the DevVM moves to the same ExecStart that `ttyd-multi.service` was running: `/usr/local/bin/ttyd -W -a -t enableClipboard=true -I /usr/local/share/ttyd/index.html -p 7681 /usr/local/bin/tmux-attach.sh`. The lobby HTML supersedes the old per-user-attach index.html (ttyd-session.sh wrapper retired alongside). Terraform: retires the `terminal-multi` Service+Endpoints and the term.viktorbarzin.me ingress (Cloudflare DNS record for `term` is released by module deletion). The tmux-api Service+Endpoints stay, but its IngressRoute now matches terminal.viktorbarzin.me — same path-prefix specificity wins against the catch-all ingress. DevVM follow-up (applied manually as before — see files/devvm/README.md): restart ttyd to pick up the new unit, stop+disable ttyd-multi.service.	2026-05-22 14:16:45 +00:00
root	86a2c66c8e	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:45 +00:00
Viktor Barzin	b1b2cb1974	terminal: add multi-tmux-session lobby on term.viktorbarzin.me (additive) New hostname term.viktorbarzin.me serves a session-picker UI that lists, creates, and kills tmux sessions. Visiting ?arg=<name> attaches to that session (auto-creates via tmux -A). Builds on a fresh ttyd instance (7685) plus a tmux-api Go binary (7684) on the DevVM, both running as User=wizard alongside (not replacing) the existing ttyd.service (7681), ttyd-ro.service (7682), and clipboard-upload (7683). Cutover of terminal.viktorbarzin.me to the multi-session setup is deferred. Terraform diff is purely additive — terminal-multi/tmux-api Service + Endpoints + ingress_multi (term.viktorbarzin.me, Authentik-gated) + an IngressRoute that path-prefixes /api/sessions/* to tmux-api with the matching strip-prefix Middleware. DevVM-side units ship under files/devvm/ with a README — manual scp + systemctl install (see files/devvm/README.md). ttyd 1.7.7 already deployed there (≥1.7 needed for -a). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:45 +00:00
Viktor Barzin	726fb25182	monitoring(wealth): paint declining segments red on growth chart Mirror the panel 5 treatment on panel 7 (Growth = market value − contribution). Second SQL column emits the growth value only when the point is part of a declining segment; field override paints it red with no fill, spanNulls=false.	2026-05-22 14:16:45 +00:00
Viktor Barzin	cc47da87b0	payslip-ingest, instagram-poster: suspend two chronic-failure cronjobs Identified during alert-noise review as steady sources of JobFailed. Suspending them stops the noise; unsuspend after the per-job blocker is cleared. * payslip-ingest/actualbudget-payroll-sync — blocked on Vault `secret/payslip-ingest` missing `actualbudget_encryption_password`. `actualbudget_api_key` and `actualbudget_budget_sync_id` were added (copied from `secret/fire-planner`) in the same session; the encryption password is not stored anywhere in Vault and needs to be populated separately. ExternalSecret sync has been failing since 2026-04-25. * instagram-poster/ig-refresh-token — the deployed image (:da5b4191) does not contain the `POST /ig-refresh-token` route; the route is defined in uncommitted working-copy changes at `instagram-poster/instagram_poster/app.py:695`. Unsuspend after the new image rolls. Each `suspend = true` line carries an inline comment with the unsuspend trigger.	2026-05-22 14:16:45 +00:00
Viktor Barzin	cbd0f71a3b	monitoring: PodImagePullBackOff alert + 2 inhibitors + JobFailed for:2h Three improvements identified in the 7d alert-noise review: A. New PodImagePullBackOff alert. `KubeletImagePullErrors` measures node-level pull error rate, which doesn't catch a single pod stuck in ImagePullBackOff — council-complaints sat broken for ~10h on 2026-05-12 without paging. The new rule fires per-pod after 30m. B. Two new inhibit_rules: - PVFillingUp (95% used, critical) suppresses PVPredictedFull (linear projection, warning) on the same PVC. Pair was producing ~24h of redundant firing per 7d. - EmailRoundtripFailing (active probe failure) suppresses EmailRoundtripStale (derivative >60min no-success). Same outage windows, ~14.5h of duplicate firing per 7d. C. JobFailed for: 30m → 2h. Most cronjobs run every 5–15min; the old 30-minute window paged on the first failed iteration before the next run could recover. 2h means "still failing across at least two cron iterations" — much more actionable. Verified live: rules loaded, inhibitors in alertmanager config, PodImagePullBackOff is currently inactive (council-complaints ImagePullBackOff actively detected — see separate fix).	2026-05-22 14:16:45 +00:00
Viktor Barzin	70292b9e23	monitoring: TraefikReplicaConfigStale — drop false-positive on stale series The initial formulation used clamp_min(min(rate[2h]), 0.0001), which made a recently-deleted pod's lingering rate=0 drive the ratio toward infinity for up to 2h until the stale series aged out of the rate window. With for: 2h, this was a near-miss for spurious firing in the immediate aftermath of restarting the bad replica (our remediation path). Tighter formulation: * 30m rate window — stale series ages out within minutes, not hours * `min(rate) > 0.0005` floor — filters both stale-zero and fresh-pod ramp-up series; the bug's actual rate (~0.00076 in the 2026-05-12 incident) sits well above it, so true positives still trip * for: 1h — fast enough to catch the next incident, long enough that short rate dips don't flap Verified: post-deploy `(max/min) > 5 AND min > 0.0005` evaluates to 0 results with the live cluster's tight rate spread (~0.00065–0.0007/s across all three Traefik replicas).	2026-05-22 14:16:45 +00:00
Viktor Barzin	165bb7258e	monitoring: detect stale Traefik replicas + reduce alert-storm cascading Two new alertmanager inhibit rules and one new Prometheus alert, informed by the 2026-05-12 incident where Traefik pod traefik-db7696fbf-k42wp came back after a SIGTERM with only 6 routers vs 119 on healthy peers (stale K8s informer cache) and served 404 for ~1/3 of viktorbarzin.me traffic. * New alert TraefikReplicaConfigStale: fires when max/min reload-rate ratio across Traefik pods exceeds 5x for 2h. The 2h window + 2h for-clause tolerates legitimate post-restart ramp-up; the bug pattern persists indefinitely. * New inhibit: TraefikReplicaConfigStale suppresses the symptom alerts (HighService{Error,4xx,Latency}, IngressTTFB{High,Critical}, IngressErrorRate5xxHigh, TraefikHighOpenConnections, ForwardAuthFallbackActive, AnubisChallengeStoreErrors, ExternalAccessDivergence) so only the actionable root cause pages. * New inhibit: HomeAssistantDown suppresses HomeAssistantCriticalSensorUnavailable and HomeAssistantMetricsMissing — when HA itself is down, every sensor going unavailable is noise (10x firings observed in the last 12h). * Extend NodeDown and NFSServerUnresponsive target lists to also suppress HomeAssistantCriticalSensorUnavailable.	2026-05-22 14:16:45 +00:00
Viktor Barzin	448bc0c0f6	k8s-version-upgrade: decompose into Job chain to fix self-preemption The agent-based v1 ran inside claude-agent-service (replicas=1, no nodeSelector) and self-evicted when it tried to drain its host (k8s-node4 on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers v1.34.2) until manual recovery. Rewrite the pipeline as a chain of nodeSelector-pinned Jobs: preflight (k8s-node1) → master (k8s-node1) drains k8s-master → worker × 4 (k8s-node1) drains k8s-node{4,3,2} → worker (k8s-master + control-plane toleration) drains k8s-node1 → postflight (no pinning) Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by envsubst-ing job-template.yaml into the next Job. Deterministic names (k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply` idempotent — a failed Job can be re-created without duplicating downstream. Also lands `predrain_unstick`: deletes pods on the target node whose PDB has 0 disruptionsAllowed. Without this, drain loops indefinitely on single-replica deployments (e.g. every Anubis instance — discovered the hard way during 2026-05-11 manual recovery of k8s-node3). Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min). Deprecates the agent prompt (renamed to *.deprecated.md with a header pointer to the new code). Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps), then monitoring (loads the new alert). Both applied 2026-05-11. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:45 +00:00
root	8e13f1528e	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:45 +00:00
Viktor Barzin	e8854f9230	wealthfolio, paperless-ngx: drop migration-leftover -proxmox PVCs The 2026-04-13 encrypted-PVC migration replaced the wealthfolio and paperless-ngx data volumes with -encrypted variants but never removed the original -proxmox PVC blocks from TF — both were sitting orphaned with no pod mounting them, occupying 1Gi each of LVM thin pool. The autoresizer also logged repeated "failed to get volume stats" for them (no kubelet stats without a mounted pod), masking real signal. * wealthfolio: removed kubernetes_persistent_volume_claim.data_proxmox * paperless-ngx: removed kubernetes_persistent_volume_claim.data_proxmox (the paperless PVC turned out to be out-of-TF-state, so deleted via kubectl after the TF block removal.)	2026-05-22 14:16:45 +00:00
Viktor Barzin	701b0e3c57	claude-agent: replace unused 10Gi PVC with 5Gi NFS-backed /persistent The 10Gi proxmox-lvm-encrypted PVC `claude-agent-workspace-encrypted` was declared in TF but never wired into the deployment — the `workspace` volume_mount pointed at an emptyDir, so the PVC sat allocated and idle from 2026-04-15 to 2026-05-11. Restructured per the design intent: * `workspace` (emptyDir) — fast per-pod ephemeral scratch for git clones. Each agent job clones the infra repo fresh, so persistence doesn't buy anything and emptyDir avoids RWO contention if the deployment is ever scaled past 1 replica. * `persistent` (5Gi NFS-backed RWX) — mounted at /persistent for cases where the agent needs to write state that should survive pod restarts (caches, ad-hoc outputs). RWX so all replicas share it; the service's sequential-mutex lock prevents concurrent writes. Also fixed `fix-perms` init container: the Dockerfile's `WORKDIR /workspace/infra` causes kubelet to create that path inside the emptyDir as root:fsGroup with the setgid bit, which uid 1000 can't write to. Pre-create the path + chmod 0775 to make it writable. NFS export already exists on the PVE host (/srv/nfs/claude-agent-persistent, owned 1000:1000). Verified: pod runs 1/1; `/persistent` writable as agent uid 1000; git-init successfully clones infra into /workspace/infra.	2026-05-22 14:16:45 +00:00
Viktor Barzin	cd13b9d062	monitoring: drop PVAutoExpanding alert — info-only noise, not actionable PVAutoExpanding fired at >80% used (info severity), but pvc-autoresizer's threshold is 10% free (= 90% used) — the alert always fired ~10 points before any action would have been taken, and there was nothing for an operator to do during that window either. It was a "heads up" that didn't surface a problem. Real failure modes are already covered: * PVFillingUp (critical, >95% for 10m) — autoresizer didn't keep up * PVPredictedFull (warning, predict_linear 24h) — trend toward exhaustion Sharpened PVFillingUp's annotation to spell out the likely causes (storage_limit reached, expansion failing, or missing autoresizer annotations) so the responder doesn't have to recall the runbook.	2026-05-22 14:16:44 +00:00
Viktor Barzin	396cce82cf	monitoring(wealth): paint declining segments red on portfolio chart Add a second SQL column on panel 5 that returns net_worth only when the current point's previous or next neighbor is lower — i.e. the point is part of a declining segment (including the peak and trough endpoints). A field override draws this 'decline' series in red with no fill and spanNulls=false, overlaying the green base line so down periods show up as red on top of the climb.	2026-05-22 14:16:44 +00:00
Viktor Barzin	a699d5bedf	vault: move audit-PVC autoresizer annotations to kubernetes_annotations Background: 2026-05-10 someone added `server.auditStorage.annotations` to vault/main.tf attempting to enable pvc-autoresizer on audit-vault-N PVCs. The vault helm chart maps that block into the StatefulSet's volumeClaimTemplates, which is immutable post-creation on existing StatefulSets. Result: 4 consecutive helm upgrade attempts (rev 16-19) all rejected with "StatefulSet spec: Forbidden", leaving the release stuck in failed state since 22:47 UTC that day. Live PVCs were hand-annotated via `kubectl annotate` as a workaround, but the IaC declared a path that couldn't be applied — every subsequent tg apply on the vault stack would re-fail. Fix: * Remove `annotations` block from `server.auditStorage` values (with a comment recording why it can't live there). * Add `kubernetes_annotations` resources for audit-vault-{0,1,2} with `force = true`, so Terraform adopts the existing annotations and tracks the desired-state in IaC going forward. The autoresizer cares about PVC annotations, not StatefulSet template annotations, so this is functionally equivalent. Done out-of-band before commit (helm state was already corrupted): `helm rollback vault 15 -n vault` → revision 20 deployed (clean). Verified: helm status vault = deployed; audit-vault-0 still has threshold=10% storage_limit=10Gi annotations; cluster healthcheck no longer reports vault/vault=failed.	2026-05-22 14:16:44 +00:00
Viktor Barzin	2ba36436c8	real-estate-crawler: populate SCRAPE_SCHEDULES (daily RENT + weekly BUY, London 1-2 bed) Wires celery-beat to fire two periodic scrapes via the existing in-app SchedulesConfig mechanism. Replaces the empty-string fallback with two inline schedules expressed as Terraform-managed JSON: - london-rent-daily: every day at 03:00 UTC, RENT, London, 1-2 bed, £1900-4000 - london-buy-weekly: every Sunday at 04:00 UTC, BUY, London, 1-2 bed, £400k-1.2M Schedules live in `local.scrape_schedules` (jsonencode'd) rather than Vault — they're configuration, not secrets, and benefit from being version-controlled. The previous Vault-backed lookup (`local.notification_settings["scrape_schedules"]`) was unused. Verified live: new celery-beat pod logs `Registering periodic task: london-rent-daily at 3:0` and `london-buy-weekly at 4:0` immediately after roll-out. Also tightens the comment above the wrongmove-api `auth = "none"` line so it passes the new `scripts/check-ingress-auth-comments.py` guard (pre-existing tech debt that blocked the apply). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:44 +00:00
Viktor Barzin	f10784ddb6	infra: document auth = "app\|none" tier on every legacy ingress Sweep through the 30+ stacks that predated the auth = "app" tier and were tagged auth = "none" without a comment explaining why they weren't behind Authentik. Each is now self-documenting at the call site, so the tg-level anti-exposure guard passes and future readers don't have to reverse-engineer the intent. Flipped 6 stacks from "none" to "app" — their backends have their own user auth and the new tier records that more accurately: - navidrome (Subsonic user/password) - ntfy (deny-all default + user.db tokens) - nextcloud (WebDAV/CalDAV/CardDAV app passwords) - vaultwarden (Bitwarden-compatible token auth) - headscale (OIDC + preauth keys for Tailscale nodes) - paperless-ngx (app-layer login + API tokens) Kept "none" with a comment on the rest — they're genuinely public, webhook receivers, native-protocol endpoints, OAuth callbacks, or Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt), claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api, fire-planner /api, forgejo (git/OCI native clients), frigate (HA integration), immich/frame, insta2spotify /api, instagram-poster (meta fetcher), k8s-portal, matrix (native bearer), monitoring×2 (HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT), owntracks (HTTP Basic), postiz, privatebin (client-side enc), rybbit (analytics tracker), send (E2E file drop), tuya-bridge (API key), vault (own auth + CLI), webhook_handler, woodpecker (forgejo webhooks + OAuth), xray (×3 VPN transports). real-estate-crawler/main.tf:400 already had its comment from a prior edit — not touched here. No live state changes — auth = "app" produces the same middleware chain as auth = "none" (verified earlier this session). This commit is purely documentation + intent-tagging.	2026-05-22 14:16:44 +00:00
Viktor Barzin	20774f794d	dbaas+monitoring: bump PG max_connections to 200, add scrape + alerts Cluster grew past the 100-conn default — steady-state idle was 90/100, leaving zero headroom for terragrunt applies or transient surges. The ceiling was being discovered by Terraform crashing (pq: "remaining connection slots are reserved for roles with the SUPERUSER attribute"), not by alerting, because we had no PG scrape config at all. dbaas (Tier 0): * max_connections: 100 → 200 * shared_buffers: 512MB → 1GB (Postgres recommends ~25% of pod memory) * effective_cache_size: 1536MB → 2560MB (scaled with pod memory) * pod memory: 2Gi → 3Gi (rough rule of thumb: enough for shared_buffers + ~16MB work_mem * concurrent sorts + OS cache + overhead) * Triggers bump on null_resource.pg_cluster forces CNPG to re-apply, which rolls the cluster (standby first, then primary failover). monitoring: * New scrape job 'cnpg' on dbaas namespace pods labeled cnpg.io/podRole=instance, port name=metrics (9187). Relabels add cnpg_cluster + cnpg_role labels for alert grouping. * PGConnectionsHigh (warning, >85% for 10m) — heads-up before exhaustion. * PGConnectionsCritical (critical, >95% for 3m) — last call before refusing connections. Verified: cnpg targets up, sum(cnpg_backends_total)=84, max_connections metric=200, alert ratio 0.42 → both alerts inactive.	2026-05-22 14:16:44 +00:00
Viktor Barzin	eb529d60e4	infra/ingress_factory: add auth = "app" mode for self-authed backends Adds a fourth auth tier alongside required/public/none. "app" is functionally identical to "none" — no Authentik middleware attached — but the distinct name records intent at the call site: this backend has its own user login (NextAuth, Django, OAuth, bearer-token API, etc.) and Authentik would only break it. Why the new tier: with only required/none, every "the app has its own auth so drop Authentik" decision looked identical at the call site to "this is an OAuth callback / webhook receiver / native-client API". Future readers couldn't tell whether a stack was intentionally unauthenticated or relying on backend auth. Now they can. Migrates the 8 stacks flipped earlier this session (novelapp, immich, linkwarden, tandoor, freshrss, affine, actualbudget, ebooks/audiobookshelf) from "none" to "app". Confirmed no-op: `tg plan` on novelapp showed "No changes" — same middleware chain, same live state. The variable description and the .claude/CLAUDE.md Auth section now spell out the anti-exposure rule: only pick "app" or "none" AFTER verifying the app has its own user auth ("app") or the endpoint is intentionally public ("none"). Default stays "required" so accidental omission fails closed. [ci skip]	2026-05-22 14:16:44 +00:00
root	6b9f5e8027	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:44 +00:00
Viktor Barzin	665b6b2934	actualbudget+monitoring: per-account bank-sync metrics, drop noisy alert The bank-sync CronJob was posting to /accounts/banksync which fans out to ALL accounts in a single call. With PSD2/GoCardless's 4-successful-pulls per-account per-24h quota, a single rate-limited account would 500 the whole call, and `bank_sync_success` would flip to 0 even though the data itself was still flowing through manual UI syncs. Result: BankSyncFailing fired routinely whenever the user had been active in the UI that day — a structural false positive. Fix: * CronJob: enumerate accounts via GET /accounts, POST per-account /accounts/{id}/banksync, emit bank_sync_account_success and bank_sync_account_last_success_timestamp labelled by account name. Roll up bank_sync_success = 1 iff any account succeeded. * Alerts: drop BankSyncFailing (noise generator). Keep BankSyncStale at 48h (global drought). Add BankSyncAccountStale at 72h (catches single-account auth expiry — the real signal we wanted). Verified: manual run on bank-sync-viktor pushes 6 per-account success + timestamp series; roll-up bank_sync_success=1; no firing alerts.	2026-05-22 14:16:44 +00:00
Viktor Barzin	7b6eee49c4	infra: drop Authentik forward-auth from 7 self-authed apps (auth = "none") Apps with their own user auth + bearer-token APIs were being broken by Traefik → Authentik forward-auth: every iOS/Android/native client got a 302 to authentik.viktorbarzin.me instead of the JSON they expected. Authentik's 302+cookie dance can only be followed by a real browser. Changed: - immich (Immich mobile app + bearer-token /api) - linkwarden (NextAuth + Linkwarden mobile clients) - tandoor (Django auth + Tandoor mobile clients) - freshrss (Fever/GReader API used by Reeder/FeedMe/etc.) - affine (workspace auth + AFFiNE desktop/mobile sync) - actualbudget (server password + Actual mobile/sync clients) - ebooks/abs (Audiobookshelf iOS/Android app) Each app's own auth is the gate now. CrowdSec + rate-limit + anti-AI UA filter still front the ingresses. Same pattern as the novelapp change earlier this session. [ci skip]	2026-05-22 14:16:44 +00:00
Viktor Barzin	f98c3f2049	infra/novelapp: drop Authentik forward-auth (auth = "none") novelapp handles its own user auth via NextAuth + Google OAuth, so the ingress-level Authentik forward-auth was double-gating. Mobile webviews (iOS/Android) can't follow the Authentik 302/cookie dance — they saw HTML challenges where they expected JSON. CrowdSec + rate-limit + anti-AI UA filter remain in front; novelapp's own login handles users. [ci skip]	2026-05-22 14:16:44 +00:00
root	77492b3131	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:44 +00:00
Viktor Barzin	9be0672aa3	claude-memory / resume: unblock terragrunt apply (var defaults + psql -d postgres) Two pre-existing apply failures uncovered during the Phase 4 mass apply, unrelated to the auth refactor but blocking 100% rollout. claude-memory: - `var.claude_memory_db_password` had no default and wasn't passed by terragrunt → fall back to Vault `secret/claude-memory.db_password` via `coalesce(var.x, data.vault.data["db_password"])`. - db-init Job was failing with `database "root" does not exist` because psql defaults the database name to the user when -d is omitted. Added `-d postgres` to all five psql invocations. resume: - `var.resume_database_url` had no default and wasn't passed → default to empty string. Vault carries the real value at `secret/resume.database_url` consumed at the deployment env-var level; the variable here just needs a value to satisfy the apply. Also: priority-pass had lost most of its TF state (only 3 of 8 resources tracked); imported namespace/service/pvc/deployment/ingress/tls-secret to re-bind state with live K8s resources. No code change needed there. Verified after re-apply: - claude-memory.viktorbarzin.me → 200 (auth=none, native MCP responses) - priority-pass.viktorbarzin.me → 302 → authentik (auth=required) - resume.viktorbarzin.me → 302 → authentik public outpost (auth=public) - 6 of 7 previously-failing applies now green; only vault remains, blocked by an unrelated helm chart immutable-StatefulSet-field issue. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:44 +00:00
Viktor Barzin	a168277213	healthcheck: tune noise filters + nvidia-exporter auth=none Six tuning changes to cluster_healthcheck.sh so PASS sections actually reflect "nothing to act on": 1. prometheus_alerts: only count severity=warning\|critical. Info-level alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the alert rule itself sets severity; the script should respect it. 2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the Lets Encrypt wildcard renews weekly; <14d is the only window where human attention is genuinely useful. 3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/ event/image/update domains (transient by design), skip friendly names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and only count entities whose last_changed > 24h. Was 431/1470, most of which were "phone in standby" noise. 4. ha_automations: only flag DISABLED automations as abandoned if they've also been untouched (last_changed) for >180 days; raise stale threshold 30d → 180d. Was flagging seasonal/holiday-only automations as broken. 5. problematic_pods + evicted_pods: exclude pods owned by Jobs. CronJob retry leftovers (Error/Failed phase pods that K8s keeps around for log inspection) aren't problematic at the cluster level. 6. uptime_kuma: retry the WebSocket login 3x with backoff. Single- shot failures were a recurring false-positive even though the service was healthy. Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll /metrics and got 302'd to Authentik like the idrac/snmp ones did. Same fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
root	8483ca59ba	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:43 +00:00
Viktor Barzin	dc7c19d88e	frigate: lan ingress auth=none for HA Sofia integration The frigate-lan.viktorbarzin.lan ingress had Authentik forward-auth in front. HA Sofia's frigate integration polls /api/config and only knows how to use Frigate's own API key (not browser SSO), so every poll got a 302 to authentik.viktorbarzin.me and the integration entered the errors-state. Same pattern as idrac-redfish-exporter (5c594291). allow_local_access_only IP allowlist + Frigate's API key are enough. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	dc134011eb	fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes After fixing the threshold=80% misconfig and seeing two PVCs (prometheus + technitium primary) get stuck Terminating, a 3rd round showed four more PVCs (frigate, hackmd, immich-postgresql, paperless-ngx) in the same state. Same root cause: TF spec'd a smaller storage size than the autoresizer-grown live value, K8s rejected the shrink, TF force-replaced the PVC, and the pvc-protection finalizer held it in Terminating while the pod kept using the underlying volume. Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests] on every kubernetes_persistent_volume_claim block that has resize.topolvm.io/threshold annotations. The pattern was already documented in .claude/CLAUDE.md but ~63 stacks were missing it. Live PVCs are unaffected; this only prevents future TF applies from attempting the destroy+recreate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	dd2b7de291	fix: HA Sofia REST sensors + PVC drift safety Two real issues found while triaging HomeAssistantCriticalSensorUnavailable alerts and the prometheus + technitium PVC Terminating-but-in-use state from the earlier session. 1. idrac-redfish-exporter + snmp-exporter ingresses: auth=required → auth=none. HA Sofia REST sensors scrape these endpoints programmatically; with Authentik forward-auth in front, every request got a 302 to authentik.viktorbarzin.me and the REST sensors parsed the HTML login page instead of metrics — leaving the R730, UPS, and ~20 other sensors permanently unavailable. The allow_local_access_only IP allowlist (192.168.0.0/16 + 10.0.0.0/8) already gates external access, so authentik on top was breaking machine-to-machine traffic for no security gain. 2. prometheus_server_pvc + technitium primary_config_encrypted: add lifecycle.ignore_changes = [spec[0].resources[0].requests]. The autoresizer expands these PVCs; PVCs can't shrink. Without the ignore, every TF apply tried to revert the live size back to the TF spec value, hit K8s's shrink-forbidden rule, and force-replaced the PVC. Because the pod still mounted it, the PVC went into Terminating-but-protected limbo — fine until a pod restart would have orphaned the volume. Root cause of the 2026-05-10 PVC Terminating incident. Bonus: prometheus_server_pvc threshold was the inverted "90%" (the same bug the bulk fecfa211 sweep fixed elsewhere; my regex only matched "80%" so this one slipped through). Now "10%". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	7e69951cb9	state(dbaas): update encrypted state	2026-05-22 14:16:43 +00:00
Viktor Barzin	ee47197f3b	vault: enroll audit-vault-0 in pvc-autoresizer (10Gi limit) audit-vault-0 fills steadily with raft audit logs; without autoresizer annotations it hits the 2Gi ceiling and Vault stalls on writes (PVAutoExpanding alert was firing at 81% used). The Vault Helm chart copies server.auditStorage.annotations onto the PVC at create time. Live PVC already has the annotations applied via kubectl annotate; this just keeps TF in sync. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	0fdadcc3dd	dbaas: pg-cluster threshold 80%→10% in CNPG inheritedMetadata Same misconfig as the bulk fecfa211 sweep, but the pg-cluster YAML is buried inside a null_resource local-exec heredoc so the regex didn't catch it. CNPG operator inherits these annotations onto each member PVC (pg-cluster-1, pg-cluster-2), and reapplies them on every reconcile — patching the live PVCs alone bounces back within seconds. Live state already patched via kubectl patch cluster, this just keeps TF in sync. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	3f2b2f9d32	fix: pvc-autoresizer threshold should be 10%, not 80% topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE percentage below which expansion fires (per upstream README). Setting it to "80%" means "expand when free-space drops below 80%", i.e. as soon as the PVC crosses 20% utilization — which caused prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi in 70 minutes (six 10% bumps, all when the volume was only ~14% used). Once the SC opt-in fix landed (1e4eac53) and the inode metrics fix landed (02a12f1a), the autoresizer started actively misfiring across 75+ PVCs cluster-wide. Flip the value to "10%" everywhere — that's "expand when free-space drops below 10%", i.e. at 90% utilization, which is the conventional semantic and matches the alert thresholds in prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp at 95%). The CLAUDE.md PVC template was the source of the misconfig, so update it too. Live PVC annotations were patched in parallel via kubectl annotate; TF apply on each affected stack will be a no-op against those live values. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	dc4ce46411	k8s-version-upgrade: detection script refresh apt before madison + DRY_RUN_OVERRIDE Test 2 dry-run revealed kubeadm plan reports v1.34.7 as latest while apt-cache madison (without prior apt-get update) was reporting v1.34.5 — so the CronJob would have dispatched the agent against a stale target. Now do `sudo apt-get update -qq` for just the kubernetes repo before querying madison. Also add a DRY_RUN_OVERRIDE env precedence so future test invocations can override DRY_RUN without an apply cycle — but Job spec env is immutable post-create, so this is only useful for CronJob spec edits (suspend, then add env, then resume). Documented in the runbook.	2026-05-22 14:16:43 +00:00
Viktor Barzin	ae6dde45c2	k8s-version-upgrade: trigger etcd snapshot via existing backup-etcd Job; broaden agent RBAC Stage 2 now reuses the existing default/backup-etcd CronJob (NFS-backed PV pointing at 192.168.1.127:/srv/nfs/etcd-backup) instead of trying to ssh into master and run etcdctl against a non-existent /mnt/main mount. The agent triggers a one-shot Job from cronjob/backup-etcd, waits up to 10 min, then parses the backup-manage container log for "Backup done" line + byte count. Test 2 (dry-run) surfaced 5 real cluster blockers — agent loop works end-to-end at the planning level. Expanded the claude-agent ServiceAccount's privileges via a sibling ClusterRole (claude-agent-upgrade-ops): - patch namespaces/k8s-upgrade (in-flight annotation) - create batch/jobs (trigger etcd snapshot Job) - patch nodes (cordon/uncordon) - create pods/eviction (drain) - delete pods (drain fallback)	2026-05-22 14:16:43 +00:00
Viktor Barzin	e75bcaf394	k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison on master for new patches + HEAD pkgs.k8s.io for next-minor availability, then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent. The agent (.claude/agents/k8s-version-upgrade.md) orchestrates: pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match) -> etcd snapshot save -> optional master containerd skew fix -> apt repo URL rewrite (minor bumps only) -> drain/upgrade/uncordon master via ssh < update_k8s.sh -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each -> post-flight verification Two new Upgrade Gates alerts catch failure modes: - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m) - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m) update_k8s.sh refactored to take --role / --release args; the agent shells it into each node via SSH pipe. update_node.sh annotated as OS-major path. Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section in docs/architecture/automated-upgrades.md. Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519 keypair distributed to all 5 nodes via authorized_keys; slack_webhook reuses kured webhook URL on initial deploy).	2026-05-22 14:16:42 +00:00
Viktor Barzin	09f83b4e83	fire-planner / k8s-portal / insta2spotify: revert auth=public to auth=none The Phase 4 audit promoted three "smoke-test candidates" from `protected = false` to `auth = "public"`, but all three are XHR / curl-driven endpoints (fetch() calls, automation scripts) that don't survive the 302+cookie redirect dance that the public-auto-login flow requires on first visit. fire-planner's SPA broke immediately — every fetch() to /api/* hit a cross-origin redirect and CORS preflight rejected it. Important learning for the `auth = "public"` design: `auth = "public"` is functionally equivalent to a normal Authentik forward-auth for the FIRST request — it issues a 302 to authentik to set a guest session cookie, then 302s back. This is invisible for top-level browser navigation but BREAKS: - XHR/fetch() under CORS preflight (preflight rejects redirects) - curl/automation scripts that don't preserve cookies across requests - Mobile / native clients that can't follow OAuth-style redirects Use `auth = "public"` only for top-level HTML pages where the user navigates via the browser address bar (or links). For XHR APIs, native-client surfaces, webhooks, OAuth callbacks — use `auth = "none"`. The plan's "smoke test 3 candidates" were misjudged on this front. Reverting all three to `auth = "none"` (their previous behaviour). The end-to-end public flow IS verified working via curl + flow API — the design is sound, just the test targets were wrong. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
root	faad99cff3	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:42 +00:00
Viktor Barzin	143413dc0b	owntracks: explicit auth = "none" — Phase 5 audit completion The Phase 4 audit pass missed this site because the previous agent scoped out owntracks (it overrides the factory's middleware list via extra_annotations to use its own basic-auth middleware). Adding the explicit auth = "none" satisfies Phase 5's "every ingress has an explicit decision" goal and makes the intent visible — mobile OwnTracks clients post location data via HTTP basic-auth and can't follow Authentik forward-auth 302s. Closes the loop on Phase 5: 122/122 active ingress_factory call sites now carry an explicit auth = "..." decision (zero callers rely on the default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor Barzin	ff5538a667	ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default false → unprotected) variable in `modules/kubernetes/ingress_factory` with `auth = string` enum (default "required" → fail-closed). Touches every ingress_factory caller so the audit decision is recorded explicitly in code. ingress_factory (Phase 3): - `auth = "required"`: standard Authentik forward-auth (the legacy `protected = true` semantic). - `auth = "public"`: forward-auth via the new `authentik-forward-auth-public` middleware → dedicated public outpost → guest auto-bind. Logged-in users keep their real identity. - `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost itself. - `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated ingresses don't need anti-AI noise; the auth flow already discourages bots). Audit pass (Phase 4) across 96 ingress_factory call sites: - 49 explicit `protected = true` → `auth = "required"` - 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3) - 64 previously-default (no protected line) → `auth = "required"` ADDED, then reviewed individually: * 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack, homepage, wrongmove UI, privatebin) → `auth = "none"` * 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC, xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich location ingestion, immich frame kiosk, headscale CP, send anonymous drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) → `auth = "none"` * Remaining ~33 → `auth = "required"` confirmed (admin tools, internal UIs, services without app-level auth) - Smoke-test promotions to `auth = "public"`: fire-planner public UI, k8s-portal API, insta2spotify callback. Three call sites in wrapper modules (`stacks/freedify/factory/`, `stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected` bool — they translate to `auth` internally, out of scope for this rename. Behavior change: previously-default ingresses now fail closed (require Authentik login) unless explicitly flipped to `auth = "none"` or `auth = "public"`. This is the audit goal — no more accidentally-unprotected surfaces. Sites that were intentionally public (Anubis content, native APIs, webhooks) are now explicitly recorded as `auth = "none"`. Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via `terraform fmt -recursive` during the audit. Behavior-neutral. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor Barzin	88e57fdddb	instagram-poster: disable ig-ingest-stories CronJob until /ig-ingest ships The endpoint exists in the working copy of instagram_poster/app.py but isn't committed/built/deployed, so every cron fire returned 404 and triggered JobFailed alerts every 30 min. Set count = 0 to leave the resource declaration in place — re-enable by removing that line once the endpoint is in a built image. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor	fddf168ecb	cloudflare: disable AI bot edge-block so x402 can issue payment offers CF zone was returning 403 to declared AI-bot UAs at the edge (`ai_bots_protection: "block"`). That meant the in-cluster x402 gateway never saw the request and could never issue an HTTP 402 with the wallet payment requirements — the bot just bounced. Adopt `cloudflare_bot_management.zone` via root-module import block, flip ai_bots_protection to "disabled". Bot Fight Mode (`fight_mode`), crawler challenge (`crawler_protection`), and managed robots.txt are unaffected — generic automated traffic still gets the bot fight gate. End-to-end verified: `User-Agent: Mozilla/5.0 (compatible; ClaudeBot/ 1.0;...)` on viktorbarzin.me now returns HTTP 402 (was 403 CF block) with `payTo=0xCc33...659f`, `amount=10000` micro-USDC, `network=base`. Trade-off: bots that don't pay still hit origin (instead of CF blackholing them), so a small bandwidth uptick. Negligible at our traffic level.	2026-05-22 14:16:42 +00:00
Viktor Barzin	4103ea2ba0	monitoring(prometheus): keep all 4 kubelet_volume_stats_inodes metrics pvc-autoresizer's GetMetrics() returns volume stats for a PVC only if all four kubelet_volume_stats metrics (available_bytes, capacity_bytes, inodes_free, inodes) are retrieved. The keep-list in the kubernetes-nodes scrape job had available_bytes and capacity_bytes (post 9d5da4d8) but was missing the two inode metrics, so the autoresizer's reconcile logged "failed to get volume stats" for every PVC and never resized anything. Add kubelet_volume_stats_inodes and kubelet_volume_stats_inodes_free to the regex. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor Barzin	6d3308c848	authentik: add public guest auto-login flow + dedicated outpost + traefik public middleware Phase 1+2 of default-deny ingress plan. Adds the infrastructure for an `auth = "public"` ingress tier that auto-binds anonymous requests to a `guest` Authentik user (no UI prompt), so public sites are still recorded as authenticated by Authentik for audit purposes — but as `guest`, not by leaking the standard catchall flow. - guest user in `Public Guests` group (NOT `Allow Login Users`). - `public-auto-login` flow: stage_binding policy sets `pending_user = guest`, `evaluate_on_plan = false` + `re_evaluate_policies = true` so flow_plan is populated when the policy mutates it; `authentication = none` lets anonymous requests enter. - `Provider for Public` proxy provider (forward_domain, cookie_domain viktorbarzin.me) with `authentication_flow = public-auto-login`. - Dedicated `public` outpost: only the public provider bound, deployed as `ak-outpost-public` Deployment+Service in the `authentik` namespace by Authentik's K8s controller. - `public-auth.viktorbarzin.me` ingress exposes the public outpost's `/outpost.goauthentik.io/*` so OAuth callbacks land on it (the embedded outpost doesn't know about the public provider, so `authentik.viktorbarzin.me` callbacks would fail). - `authentik-forward-auth-public` traefik middleware points at the public outpost service (not via the auth-proxy nginx fallback). The plan's `?app=public` dispatch idea was tested and rejected — the embedded outpost dispatches purely by Host header, so a dedicated outpost was the only way to isolate the public flow without conflicts. No ingresses use the new middleware yet — Phase 3+4 (the ingress_factory `auth` variable refactor + audit pass) wires it up. This commit is additive and behaviour-neutral. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor Barzin	ff5416ff40	proxmox-csi: opt SCs into pvc-autoresizer (resize.topolvm.io/enabled=true) Without this annotation on the StorageClass, pvc-autoresizer's controller filters the SC out at the index lookup stage and never patches any of its PVCs, regardless of utilization or per-PVC threshold/increase/storage_limit annotations. Internal metric pvcautoresizer_loop_seconds_total ticked but no PVCs were ever evaluated — visible cluster-wide as PVAutoExpanding alerts firing for forgejo-data-encrypted (82%) and audit-vault-0 (81%) without any ResizeStarted events ever following. The Prometheus scrape-config fix in 9d5da4d8 was a prerequisite (autoresizer reads kubelet_volume_stats_available_bytes) but not sufficient on its own. Also pinning chart version to 0.5.6 so the next apply doesn't incidentally bump to 0.5.7. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:41 +00:00

1 2 3 4 5 ...

897 commits