infra

Author	SHA1	Message	Date
Viktor Barzin	3025879478	claude-agent-service: ship recruiter-triage agent + restore missing terragrunt.hcl - main.tf: bump image_tag to 1b3350c0 (carries the new agent), init container also copies recruiter-triage.md into /home/agent/.claude/agents/. - terragrunt.hcl: restored (file was missing — apply was blocked). Standard root include + platform/vault/external-secrets dependencies. Smoke-tested 2026-05-16: deep_research call on Datadog (thread 42) via recruiter-responder REST API → 102.5s, $0.43, structured markdown report with comp bands vs £600k floor, culture signals, remote policy, recent news, sources cited. End-to-end Tier-2 is live. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	ea2342b8e2	docs: add CONTEXT.md domain glossary [ci skip] Adds the per-repo domain glossary that engineering skills (diagnose, tdd, improve-codebase-architecture, grill-with-docs) read before working in this repo. Terms only — no implementation detail. Six clusters (code organization, cluster, networking, storage, secrets, CI/CD), 22 terms, plus relationships, an example dialogue, and five flagged ambiguities. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	ce5f3ec209	recruiter-responder: expose Gmail IMAP creds for backtest CLI Pulls vbarzin@gmail.com app password from secret/recruiter-responder (seeded from secret/wealthfolio.imap_password — same Gmail credential that wealthfolio uses for broker-statement ingestion). Env vars GMAIL_IMAP_USER + GMAIL_IMAP_PASS, consumed by 'backtest gmail'. Backtest verified 2026-05-16 against folder 'companies-I-dont-take-seriously': 20/20 recruiter, 100% company extraction (9 stated, 6 subject, 4 sender_domain, 1 body), 30% comp, avg 12s latency. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	065982d978	kured: fix sentinel path mismatch that stalled rolling reboots The kured Helm chart derives the sentinel hostPath from `dirname(configuration.rebootSentinel)`. Previously rebootSentinel=/sentinel/gated-reboot-required pointed hostPath at `/sentinel/` (an empty auto-created directory on every host) while the kured-sentinel-gate DaemonSet writes to /var/run/gated-reboot-required. Two different host directories → kured never saw the open gate, even though the gate's checks were all green every 5 min on every node. Result: unattended-upgrades has packages waiting on every node since 2026-05-10 (when uu was re-enabled) and kured's hourly log says "Reboot not required" for the entire period. Set rebootSentinel=/var/run/gated-reboot-required so the chart mounts hostPath /var/run — same directory the gate writes to. The in-pod mountPath (/sentinel) is hardcoded by the chart and doesn't matter, the symlink chain works out: /sentinel/<file> inside the pod resolves to /var/run/<file> on the host. Verified: kured pod can now list /sentinel/gated-reboot-required (0 B) AND /sentinel/reboot-required (32 B, set by uu on 2026-05-15). First gated reboot will land Mon 2026-05-18 02:00 London. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	80e6314bf0	recruiter-responder: bump image_tag to 559e5c57 PDF extraction, tech_stack list, aggressive company/comp inference, no-phone-call drafts, backtest CLI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	8e11caff8d	recruiter-responder: bump image_tag to bbd178da (structured Telegram + comp floor) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	391c002f9a	service-catalog: add aiostreams entry Stremio stream aggregator now has its own row in the Active Use tier. Captures the auth model (own UUID+password, not Authentik), monitoring posture (canary probe + 3 alerts), and backup pipeline (weekly NFS dumps of both decrypted config and the Stremio account addon collection). Follow-up from the 2026-05-15/16 hardening session: 5 commits on servarr/aiostreams, none previously catalogued.	2026-05-22 14:16:47 +00:00
Viktor Barzin	24ce3e267d	aiostreams: weekly backup of Stremio account addon collection Adds stremio-account-backup CronJob (Sun 04:00 weekly, offset 1h from the AIOStreams config-backup at 03:00): - Logs into api.strem.io with credentials from Vault (secret/viktor.stremio_email + stremio_password, now also synced into the aiostreams-probe-secrets ExternalSecret) - Fetches the full addonCollection via addonCollectionGet - Writes timestamped JSON to the existing aiostreams-backup PVC (NFS /srv/nfs/aiostreams-backup/stremio-collection-*.json, mode 0600) - 90-day retention, logs out to invalidate the auth key - Pushgateway metrics: stremio_account_backup_{success,bytes, addon_count,duration_seconds,last_run_timestamp} Protects against: accidental "uninstall all" / API regression / wrong account login wiping the curated set of 22 addons (Cinemeta + 16 MDBList + AIOStreams + More Like This + Formulio + Zamunda + Local). Verified: manual run wrote 93480 bytes, 22 addons, file present on NFS.	2026-05-22 14:16:47 +00:00
Viktor Barzin	aa6e9b0242	recruiter-responder: public /cb ingress for Telegram URL-button callbacks - Add ingress_factory module (auth=none, HMAC + expiry are the gate); ingress_path=["/cb"] only — /api stays internal, /healthz cluster. dns_type=proxied. anti_ai_scraping=false. - Drop setup_tls_secret module — Kyverno ClusterPolicy `sync-tls-secret` auto-clones the wildcard cert into every namespace. - Bump image_tag to 7383b426 (callback endpoints + SMTP STARTTLS hostname relax). - Wire CALLBACK_BASE_URL=https://recruiter-responder.viktorbarzin.me. - Drop git-crypt-encrypted wildcard cert files into stacks/recruiter-responder/secrets/. Allowlist privkey.pem in a new .gitleaksignore — git-crypt encrypts at rest but the working-tree copy is plaintext, so gitleaks can't tell. Smoke-tested end-to-end 2026-05-15 23:45: synthetic email -> Telegram with ✅/❌ buttons -> ✅ tapped via curl -> 'Sent' HTML page -> thread.status=sent, decision row recorded with decided_via=telegram_button, outbound message threaded correctly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	77010b769a	aiostreams: whitelist Vidhin + Tamtaro sync URLs Adds two env vars on the AIOStreams deployment: - WHITELISTED_REGEX_PATTERNS_URLS: Vidhin's release-group regex (TRaSH-aligned) so syncedRankedRegexUrls works for the user - WHITELISTED_SEL_URLS: Vidhin's ranked stream expressions + Tamtaro's ISE/PSE/ESE-standard Gotcha: AIOStreams validates each synced* field against the matching whitelist — stream-expression files (incl. Vidhin's expressions.json) go in WHITELISTED_SEL_URLS, not the regex one, even though they live in Vidhin's regex repo. Mixing them up returns USER_INVALID_CONFIG. User config: enabled Vidhin's regex + ranked expressions + Tamtaro's ISEs. Skipped Tamtaro PSE/ESE for now to avoid surprise over-filtering; can be added later from the same whitelist.	2026-05-22 14:16:47 +00:00
Viktor Barzin	c396092c86	aiostreams: weekly NFS backup of decrypted user config Adds aiostreams-config-backup CronJob (Sun 03:00 weekly): - Pulls /api/v1/user via internal ClusterIP with UUID + password from the existing aiostreams-probe-secrets ExternalSecret - Writes timestamped JSON to nfs-backup PVC mounted at /backup - 90-day retention, prunes older files - Pushgateway metrics: aiostreams_config_backup_{success,bytes,duration,last_run_timestamp} NFS path: 192.168.1.127:/srv/nfs/aiostreams-backup (auto-synced offsite to Synology via the existing offsite-sync-backup CronJob). Complements the daily postgresql-backup-per-db pipeline (which dumps the encrypted blob) by storing the decrypted JSON — usable for human inspection / disaster recovery even without the AIOStreams password. Verified: manual job wrote 12931 bytes, file present on NFS.	2026-05-22 14:16:47 +00:00
root	1177a82452	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:47 +00:00
Viktor Barzin	a98b00324d	recruiter-responder: pin image tag + run plugin installer init as root - stacks/recruiter-responder/terragrunt.hcl: bump image_tag to 0500c3d3 (300s LLM timeouts + IMAP BODY.PEEK[] fix). - stacks/openclaw/main.tf: install-recruiter-plugin init container now runs as uid 0 — the openclaw NFS volume is owned by uid 1000 and the recruiter-responder image otherwise drops to uid 10001 which can't write or chown. Smoke-tested end-to-end 2026-05-15 ~23:15: Synthetic recruiter email -> IMAP IDLE EXISTS push -> qwen3-8b triage (12.1s, JSON output complete with company/role/salary/location/tech) -> 2 drafts persisted in Postgres -> Telegram sendMessage 200 OK. Then deleted 3 stale n8n workflows W992Nr7..., 1AU4k7..., IisDNx... from the n8n Postgres workflow_entity table. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	a72590db7d	recruiter-responder: vault DB role + switch proactive push to Telegram - stacks/vault/main.tf: register pg-recruiter-responder static role on the postgresql connection (7d password rotation). Adds the role to allowed_roles and creates vault_database_secret_backend_static_role for `recruiter_responder` user. - stacks/recruiter-responder/main.tf: drop TASK_WEBHOOK_URL env, swap TASK_WEBHOOK_TOKEN secret for TELEGRAM_BOT_TOKEN + TELEGRAM_CHAT_ID. Updated header doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:46 +00:00
Viktor Barzin	89e9471e87	state(vault): update encrypted state	2026-05-22 14:16:46 +00:00
Viktor Barzin	7e1580ba8c	recruiter-responder: deploy stack + llama-cpp qwen3-8b + openclaw plugin mount Three coupled changes for the new recruiter-responder pipeline: 1. stacks/llama-cpp/: add qwen3-8b text-only model to llama-swap. Uses unsloth/Qwen3-8B-GGUF Q4_K_M, 16k context, no mmproj. Refactored the download Job script + cmd renderer to handle text_only=true (skip mmproj download + --mmproj flag). The 3 existing vision models stay on text_only=false; no behaviour change for them. 2. stacks/recruiter-responder/: new stack. Namespace, 2 ExternalSecrets (app secrets from secret/recruiter-responder, DB creds from Vault DB engine static-creds/pg-recruiter-responder), Deployment (replicas=1, Recreate -- IMAP IDLE + APScheduler want single leader), Service ClusterIP. Image: forgejo.viktorbarzin.me/viktor/recruiter-responder. 3. stacks/openclaw/: add init container `install-recruiter-plugin` that uses the recruiter-responder image to copy the .mjs plugin into /home/node/.openclaw/extensions/recruiter-api/ on NFS. Couples plugin version to the recruiter-responder image tag. Also injects RECRUITER_RESPONDER_URL + RECRUITER_RESPONDER_TOKEN env vars (token from openclaw-secrets.recruiter_responder_bearer_token, optional). Pre-apply checklist for recruiter-responder stack: - Vault: seed secret/recruiter-responder with webhook_bearer_token, imap_{me,spam}_{user,pass}, smtp_password, claude_agent_token, task_webhook_token. - Vault: add secret/openclaw.recruiter_responder_bearer_token (same as above webhook_bearer_token). - dbaas: create DB recruiter_responder + role recruiter_responder, and Vault DB-engine role static-creds/pg-recruiter-responder. - Build + push image via Woodpecker (recruiter-responder repo CI). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:46 +00:00
Viktor Barzin	95b9f7bc89	aiostreams: 1h stream cache + canary stream-count probe + 3 alerts Hardening pass following the empty-stream-list incident: 1. STREAM_CACHE_TTL=3600 — re-enables stream payload cache (was -1 / disabled). Default behaviour hit all 5 upstream addons on every Stremio request; with a 1h TTL repeat requests for the same title are instant, while RD cache invalidations still propagate quickly. 2. aiostreams-stream-probe CronJob (every 5 min): fetches the user's encryptedPassword via the internal ClusterIP, runs a canary stream search for Breaking Bad S01E01, pushes streams_count + probe_success to Pushgateway. Uses an ExternalSecret pulling UUID + password from Vault secret/viktor. Same pattern as email-roundtrip-monitor. 3. Three alerts in monitoring's prometheus_chart_values.tpl: - AIOStreamsStreamCountLow (< 50 streams for 30m) - AIOStreamsProbeFailing (probe_success == 0 for 30m) - AIOStreamsProbeStale (last_run_timestamp > 30min for 10m) Verified: probe returned streams=411 success=1 on first run; all 3 alerts loaded into Prometheus with state=inactive health=ok.	2026-05-22 14:16:46 +00:00
root	fba5ee2df4	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:46 +00:00
Viktor Barzin	c73234982f	aiostreams: pin nightly + switch to auth=app - Pin viren070/aiostreams:nightly → :2026.05.14.1326-nightly (avoid stale-pull cache, matches 8-char SHA convention for rolling tags) - Switch ingress auth tier required → app: Authentik forward-auth blocks Stremio clients (cannot follow OAuth 302), and AIOStreams already enforces UUID + password on /configure and /api/*, with Stremio addon URLs using encryptedPassword as a bearer token. Result: empty-stream-list issue fixed for public Stremio clients. Verified: 410 streams returned via public URL for Breaking Bad S01E01 with no cookies, vs 0 before (502→Authentik OIDC redirect).	2026-05-22 14:16:46 +00:00
Viktor Barzin	2903ab9778	monitoring(wealth): move Positions table under contrib/growth row Positions panel now sits at y=32 (immediately below the contrib-vs-market + growth row at y=22..32), and everything from the per-account stack down shifts 8 rows lower.	2026-05-22 14:16:46 +00:00
Viktor Barzin	8461275308	wealth: positions table panel (shares + cost basis + unrealised return) pg-sync sidecar now mirrors three extra views from the wealthfolio SQLite: assets (id/symbol/name/currency), quote_latest (one row per asset, preferring YAHOO over MANUAL on same-day collisions), and positions_latest (currently-held positions extracted from the TOTAL aggregate row of holdings_snapshots — quantity, average cost, total cost basis). Wealth dashboard gets a new bottom Positions table joining the three: symbol, name, shares, avg cost, last price, market value, cost, gain, return %. Gain and return % are color-text with red<0, green>=0 thresholds.	2026-05-22 14:16:46 +00:00
Viktor Barzin	d6049ff7a0	terminal: extract app code to viktor/terminal-lobby on Forgejo The lobby has grown enough (frontend, two Go services, devvm units + scripts + config) that it earns its own repo. Code now lives at https://forgejo.viktorbarzin.me/viktor/terminal-lobby with scripts/deploy.sh covering the manual deploy until CI activation lands (Woodpecker forge_id=2 activation still 500s; Forgejo Actions not yet enabled). This stack now owns only the K8s side — Services, Endpoints, IngressRoutes, middlewares. main.tf comment block updated to point at the new repo and the full DevVM port map. Removed: - stacks/terminal/files/ (index.html + DevVM artefacts) - stacks/terminal/tmux-api/ (Go service) - stacks/terminal/clipboard-upload/ (Go service)	2026-05-22 14:16:46 +00:00
Viktor Barzin	c135c04c79	terminal: make slate the default theme	2026-05-22 14:16:46 +00:00
Viktor Barzin	a44aa52e1a	terminal: theme picker (carbon/slate/mono/ink) replacing violet Drops the hardcoded violet/indigo palette. Four themes are defined as CSS variables on body.theme-{carbon,slate,mono,ink}: - Carbon (default): warm dark, ivory text, restrained amber accent. - Slate: cool dark, GitHub/Linear-ish charcoal with electric blue. - Mono: strict greyscale, off-white accent. - Ink: warm paper light, deep ink, terracotta accent. The lobby reads the choice from localStorage and applies the class before render. The picker lives at the bottom of the sidebar (margin-top: auto pins it). On change, the iframe is bounced through about:blank so the inner xterm picks up the new computed CSS vars (--terminal-bg/fg/cursor/selection) on the next mount. Picker UI uses native buttons, current theme highlighted with the accent border + color. No gradients, hairline borders only.	2026-05-22 14:16:45 +00:00
Viktor Barzin	cbe83597c0	terminal: rename sessions + drag-and-drop reorder Backend: POST /sessions/<name>/rename in tmux-api runs tmux rename-session as the mapped OS user. 400 on bad name, 404 on missing source, 409 on duplicate target, 401 on missing auth header. Frontend: - Rename button per card → prompt() dialog, validates against the shared regex. Updates currentActive + hash + iframe.src if the renamed session was active. - Session order is now user-driven, persisted in localStorage keyed per osUser. New sessions append at the bottom. The previous sort-by-lastActivity is gone. - HTML5 drag-and-drop reorders cards live during dragover; dragend captures the DOM order into localStorage. - Polling renderLobby is suppressed while a drag is in flight so the 5s tick doesn't yank the list out from under the user.	2026-05-22 14:16:45 +00:00
Viktor Barzin	04fd241679	terminal: inline session switching via sidebar + iframe Replace full-page navigation with a two-pane lobby. Sidebar holds the session list as clickable cards; an iframe in the content pane swaps its src on click so switching sessions takes one click instead of two navigations. - #lobby-shell grid (260px sidebar + iframe pane) - Cards become role=button, kill button stops propagation - activateSession/deactivateSession with hash routing (location.hash <-> active session, replaceState so back stack stays clean) - Killed active session deactivates the iframe before re-render - 5s session poll preserves currentActive; deactivates if gone - Mobile media query collapses to one column CSP frame-ancestors already permits same-origin embedding (*.viktorbarzin.me), no infra changes needed. Direct-link ?arg=<name> path is unchanged.	2026-05-22 14:16:45 +00:00
root	7663b5c36e	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:45 +00:00
Viktor Barzin	43affc3cdc	actualbudget: add `enabled` flag to factory, disable emo Emo isn't using the instance and the daily bank-sync CronJob has been failing because the budget has zero accounts (deleted from the UI), triggering BankSyncStale. Adds an `enabled` toggle that gates the core Deployment + Service + Ingress + http-api + CronJob behind a single plan-time bool while preserving the PVC, so we can flip back to true later to restore the instance as-was. Also fixes a latent bug where the http-api Service was always created even when `enable_http_api=false`. Apply: 7 resources destroyed (emo deploy/svc/ingress/cf dns/http-api deploy+svc/cronjob), 0 changes for viktor/anca (moved blocks migrated their state cleanly to the new [0] addresses). Pushgateway job bank-sync-emo cleared manually; orphaned external-monitor synced out by external-monitor-sync.	2026-05-22 14:16:45 +00:00
Viktor Barzin	9fce3c7b09	terminal: per-Authentik-user OS-user isolation; deny unmapped users Restores the kernel-level isolation the pre-cutover ttyd-session.sh had, but keeps the multi-session lobby UX: - ttyd.service gets `-H X-authentik-username` back. `tmux-attach.sh` reads $TTYD_USER, looks up the local part in /etc/ttyd-user-map, denies the connection (no fallback to wizard) if there's no mapping, otherwise `sudo -n -H -u <os_user> tmux …`. Each Authentik identity → its own Unix user → its own `/tmp/tmux-<uid>/default` socket. - tmux-api scopes every request to the same OS user via the same header. Adds /whoami so the lobby HTML can preflight access and render "logged in as <os_user> (<authentik>)" instead of leaving the user to discover the deny via a reconnect loop. - Commits /etc/ttyd-user-map and the matching /etc/sudoers.d/ttyd-users fragment under files/devvm/ so future operators see one canonical source of truth. Current mappings: vbarzin → wizard, emil.barzin → emo. Adding a user is now: append a line to ttyd-user-map + a NOPASSWD sudoers line + `useradd -m`. README walks through it. No Terraform changes — this is all DevVM-side + lobby JS.	2026-05-22 14:16:45 +00:00
Viktor Barzin	aff4f67671	terminal: cut over to multi-session lobby on terminal.viktorbarzin.me Promotes the staged multi-session UX from term.viktorbarzin.me to the primary terminal.viktorbarzin.me hostname. `ttyd.service` on the DevVM moves to the same ExecStart that `ttyd-multi.service` was running: `/usr/local/bin/ttyd -W -a -t enableClipboard=true -I /usr/local/share/ttyd/index.html -p 7681 /usr/local/bin/tmux-attach.sh`. The lobby HTML supersedes the old per-user-attach index.html (ttyd-session.sh wrapper retired alongside). Terraform: retires the `terminal-multi` Service+Endpoints and the term.viktorbarzin.me ingress (Cloudflare DNS record for `term` is released by module deletion). The tmux-api Service+Endpoints stay, but its IngressRoute now matches terminal.viktorbarzin.me — same path-prefix specificity wins against the catch-all ingress. DevVM follow-up (applied manually as before — see files/devvm/README.md): restart ttyd to pick up the new unit, stop+disable ttyd-multi.service.	2026-05-22 14:16:45 +00:00
root	86a2c66c8e	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:45 +00:00
Viktor Barzin	b1b2cb1974	terminal: add multi-tmux-session lobby on term.viktorbarzin.me (additive) New hostname term.viktorbarzin.me serves a session-picker UI that lists, creates, and kills tmux sessions. Visiting ?arg=<name> attaches to that session (auto-creates via tmux -A). Builds on a fresh ttyd instance (7685) plus a tmux-api Go binary (7684) on the DevVM, both running as User=wizard alongside (not replacing) the existing ttyd.service (7681), ttyd-ro.service (7682), and clipboard-upload (7683). Cutover of terminal.viktorbarzin.me to the multi-session setup is deferred. Terraform diff is purely additive — terminal-multi/tmux-api Service + Endpoints + ingress_multi (term.viktorbarzin.me, Authentik-gated) + an IngressRoute that path-prefixes /api/sessions/* to tmux-api with the matching strip-prefix Middleware. DevVM-side units ship under files/devvm/ with a README — manual scp + systemctl install (see files/devvm/README.md). ttyd 1.7.7 already deployed there (≥1.7 needed for -a). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:45 +00:00
Viktor Barzin	726fb25182	monitoring(wealth): paint declining segments red on growth chart Mirror the panel 5 treatment on panel 7 (Growth = market value − contribution). Second SQL column emits the growth value only when the point is part of a declining segment; field override paints it red with no fill, spanNulls=false.	2026-05-22 14:16:45 +00:00
Viktor Barzin	cc47da87b0	payslip-ingest, instagram-poster: suspend two chronic-failure cronjobs Identified during alert-noise review as steady sources of JobFailed. Suspending them stops the noise; unsuspend after the per-job blocker is cleared. * payslip-ingest/actualbudget-payroll-sync — blocked on Vault `secret/payslip-ingest` missing `actualbudget_encryption_password`. `actualbudget_api_key` and `actualbudget_budget_sync_id` were added (copied from `secret/fire-planner`) in the same session; the encryption password is not stored anywhere in Vault and needs to be populated separately. ExternalSecret sync has been failing since 2026-04-25. * instagram-poster/ig-refresh-token — the deployed image (:da5b4191) does not contain the `POST /ig-refresh-token` route; the route is defined in uncommitted working-copy changes at `instagram-poster/instagram_poster/app.py:695`. Unsuspend after the new image rolls. Each `suspend = true` line carries an inline comment with the unsuspend trigger.	2026-05-22 14:16:45 +00:00
Viktor Barzin	cbd0f71a3b	monitoring: PodImagePullBackOff alert + 2 inhibitors + JobFailed for:2h Three improvements identified in the 7d alert-noise review: A. New PodImagePullBackOff alert. `KubeletImagePullErrors` measures node-level pull error rate, which doesn't catch a single pod stuck in ImagePullBackOff — council-complaints sat broken for ~10h on 2026-05-12 without paging. The new rule fires per-pod after 30m. B. Two new inhibit_rules: - PVFillingUp (95% used, critical) suppresses PVPredictedFull (linear projection, warning) on the same PVC. Pair was producing ~24h of redundant firing per 7d. - EmailRoundtripFailing (active probe failure) suppresses EmailRoundtripStale (derivative >60min no-success). Same outage windows, ~14.5h of duplicate firing per 7d. C. JobFailed for: 30m → 2h. Most cronjobs run every 5–15min; the old 30-minute window paged on the first failed iteration before the next run could recover. 2h means "still failing across at least two cron iterations" — much more actionable. Verified live: rules loaded, inhibitors in alertmanager config, PodImagePullBackOff is currently inactive (council-complaints ImagePullBackOff actively detected — see separate fix).	2026-05-22 14:16:45 +00:00
Viktor Barzin	70292b9e23	monitoring: TraefikReplicaConfigStale — drop false-positive on stale series The initial formulation used clamp_min(min(rate[2h]), 0.0001), which made a recently-deleted pod's lingering rate=0 drive the ratio toward infinity for up to 2h until the stale series aged out of the rate window. With for: 2h, this was a near-miss for spurious firing in the immediate aftermath of restarting the bad replica (our remediation path). Tighter formulation: * 30m rate window — stale series ages out within minutes, not hours * `min(rate) > 0.0005` floor — filters both stale-zero and fresh-pod ramp-up series; the bug's actual rate (~0.00076 in the 2026-05-12 incident) sits well above it, so true positives still trip * for: 1h — fast enough to catch the next incident, long enough that short rate dips don't flap Verified: post-deploy `(max/min) > 5 AND min > 0.0005` evaluates to 0 results with the live cluster's tight rate spread (~0.00065–0.0007/s across all three Traefik replicas).	2026-05-22 14:16:45 +00:00
Viktor Barzin	165bb7258e	monitoring: detect stale Traefik replicas + reduce alert-storm cascading Two new alertmanager inhibit rules and one new Prometheus alert, informed by the 2026-05-12 incident where Traefik pod traefik-db7696fbf-k42wp came back after a SIGTERM with only 6 routers vs 119 on healthy peers (stale K8s informer cache) and served 404 for ~1/3 of viktorbarzin.me traffic. * New alert TraefikReplicaConfigStale: fires when max/min reload-rate ratio across Traefik pods exceeds 5x for 2h. The 2h window + 2h for-clause tolerates legitimate post-restart ramp-up; the bug pattern persists indefinitely. * New inhibit: TraefikReplicaConfigStale suppresses the symptom alerts (HighService{Error,4xx,Latency}, IngressTTFB{High,Critical}, IngressErrorRate5xxHigh, TraefikHighOpenConnections, ForwardAuthFallbackActive, AnubisChallengeStoreErrors, ExternalAccessDivergence) so only the actionable root cause pages. * New inhibit: HomeAssistantDown suppresses HomeAssistantCriticalSensorUnavailable and HomeAssistantMetricsMissing — when HA itself is down, every sensor going unavailable is noise (10x firings observed in the last 12h). * Extend NodeDown and NFSServerUnresponsive target lists to also suppress HomeAssistantCriticalSensorUnavailable.	2026-05-22 14:16:45 +00:00
Viktor Barzin	448bc0c0f6	k8s-version-upgrade: decompose into Job chain to fix self-preemption The agent-based v1 ran inside claude-agent-service (replicas=1, no nodeSelector) and self-evicted when it tried to drain its host (k8s-node4 on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers v1.34.2) until manual recovery. Rewrite the pipeline as a chain of nodeSelector-pinned Jobs: preflight (k8s-node1) → master (k8s-node1) drains k8s-master → worker × 4 (k8s-node1) drains k8s-node{4,3,2} → worker (k8s-master + control-plane toleration) drains k8s-node1 → postflight (no pinning) Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by envsubst-ing job-template.yaml into the next Job. Deterministic names (k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply` idempotent — a failed Job can be re-created without duplicating downstream. Also lands `predrain_unstick`: deletes pods on the target node whose PDB has 0 disruptionsAllowed. Without this, drain loops indefinitely on single-replica deployments (e.g. every Anubis instance — discovered the hard way during 2026-05-11 manual recovery of k8s-node3). Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min). Deprecates the agent prompt (renamed to *.deprecated.md with a header pointer to the new code). Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps), then monitoring (loads the new alert). Both applied 2026-05-11. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:45 +00:00
root	8e13f1528e	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:45 +00:00
Viktor Barzin	e8854f9230	wealthfolio, paperless-ngx: drop migration-leftover -proxmox PVCs The 2026-04-13 encrypted-PVC migration replaced the wealthfolio and paperless-ngx data volumes with -encrypted variants but never removed the original -proxmox PVC blocks from TF — both were sitting orphaned with no pod mounting them, occupying 1Gi each of LVM thin pool. The autoresizer also logged repeated "failed to get volume stats" for them (no kubelet stats without a mounted pod), masking real signal. * wealthfolio: removed kubernetes_persistent_volume_claim.data_proxmox * paperless-ngx: removed kubernetes_persistent_volume_claim.data_proxmox (the paperless PVC turned out to be out-of-TF-state, so deleted via kubectl after the TF block removal.)	2026-05-22 14:16:45 +00:00
Viktor Barzin	701b0e3c57	claude-agent: replace unused 10Gi PVC with 5Gi NFS-backed /persistent The 10Gi proxmox-lvm-encrypted PVC `claude-agent-workspace-encrypted` was declared in TF but never wired into the deployment — the `workspace` volume_mount pointed at an emptyDir, so the PVC sat allocated and idle from 2026-04-15 to 2026-05-11. Restructured per the design intent: * `workspace` (emptyDir) — fast per-pod ephemeral scratch for git clones. Each agent job clones the infra repo fresh, so persistence doesn't buy anything and emptyDir avoids RWO contention if the deployment is ever scaled past 1 replica. * `persistent` (5Gi NFS-backed RWX) — mounted at /persistent for cases where the agent needs to write state that should survive pod restarts (caches, ad-hoc outputs). RWX so all replicas share it; the service's sequential-mutex lock prevents concurrent writes. Also fixed `fix-perms` init container: the Dockerfile's `WORKDIR /workspace/infra` causes kubelet to create that path inside the emptyDir as root:fsGroup with the setgid bit, which uid 1000 can't write to. Pre-create the path + chmod 0775 to make it writable. NFS export already exists on the PVE host (/srv/nfs/claude-agent-persistent, owned 1000:1000). Verified: pod runs 1/1; `/persistent` writable as agent uid 1000; git-init successfully clones infra into /workspace/infra.	2026-05-22 14:16:45 +00:00
Viktor Barzin	cd13b9d062	monitoring: drop PVAutoExpanding alert — info-only noise, not actionable PVAutoExpanding fired at >80% used (info severity), but pvc-autoresizer's threshold is 10% free (= 90% used) — the alert always fired ~10 points before any action would have been taken, and there was nothing for an operator to do during that window either. It was a "heads up" that didn't surface a problem. Real failure modes are already covered: * PVFillingUp (critical, >95% for 10m) — autoresizer didn't keep up * PVPredictedFull (warning, predict_linear 24h) — trend toward exhaustion Sharpened PVFillingUp's annotation to spell out the likely causes (storage_limit reached, expansion failing, or missing autoresizer annotations) so the responder doesn't have to recall the runbook.	2026-05-22 14:16:44 +00:00
Viktor Barzin	396cce82cf	monitoring(wealth): paint declining segments red on portfolio chart Add a second SQL column on panel 5 that returns net_worth only when the current point's previous or next neighbor is lower — i.e. the point is part of a declining segment (including the peak and trough endpoints). A field override draws this 'decline' series in red with no fill and spanNulls=false, overlaying the green base line so down periods show up as red on top of the climb.	2026-05-22 14:16:44 +00:00
Viktor Barzin	30eff178e9	healthcheck: probe uptime-kuma via internal Service (port-forward), not public URL The Uptime Kuma check was hitting https://uptime.viktorbarzin.me, which sits behind Authentik forward-auth. Authentik 302-redirects the Socket.IO handshake the uptime-kuma-api library uses, and the library can't complete the OAuth flow, so every healthcheck reported "Connection failed" even though the pod was healthy and serving 225 monitors. Fix: open a transient `kubectl port-forward` to svc/uptime-kuma in the uptime-kuma namespace for the duration of the check, connect the library to http://127.0.0.1:<port> (no auth gate), then SIGKILL the port-forward on the way out. The disown is to suppress bash's "Killed" job notification on stderr, which corrupted stdout when stderr was merged for JSON parsing. Verified end-to-end: healthcheck now reports the real signal — "external down(3): www, xray-vless, hermes-agent" — the same 3 Cloudflare-facing endpoints flagging in the uptime-kuma logs.	2026-05-22 14:16:44 +00:00
Viktor Barzin	a699d5bedf	vault: move audit-PVC autoresizer annotations to kubernetes_annotations Background: 2026-05-10 someone added `server.auditStorage.annotations` to vault/main.tf attempting to enable pvc-autoresizer on audit-vault-N PVCs. The vault helm chart maps that block into the StatefulSet's volumeClaimTemplates, which is immutable post-creation on existing StatefulSets. Result: 4 consecutive helm upgrade attempts (rev 16-19) all rejected with "StatefulSet spec: Forbidden", leaving the release stuck in failed state since 22:47 UTC that day. Live PVCs were hand-annotated via `kubectl annotate` as a workaround, but the IaC declared a path that couldn't be applied — every subsequent tg apply on the vault stack would re-fail. Fix: * Remove `annotations` block from `server.auditStorage` values (with a comment recording why it can't live there). * Add `kubernetes_annotations` resources for audit-vault-{0,1,2} with `force = true`, so Terraform adopts the existing annotations and tracks the desired-state in IaC going forward. The autoresizer cares about PVC annotations, not StatefulSet template annotations, so this is functionally equivalent. Done out-of-band before commit (helm state was already corrupted): `helm rollback vault 15 -n vault` → revision 20 deployed (clean). Verified: helm status vault = deployed; audit-vault-0 still has threshold=10% storage_limit=10Gi annotations; cluster healthcheck no longer reports vault/vault=failed.	2026-05-22 14:16:44 +00:00
Viktor Barzin	18a17891c4	state(vault): update encrypted state	2026-05-22 14:16:44 +00:00
Viktor Barzin	bc5c10b38d	ci: retrigger image rebuild — prior pipeline aborted during PG outage	2026-05-22 14:16:44 +00:00
Viktor Barzin	b278a8f158	docs/auth: sync to current `auth` enum (required/app/public/none) Replace the legacy `protected = true` reference with the four-tier `auth` enum that's been live for weeks. Document the anti-exposure guard (`scripts/check-ingress-auth-comments.py` + `scripts/tg`) that enforces the inline-comment convention. Fix two stale paths: - `stacks/platform/modules/ingress_factory/` → `modules/kubernetes/ingress_factory/` - `stacks/platform/modules/traefik/middleware.tf` → `stacks/traefik/modules/traefik/middleware.tf` Replace the single `protected = true` example with three: a default Authentik-gated admin UI, an app-managed backend, and an intentionally-public webhook receiver. Each example shows the required comment line above the auth assignment. [ci skip]	2026-05-22 14:16:44 +00:00
Viktor Barzin	2ba36436c8	real-estate-crawler: populate SCRAPE_SCHEDULES (daily RENT + weekly BUY, London 1-2 bed) Wires celery-beat to fire two periodic scrapes via the existing in-app SchedulesConfig mechanism. Replaces the empty-string fallback with two inline schedules expressed as Terraform-managed JSON: - london-rent-daily: every day at 03:00 UTC, RENT, London, 1-2 bed, £1900-4000 - london-buy-weekly: every Sunday at 04:00 UTC, BUY, London, 1-2 bed, £400k-1.2M Schedules live in `local.scrape_schedules` (jsonencode'd) rather than Vault — they're configuration, not secrets, and benefit from being version-controlled. The previous Vault-backed lookup (`local.notification_settings["scrape_schedules"]`) was unused. Verified live: new celery-beat pod logs `Registering periodic task: london-rent-daily at 3:0` and `london-buy-weekly at 4:0` immediately after roll-out. Also tightens the comment above the wrongmove-api `auth = "none"` line so it passes the new `scripts/check-ingress-auth-comments.py` guard (pre-existing tech debt that blocked the apply). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:44 +00:00
Viktor Barzin	b3c1631597	ci: add python3 to infra-ci image — unblocks scripts/tg auth-comment check Commit `0712a1b6` added a Python-based ingress_factory auth-comment check that runs from scripts/tg on every plan/apply. The CI image (forgejo.viktorbarzin.me/viktor/infra-ci) doesn't ship python3, so every CI apply has been failing since with: env: can't execute 'python3': No such file or directory Adding python3 to the apk install line restores CI applies for all stacks. The build-ci-image.yml pipeline auto-fires on this commit (path filter on ci/Dockerfile), so the rebuild + retag happens without manual action.	2026-05-22 14:16:44 +00:00

1 2 3 4 5 ...

3262 commits