Commit graph

908 commits

Author SHA1 Message Date
a72590db7d recruiter-responder: vault DB role + switch proactive push to Telegram
- stacks/vault/main.tf: register pg-recruiter-responder static role on
  the postgresql connection (7d password rotation). Adds the role to
  allowed_roles and creates vault_database_secret_backend_static_role
  for `recruiter_responder` user.
- stacks/recruiter-responder/main.tf: drop TASK_WEBHOOK_URL env, swap
  TASK_WEBHOOK_TOKEN secret for TELEGRAM_BOT_TOKEN + TELEGRAM_CHAT_ID.
  Updated header doc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:46 +00:00
7e1580ba8c recruiter-responder: deploy stack + llama-cpp qwen3-8b + openclaw plugin mount
Three coupled changes for the new recruiter-responder pipeline:

1. stacks/llama-cpp/: add qwen3-8b text-only model to llama-swap. Uses
   unsloth/Qwen3-8B-GGUF Q4_K_M, 16k context, no mmproj. Refactored the
   download Job script + cmd renderer to handle text_only=true (skip
   mmproj download + --mmproj flag). The 3 existing vision models stay
   on text_only=false; no behaviour change for them.

2. stacks/recruiter-responder/: new stack. Namespace, 2 ExternalSecrets
   (app secrets from secret/recruiter-responder, DB creds from Vault DB
   engine static-creds/pg-recruiter-responder), Deployment (replicas=1,
   Recreate -- IMAP IDLE + APScheduler want single leader), Service
   ClusterIP. Image: forgejo.viktorbarzin.me/viktor/recruiter-responder.

3. stacks/openclaw/: add init container `install-recruiter-plugin` that
   uses the recruiter-responder image to copy the .mjs plugin into
   /home/node/.openclaw/extensions/recruiter-api/ on NFS. Couples plugin
   version to the recruiter-responder image tag. Also injects
   RECRUITER_RESPONDER_URL + RECRUITER_RESPONDER_TOKEN env vars (token
   from openclaw-secrets.recruiter_responder_bearer_token, optional).

Pre-apply checklist for recruiter-responder stack:
  - Vault: seed secret/recruiter-responder with webhook_bearer_token,
    imap_{me,spam}_{user,pass}, smtp_password, claude_agent_token,
    task_webhook_token.
  - Vault: add secret/openclaw.recruiter_responder_bearer_token (same as
    above webhook_bearer_token).
  - dbaas: create DB recruiter_responder + role recruiter_responder,
    and Vault DB-engine role static-creds/pg-recruiter-responder.
  - Build + push image via Woodpecker (recruiter-responder repo CI).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:46 +00:00
Viktor Barzin
95b9f7bc89 aiostreams: 1h stream cache + canary stream-count probe + 3 alerts
Hardening pass following the empty-stream-list incident:

1. STREAM_CACHE_TTL=3600 — re-enables stream payload cache (was -1 /
   disabled). Default behaviour hit all 5 upstream addons on every
   Stremio request; with a 1h TTL repeat requests for the same title
   are instant, while RD cache invalidations still propagate quickly.

2. aiostreams-stream-probe CronJob (every 5 min): fetches the user's
   encryptedPassword via the internal ClusterIP, runs a canary stream
   search for Breaking Bad S01E01, pushes streams_count + probe_success
   to Pushgateway. Uses an ExternalSecret pulling UUID + password from
   Vault secret/viktor. Same pattern as email-roundtrip-monitor.

3. Three alerts in monitoring's prometheus_chart_values.tpl:
   - AIOStreamsStreamCountLow  (< 50 streams for 30m)
   - AIOStreamsProbeFailing    (probe_success == 0 for 30m)
   - AIOStreamsProbeStale      (last_run_timestamp > 30min for 10m)

Verified: probe returned streams=411 success=1 on first run; all 3
alerts loaded into Prometheus with state=inactive health=ok.
2026-05-22 14:16:46 +00:00
root
fba5ee2df4 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:46 +00:00
Viktor Barzin
c73234982f aiostreams: pin nightly + switch to auth=app
- Pin viren070/aiostreams:nightly → :2026.05.14.1326-nightly (avoid
  stale-pull cache, matches 8-char SHA convention for rolling tags)
- Switch ingress auth tier required → app: Authentik forward-auth
  blocks Stremio clients (cannot follow OAuth 302), and AIOStreams
  already enforces UUID + password on /configure and /api/*, with
  Stremio addon URLs using encryptedPassword as a bearer token.
  Result: empty-stream-list issue fixed for public Stremio clients.

Verified: 410 streams returned via public URL for Breaking Bad S01E01
with no cookies, vs 0 before (502→Authentik OIDC redirect).
2026-05-22 14:16:46 +00:00
2903ab9778 monitoring(wealth): move Positions table under contrib/growth row
Positions panel now sits at y=32 (immediately below the
contrib-vs-market + growth row at y=22..32), and everything from
the per-account stack down shifts 8 rows lower.
2026-05-22 14:16:46 +00:00
8461275308 wealth: positions table panel (shares + cost basis + unrealised return)
pg-sync sidecar now mirrors three extra views from the wealthfolio
SQLite: assets (id/symbol/name/currency), quote_latest (one row per
asset, preferring YAHOO over MANUAL on same-day collisions), and
positions_latest (currently-held positions extracted from the TOTAL
aggregate row of holdings_snapshots — quantity, average cost,
total cost basis).

Wealth dashboard gets a new bottom Positions table joining the three:
symbol, name, shares, avg cost, last price, market value, cost,
gain, return %. Gain and return % are color-text with red<0, green>=0
thresholds.
2026-05-22 14:16:46 +00:00
d6049ff7a0 terminal: extract app code to viktor/terminal-lobby on Forgejo
The lobby has grown enough (frontend, two Go services, devvm units +
scripts + config) that it earns its own repo. Code now lives at
https://forgejo.viktorbarzin.me/viktor/terminal-lobby with
scripts/deploy.sh covering the manual deploy until CI activation
lands (Woodpecker forge_id=2 activation still 500s; Forgejo Actions
not yet enabled).

This stack now owns only the K8s side — Services, Endpoints,
IngressRoutes, middlewares. main.tf comment block updated to point
at the new repo and the full DevVM port map.

Removed:
- stacks/terminal/files/        (index.html + DevVM artefacts)
- stacks/terminal/tmux-api/     (Go service)
- stacks/terminal/clipboard-upload/ (Go service)
2026-05-22 14:16:46 +00:00
c135c04c79 terminal: make slate the default theme 2026-05-22 14:16:46 +00:00
a44aa52e1a terminal: theme picker (carbon/slate/mono/ink) replacing violet
Drops the hardcoded violet/indigo palette. Four themes are defined as
CSS variables on body.theme-{carbon,slate,mono,ink}:

- Carbon (default): warm dark, ivory text, restrained amber accent.
- Slate: cool dark, GitHub/Linear-ish charcoal with electric blue.
- Mono: strict greyscale, off-white accent.
- Ink: warm paper light, deep ink, terracotta accent.

The lobby reads the choice from localStorage and applies the class
before render. The picker lives at the bottom of the sidebar
(margin-top: auto pins it). On change, the iframe is bounced through
about:blank so the inner xterm picks up the new computed CSS vars
(--terminal-bg/fg/cursor/selection) on the next mount.

Picker UI uses native buttons, current theme highlighted with the
accent border + color. No gradients, hairline borders only.
2026-05-22 14:16:45 +00:00
cbe83597c0 terminal: rename sessions + drag-and-drop reorder
Backend: POST /sessions/<name>/rename in tmux-api runs tmux
rename-session as the mapped OS user. 400 on bad name, 404 on missing
source, 409 on duplicate target, 401 on missing auth header.

Frontend:
- Rename button per card → prompt() dialog, validates against the
  shared regex. Updates currentActive + hash + iframe.src if the
  renamed session was active.
- Session order is now user-driven, persisted in localStorage
  keyed per osUser. New sessions append at the bottom. The previous
  sort-by-lastActivity is gone.
- HTML5 drag-and-drop reorders cards live during dragover; dragend
  captures the DOM order into localStorage.
- Polling renderLobby is suppressed while a drag is in flight so the
  5s tick doesn't yank the list out from under the user.
2026-05-22 14:16:45 +00:00
Viktor Barzin
04fd241679 terminal: inline session switching via sidebar + iframe
Replace full-page navigation with a two-pane lobby. Sidebar holds the
session list as clickable cards; an iframe in the content pane swaps
its src on click so switching sessions takes one click instead of two
navigations.

- #lobby-shell grid (260px sidebar + iframe pane)
- Cards become role=button, kill button stops propagation
- activateSession/deactivateSession with hash routing
  (location.hash <-> active session, replaceState so back stack stays
  clean)
- Killed active session deactivates the iframe before re-render
- 5s session poll preserves currentActive; deactivates if gone
- Mobile media query collapses to one column

CSP frame-ancestors already permits same-origin embedding
(*.viktorbarzin.me), no infra changes needed. Direct-link
?arg=<name> path is unchanged.
2026-05-22 14:16:45 +00:00
root
7663b5c36e Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:45 +00:00
Viktor Barzin
43affc3cdc actualbudget: add enabled flag to factory, disable emo
Emo isn't using the instance and the daily bank-sync CronJob has been
failing because the budget has zero accounts (deleted from the UI),
triggering BankSyncStale. Adds an `enabled` toggle that gates the core
Deployment + Service + Ingress + http-api + CronJob behind a single
plan-time bool while preserving the PVC, so we can flip back to true
later to restore the instance as-was.

Also fixes a latent bug where the http-api Service was always created
even when `enable_http_api=false`.

Apply: 7 resources destroyed (emo deploy/svc/ingress/cf dns/http-api
deploy+svc/cronjob), 0 changes for viktor/anca (moved blocks
migrated their state cleanly to the new [0] addresses). Pushgateway
job bank-sync-emo cleared manually; orphaned external-monitor
synced out by external-monitor-sync.
2026-05-22 14:16:45 +00:00
9fce3c7b09 terminal: per-Authentik-user OS-user isolation; deny unmapped users
Restores the kernel-level isolation the pre-cutover ttyd-session.sh had,
but keeps the multi-session lobby UX:

- ttyd.service gets `-H X-authentik-username` back. `tmux-attach.sh` reads
  $TTYD_USER, looks up the local part in /etc/ttyd-user-map, denies the
  connection (no fallback to wizard) if there's no mapping, otherwise
  `sudo -n -H -u <os_user> tmux …`. Each Authentik identity → its own
  Unix user → its own `/tmp/tmux-<uid>/default` socket.
- tmux-api scopes every request to the same OS user via the same header.
  Adds /whoami so the lobby HTML can preflight access and render
  "logged in as <os_user> (<authentik>)" instead of leaving the user to
  discover the deny via a reconnect loop.
- Commits /etc/ttyd-user-map and the matching /etc/sudoers.d/ttyd-users
  fragment under files/devvm/ so future operators see one canonical
  source of truth. Current mappings: vbarzin → wizard, emil.barzin → emo.

Adding a user is now: append a line to ttyd-user-map + a NOPASSWD
sudoers line + `useradd -m`. README walks through it.

No Terraform changes — this is all DevVM-side + lobby JS.
2026-05-22 14:16:45 +00:00
aff4f67671 terminal: cut over to multi-session lobby on terminal.viktorbarzin.me
Promotes the staged multi-session UX from term.viktorbarzin.me to the
primary terminal.viktorbarzin.me hostname. `ttyd.service` on the DevVM
moves to the same ExecStart that `ttyd-multi.service` was running:
`/usr/local/bin/ttyd -W -a -t enableClipboard=true -I
/usr/local/share/ttyd/index.html -p 7681 /usr/local/bin/tmux-attach.sh`.
The lobby HTML supersedes the old per-user-attach index.html
(ttyd-session.sh wrapper retired alongside).

Terraform: retires the `terminal-multi` Service+Endpoints and the
term.viktorbarzin.me ingress (Cloudflare DNS record for `term` is
released by module deletion). The tmux-api Service+Endpoints stay, but
its IngressRoute now matches terminal.viktorbarzin.me — same path-prefix
specificity wins against the catch-all ingress.

DevVM follow-up (applied manually as before — see files/devvm/README.md):
restart ttyd to pick up the new unit, stop+disable ttyd-multi.service.
2026-05-22 14:16:45 +00:00
root
86a2c66c8e Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:45 +00:00
Viktor Barzin
b1b2cb1974 terminal: add multi-tmux-session lobby on term.viktorbarzin.me (additive)
New hostname term.viktorbarzin.me serves a session-picker UI that lists,
creates, and kills tmux sessions. Visiting ?arg=<name> attaches to that
session (auto-creates via tmux -A). Builds on a fresh ttyd instance
(7685) plus a tmux-api Go binary (7684) on the DevVM, both running as
User=wizard alongside (not replacing) the existing ttyd.service (7681),
ttyd-ro.service (7682), and clipboard-upload (7683). Cutover of
terminal.viktorbarzin.me to the multi-session setup is deferred.

Terraform diff is purely additive — terminal-multi/tmux-api Service +
Endpoints + ingress_multi (term.viktorbarzin.me, Authentik-gated) + an
IngressRoute that path-prefixes /api/sessions/* to tmux-api with the
matching strip-prefix Middleware.

DevVM-side units ship under files/devvm/ with a README — manual scp +
systemctl install (see files/devvm/README.md). ttyd 1.7.7 already
deployed there (≥1.7 needed for -a).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:45 +00:00
726fb25182 monitoring(wealth): paint declining segments red on growth chart
Mirror the panel 5 treatment on panel 7 (Growth = market value −
contribution). Second SQL column emits the growth value only when
the point is part of a declining segment; field override paints it
red with no fill, spanNulls=false.
2026-05-22 14:16:45 +00:00
Viktor Barzin
cc47da87b0 payslip-ingest, instagram-poster: suspend two chronic-failure cronjobs
Identified during alert-noise review as steady sources of JobFailed.
Suspending them stops the noise; unsuspend after the per-job blocker is
cleared.

* payslip-ingest/actualbudget-payroll-sync — blocked on Vault
  `secret/payslip-ingest` missing `actualbudget_encryption_password`.
  `actualbudget_api_key` and `actualbudget_budget_sync_id` were added
  (copied from `secret/fire-planner`) in the same session; the
  encryption password is not stored anywhere in Vault and needs to be
  populated separately. ExternalSecret sync has been failing since
  2026-04-25.

* instagram-poster/ig-refresh-token — the deployed image (:da5b4191)
  does not contain the `POST /ig-refresh-token` route; the route is
  defined in uncommitted working-copy changes at
  `instagram-poster/instagram_poster/app.py:695`. Unsuspend after the
  new image rolls.

Each `suspend = true` line carries an inline comment with the unsuspend
trigger.
2026-05-22 14:16:45 +00:00
Viktor Barzin
cbd0f71a3b monitoring: PodImagePullBackOff alert + 2 inhibitors + JobFailed for:2h
Three improvements identified in the 7d alert-noise review:

A. New PodImagePullBackOff alert. `KubeletImagePullErrors` measures
   node-level pull error rate, which doesn't catch a single pod stuck
   in ImagePullBackOff — council-complaints sat broken for ~10h on
   2026-05-12 without paging. The new rule fires per-pod after 30m.

B. Two new inhibit_rules:
   - PVFillingUp (95% used, critical) suppresses PVPredictedFull
     (linear projection, warning) on the same PVC. Pair was producing
     ~24h of redundant firing per 7d.
   - EmailRoundtripFailing (active probe failure) suppresses
     EmailRoundtripStale (derivative >60min no-success). Same outage
     windows, ~14.5h of duplicate firing per 7d.

C. JobFailed for: 30m → 2h. Most cronjobs run every 5–15min; the old
   30-minute window paged on the first failed iteration before the
   next run could recover. 2h means "still failing across at least
   two cron iterations" — much more actionable.

Verified live: rules loaded, inhibitors in alertmanager config,
PodImagePullBackOff is currently inactive (council-complaints
ImagePullBackOff actively detected — see separate fix).
2026-05-22 14:16:45 +00:00
Viktor Barzin
70292b9e23 monitoring: TraefikReplicaConfigStale — drop false-positive on stale series
The initial formulation used clamp_min(min(rate[2h]), 0.0001), which
made a recently-deleted pod's lingering rate=0 drive the ratio toward
infinity for up to 2h until the stale series aged out of the rate
window. With for: 2h, this was a near-miss for spurious firing in the
immediate aftermath of restarting the bad replica (our remediation
path).

Tighter formulation:
* 30m rate window — stale series ages out within minutes, not hours
* `min(rate) > 0.0005` floor — filters both stale-zero and fresh-pod
  ramp-up series; the bug's actual rate (~0.00076 in the 2026-05-12
  incident) sits well above it, so true positives still trip
* for: 1h — fast enough to catch the next incident, long enough that
  short rate dips don't flap

Verified: post-deploy `(max/min) > 5 AND min > 0.0005` evaluates to 0
results with the live cluster's tight rate spread (~0.00065–0.0007/s
across all three Traefik replicas).
2026-05-22 14:16:45 +00:00
Viktor Barzin
165bb7258e monitoring: detect stale Traefik replicas + reduce alert-storm cascading
Two new alertmanager inhibit rules and one new Prometheus alert,
informed by the 2026-05-12 incident where Traefik pod
traefik-db7696fbf-k42wp came back after a SIGTERM with only 6 routers
vs 119 on healthy peers (stale K8s informer cache) and served 404 for
~1/3 of viktorbarzin.me traffic.

* New alert TraefikReplicaConfigStale: fires when max/min reload-rate
  ratio across Traefik pods exceeds 5x for 2h. The 2h window + 2h
  for-clause tolerates legitimate post-restart ramp-up; the bug
  pattern persists indefinitely.
* New inhibit: TraefikReplicaConfigStale suppresses the symptom
  alerts (HighService{Error,4xx,Latency}, IngressTTFB{High,Critical},
  IngressErrorRate5xxHigh, TraefikHighOpenConnections,
  ForwardAuthFallbackActive, AnubisChallengeStoreErrors,
  ExternalAccessDivergence) so only the actionable root cause pages.
* New inhibit: HomeAssistantDown suppresses
  HomeAssistantCriticalSensorUnavailable and
  HomeAssistantMetricsMissing — when HA itself is down, every sensor
  going unavailable is noise (10x firings observed in the last 12h).
* Extend NodeDown and NFSServerUnresponsive target lists to also
  suppress HomeAssistantCriticalSensorUnavailable.
2026-05-22 14:16:45 +00:00
Viktor Barzin
448bc0c0f6 k8s-version-upgrade: decompose into Job chain to fix self-preemption
The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:45 +00:00
root
8e13f1528e Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:45 +00:00
Viktor Barzin
e8854f9230 wealthfolio, paperless-ngx: drop migration-leftover -proxmox PVCs
The 2026-04-13 encrypted-PVC migration replaced the wealthfolio and
paperless-ngx data volumes with -encrypted variants but never removed
the original -proxmox PVC blocks from TF — both were sitting orphaned
with no pod mounting them, occupying 1Gi each of LVM thin pool. The
autoresizer also logged repeated "failed to get volume stats" for them
(no kubelet stats without a mounted pod), masking real signal.

  * wealthfolio: removed kubernetes_persistent_volume_claim.data_proxmox
  * paperless-ngx: removed kubernetes_persistent_volume_claim.data_proxmox
  (the paperless PVC turned out to be out-of-TF-state, so deleted via
   kubectl after the TF block removal.)
2026-05-22 14:16:45 +00:00
Viktor Barzin
701b0e3c57 claude-agent: replace unused 10Gi PVC with 5Gi NFS-backed /persistent
The 10Gi proxmox-lvm-encrypted PVC `claude-agent-workspace-encrypted` was
declared in TF but never wired into the deployment — the `workspace`
volume_mount pointed at an emptyDir, so the PVC sat allocated and idle
from 2026-04-15 to 2026-05-11.

Restructured per the design intent:
  * `workspace` (emptyDir) — fast per-pod ephemeral scratch for git clones.
    Each agent job clones the infra repo fresh, so persistence doesn't
    buy anything and emptyDir avoids RWO contention if the deployment
    is ever scaled past 1 replica.
  * `persistent` (5Gi NFS-backed RWX) — mounted at /persistent for cases
    where the agent needs to write state that should survive pod
    restarts (caches, ad-hoc outputs). RWX so all replicas share it;
    the service's sequential-mutex lock prevents concurrent writes.

Also fixed `fix-perms` init container: the Dockerfile's `WORKDIR
/workspace/infra` causes kubelet to create that path inside the
emptyDir as root:fsGroup with the setgid bit, which uid 1000 can't
write to. Pre-create the path + chmod 0775 to make it writable.

NFS export already exists on the PVE host
(/srv/nfs/claude-agent-persistent, owned 1000:1000).

Verified: pod runs 1/1; `/persistent` writable as agent uid 1000;
git-init successfully clones infra into /workspace/infra.
2026-05-22 14:16:45 +00:00
Viktor Barzin
cd13b9d062 monitoring: drop PVAutoExpanding alert — info-only noise, not actionable
PVAutoExpanding fired at >80% used (info severity), but pvc-autoresizer's
threshold is 10% free (= 90% used) — the alert always fired ~10 points
before any action would have been taken, and there was nothing for an
operator to do during that window either. It was a "heads up" that
didn't surface a problem.

Real failure modes are already covered:
  * PVFillingUp (critical, >95% for 10m) — autoresizer didn't keep up
  * PVPredictedFull (warning, predict_linear 24h) — trend toward exhaustion

Sharpened PVFillingUp's annotation to spell out the likely causes
(storage_limit reached, expansion failing, or missing autoresizer
annotations) so the responder doesn't have to recall the runbook.
2026-05-22 14:16:44 +00:00
396cce82cf monitoring(wealth): paint declining segments red on portfolio chart
Add a second SQL column on panel 5 that returns net_worth only when the
current point's previous or next neighbor is lower — i.e. the point is
part of a declining segment (including the peak and trough endpoints).
A field override draws this 'decline' series in red with no fill and
spanNulls=false, overlaying the green base line so down periods show
up as red on top of the climb.
2026-05-22 14:16:44 +00:00
Viktor Barzin
a699d5bedf vault: move audit-PVC autoresizer annotations to kubernetes_annotations
Background: 2026-05-10 someone added `server.auditStorage.annotations`
to vault/main.tf attempting to enable pvc-autoresizer on audit-vault-N
PVCs. The vault helm chart maps that block into the StatefulSet's
volumeClaimTemplates, which is immutable post-creation on existing
StatefulSets. Result: 4 consecutive helm upgrade attempts (rev 16-19)
all rejected with "StatefulSet spec: Forbidden", leaving the release
stuck in failed state since 22:47 UTC that day. Live PVCs were
hand-annotated via `kubectl annotate` as a workaround, but the IaC
declared a path that couldn't be applied — every subsequent tg apply
on the vault stack would re-fail.

Fix:
  * Remove `annotations` block from `server.auditStorage` values
    (with a comment recording why it can't live there).
  * Add `kubernetes_annotations` resources for audit-vault-{0,1,2}
    with `force = true`, so Terraform adopts the existing annotations
    and tracks the desired-state in IaC going forward. The autoresizer
    cares about PVC annotations, not StatefulSet template annotations,
    so this is functionally equivalent.

Done out-of-band before commit (helm state was already corrupted):
  `helm rollback vault 15 -n vault` → revision 20 deployed (clean).

Verified: helm status vault = deployed; audit-vault-0 still has
threshold=10% storage_limit=10Gi annotations; cluster healthcheck
no longer reports vault/vault=failed.
2026-05-22 14:16:44 +00:00
Viktor Barzin
2ba36436c8 real-estate-crawler: populate SCRAPE_SCHEDULES (daily RENT + weekly BUY, London 1-2 bed)
Wires celery-beat to fire two periodic scrapes via the existing in-app
SchedulesConfig mechanism. Replaces the empty-string fallback with two
inline schedules expressed as Terraform-managed JSON:

- london-rent-daily: every day at 03:00 UTC, RENT, London, 1-2 bed,
  £1900-4000
- london-buy-weekly: every Sunday at 04:00 UTC, BUY, London, 1-2 bed,
  £400k-1.2M

Schedules live in `local.scrape_schedules` (jsonencode'd) rather than
Vault — they're configuration, not secrets, and benefit from being
version-controlled. The previous Vault-backed lookup
(`local.notification_settings["scrape_schedules"]`) was unused.

Verified live: new celery-beat pod logs
`Registering periodic task: london-rent-daily at 3:0` and
`london-buy-weekly at 4:0` immediately after roll-out.

Also tightens the comment above the wrongmove-api `auth = "none"` line
so it passes the new `scripts/check-ingress-auth-comments.py` guard
(pre-existing tech debt that blocked the apply).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:44 +00:00
Viktor Barzin
f10784ddb6 infra: document auth = "app|none" tier on every legacy ingress
Sweep through the 30+ stacks that predated the auth = "app" tier
and were tagged auth = "none" without a comment explaining why
they weren't behind Authentik. Each is now self-documenting at the
call site, so the tg-level anti-exposure guard passes and future
readers don't have to reverse-engineer the intent.

Flipped 6 stacks from "none" to "app" — their backends have their
own user auth and the new tier records that more accurately:
  - navidrome   (Subsonic user/password)
  - ntfy        (deny-all default + user.db tokens)
  - nextcloud   (WebDAV/CalDAV/CardDAV app passwords)
  - vaultwarden (Bitwarden-compatible token auth)
  - headscale   (OIDC + preauth keys for Tailscale nodes)
  - paperless-ngx (app-layer login + API tokens)

Kept "none" with a comment on the rest — they're genuinely public,
webhook receivers, native-protocol endpoints, OAuth callbacks, or
Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt),
claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api,
fire-planner /api, forgejo (git/OCI native clients), frigate (HA
integration), immich/frame, insta2spotify /api, instagram-poster
(meta fetcher), k8s-portal, matrix (native bearer), monitoring×2
(HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT),
owntracks (HTTP Basic), postiz, privatebin (client-side enc),
rybbit (analytics tracker), send (E2E file drop), tuya-bridge
(API key), vault (own auth + CLI), webhook_handler, woodpecker
(forgejo webhooks + OAuth), xray (×3 VPN transports).

real-estate-crawler/main.tf:400 already had its comment from a
prior edit — not touched here.

No live state changes — auth = "app" produces the same middleware
chain as auth = "none" (verified earlier this session). This commit
is purely documentation + intent-tagging.
2026-05-22 14:16:44 +00:00
Viktor Barzin
20774f794d dbaas+monitoring: bump PG max_connections to 200, add scrape + alerts
Cluster grew past the 100-conn default — steady-state idle was 90/100,
leaving zero headroom for terragrunt applies or transient surges. The
ceiling was being discovered by Terraform crashing (pq: "remaining
connection slots are reserved for roles with the SUPERUSER attribute"),
not by alerting, because we had no PG scrape config at all.

dbaas (Tier 0):
  * max_connections: 100 → 200
  * shared_buffers: 512MB → 1GB (Postgres recommends ~25% of pod memory)
  * effective_cache_size: 1536MB → 2560MB (scaled with pod memory)
  * pod memory: 2Gi → 3Gi (rough rule of thumb: enough for shared_buffers
    + ~16MB work_mem * concurrent sorts + OS cache + overhead)
  * Triggers bump on null_resource.pg_cluster forces CNPG to re-apply,
    which rolls the cluster (standby first, then primary failover).

monitoring:
  * New scrape job 'cnpg' on dbaas namespace pods labeled
    cnpg.io/podRole=instance, port name=metrics (9187). Relabels add
    cnpg_cluster + cnpg_role labels for alert grouping.
  * PGConnectionsHigh (warning, >85% for 10m) — heads-up before exhaustion.
  * PGConnectionsCritical (critical, >95% for 3m) — last call before
    refusing connections.

Verified: cnpg targets up, sum(cnpg_backends_total)=84, max_connections
metric=200, alert ratio 0.42 → both alerts inactive.
2026-05-22 14:16:44 +00:00
Viktor Barzin
eb529d60e4 infra/ingress_factory: add auth = "app" mode for self-authed backends
Adds a fourth auth tier alongside required/public/none. "app" is
functionally identical to "none" — no Authentik middleware attached —
but the distinct name records intent at the call site: this backend
has its own user login (NextAuth, Django, OAuth, bearer-token API,
etc.) and Authentik would only break it.

Why the new tier: with only required/none, every "the app has its
own auth so drop Authentik" decision looked identical at the call
site to "this is an OAuth callback / webhook receiver / native-client
API". Future readers couldn't tell whether a stack was intentionally
unauthenticated or relying on backend auth. Now they can.

Migrates the 8 stacks flipped earlier this session (novelapp, immich,
linkwarden, tandoor, freshrss, affine, actualbudget, ebooks/audiobookshelf)
from "none" to "app". Confirmed no-op: `tg plan` on novelapp showed
"No changes" — same middleware chain, same live state.

The variable description and the .claude/CLAUDE.md Auth section now
spell out the anti-exposure rule: only pick "app" or "none" AFTER
verifying the app has its own user auth ("app") or the endpoint is
intentionally public ("none"). Default stays "required" so accidental
omission fails closed.

[ci skip]
2026-05-22 14:16:44 +00:00
root
6b9f5e8027 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:44 +00:00
Viktor Barzin
665b6b2934 actualbudget+monitoring: per-account bank-sync metrics, drop noisy alert
The bank-sync CronJob was posting to /accounts/banksync which fans out to
ALL accounts in a single call. With PSD2/GoCardless's 4-successful-pulls
per-account per-24h quota, a single rate-limited account would 500 the
whole call, and `bank_sync_success` would flip to 0 even though the data
itself was still flowing through manual UI syncs. Result: BankSyncFailing
fired routinely whenever the user had been active in the UI that day —
a structural false positive.

Fix:
  * CronJob: enumerate accounts via GET /accounts, POST per-account
    /accounts/{id}/banksync, emit bank_sync_account_success and
    bank_sync_account_last_success_timestamp labelled by account name.
    Roll up bank_sync_success = 1 iff any account succeeded.
  * Alerts: drop BankSyncFailing (noise generator). Keep BankSyncStale
    at 48h (global drought). Add BankSyncAccountStale at 72h (catches
    single-account auth expiry — the real signal we wanted).

Verified: manual run on bank-sync-viktor pushes 6 per-account success +
timestamp series; roll-up bank_sync_success=1; no firing alerts.
2026-05-22 14:16:44 +00:00
Viktor Barzin
7b6eee49c4 infra: drop Authentik forward-auth from 7 self-authed apps (auth = "none")
Apps with their own user auth + bearer-token APIs were being broken by
Traefik → Authentik forward-auth: every iOS/Android/native client got a
302 to authentik.viktorbarzin.me instead of the JSON they expected.
Authentik's 302+cookie dance can only be followed by a real browser.

Changed:
  - immich         (Immich mobile app + bearer-token /api)
  - linkwarden     (NextAuth + Linkwarden mobile clients)
  - tandoor        (Django auth + Tandoor mobile clients)
  - freshrss       (Fever/GReader API used by Reeder/FeedMe/etc.)
  - affine         (workspace auth + AFFiNE desktop/mobile sync)
  - actualbudget   (server password + Actual mobile/sync clients)
  - ebooks/abs     (Audiobookshelf iOS/Android app)

Each app's own auth is the gate now. CrowdSec + rate-limit + anti-AI
UA filter still front the ingresses. Same pattern as the novelapp
change earlier this session.

[ci skip]
2026-05-22 14:16:44 +00:00
Viktor Barzin
f98c3f2049 infra/novelapp: drop Authentik forward-auth (auth = "none")
novelapp handles its own user auth via NextAuth + Google OAuth, so the
ingress-level Authentik forward-auth was double-gating. Mobile webviews
(iOS/Android) can't follow the Authentik 302/cookie dance — they saw
HTML challenges where they expected JSON. CrowdSec + rate-limit +
anti-AI UA filter remain in front; novelapp's own login handles users.

[ci skip]
2026-05-22 14:16:44 +00:00
root
77492b3131 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:44 +00:00
Viktor Barzin
9be0672aa3 claude-memory / resume: unblock terragrunt apply (var defaults + psql -d postgres)
Two pre-existing apply failures uncovered during the Phase 4 mass apply,
unrelated to the auth refactor but blocking 100% rollout.

claude-memory:
- `var.claude_memory_db_password` had no default and wasn't passed by
  terragrunt → fall back to Vault `secret/claude-memory.db_password` via
  `coalesce(var.x, data.vault.data["db_password"])`.
- db-init Job was failing with `database "root" does not exist` because
  psql defaults the database name to the user when -d is omitted. Added
  `-d postgres` to all five psql invocations.

resume:
- `var.resume_database_url` had no default and wasn't passed → default to
  empty string. Vault carries the real value at `secret/resume.database_url`
  consumed at the deployment env-var level; the variable here just needs
  a value to satisfy the apply.

Also: priority-pass had lost most of its TF state (only 3 of 8 resources
tracked); imported namespace/service/pvc/deployment/ingress/tls-secret to
re-bind state with live K8s resources. No code change needed there.

Verified after re-apply:
- claude-memory.viktorbarzin.me → 200 (auth=none, native MCP responses)
- priority-pass.viktorbarzin.me → 302 → authentik (auth=required)
- resume.viktorbarzin.me → 302 → authentik public outpost (auth=public)
- 6 of 7 previously-failing applies now green; only vault remains, blocked
  by an unrelated helm chart immutable-StatefulSet-field issue.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:44 +00:00
Viktor Barzin
a168277213 healthcheck: tune noise filters + nvidia-exporter auth=none
Six tuning changes to cluster_healthcheck.sh so PASS sections actually
reflect "nothing to act on":

1. prometheus_alerts: only count severity=warning|critical. Info-level
   alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the
   alert rule itself sets severity; the script should respect it.

2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert
   auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the
   Lets Encrypt wildcard renews weekly; <14d is the only window where
   human attention is genuinely useful.

3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/
   event/image/update domains (transient by design), skip friendly
   names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and
   only count entities whose last_changed > 24h. Was 431/1470,
   most of which were "phone in standby" noise.

4. ha_automations: only flag DISABLED automations as abandoned if
   they've also been untouched (last_changed) for >180 days; raise
   stale threshold 30d → 180d. Was flagging seasonal/holiday-only
   automations as broken.

5. problematic_pods + evicted_pods: exclude pods owned by Jobs.
   CronJob retry leftovers (Error/Failed phase pods that K8s keeps
   around for log inspection) aren't problematic at the cluster level.

6. uptime_kuma: retry the WebSocket login 3x with backoff. Single-
   shot failures were a recurring false-positive even though the
   service was healthy.

Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's
nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll
/metrics and got 302'd to Authentik like the idrac/snmp ones did.
Same fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
root
8483ca59ba Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:43 +00:00
Viktor Barzin
dc7c19d88e frigate: lan ingress auth=none for HA Sofia integration
The frigate-lan.viktorbarzin.lan ingress had Authentik forward-auth in
front. HA Sofia's frigate integration polls /api/config and only knows
how to use Frigate's own API key (not browser SSO), so every poll got
a 302 to authentik.viktorbarzin.me and the integration entered the
errors-state. Same pattern as idrac-redfish-exporter (5c594291).

allow_local_access_only IP allowlist + Frigate's API key are enough.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
dc134011eb fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes
After fixing the threshold=80% misconfig and seeing two PVCs
(prometheus + technitium primary) get stuck Terminating, a 3rd round
showed four more PVCs (frigate, hackmd, immich-postgresql,
paperless-ngx) in the same state. Same root cause: TF spec'd a
smaller storage size than the autoresizer-grown live value, K8s
rejected the shrink, TF force-replaced the PVC, and the
pvc-protection finalizer held it in Terminating while the pod kept
using the underlying volume.

Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests]
on every kubernetes_persistent_volume_claim block that has
resize.topolvm.io/threshold annotations. The pattern was already
documented in .claude/CLAUDE.md but ~63 stacks were missing it.

Live PVCs are unaffected; this only prevents future TF applies from
attempting the destroy+recreate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
dd2b7de291 fix: HA Sofia REST sensors + PVC drift safety
Two real issues found while triaging HomeAssistantCriticalSensorUnavailable
alerts and the prometheus + technitium PVC Terminating-but-in-use
state from the earlier session.

1. idrac-redfish-exporter + snmp-exporter ingresses: auth=required →
   auth=none. HA Sofia REST sensors scrape these endpoints
   programmatically; with Authentik forward-auth in front, every
   request got a 302 to authentik.viktorbarzin.me and the REST
   sensors parsed the HTML login page instead of metrics — leaving
   the R730, UPS, and ~20 other sensors permanently unavailable.
   The allow_local_access_only IP allowlist (192.168.0.0/16 +
   10.0.0.0/8) already gates external access, so authentik on top
   was breaking machine-to-machine traffic for no security gain.

2. prometheus_server_pvc + technitium primary_config_encrypted:
   add lifecycle.ignore_changes = [spec[0].resources[0].requests].
   The autoresizer expands these PVCs; PVCs can't shrink. Without
   the ignore, every TF apply tried to revert the live size back
   to the TF spec value, hit K8s's shrink-forbidden rule, and
   force-replaced the PVC. Because the pod still mounted it, the
   PVC went into Terminating-but-protected limbo — fine until a
   pod restart would have orphaned the volume. Root cause of the
   2026-05-10 PVC Terminating incident.

Bonus: prometheus_server_pvc threshold was the inverted "90%" (the
same bug the bulk fecfa211 sweep fixed elsewhere; my regex only
matched "80%" so this one slipped through). Now "10%".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
7e69951cb9 state(dbaas): update encrypted state 2026-05-22 14:16:43 +00:00
Viktor Barzin
ee47197f3b vault: enroll audit-vault-0 in pvc-autoresizer (10Gi limit)
audit-vault-0 fills steadily with raft audit logs; without autoresizer
annotations it hits the 2Gi ceiling and Vault stalls on writes
(PVAutoExpanding alert was firing at 81% used). The Vault Helm chart
copies server.auditStorage.annotations onto the PVC at create time.

Live PVC already has the annotations applied via kubectl annotate;
this just keeps TF in sync.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
0fdadcc3dd dbaas: pg-cluster threshold 80%→10% in CNPG inheritedMetadata
Same misconfig as the bulk fecfa211 sweep, but the pg-cluster YAML
is buried inside a null_resource local-exec heredoc so the regex
didn't catch it. CNPG operator inherits these annotations onto each
member PVC (pg-cluster-1, pg-cluster-2), and reapplies them on every
reconcile — patching the live PVCs alone bounces back within seconds.

Live state already patched via kubectl patch cluster, this just keeps
TF in sync.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
3f2b2f9d32 fix: pvc-autoresizer threshold should be 10%, not 80%
topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE
percentage below which expansion fires (per upstream README). Setting
it to "80%" means "expand when free-space drops below 80%", i.e. as
soon as the PVC crosses 20% utilization — which caused
prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi
in 70 minutes (six 10% bumps, all when the volume was only ~14% used).
Once the SC opt-in fix landed (1e4eac53) and the inode metrics fix
landed (02a12f1a), the autoresizer started actively misfiring across
75+ PVCs cluster-wide.

Flip the value to "10%" everywhere — that's "expand when free-space
drops below 10%", i.e. at 90% utilization, which is the conventional
semantic and matches the alert thresholds in
prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp
at 95%).

The CLAUDE.md PVC template was the source of the misconfig, so update
it too. Live PVC annotations were patched in parallel via kubectl
annotate; TF apply on each affected stack will be a no-op against
those live values.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
dc4ce46411 k8s-version-upgrade: detection script refresh apt before madison + DRY_RUN_OVERRIDE
Test 2 dry-run revealed kubeadm plan reports v1.34.7 as latest while
apt-cache madison (without prior apt-get update) was reporting v1.34.5
— so the CronJob would have dispatched the agent against a stale
target. Now do `sudo apt-get update -qq` for just the kubernetes repo
before querying madison.

Also add a DRY_RUN_OVERRIDE env precedence so future test invocations
can override DRY_RUN without an apply cycle — but Job spec env is
immutable post-create, so this is only useful for CronJob spec edits
(suspend, then add env, then resume). Documented in the runbook.
2026-05-22 14:16:43 +00:00