Commit graph

880 commits

Author SHA1 Message Date
396cce82cf monitoring(wealth): paint declining segments red on portfolio chart
Add a second SQL column on panel 5 that returns net_worth only when the
current point's previous or next neighbor is lower — i.e. the point is
part of a declining segment (including the peak and trough endpoints).
A field override draws this 'decline' series in red with no fill and
spanNulls=false, overlaying the green base line so down periods show
up as red on top of the climb.
2026-05-22 14:16:44 +00:00
Viktor Barzin
a699d5bedf vault: move audit-PVC autoresizer annotations to kubernetes_annotations
Background: 2026-05-10 someone added `server.auditStorage.annotations`
to vault/main.tf attempting to enable pvc-autoresizer on audit-vault-N
PVCs. The vault helm chart maps that block into the StatefulSet's
volumeClaimTemplates, which is immutable post-creation on existing
StatefulSets. Result: 4 consecutive helm upgrade attempts (rev 16-19)
all rejected with "StatefulSet spec: Forbidden", leaving the release
stuck in failed state since 22:47 UTC that day. Live PVCs were
hand-annotated via `kubectl annotate` as a workaround, but the IaC
declared a path that couldn't be applied — every subsequent tg apply
on the vault stack would re-fail.

Fix:
  * Remove `annotations` block from `server.auditStorage` values
    (with a comment recording why it can't live there).
  * Add `kubernetes_annotations` resources for audit-vault-{0,1,2}
    with `force = true`, so Terraform adopts the existing annotations
    and tracks the desired-state in IaC going forward. The autoresizer
    cares about PVC annotations, not StatefulSet template annotations,
    so this is functionally equivalent.

Done out-of-band before commit (helm state was already corrupted):
  `helm rollback vault 15 -n vault` → revision 20 deployed (clean).

Verified: helm status vault = deployed; audit-vault-0 still has
threshold=10% storage_limit=10Gi annotations; cluster healthcheck
no longer reports vault/vault=failed.
2026-05-22 14:16:44 +00:00
Viktor Barzin
2ba36436c8 real-estate-crawler: populate SCRAPE_SCHEDULES (daily RENT + weekly BUY, London 1-2 bed)
Wires celery-beat to fire two periodic scrapes via the existing in-app
SchedulesConfig mechanism. Replaces the empty-string fallback with two
inline schedules expressed as Terraform-managed JSON:

- london-rent-daily: every day at 03:00 UTC, RENT, London, 1-2 bed,
  £1900-4000
- london-buy-weekly: every Sunday at 04:00 UTC, BUY, London, 1-2 bed,
  £400k-1.2M

Schedules live in `local.scrape_schedules` (jsonencode'd) rather than
Vault — they're configuration, not secrets, and benefit from being
version-controlled. The previous Vault-backed lookup
(`local.notification_settings["scrape_schedules"]`) was unused.

Verified live: new celery-beat pod logs
`Registering periodic task: london-rent-daily at 3:0` and
`london-buy-weekly at 4:0` immediately after roll-out.

Also tightens the comment above the wrongmove-api `auth = "none"` line
so it passes the new `scripts/check-ingress-auth-comments.py` guard
(pre-existing tech debt that blocked the apply).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:44 +00:00
Viktor Barzin
f10784ddb6 infra: document auth = "app|none" tier on every legacy ingress
Sweep through the 30+ stacks that predated the auth = "app" tier
and were tagged auth = "none" without a comment explaining why
they weren't behind Authentik. Each is now self-documenting at the
call site, so the tg-level anti-exposure guard passes and future
readers don't have to reverse-engineer the intent.

Flipped 6 stacks from "none" to "app" — their backends have their
own user auth and the new tier records that more accurately:
  - navidrome   (Subsonic user/password)
  - ntfy        (deny-all default + user.db tokens)
  - nextcloud   (WebDAV/CalDAV/CardDAV app passwords)
  - vaultwarden (Bitwarden-compatible token auth)
  - headscale   (OIDC + preauth keys for Tailscale nodes)
  - paperless-ngx (app-layer login + API tokens)

Kept "none" with a comment on the rest — they're genuinely public,
webhook receivers, native-protocol endpoints, OAuth callbacks, or
Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt),
claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api,
fire-planner /api, forgejo (git/OCI native clients), frigate (HA
integration), immich/frame, insta2spotify /api, instagram-poster
(meta fetcher), k8s-portal, matrix (native bearer), monitoring×2
(HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT),
owntracks (HTTP Basic), postiz, privatebin (client-side enc),
rybbit (analytics tracker), send (E2E file drop), tuya-bridge
(API key), vault (own auth + CLI), webhook_handler, woodpecker
(forgejo webhooks + OAuth), xray (×3 VPN transports).

real-estate-crawler/main.tf:400 already had its comment from a
prior edit — not touched here.

No live state changes — auth = "app" produces the same middleware
chain as auth = "none" (verified earlier this session). This commit
is purely documentation + intent-tagging.
2026-05-22 14:16:44 +00:00
Viktor Barzin
20774f794d dbaas+monitoring: bump PG max_connections to 200, add scrape + alerts
Cluster grew past the 100-conn default — steady-state idle was 90/100,
leaving zero headroom for terragrunt applies or transient surges. The
ceiling was being discovered by Terraform crashing (pq: "remaining
connection slots are reserved for roles with the SUPERUSER attribute"),
not by alerting, because we had no PG scrape config at all.

dbaas (Tier 0):
  * max_connections: 100 → 200
  * shared_buffers: 512MB → 1GB (Postgres recommends ~25% of pod memory)
  * effective_cache_size: 1536MB → 2560MB (scaled with pod memory)
  * pod memory: 2Gi → 3Gi (rough rule of thumb: enough for shared_buffers
    + ~16MB work_mem * concurrent sorts + OS cache + overhead)
  * Triggers bump on null_resource.pg_cluster forces CNPG to re-apply,
    which rolls the cluster (standby first, then primary failover).

monitoring:
  * New scrape job 'cnpg' on dbaas namespace pods labeled
    cnpg.io/podRole=instance, port name=metrics (9187). Relabels add
    cnpg_cluster + cnpg_role labels for alert grouping.
  * PGConnectionsHigh (warning, >85% for 10m) — heads-up before exhaustion.
  * PGConnectionsCritical (critical, >95% for 3m) — last call before
    refusing connections.

Verified: cnpg targets up, sum(cnpg_backends_total)=84, max_connections
metric=200, alert ratio 0.42 → both alerts inactive.
2026-05-22 14:16:44 +00:00
Viktor Barzin
eb529d60e4 infra/ingress_factory: add auth = "app" mode for self-authed backends
Adds a fourth auth tier alongside required/public/none. "app" is
functionally identical to "none" — no Authentik middleware attached —
but the distinct name records intent at the call site: this backend
has its own user login (NextAuth, Django, OAuth, bearer-token API,
etc.) and Authentik would only break it.

Why the new tier: with only required/none, every "the app has its
own auth so drop Authentik" decision looked identical at the call
site to "this is an OAuth callback / webhook receiver / native-client
API". Future readers couldn't tell whether a stack was intentionally
unauthenticated or relying on backend auth. Now they can.

Migrates the 8 stacks flipped earlier this session (novelapp, immich,
linkwarden, tandoor, freshrss, affine, actualbudget, ebooks/audiobookshelf)
from "none" to "app". Confirmed no-op: `tg plan` on novelapp showed
"No changes" — same middleware chain, same live state.

The variable description and the .claude/CLAUDE.md Auth section now
spell out the anti-exposure rule: only pick "app" or "none" AFTER
verifying the app has its own user auth ("app") or the endpoint is
intentionally public ("none"). Default stays "required" so accidental
omission fails closed.

[ci skip]
2026-05-22 14:16:44 +00:00
root
6b9f5e8027 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:44 +00:00
Viktor Barzin
665b6b2934 actualbudget+monitoring: per-account bank-sync metrics, drop noisy alert
The bank-sync CronJob was posting to /accounts/banksync which fans out to
ALL accounts in a single call. With PSD2/GoCardless's 4-successful-pulls
per-account per-24h quota, a single rate-limited account would 500 the
whole call, and `bank_sync_success` would flip to 0 even though the data
itself was still flowing through manual UI syncs. Result: BankSyncFailing
fired routinely whenever the user had been active in the UI that day —
a structural false positive.

Fix:
  * CronJob: enumerate accounts via GET /accounts, POST per-account
    /accounts/{id}/banksync, emit bank_sync_account_success and
    bank_sync_account_last_success_timestamp labelled by account name.
    Roll up bank_sync_success = 1 iff any account succeeded.
  * Alerts: drop BankSyncFailing (noise generator). Keep BankSyncStale
    at 48h (global drought). Add BankSyncAccountStale at 72h (catches
    single-account auth expiry — the real signal we wanted).

Verified: manual run on bank-sync-viktor pushes 6 per-account success +
timestamp series; roll-up bank_sync_success=1; no firing alerts.
2026-05-22 14:16:44 +00:00
Viktor Barzin
7b6eee49c4 infra: drop Authentik forward-auth from 7 self-authed apps (auth = "none")
Apps with their own user auth + bearer-token APIs were being broken by
Traefik → Authentik forward-auth: every iOS/Android/native client got a
302 to authentik.viktorbarzin.me instead of the JSON they expected.
Authentik's 302+cookie dance can only be followed by a real browser.

Changed:
  - immich         (Immich mobile app + bearer-token /api)
  - linkwarden     (NextAuth + Linkwarden mobile clients)
  - tandoor        (Django auth + Tandoor mobile clients)
  - freshrss       (Fever/GReader API used by Reeder/FeedMe/etc.)
  - affine         (workspace auth + AFFiNE desktop/mobile sync)
  - actualbudget   (server password + Actual mobile/sync clients)
  - ebooks/abs     (Audiobookshelf iOS/Android app)

Each app's own auth is the gate now. CrowdSec + rate-limit + anti-AI
UA filter still front the ingresses. Same pattern as the novelapp
change earlier this session.

[ci skip]
2026-05-22 14:16:44 +00:00
Viktor Barzin
f98c3f2049 infra/novelapp: drop Authentik forward-auth (auth = "none")
novelapp handles its own user auth via NextAuth + Google OAuth, so the
ingress-level Authentik forward-auth was double-gating. Mobile webviews
(iOS/Android) can't follow the Authentik 302/cookie dance — they saw
HTML challenges where they expected JSON. CrowdSec + rate-limit +
anti-AI UA filter remain in front; novelapp's own login handles users.

[ci skip]
2026-05-22 14:16:44 +00:00
root
77492b3131 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:44 +00:00
Viktor Barzin
9be0672aa3 claude-memory / resume: unblock terragrunt apply (var defaults + psql -d postgres)
Two pre-existing apply failures uncovered during the Phase 4 mass apply,
unrelated to the auth refactor but blocking 100% rollout.

claude-memory:
- `var.claude_memory_db_password` had no default and wasn't passed by
  terragrunt → fall back to Vault `secret/claude-memory.db_password` via
  `coalesce(var.x, data.vault.data["db_password"])`.
- db-init Job was failing with `database "root" does not exist` because
  psql defaults the database name to the user when -d is omitted. Added
  `-d postgres` to all five psql invocations.

resume:
- `var.resume_database_url` had no default and wasn't passed → default to
  empty string. Vault carries the real value at `secret/resume.database_url`
  consumed at the deployment env-var level; the variable here just needs
  a value to satisfy the apply.

Also: priority-pass had lost most of its TF state (only 3 of 8 resources
tracked); imported namespace/service/pvc/deployment/ingress/tls-secret to
re-bind state with live K8s resources. No code change needed there.

Verified after re-apply:
- claude-memory.viktorbarzin.me → 200 (auth=none, native MCP responses)
- priority-pass.viktorbarzin.me → 302 → authentik (auth=required)
- resume.viktorbarzin.me → 302 → authentik public outpost (auth=public)
- 6 of 7 previously-failing applies now green; only vault remains, blocked
  by an unrelated helm chart immutable-StatefulSet-field issue.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:44 +00:00
Viktor Barzin
a168277213 healthcheck: tune noise filters + nvidia-exporter auth=none
Six tuning changes to cluster_healthcheck.sh so PASS sections actually
reflect "nothing to act on":

1. prometheus_alerts: only count severity=warning|critical. Info-level
   alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the
   alert rule itself sets severity; the script should respect it.

2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert
   auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the
   Lets Encrypt wildcard renews weekly; <14d is the only window where
   human attention is genuinely useful.

3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/
   event/image/update domains (transient by design), skip friendly
   names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and
   only count entities whose last_changed > 24h. Was 431/1470,
   most of which were "phone in standby" noise.

4. ha_automations: only flag DISABLED automations as abandoned if
   they've also been untouched (last_changed) for >180 days; raise
   stale threshold 30d → 180d. Was flagging seasonal/holiday-only
   automations as broken.

5. problematic_pods + evicted_pods: exclude pods owned by Jobs.
   CronJob retry leftovers (Error/Failed phase pods that K8s keeps
   around for log inspection) aren't problematic at the cluster level.

6. uptime_kuma: retry the WebSocket login 3x with backoff. Single-
   shot failures were a recurring false-positive even though the
   service was healthy.

Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's
nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll
/metrics and got 302'd to Authentik like the idrac/snmp ones did.
Same fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
root
8483ca59ba Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:43 +00:00
Viktor Barzin
dc7c19d88e frigate: lan ingress auth=none for HA Sofia integration
The frigate-lan.viktorbarzin.lan ingress had Authentik forward-auth in
front. HA Sofia's frigate integration polls /api/config and only knows
how to use Frigate's own API key (not browser SSO), so every poll got
a 302 to authentik.viktorbarzin.me and the integration entered the
errors-state. Same pattern as idrac-redfish-exporter (5c594291).

allow_local_access_only IP allowlist + Frigate's API key are enough.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
dc134011eb fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes
After fixing the threshold=80% misconfig and seeing two PVCs
(prometheus + technitium primary) get stuck Terminating, a 3rd round
showed four more PVCs (frigate, hackmd, immich-postgresql,
paperless-ngx) in the same state. Same root cause: TF spec'd a
smaller storage size than the autoresizer-grown live value, K8s
rejected the shrink, TF force-replaced the PVC, and the
pvc-protection finalizer held it in Terminating while the pod kept
using the underlying volume.

Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests]
on every kubernetes_persistent_volume_claim block that has
resize.topolvm.io/threshold annotations. The pattern was already
documented in .claude/CLAUDE.md but ~63 stacks were missing it.

Live PVCs are unaffected; this only prevents future TF applies from
attempting the destroy+recreate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
dd2b7de291 fix: HA Sofia REST sensors + PVC drift safety
Two real issues found while triaging HomeAssistantCriticalSensorUnavailable
alerts and the prometheus + technitium PVC Terminating-but-in-use
state from the earlier session.

1. idrac-redfish-exporter + snmp-exporter ingresses: auth=required →
   auth=none. HA Sofia REST sensors scrape these endpoints
   programmatically; with Authentik forward-auth in front, every
   request got a 302 to authentik.viktorbarzin.me and the REST
   sensors parsed the HTML login page instead of metrics — leaving
   the R730, UPS, and ~20 other sensors permanently unavailable.
   The allow_local_access_only IP allowlist (192.168.0.0/16 +
   10.0.0.0/8) already gates external access, so authentik on top
   was breaking machine-to-machine traffic for no security gain.

2. prometheus_server_pvc + technitium primary_config_encrypted:
   add lifecycle.ignore_changes = [spec[0].resources[0].requests].
   The autoresizer expands these PVCs; PVCs can't shrink. Without
   the ignore, every TF apply tried to revert the live size back
   to the TF spec value, hit K8s's shrink-forbidden rule, and
   force-replaced the PVC. Because the pod still mounted it, the
   PVC went into Terminating-but-protected limbo — fine until a
   pod restart would have orphaned the volume. Root cause of the
   2026-05-10 PVC Terminating incident.

Bonus: prometheus_server_pvc threshold was the inverted "90%" (the
same bug the bulk fecfa211 sweep fixed elsewhere; my regex only
matched "80%" so this one slipped through). Now "10%".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
7e69951cb9 state(dbaas): update encrypted state 2026-05-22 14:16:43 +00:00
Viktor Barzin
ee47197f3b vault: enroll audit-vault-0 in pvc-autoresizer (10Gi limit)
audit-vault-0 fills steadily with raft audit logs; without autoresizer
annotations it hits the 2Gi ceiling and Vault stalls on writes
(PVAutoExpanding alert was firing at 81% used). The Vault Helm chart
copies server.auditStorage.annotations onto the PVC at create time.

Live PVC already has the annotations applied via kubectl annotate;
this just keeps TF in sync.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
0fdadcc3dd dbaas: pg-cluster threshold 80%→10% in CNPG inheritedMetadata
Same misconfig as the bulk fecfa211 sweep, but the pg-cluster YAML
is buried inside a null_resource local-exec heredoc so the regex
didn't catch it. CNPG operator inherits these annotations onto each
member PVC (pg-cluster-1, pg-cluster-2), and reapplies them on every
reconcile — patching the live PVCs alone bounces back within seconds.

Live state already patched via kubectl patch cluster, this just keeps
TF in sync.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
3f2b2f9d32 fix: pvc-autoresizer threshold should be 10%, not 80%
topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE
percentage below which expansion fires (per upstream README). Setting
it to "80%" means "expand when free-space drops below 80%", i.e. as
soon as the PVC crosses 20% utilization — which caused
prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi
in 70 minutes (six 10% bumps, all when the volume was only ~14% used).
Once the SC opt-in fix landed (1e4eac53) and the inode metrics fix
landed (02a12f1a), the autoresizer started actively misfiring across
75+ PVCs cluster-wide.

Flip the value to "10%" everywhere — that's "expand when free-space
drops below 10%", i.e. at 90% utilization, which is the conventional
semantic and matches the alert thresholds in
prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp
at 95%).

The CLAUDE.md PVC template was the source of the misconfig, so update
it too. Live PVC annotations were patched in parallel via kubectl
annotate; TF apply on each affected stack will be a no-op against
those live values.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
dc4ce46411 k8s-version-upgrade: detection script refresh apt before madison + DRY_RUN_OVERRIDE
Test 2 dry-run revealed kubeadm plan reports v1.34.7 as latest while
apt-cache madison (without prior apt-get update) was reporting v1.34.5
— so the CronJob would have dispatched the agent against a stale
target. Now do `sudo apt-get update -qq` for just the kubernetes repo
before querying madison.

Also add a DRY_RUN_OVERRIDE env precedence so future test invocations
can override DRY_RUN without an apply cycle — but Job spec env is
immutable post-create, so this is only useful for CronJob spec edits
(suspend, then add env, then resume). Documented in the runbook.
2026-05-22 14:16:43 +00:00
Viktor Barzin
ae6dde45c2 k8s-version-upgrade: trigger etcd snapshot via existing backup-etcd Job; broaden agent RBAC
Stage 2 now reuses the existing default/backup-etcd CronJob (NFS-backed
PV pointing at 192.168.1.127:/srv/nfs/etcd-backup) instead of trying to
ssh into master and run etcdctl against a non-existent /mnt/main mount.
The agent triggers a one-shot Job from cronjob/backup-etcd, waits up to
10 min, then parses the backup-manage container log for "Backup done"
line + byte count.

Test 2 (dry-run) surfaced 5 real cluster blockers — agent loop works
end-to-end at the planning level.

Expanded the claude-agent ServiceAccount's privileges via a sibling
ClusterRole (claude-agent-upgrade-ops):
  - patch namespaces/k8s-upgrade (in-flight annotation)
  - create batch/jobs (trigger etcd snapshot Job)
  - patch nodes (cordon/uncordon)
  - create pods/eviction (drain)
  - delete pods (drain fallback)
2026-05-22 14:16:43 +00:00
Viktor Barzin
e75bcaf394 k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
2026-05-22 14:16:42 +00:00
Viktor Barzin
09f83b4e83 fire-planner / k8s-portal / insta2spotify: revert auth=public to auth=none
The Phase 4 audit promoted three "smoke-test candidates" from `protected = false`
to `auth = "public"`, but all three are XHR / curl-driven endpoints (fetch()
calls, automation scripts) that don't survive the 302+cookie redirect dance
that the public-auto-login flow requires on first visit. fire-planner's SPA
broke immediately — every fetch() to /api/* hit a cross-origin redirect and
CORS preflight rejected it.

Important learning for the `auth = "public"` design:

  `auth = "public"` is functionally equivalent to a normal Authentik forward-auth
  for the FIRST request — it issues a 302 to authentik to set a guest session
  cookie, then 302s back. This is invisible for top-level browser navigation
  but BREAKS:
    - XHR/fetch() under CORS preflight (preflight rejects redirects)
    - curl/automation scripts that don't preserve cookies across requests
    - Mobile / native clients that can't follow OAuth-style redirects

  Use `auth = "public"` only for top-level HTML pages where the user navigates
  via the browser address bar (or links). For XHR APIs, native-client surfaces,
  webhooks, OAuth callbacks — use `auth = "none"`.

The plan's "smoke test 3 candidates" were misjudged on this front. Reverting
all three to `auth = "none"` (their previous behaviour). The end-to-end public
flow IS verified working via curl + flow API — the design is sound, just the
test targets were wrong.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
root
faad99cff3 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:42 +00:00
Viktor Barzin
143413dc0b owntracks: explicit auth = "none" — Phase 5 audit completion
The Phase 4 audit pass missed this site because the previous agent scoped
out owntracks (it overrides the factory's middleware list via
extra_annotations to use its own basic-auth middleware). Adding the explicit
auth = "none" satisfies Phase 5's "every ingress has an explicit decision"
goal and makes the intent visible — mobile OwnTracks clients post location
data via HTTP basic-auth and can't follow Authentik forward-auth 302s.

Closes the loop on Phase 5: 122/122 active ingress_factory call sites now
carry an explicit auth = "..." decision (zero callers rely on the default).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
Viktor Barzin
ff5538a667 ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.

ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
  `protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
  middleware → dedicated public outpost → guest auto-bind. Logged-in users
  keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
  client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
  itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
  ingresses don't need anti-AI noise; the auth flow already discourages bots).

Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true`     → `auth = "required"`
- 8 explicit `protected = false`     → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
  reviewed individually:
  * 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
    homepage, wrongmove UI, privatebin) → `auth = "none"`
  * 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
    handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
    xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
    location ingestion, immich frame kiosk, headscale CP, send anonymous
    drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
    `auth = "none"`
  * Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
    UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
  k8s-portal API, insta2spotify callback.

Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.

Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.

Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
Viktor Barzin
88e57fdddb instagram-poster: disable ig-ingest-stories CronJob until /ig-ingest ships
The endpoint exists in the working copy of instagram_poster/app.py
but isn't committed/built/deployed, so every cron fire returned 404
and triggered JobFailed alerts every 30 min.

Set count = 0 to leave the resource declaration in place — re-enable
by removing that line once the endpoint is in a built image.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
fddf168ecb cloudflare: disable AI bot edge-block so x402 can issue payment offers
CF zone was returning 403 to declared AI-bot UAs at the edge
(`ai_bots_protection: "block"`). That meant the in-cluster x402
gateway never saw the request and could never issue an HTTP 402 with
the wallet payment requirements — the bot just bounced.

Adopt `cloudflare_bot_management.zone` via root-module import block,
flip ai_bots_protection to "disabled". Bot Fight Mode (`fight_mode`),
crawler challenge (`crawler_protection`), and managed robots.txt are
unaffected — generic automated traffic still gets the bot fight gate.

End-to-end verified: `User-Agent: Mozilla/5.0 (compatible; ClaudeBot/
1.0;...)` on viktorbarzin.me now returns HTTP 402 (was 403 CF block)
with `payTo=0xCc33...659f`, `amount=10000` micro-USDC, `network=base`.

Trade-off: bots that don't pay still hit origin (instead of CF
blackholing them), so a small bandwidth uptick. Negligible at our
traffic level.
2026-05-22 14:16:42 +00:00
Viktor Barzin
4103ea2ba0 monitoring(prometheus): keep all 4 kubelet_volume_stats_inodes metrics
pvc-autoresizer's GetMetrics() returns volume stats for a PVC only if
all four kubelet_volume_stats metrics (available_bytes, capacity_bytes,
inodes_free, inodes) are retrieved. The keep-list in the
kubernetes-nodes scrape job had available_bytes and capacity_bytes
(post 9d5da4d8) but was missing the two inode metrics, so the
autoresizer's reconcile logged "failed to get volume stats" for every
PVC and never resized anything.

Add kubelet_volume_stats_inodes and kubelet_volume_stats_inodes_free
to the regex.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
Viktor Barzin
6d3308c848 authentik: add public guest auto-login flow + dedicated outpost + traefik public middleware
Phase 1+2 of default-deny ingress plan. Adds the infrastructure for an `auth = "public"`
ingress tier that auto-binds anonymous requests to a `guest` Authentik user (no UI
prompt), so public sites are still recorded as authenticated by Authentik for audit
purposes — but as `guest`, not by leaking the standard catchall flow.

- guest user in `Public Guests` group (NOT `Allow Login Users`).
- `public-auto-login` flow: stage_binding policy sets `pending_user = guest`,
  `evaluate_on_plan = false` + `re_evaluate_policies = true` so flow_plan is
  populated when the policy mutates it; `authentication = none` lets anonymous
  requests enter.
- `Provider for Public` proxy provider (forward_domain, cookie_domain
  viktorbarzin.me) with `authentication_flow = public-auto-login`.
- Dedicated `public` outpost: only the public provider bound, deployed as
  `ak-outpost-public` Deployment+Service in the `authentik` namespace by
  Authentik's K8s controller.
- `public-auth.viktorbarzin.me` ingress exposes the public outpost's
  `/outpost.goauthentik.io/*` so OAuth callbacks land on it (the embedded
  outpost doesn't know about the public provider, so `authentik.viktorbarzin.me`
  callbacks would fail).
- `authentik-forward-auth-public` traefik middleware points at the public
  outpost service (not via the auth-proxy nginx fallback). The plan's
  `?app=public` dispatch idea was tested and rejected — the embedded outpost
  dispatches purely by Host header, so a dedicated outpost was the only way
  to isolate the public flow without conflicts.

No ingresses use the new middleware yet — Phase 3+4 (the ingress_factory
`auth` variable refactor + audit pass) wires it up. This commit is additive
and behaviour-neutral.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
Viktor Barzin
ff5416ff40 proxmox-csi: opt SCs into pvc-autoresizer (resize.topolvm.io/enabled=true)
Without this annotation on the StorageClass, pvc-autoresizer's controller
filters the SC out at the index lookup stage and never patches any of its
PVCs, regardless of utilization or per-PVC threshold/increase/storage_limit
annotations. Internal metric pvcautoresizer_loop_seconds_total ticked but
no PVCs were ever evaluated — visible cluster-wide as PVAutoExpanding alerts
firing for forgejo-data-encrypted (82%) and audit-vault-0 (81%) without any
ResizeStarted events ever following.

The Prometheus scrape-config fix in 9d5da4d8 was a prerequisite (autoresizer
reads kubelet_volume_stats_available_bytes) but not sufficient on its own.

Also pinning chart version to 0.5.6 so the next apply doesn't incidentally
bump to 0.5.7.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:41 +00:00
ea9b5542d1 x402: flip gateway live with Viktor's wallet + Slack payment notifications
Wires the traefik stack to read two new fields from secret/viktor:
  * x402_wallet_address     -> 0xCc33BD250d39752e0ceaB616f8a05F72274a659f
  * alertmanager_slack_api_url (existing) -> reused as the per-payment
    notification webhook so payment events arrive in the same Slack
    channel as other infra alerts.

Gateway now runs `wallet_set:true, dry_run:false`. Verified end-to-end:
  - Browser UA on all 9 sites -> 200 (passes through to Anubis)
  - python-requests/2.31 + scrapy + ClaudeBot UA -> 402 with
    PaymentRequiredResponse, payTo == Viktor's wallet, amount=10000
    micro-USDC, network=base, asset=Base USDC contract
  - Direct Slack-webhook test from inside cluster -> HTTP 200

Image bumped to forgejo.../x402-gateway:d9b83125 with Slack-format
notification payload (text=..., username=x402-gateway,
icon_emoji=💰; auxiliary fields preserved for richer receivers).

Notifications fire on every successful X-PAYMENT validation; failures
on Slack webhook are logged at WARN, never block the request, never
double-charge the bot.
2026-05-22 14:16:41 +00:00
Viktor Barzin
58789cde8b kured(sentinel-gate): fix auth + write-perm so safety checks actually run
Test 3 validation surfaced two latent bugs in the sentinel-gate
DaemonSet that have been masked since 2026-04-18 (when uu was off,
nothing wrote /var/run/reboot-required, so the gate never had to
fire):

1. automount_service_account_token=false on both the SA and the
   pod spec → kubectl in the script falls back to localhost:8080
   on every call. Each check (`kubectl get nodes`, `kubectl get
   pods -n calico-system`, transition-time read) errors to stderr
   and emits empty stdout. `wc -l` reports 0 → checks "pass" with
   no real data.

2. bitnami/kubectl:latest runs as uid=1001 by default. The hostPath
   /var/run is root:root 0755 → final
   `touch /host/var-run/gated-reboot-required` failed with EACCES.
   Fail-safe by accident — but if anything had ever loosened those
   perms, the broken checks above would have green-lit the gate
   with no real validation.

Fix: enable token mount on the SA + pod, set
securityContext.run_as_user=0 on the container.

Verified post-fix: kubectl returns all 5 nodes, touch succeeds,
sentinel-gate now reports the correct
`BLOCKED: A node transitioned Ready within the last 24 hours
(soak window)` when triggered with k8s-node1's recent reboot
within the cool-down period.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:41 +00:00
Viktor Barzin
278ef5f19b monitoring(grafana): swap python3 for jq in folder-ACL local-exec
CI image (ci/Dockerfile) is alpine + jq, no python3. The
grafana_admin_only_folder_acl null_resource was parsing /api/folders
with a python3 oneliner, which crashed every CI apply with
"python3: command not found" and made every monitoring stack apply
fail in CI (worked locally because the dev VM has python3).

jq is already in the CI image and produces the same output.
2026-05-22 14:16:41 +00:00
Viktor Barzin
5c0ea96a91 infra: re-enable unattended-upgrades with kured prometheus-gating
Reverses the March 2026 outage mitigation that disabled unattended-
upgrades cluster-wide. Now re-enables it on the k8s template VM with:

  - Allowed-Origins limited to security/updates pockets
  - Package-Blacklist for k8s/containerd/runc/calico-node (apt-mark
    hold on the cluster-critical components)
  - Automatic-Reboot disabled — kured drives the actual reboots
  - Compatible with the existing kured + sentinel-gate flow

kured side:
  - rebootDelay 30s, concurrency 1
  - Sentinel cool-down stretched 30m → 24h (aligns with the 24h soak
    window from the post-mortem)
  - prometheusUrl + alertFilterRegexp wired so any firing non-ignored
    alert halts the rollout. Ignore-list excludes self-referential
    alerts (Watchdog/RebootRequired/KuredNodeWasNotDrained/
    InfoInhibitor) that would otherwise deadlock kured.

Prometheus side (already partly landed in 6c4e0966 — the "Upgrade
Gates" rule group):
  - Refine `KubeQuotaAlmostFull` to include the resourcequota label in
    both the on-clause and the summary, so multi-quota namespaces
    (authentik, beads-server, frigate) report the quota name correctly.

grafana.tf: terraform fmt whitespace only.

Together with the post-mortem 2026-03-22 (memory id=390) the loop is
closed: unattended-upgrades runs again, kernel-class updates can land,
but only when cluster health is green and the reboot window is open.
2026-05-22 14:16:41 +00:00
Viktor Barzin
fe75fad467 monitoring: protect grafana ingress with authentik + disable anonymous
- add traefik-authentik-forward-auth to grafana ingress middleware list
- disable auth.anonymous (was Viewer-by-default for the public)
- enable auth.proxy with X-authentik-username so Authentik users get
  signed in seamlessly (no double-login UX)

Prometheus and Alertmanager already had forward-auth — no change.
2026-05-22 14:16:41 +00:00
Viktor Barzin
6c294d4bb0 authentik: zero-endpoints alert + upgrade-validation checklist
Add `AuthentikForwardAuthFallbackActive` Prometheus alert: fires on
sustained 401/s spike on the websecure entrypoint (>5/s for 5m), which
is the symptom of the auth-proxy Emergency-Access fallback firing —
in turn caused by zero ready endpoints on the outpost service.

Why this rule and not `kube_endpoint_address_available == 0`:
kube-state-metrics endpoint metrics exist as series names but never
have current values in this Prometheus pipeline (something is dropping
them silently). Detecting the failure at the edge via Traefik is more
reliable than instrumenting the broken middle.

Also fix the pre-existing `AuthentikOutpostForwardAuth400Spike` regex
— the service label is `authentik-ak-outpost-...`, not
`authentik-authentik-outpost-...`, so the alert never matched any
series and never could have fired. Verified in Prometheus before/after
the fix.

Add an "Upgrade Validation Checklist" section to
`.claude/reference/authentik-state.md` with the seven-step smoke test
to run after Authentik chart bumps, provider bumps, or outpost pod
recreation. Covers the brittle surfaces (Service selector, JSON
patches, postgres backend wiring, access_token_validity TTL, edge
auth flow, plan-to-zero).
2026-05-22 14:16:41 +00:00
Viktor Barzin
dc87a9bffe infra/instagram-poster: shared CNPG-backed benchmark DB, no PVC for scores
The instagram_poster.benchmark CLI was writing scores to a sqlite file
on the pod's data PVC. Moving it to the shared CNPG cluster so the
benchmark scoring path is stateless on the pod, scores survive pod
recreation, and the rotation/backup pipeline applies automatically.

- dbaas: null_resource.pg_instagram_poster_db creates role + DB
  (idempotent CREATE IF NOT EXISTS, password placeholder) — same
  shape as pg_postiz_dbs / pg_wealthfolio_sync_db.
- vault: vault_database_secret_backend_static_role.pg_instagram_poster
  + add to allowed_roles. 7d rotation_period.
- instagram-poster: second ExternalSecret (vault-database store) →
  K8s Secret instagram-poster-benchmark-db with BENCHMARK_PG_HOST/
  PORT/USER/PASSWORD/DATABASE. env_from on the deployment.
  reloader.stakater.com/match=true bounces the pod on rotation.

Code-side: instagram_poster/benchmark.py now resolves the DB URL from
BENCHMARK_DB_URL or BENCHMARK_PG_* env vars; falls back to sqlite for
local DevVM scratch runs. Schema bootstraps via Base.metadata.create_all,
no alembic step needed for the benchmark-only side-DB.

Verified end-to-end via DevVM port-forward: ESO synced, K8s Secret has
all 5 fields, pod env shows BENCHMARK_PG_*, smoke-test scoring 3 photos
landed in the new PG table with subject_category populated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:41 +00:00
Viktor Barzin
1fcf911269 authentik/pgbouncer: image_pull_policy IfNotPresent -> Always (match live)
The HCL declared `IfNotPresent` since module creation but the live
deployment reconciled to `Always` somewhere along the way (likely a
Helm/operator default). Since the image is `:latest`, `Always` is the
correct value — `IfNotPresent` would skip pulling updated images on
pod restart, defeating the point of the floating tag.

Drops the lone remaining drift in the authentik stack so plan-to-zero
holds across the whole stack, not just the resources I just adopted.
2026-05-22 14:16:41 +00:00
Viktor Barzin
24795ec203 authentik: codify proxy provider TTL + adopt embedded outpost
Bump access_token_validity to weeks=4 (was hours=168, UI-managed in
ignore_changes). Drives the cookie Max-Age and the proxysession.expires
TTL — keeps users logged in for 28d instead of 7d.

Adopt the embedded outpost into Terraform so the postgres-session-backend
fix from earlier today (2026-05-10) is described as code:
  - kubernetes_json_patches.deployment carries dshm 2Gi tmpfs, resource
    requests/limits, the app.kubernetes.io/component=server pod label
    (workaround for goauthentik 2026.2.2 service.py:52 selector mismatch
    on standalone embedded outposts), and AUTHENTIK_POSTGRESQL__* envFrom
    the shared `goauthentik` Secret so the postgres session backend can
    connect to the dbaas cluster.
  - kubernetes_json_patches.service replaces the controller-set selector
    (which targets app.kubernetes.io/name=authentik / the goauthentik-server
    pods) with the outpost's own labels — without this, endpoints are
    empty and auth-proxy falls back to Basic-Auth realm "Emergency Access".

The `managed` field ("goauthentik.io/outposts/embedded") is server-set
and not in the Terraform provider's schema, so TF preserves it across
applies (writes only fields it knows about). Plan-to-zero verified.
2026-05-22 14:16:41 +00:00
Viktor Barzin
6e7fe96a40 infra/llama-cpp: benchmark report + -fa flag fix
Phase 7 of the vision-LLM benchmark plan. Adds:

- docs/benchmarks/2026-05-10-vision-llm.md — curated report (TL;DR,
  per-model analysis, top-N agreement, cost vs cloud APIs, sample
  captions). Verdict: qwen3vl-4b for the request path (3.55 s p50,
  100% parse, decisive top-N distro); qwen3vl-8b for caption polish.
- docs/benchmarks/benchmark-2026-05-10-1424.json — raw 300-row dump
  for diff-checking against future runs.
- main.tf: -fa -> -fa on (b9085 llama.cpp removed the no-value form
  of the flash-attention flag; without the value llama-server exits
  before serving any request).
- llama-cpp.md architecture doc links the report so future operators
  land on the deployed-and-evaluated model from one entry point.

300/300 calls, 0 parse errors, 33m32s wall on a single T4 with the
GPU exclusively allocated. immich-ml was scaled to 0 for the run
(node1 RAM constraint, not GPU - bumping node1 RAM is tracked as a
follow-up).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:41 +00:00
Viktor Barzin
3da01e6e1e anubis: only challenge GET requests; allow everything else
PrivateBin's XHR `POST /` (paste creation) was the trigger — Anubis's
catch-all CHALLENGE rule served an HTML challenge page where the JS
expected JSON, breaking paste creation entirely. Same shape will hit
any SPA XHR or CORS preflight on the other 8 Anubis-fronted sites
(homepage actions, kms upload-then-poll, wrongmove search refresh,
jsoncrack share, etc.) the moment it gets exercised.

Add an `ALLOW` rule keyed on `method != "GET"` between the AI/UA-block
imports and the catch-all CHALLENGE. Rationale:

  * AI scrapers consume GET response bodies — they don't POST.
  * State-mutating XHRs and OPTIONS preflight need to bypass the
    challenge or the app breaks.
  * CrowdSec + per-route rate-limit + app-level auth already cover
    abuse on mutating methods, so this gives up nothing.
  * Hard-deny rules for known-bad bots run first, so a declared bad
    bot can't sneak through by sending a POST.

Also added a `checksum/policy` annotation on the Anubis pod template
sourced from `sha256(coalesce(var.policy_yaml, default_policy_yaml))`
so future policy changes auto-roll the deployment instead of needing
a manual `kubectl rollout restart`.

f1-stream had its own policy override (path carve-outs for SvelteKit
asset hashes and JSON data routes); mirrored the new rule there too.

Applied to all 8 Anubis-fronted stacks: blog, kms, f1-stream,
travel_blog, real-estate-crawler, homepage, cyberchef, jsoncrack.
Verified per stack: GET / returns the Anubis challenge page; POST,
PUT, DELETE, OPTIONS pass through to the backend (HTTP 301/405/502
from the upstream app, never the Anubis "not a bot" HTML).
2026-05-22 14:16:40 +00:00
root
ff3d64159a Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:40 +00:00
Viktor Barzin
1f0bd11d3f privatebin: drop Anubis — broke XHR paste creation
PrivateBin's UI POSTs the encrypted blob to `/` via XHR. With Anubis in
front, the catch-all CHALLENGE rule returned an HTML challenge page
where the JS expected JSON, so paste creation failed silently for every
user. The challenge cookie didn't bypass it — Anubis appears to issue a
fresh challenge on POST regardless of cookie state.

Pastes are client-side encrypted; AI scrapers gain nothing from
indexing them, so the default `anti_ai_scraping` middleware is enough
protection. Restoring the ingress to point straight at the privatebin
service. CSP `wasm-unsafe-eval` retained — PrivateBin's zlib.wasm
needs it independent of Anubis.

This matches the rule already documented in infra/.claude/CLAUDE.md:
"DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints — clients
without JS can't solve PoW." A SPA's XHR is the same shape.

Verified: GET / returns PrivateBin HTML (not the Anubis challenge),
POST / returns PrivateBin's own JSON error envelope.
2026-05-22 14:16:40 +00:00
Viktor Barzin
9c617e6d38 infra/llama-cpp: add stack — llama-swap fronting Qwen3-VL + MiniCPM-V
Single Deployment of mostlygeek/llama-swap:cuda hot-swaps three
GGUF vision models (qwen3vl-8b, minicpm-v-4-5, qwen3vl-4b) at one
OpenAI-compat /v1 endpoint on Service llama-swap.llama-cpp.svc.
Idle TTL 10min so models unload between benchmark batches.

Storage: NFS-RWX from /srv/nfs-ssd/llamacpp (30Gi). One-shot
download Job pulls Q4_K_M GGUF + mmproj per model, creates stable
model.gguf / mmproj.gguf symlinks so the llama-swap config is
filename-agnostic, then warms the kernel page cache.

GPU: nvidia.com/gpu=1 = whole T4 — operator must scale immich-ml
to 0 during benchmark windows. wait_for_rollout=false so apply
doesn't block on GPU availability.

Initial use case: vision-LLM benchmark for instagram-poster
candidate scoring; future consumers (HA, agentic tooling) hit
the same endpoint via LiteLLM at the gateway.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:40 +00:00
Viktor Barzin
d85b54d89d kms: per-connection state in notifier (vlmcsd is multi-threaded)
Bug found via E2E test against the Windows VM (VMID 300). The single
shared `state` dict in slack-notifier.py worked when vlmcsd processed
one connection at a time, but real Windows KMS activations hold the
connection open ~30 seconds (handshake + keep-alive). During that
window vlmcsd accepts other concurrent connections — most relevantly
the new kubelet TCP readiness probe every 5s — and each new OPEN line
reset the shared state, wiping the in-flight activation's
app/product/host before its CLOSE arrived. Result: real activations
were misclassified as probes (no Slack post, no metric increment).

Fix: state is now a dict keyed by `ip:port` with one sub-dict per
in-flight connection. A `__current` pointer tracks the most recent
OPEN so unkeyed detail lines (Application ID, Workstation name, etc.)
can be attributed correctly — vlmcsd writes detail lines immediately
after the OPEN and before any subsequent OPEN, so the heuristic holds.
Orphan CLOSEs (notifier started mid-conn) are now silently dropped
instead of emitting an empty probe event.

Two new regression tests:
- test_kubelet_probe_during_long_activation: 5s probe interleaved into
  a 31s activation block — exact production failure mode.
- test_orphan_close_no_event: bare CLOSE without prior OPEN.

Verified live: triggered slmgr /upk + /ipk + /skms 10.0.20.202 + /ato
on WIN10Pro-DS32. vlmcsd logged the full activation block, notifier
posted to Slack with ip=192.168.1.230 source=external
product='Windows 10 Professional' host='WIN10Pro-DS32.viktorbarzin.lan'
and kms_activations_total{product=Windows 10 Professional,
status=Licensed} 1 — real WAN client IP preserved through the
ETP=Local + dedicated MetalLB IP chain end to end.
2026-05-22 14:16:40 +00:00
Viktor Barzin
4a3ca572e8 fire-planner: imagePullPolicy=Always on alembic-migrate init container
After a rollout-restart, the main container (default Always for :latest)
pulled the new image with alembic 0003, but the init container
defaulted to IfNotPresent and reused a cached old image lacking 0003 →
"Can't locate revision identified by '0003'" → CrashLoopBackOff.

Setting Always on the init container so both containers stay in lockstep
across rollouts. Longer term we should switch the deployment to 8-char
git-SHA tags per the cluster policy in .claude/CLAUDE.md, but this
unblocks the Wave 1 deploy in the meantime.
2026-05-22 14:16:40 +00:00
Viktor Barzin
67b11a964a kms: dedicate MetalLB IP 10.0.20.202 + filter probe noise
Two coupled fixes for the hourly Slack noise + missing client IPs:

1. Move windows-kms off shared 10.0.20.200 to a dedicated MetalLB IP
   10.0.20.202 with externalTrafficPolicy=Local, so vlmcsd sees real
   WAN client IPs (pfSense WAN forwards do DNAT-only; ETP=Local skips
   kube-proxy SNAT). Same pattern mailserver used pre-2026-04-19.
   Sharing 10.0.20.200 is blocked because all 10 services there are
   ETP=Cluster and MetalLB requires consistent ETP per shared IP.

2. Slack notifier now suppresses Slack posts for bare TCP open/close
   pairs (no Application/Activation block) — these are Uptime Kuma's
   port monitor and the new kubelet readiness/liveness probes. Probe
   counts go to a new metric kms_connection_probes_total{source} where
   source classifies the IP as internal_pod / cluster_node / external.
   Real activations are unaffected.

Pod fluidity: added TCP readiness/liveness probes on 1688 to gate Pod
Ready on the listener actually being up — required for ETP=Local so
MetalLB only advertises 10.0.20.202 from a node where vlmcsd is serving.

pfSense side (applied separately, not codified):
- New alias k8s_kms_lb = 10.0.20.202 (KMS-only)
- WAN:1688 NAT + filter rule retargeted from k8s_shared_lb to k8s_kms_lb
- All other forwards on k8s_shared_lb (WireGuard, HTTPS, shadowsocks,
  smtps, etc.) untouched

Runbook updated. Tests added for classify_source / is_probe / process_line.
2026-05-22 14:16:40 +00:00