Compare commits

...

351 commits

Author SHA1 Message Date
Viktor Barzin
1f6facc8e4 Merge forgejo/master — reconcile 18-day divergence with origin
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Origin and forgejo had drifted since 2026-05-05 (merge base b45c45e4).
Each remote was receiving Viktor's commits independently — origin since
2026-05-23 and forgejo from 2026-05-06 to 2026-05-22 14:15. Both had
~30 substantive commits. This merge brings forgejo's work into the
local branch.

13 conflict files resolved as follows (all favoured HEAD = origin/local,
which is newer in every case):

- secrets/{fullchain,privkey}.pem — kept HEAD (renewed 2026-05-24,
  vs forgejo's 2026-05-17 renewal)
- stacks/blog/main.tf — kept HEAD (ingress-www intentionally removed
  today after DNS+monitor cleanup; forgejo had the old block)
- stacks/xray/modules/xray/main.tf — kept HEAD (vless dropped today
  as dead ingress; forgejo had the old 3-port service)
- stacks/k8s-version-upgrade/scripts/upgrade-step.sh — kept HEAD
  (allowlist refactor, master-phase idempotency skip, tigera-operator
  quiesce/restore, IngressTTFBCritical ignore — all newer than forgejo)
- stacks/k8s-version-upgrade/main.tf — kept HEAD (deployments/scale
  RBAC, oldest-kubelet detection — both added 2026-05-23)
- scripts/update_k8s.sh — kept HEAD (--etcd-upgrade=false fallback)
- stacks/llama-cpp/main.tf — kept HEAD (KEEL_LIFECYCLE_V1 ignore_changes
  block added today, commit 0b1282a1)
- stacks/openclaw/main.tf — kept HEAD (nim/meta/llama-3.1-70b primary)
- stacks/trading-bot/main.tf — kept HEAD (claude-haiku-4-5 pin +
  kevin-signal-bridge container)
- stacks/postiz/modules/postiz/main.tf — kept HEAD (memory 2Gi/3Gi
  bump, despite postiz being destroyed today — kept TF intent)
- stacks/nvidia/modules/nvidia/values.yaml — kept HEAD (mem 822Mi)
- stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl —
  kept HEAD (richer alert list + raised StatefulSet `for: 3m`)
- stacks/kyverno/modules/kyverno/security-policies.tf — kept HEAD
  (expanded registry allowlist + comments)
- docs/architecture/security.md — kept HEAD (detailed W1.7 analysis)
- docs/plans/2026-05-21-ha-control-plane-design.md — kept HEAD
  (178-line superset incl. 2026-05-23 deferral rationale)

Auto-merged (no conflict): broker-sync, claude-agent-service,
cloudflared, mailserver, n8n, technitium, traefik, url, proxmox-csi,
xray (deployment portion). Brings in forgejo-only substantive commits:
fire-planner, openclaw v3 flow + recruiter-responder wiring, several
k8s-version-upgrade hardening passes (kill-switch, RecentNodeReboot
ignore, pipefail fixes), HA control plane design, security wave 1
expansion to tier 3+4, alloy file-tail switch, prometheus scrape 2m,
authentik replica cut, forgejo archive disable.

Meta: forgejo and origin drift is a coordination bug. Going forward we
need to either (a) have one CI mirror to the other, or (b) standardize
on one remote. Filed mentally; not addressed in this commit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 09:41:36 +00:00
Viktor Barzin
0b1282a13c llama-cpp: ignore_changes for keel/k8s-managed annotations
Every `tg apply` was reverting the annotations that keel patches when it
detects an upstream digest change — `keel.sh/match-tag` (Kyverno-stamped),
`keel.sh/update-time` (on the pod template; what actually triggers the
rollout), plus the K8s-managed `kubernetes.io/change-cause` and
`deployment.kubernetes.io/revision`. The revert forced a rollout, then
the next keel poll re-stamped the annotations, forcing another. With
llama-swap's ~10s cold-load on each pod recreate the user noticed.

Upstream `ghcr.io/mostlygeek/llama-swap:cuda` is a moving nightly tag —
keel still drives one legitimate rollout per day at ~07:25 UTC; this
patch stops the apply-driven extra rollouts on top of that.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 09:01:17 +00:00
67f8be4598 trading-bot: add kevin_signal_bridge container (kill-switch OFF for Phase 1)
5th worker container running in audit-only mode. Writes
kevin_signal_bridge_state rows showing what it WOULD trade but never
publishes to signals:generated. Kill-switch flipped in Phase 2.
2026-05-24 01:22:53 +00:00
Viktor Barzin
6218868ea5 xray: drop dead vless ingress + pin Service target_port
The xray-vless ingress, Service port 6443, and container port 6443 had
no backing listener — xray.config.json only binds 7443 (REALITY), 8443
(WS) and 9443 (XHTTP). The "xray-vless" hostname was returning 502
since the module was created.

Side effect: removing the first Service port slot ("vless"/6443) caused
the kubernetes provider to shift targetPort values on the remaining
two ports (defaulting only worked at create time, not on port removal).
Pinning target_port explicitly makes Service routing deterministic.

End-to-end verified: REALITY via public IP:8080 (pfSense forward 8080
-> 10.0.20.200:7443), WS via Cloudflare, XHTTP via Cloudflare — all
three transports proxied successfully through a test pod, egress IP
correctly resolves to the home WAN.
2026-05-24 01:13:54 +00:00
Viktor Barzin
ae874e028d postiz: bump memory request 512Mi → 2Gi, limit 4Gi → 3Gi (right-size for next deploy)
krr 2026-05-22 flagged postiz-app as critically under-requested when it
was running (gap 2.2 GiB above the 512Mi request). Postiz is currently
uninstalled in the cluster — this change is only for when the stack is
re-deployed later. No apply triggered now.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:11:25 +00:00
Viktor Barzin
b59acbc1db crowdsec/agent: bump memory request 64Mi → 128Mi
krr 2026-05-22 flagged crowdsec-agent DaemonSet (4 pods) as under-
requested by ~588 MiB across the cluster. Live usage around the
80-128 MiB mark for active log parsing — 64 MiB request risked eviction
ahead of more-needed pods. Limit stays at 512 MiB.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:11:16 +00:00
Viktor Barzin
7108843b38 nvidia/driver-daemonset: bump memory request 256Mi → 822Mi
krr 2026-05-22 flagged nvidia-driver-daemonset as critically
under-requested (~566 MiB gap). Live driver process holds ~600-800Mi
once the kernel module is loaded. Limit stays at 2Gi so the DKMS build
during a kernel upgrade still has headroom (documented in values.yaml
to need ~1.4 GiB peak).

May help unblock code-8vr0 (GPU driver crashloop on node1) if the
crashloop was OOM-driven.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:11:06 +00:00
Viktor Barzin
2711d4af05 monitoring/loki: bump memory request 2Gi → 3Gi (close gap to 4Gi limit)
krr 2026-05-22 flagged loki as under-requested by 1.9 GiB. Live working
set is sitting at ~3 GiB during normal ingestion; the existing 2 GiB
request meant scheduler didn't reserve enough room and the pod risked
eviction. Limit stays at 4 GiB (documented ceiling in loki.yaml).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:10:55 +00:00
Viktor Barzin
c77984a713 proxmox-csi/node: bump memory request 64Mi → 1Gi (LUKS unlock reservation)
The CSI node plugin's LUKS2 Argon2id key derivation peaks at ~1 GiB
during unlock (memory id=712 + already-documented in the limits=1280Mi).
Request was 64 MiB — meaning the unlock burst ran "best-effort", first in
line for OOM under node pressure. krr 2026-05-22 flagged this as a top
under-request. Bumping request matches the documented requirement.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:10:44 +00:00
Viktor Barzin
467460cccd k8s-version-upgrade: ignore IngressTTFBCritical in halt-on-alert check
The Synology DSM (port 5001) ingress chronically trips IngressTTFBCritical
because of NAS-side latency that is unrelated to k8s upgrades. The chain
was halting indefinitely waiting for it to clear. Add it alongside
RecentNodeReboot to the per-call ignore regex so the chain can proceed
autonomously without manual silences.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:10:44 +00:00
Viktor Barzin
447bfef507 blog: remove www.viktorbarzin.me ingress
The www subdomain was internal-only (no Cloudflare DNS record) but the
external uptime-kuma monitor still flagged it as down because public DNS
resolution failed. Removing the ingress along with the Technitium CNAME
makes the failure mode disappear and lets the cluster reach an
autonomous-clean state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:10:44 +00:00
root
10ac174627 Woodpecker CI Update TLS Certificates Commit 2026-05-24 00:03:48 +00:00
Viktor Barzin
b4aa8eaf58 technitium: cut memory — primary 2Gi → 1Gi, secondary+tertiary 2Gi → 512Mi
Right-sizing per krr report (2026-05-22). Zone data is ~43 MiB; the rest
was cache headroom. Primary keeps more (1 GiB) since it owns authoritative
zones; replicas get 512 MiB. DNS sanity-checked across CoreDNS and the
MetalLB external IP (10.0.20.201) post-rollout.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 10:03:51 +00:00
Viktor Barzin
931d7b6c9d claude-agent-service: cut memory request 2Gi → 1Gi (limit 4Gi → 2Gi)
Right-sizing per krr report (2026-05-22). Kept Burstable QoS (limit > request)
so an active agent run still has 2 GiB headroom — krr's 100 MiB recommendation
was measured idle and is not safe for an active job.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 10:03:42 +00:00
Viktor Barzin
d76f4c4827 n8n: cut memory request 1Gi → 512Mi (+ image bump 1.80.0 → 1.80.5)
Right-sizing per krr report (2026-05-22). Image bump syncs main.tf with
the live Keel-managed version to avoid an inadvertent downgrade on apply.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 10:03:28 +00:00
Viktor Barzin
17c1ef73be url/shlink: cut memory request 960Mi → 512Mi
Right-sizing per krr report (2026-05-22, memory id=2431-2438). Live pod
working set is ~80 MiB; 512Mi leaves comfortable headroom for the
Symfony+RoadRunner footprint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 10:02:45 +00:00
Viktor Barzin
02ea5da8dc k8s-version-upgrade: skip phase_master/phase_worker if node already on target
The chain wasn't idempotent — re-running on a partially-upgraded cluster
would re-drain + re-kubeadm + re-apt an already-upgraded node, causing
unnecessary disruption (5-10 min per no-op node) and risking alert
re-fires during the unnecessary drain.

Today's chain hit this twice: after fixing the version-detection bug
(commit a0f3e155), the chain correctly resumed but re-did master AND
node4 even though both were already on v1.34.8. node4 got cordoned,
drained, and is now soaking for 10 min for no reason.

Fix: at the top of phase_master and phase_worker, read the node's
current kubelet version. If it equals TARGET_VERSION, skip the whole
phase (return 0 — spawn_next will fire downstream). Chain advances
without disturbing the already-upgraded node.

In-flight effect: the current node4 worker pod has the old script
mounted from configmap snapshot, so it'll continue. If it fails and
retries, the new pod will see node4 on v1.34.8 and short-circuit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:53:57 +00:00
Viktor Barzin
a0f3e15562 k8s-version-upgrade: version-check uses oldest kubelet, not master
Previous version-check read RUNNING from .items[0].nodeInfo.kubeletVersion
— which is just k8s-master. If master is upgraded but workers aren't
(e.g. a chain that completed master phase but failed mid-worker), the
version-check sees v1.34.8 and decides "no upgrade needed", never
spawning the resume phase. Workers stay behind forever.

Today's chain hit exactly this: master + node4 upgraded to v1.34.8,
worker-node4 Failed mid-soak (alert sensitivity, since loosened),
chain dead. Re-triggering the version-check looked at master only,
decided cluster was "done", and refused to resume worker chain.

Fix: read all node kubelet versions, sort -V, take head -1 (oldest).
A partial chain now correctly reports the un-upgraded version and the
chain resumes.

Trivial change; tested live — chain now correctly reports v1.34.7
(workers' version) and spawns preflight → master → worker chain.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:48:50 +00:00
Viktor Barzin
68f8514e61 monitoring: MetalLBSpeakerDown for: 2m → 10m (was upgrade-chain regression)
Earlier in this session, commit 503ac4c1 brought the for: from 5m → 2m
based on a brief I wrote inaccurately. The brief said the alert "fires
immediately" but it was actually already at 5m. The subagent followed
the explicit "2m" target and tightened it — opposite of what we wanted.

10m is the right value for our chain: a full drain + kubeadm + apt +
kubelet restart + uncordon cycle can take a worker out of MetalLB
rotation for 5-7 min in the worst case (PDB stickiness on some pods).
10m suppresses upgrade-induced blips while still catching real
speaker-down conditions.

node4 worker phase tripped this alert mid-soak today, aborted the
chain (Job retry), succeeded on the 2nd attempt only because alerts
didn't re-fire fast enough. With 10m the next workers shouldn't need
the retry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:32:41 +00:00
Viktor Barzin
503ac4c192 monitoring: tune 4 alerts for transient drain/upgrade blips
Today's worker-phase rolling upgrade tripped MysqlStandaloneDown,
MetalLBSpeakerDown, KubeletRunningContainersDrop, and
IngressErrorRate5xxHigh even though every affected workload
recovered within 30-60s. Loosen `for:` (and one threshold) on each so
they only fire on persistent faults, not on routine drain+kubelet-
restart cycles.

- MysqlStandaloneDown: for 2m -> 3m (single-replica StatefulSet,
  drain re-scheduling routinely takes 1-3m).
- MetalLBSpeakerDown: for 5m -> 2m (kubelet restart drops the
  speaker pod for 30-45s; 2m suppresses that blip).
- KubeletRunningContainersDrop: absolute `< -10` threshold replaced
  with relative `< -0.5` (>50% drop vs. 10m ago); routine drains
  routinely shed 10-30 containers and tripped the old rule.
- IngressErrorRate5xxHigh: for 5m -> 10m (rolling pod migrations
  cause brief 5xx spikes that clear in 1-2m).

Severity, labels, and annotation structure preserved; only `for:`
durations and the one expression changed. Tactical loosening of
four specific alerts -- broader observability audit tracked
separately in beads.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:28:53 +00:00
Viktor Barzin
ad9f6c8f41 k8s-version-upgrade: halt_on_alert allowlist (severity=critical only)
Refactored halt_on_alert_query from denylist ("ignore these noisy alerts")
to an allowlist ("only halt on severity=critical"). Today's blocking
alerts were all warning/info-level and not actual upgrade blockers:
  - PodCrashLooping (gpu-operator on the GPU node, code-8vr0, long-standing)
  - IngressTTFBHigh (Traefik latency, transient)
  - NodeHighIOWait (chicken-and-egg with our own upgrade I/O)
  - RecentNodeReboot (chain causes this itself)

severity=critical filtering is more robust than maintaining a denylist
of every noisy alert that crops up. extra_ignore parameter kept for
backwards compatibility but is rarely needed now (critical alerts are
the only ones that should actually halt the chain).

Tested end-to-end this session — master successfully upgraded to v1.34.8
via the autonomous chain after the apiserver state-repair (apiserver
manifest had been pinned at v1.34.2 from a previous month's rollback;
required a one-time manual edit + kubelet reload to bring back to v1.34.7,
after which the chain ran cleanly).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:14:39 +00:00
0025511b6a docs: Technitium DNS IP — 10.0.20.101 → 10.0.20.201
Stragglers from the same drift as commit b288a59 (monorepo) / the
2026-05-22 viktorbarzin.me apex incident — the `.101` references were
left over from the NodePort exposure era. Technitium's actual MetalLB LB
IP is `.201` (in pool 10.0.20.200-220).

- architecture/vpn.md — Technitium component cell + AdGuard forwarder
  example + nslookup troubleshooting hint
- architecture/networking.md — 502 ingress troubleshooting snippet
- plans/2026-02-22-talos-linux-migration-evaluation.md — nameservers
  example
2026-05-23 08:53:52 +00:00
Viktor Barzin
68a503e29f kyverno: allowlist woodpeckerci/* for CI step pods
Wave-1 trusted-registries allowlist was missing woodpeckerci/* which is
used by every .woodpecker.yml's clone step (woodpeckerci/plugin-git) and
build steps (woodpeckerci/plugin-docker-buildx). Result: ALL Woodpecker
pipelines have been failing at the git step since the Audit→Enforce flip
on 2026-05-19. First surfaced via code-da4h (recruiter-responder pushes
not building).

Added between viren070/* and zelest/* in the same DockerHub-user-repos
block as the 2026-05-22 batch (commit 2d35d72a).

Closes: code-da4h
2026-05-23 08:52:48 +00:00
000d306542 technitium: add viktorbarzin.me apex DNS drift probe + alerts
Every internal *.viktorbarzin.me hostname (~80 services) chains through the
split-horizon `viktorbarzin.me` apex A record. If the apex drifts (ISP
rollover, accidental edit), every internal service breaks at once — the
2026-05-22 ha-sofia incident was exactly this.

This adds a backstop probe so the next drift surfaces in <10 min instead
of via user-reported outage:

- CronJob `viktorbarzin-apex-probe` in `technitium` namespace, every 5 min,
  resolves `viktorbarzin.me A` against the Technitium LB IP (10.0.20.201)
  and pushes `viktorbarzin_apex_correct` + `_last_correct_timestamp` to
  Pushgateway. Python+dnspython, ~30 LOC.

- 3 Prometheus alerts:
  - `ViktorBarzinApexDrift` (critical, 10m) — apex resolved to anything
    other than 10.0.20.200.
  - `ViktorBarzinApexProbeStale` (warning, 5m on 15m gap) — probe stopped
    succeeding.
  - `ViktorBarzinApexProbeNeverRun` (warning, 30m absent) — probe never
    reported.

- Added the new alert names to the Slack receiver matcher in both routes
  alongside EmailRoundtrip*.

Verified: rules loaded as inactive (apex is correct), metric flowing, manual
probe job pass observed.
2026-05-23 08:41:14 +00:00
Viktor Barzin
4713c3a6d9 k8s-version-upgrade: tigera quiesce + etcd-skip retry + IO-wait alert ignore
Three changes unblocking the autonomous chain for k8s patch upgrades:

1. **phase_master quiesces tigera-operator before drain, restores after.**
   Tigera crashes immediately if apiserver is unreachable (no retry logic)
   and crashlooping it during master static-pod swaps generates ~500MB/s
   disk I/O that pushes kubeadm's 5-min static-pod-hash watch past its
   limit. Quiesce removes the storm contributor; calico data plane keeps
   running unchanged (data plane is the DaemonSet+Typha, operator is just
   the reconciler).

2. **update_k8s.sh retries with --etcd-upgrade=false on the 2nd attempt.**
   For patch upgrades (1.34.7→1.34.8), etcd's image doesn't change — kubeadm
   writes an identical manifest, hash doesn't update, watch times out and
   rolls back forever. The skip-etcd retry sidesteps it for the legitimate
   no-change case while still doing a full etcd upgrade on the first
   attempt (correct for minor-version bumps).

3. **halt_on_alert_query also ignores IngressTTFBHigh + NodeHighIOWait.**
   Both are symptoms-not-causes: ingress latency spikes briefly during any
   pod-restart wave; high IOwait is exactly what upgrade activity causes
   (chicken-and-egg). The inline quiet-baseline check (Ready transition
   <10min) is the real cluster-churn gate.

RBAC: k8s-upgrade-job ClusterRole gains `patch` on deployments + scale
subresource so the chain can do the scale-to-0/back-to-1 on tigera.

These three together get the chain past the cascade that's been blocking
1.34.7→1.34.8 for a week. Long-term fix is still HA control plane
(beads code-n0ow); these are the bridge.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 08:40:11 +00:00
6f4a569d1c traefik: bump auth-proxy nginx header buffers to handle Authentik cookie pile
Browsers accumulate one authentik_proxy_<random> cookie per Authentik
Proxy Provider under viktorbarzin.me (Path=/). With 30+ services the
combined Cookie header exceeds nginx's default 4 x 8k
large_client_header_buffers and trips '431 Request Header Fields Too
Large' at the forward-auth nginx (traefik/auth-proxy).

Bumped to:
  client_header_buffer_size 8k
  large_client_header_buffers 8 64k

Matches the pattern used on the London Flint 2 router nginx
(memory id=647).
2026-05-23 08:34:33 +00:00
Viktor Barzin
7f63d35d0a docs/plans: HA control plane — design + plan + deferral
Investigated, designed, and planned the 3-master HA control plane
migration triggered by 2026-05-21's autonomous k8s upgrade cascade.

Locked 14 design decisions across two passes:
- 10 initial decisions (LB strategy, IPs, sizing, etcd, kured gate, etc)
- 4 challenger-pass amendments (cloud-init template bump, rbac stack
  multi-master refactor, HTTPS /readyz health check, expanded blast
  radius to include /home/wizard/code/infra/config root kubeconfig,
  config.tfvars, k8s-portal user kubeconfigs, etcd-backup nodeSelector,
  k8s-version-upgrade chain extension as Phase 7)

Plan covers 11 phases end-to-end including panic-mode rollback.

DEFERRED before execution. PVE host is 98% RAM-committed
(262 GB allocated / 267 GB physical, 1.5 GB swap active); the
planned 3 x 32 GB masters would push allocation to 326 GB and OOM
the host. k8s-master currently uses only 4.6 GB of its 32 GB
allocation (5-6x oversized).

Revisit triggers documented in design doc:
1. Second PVE host added → hardware HA becomes possible.
2. Right-sizing pass OR planning masters at 16 GB each.
3. Cumulative manual upgrade nursing > ~10h.

Standalone candidate worth lifting independently: Phase 1.5's
rbac stack refactor (apiserver-oidc + audit-policy + etcd-tuning
to loop over k8s_master_hosts list) — future-proofs the cluster
without committing to the HA migration.

Refs: code-n0ow (open, deferred via bd note).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 08:32:15 +00:00
70a334e431 trading-bot: pin Meet Kevin LLM model to claude-haiku-4-5
Sonnet-4-5 trips Anthropic per-account rate_limit_error on the OAuth
bearer (sk-ant-oat01) tokens after 5-10 burst calls — sticky multi-hour
quota. Haiku-4-5 has much higher RPM and processes the 16-video
backfill cleanly (~30s/video with inter-call throttle).

Comment above the env line documents the rationale for future re-evaluation.
2026-05-22 20:43:05 +00:00
5258f09230 mailserver: decommission SendGrid
Remove leftover SendGrid references after the Brevo migration was completed:

- Delete TF `cloudflare_record.mail_domainkey` (TXT at `s1._domainkey`,
  SendGrid-era DKIM, hidden behind the SendGrid CNAME but would re-emerge
  once the CNAME is removed).
- Clean up commented-out `smtp.sendgrid.net` relayhost references and the
  `# For sendgrid` comment on `sasl_passwd` in the mailserver module.

DNS records deleted out-of-band (not TF-managed):
- CF: `s1._domainkey CNAME` + `s2._domainkey CNAME` → sendgrid.net (manual entries)
- Technitium internal `viktorbarzin.me`: `em7107`, `s1._domainkey`,
  `s2._domainkey` CNAMEs → sendgrid.net

Verified end-to-end mail flow unaffected (Brevo outbound + IMAP receive,
roundtrip 20.4s — identical to baseline). Active DKIM (`mail._domainkey`
local + `brevo1/brevo2._domainkey` Brevo) untouched.
2026-05-22 20:08:38 +00:00
Viktor Barzin
b233aba710 openclaw: switch primary to nim/meta/llama-3.1-70b-instruct
Auth audit on 2026-05-22 — all the broken paths and the one that works:

- openai-codex OAuth: EXPIRED (ChatGPT Plus, ancaelena98@gmail.com)
- secret/openclaw → openai_api_key (sk-svcacct): insufficient_quota
- openrouter_api_key: "Key limit exceeded (total limit)"
- llama_api_key: region-blocked
- anthropic_api_key: sk-ant-oat-… (OAuth refresh token, not a real
  x-api-key — won't auth via x-api-key header)
- nvidia_api_key (NIM): WORKS. The key was already baked into the
  openclaw.json providers.nim.apiKey from secret/openclaw → nvidia_api_key.

Two NIM models verified end-to-end (call from inside openclaw pod
with tool-call schema, both returned proper {tool_calls:[…]} JSON):
- meta/llama-3.1-70b-instruct      — 0.58s, primary
- meta/llama-4-maverick-17b-128e   — 16s, smarter, fallback

Fallback chain: maverick → openai-codex (auto-promotes once re-authed)
→ modelrelay/auto-fastest (last resort, hallucinates instead of
tool-calling, but at least responds).

Models registered in both `agents.defaults.models` (allowlist) and
`models.providers.nim.models` (capability declarations) so the agent
sees them as available tools. Startup `models set` updated to pin
the new primary across `doctor --fix` runs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 15:23:17 +00:00
3962513036 security(wave1): W1.7 analysis snapshot — observation data → allowlist plan
First analysis pass over Calico GNP wave1-egress-observe-tier34 data captured
in Loki since 2026-05-19. Pulled ~10000 flow log lines covering 36 source
namespaces (of 82 selected by tier 3+4). Analysis script outputs preserved
on the dev host at /tmp/{analyze_flows2,build_allowlist}.py.

## Findings

**Universal baseline (every observed ns):**
- DNS to kube-system/kube-dns UDP/53
- Often mysql.dbaas TCP/3306 or pg.dbaas TCP/5432
- Often redis.redis TCP/6379

**Rollout tiering by egress fan-out:**
- Tier A (recruiter-responder only): 2 destinations, ideal pilot
- Tier B (29 namespaces): ≤3 external IPs, ≤5 internal — batch rollout
- Tier C (4 namespaces: f1-stream/openclaw/woodpecker/status-page):
  needs per-IP investigation
- Tier D (servarr): 130+ external IPs (BitTorrent P2P) — keep Log+Allow
  permanently or move to dedicated egress proxy

## Caveats blocking immediate enforce
- Observation horizon too short: ~6h dense data, ~24h total. Need ≥7 days
  to catch weekly CronJobs, Vault token rotations, Keel pulls.
- External IPs are dynamic (Cloudflare/AWS rotate). Static IP allowlists
  will break — need DNS-based selectors or CIDR ranges.
- Some intra-namespace traffic bypasses the Calico filter chain.

## Recommended next steps
1. Continue observation through 2026-05-29 (full week). Compare destination
   set day-over-day; if stable, allowlist is ready.
2. First enforce: recruiter-responder (allowlist = kube-dns + telegram CIDR
   + vault/ESO service IPs).
3. Tier B phased rollout at 3-5 ns/day after pilot proves out.

Full analysis: docs/architecture/wave1-egress-observation-2026-05-22.md
Tracked under beads code-8ywc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 15:22:25 +00:00
2d35d72a53 kyverno(wave1): add 7 missing registries to trusted-registries allowlist
Discovered via W1.5 enforcement when querying live cluster state:
PolicyViolation events on 5 deployments (council-complaints, ebook2audiobook,
hermes-agent, netbox, whisper/piper) trying to admit images from registries
not in the original enumeration.

Added entries:
- amruthpillai/*       (resume — reactive-resume)
- athomasson2/*        (ebook2audiobook)
- netboxcommunity/*    (netbox)
- nousresearch/*       (hermes-agent)
- opentripplanner/*    (osm-routing)
- rhasspy/*            (whisper, piper)
- registry.viktorbarzin.me/*  (legacy private registry — council-complaints
                                still references; should migrate to forgejo)

The legacy registry.viktorbarzin.me was supposedly decommissioned 2026-05-07
per CLAUDE.md but council-complaints still uses it — separate cleanup task.

## Verification
- kubectl delete + reapply (kubectl_manifest resourceVersion=0 patch gotcha,
  same as 2026-05-18 inject-keel-annotations)
- Dry-run admission of previously-blocked images now PASS:
  - netboxcommunity/netbox:v4.5.0-beta1 ✓
  - rhasspy/wyoming-whisper:3.1.0 ✓
  - registry.viktorbarzin.me/council-complaints:1c56f8f ✓
- Policy still in Enforce mode

## Observation status (W1.6)
- Calico GNP wave1-egress-observe-tier34 still applied, 82 ns selected
- Loki `{job="node-journal"} |~ "calico-packet"` returns ~5000 lines/hour
- No errors from observation infrastructure

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 15:17:16 +00:00
Viktor Barzin
c11ac7d486 cnpg: bump webhook-cert renewal threshold 7d -> 30d
Root cause of the recurring 'cnpg-webhook-cert' TLS expiry warn:

CNPG default 'expiringCheckThreshold = 7' means the operator only
regenerates the self-signed webhook cert when remaining lifetime drops
BELOW 7 days. Our cluster-health check #22 alerts at <30d. Result:
~23 days of WARN before CNPG would even attempt rotation.

Set EXPIRING_CHECK_THRESHOLD=30 via the chart's config.data map so the
operator now regenerates with 30d buffer, aligning with our monitoring
threshold. Cert lifetime stays at chart default 90d.

Verified after apply: operator runtime config shows
'expiringCheckThreshold:30'. Companion in-session action: deleted the
existing soon-to-expire secret and bounced the operator to force an
immediate fresh 90-day cert (notBefore=May 22, notAfter=Aug 20).
2026-05-22 15:00:41 +00:00
Viktor Barzin
96f9db0b13 state(cnpg): update encrypted state 2026-05-22 15:00:04 +00:00
6367b783c7 broker-sync(imap): fix command name + add fsGroup for sync.db writes
Two latent issues found while diagnosing why the May 2026 META vest
didn't land:

1. broker-sync-imap CronJob's command was 'broker-sync imap', but the
   actual CLI subcommand is 'imap-ingest'. Every scheduled run had
   been failing with 'No such command imap' since day-one.

2. Pod runs as uid=10001 gid=999; PVC /data dir is mode 2775
   group=10001. Without fsGroup in the pod's securityContext the
   pod gets only 'other' (r-x) perms on the dir, so sqlite3 can't
   create journal/WAL files next to sync.db -- hits
   'attempt to write a readonly database'. fsGroup=10001 adds the
   matching gid to the pod's supplemental groups so writes work.

Schwab email-sender regex fix is in broker-sync@d860aef.
2026-05-22 14:41:54 +00:00
Viktor Barzin
fa536cc08b ci: retry after Keel rollout cascade settled 2026-05-22 14:41:54 +00:00
28db8fc9d4 fire-planner: COL refresh CronJob + Grafana Cost-of-Living dashboard
Operational layer for the new col_snapshot cache shipped in
fire-planner@e72fd22:

stacks/fire-planner:
- fire-planner-col-refresh CronJob — Sun 04:00 UTC, no-op until rows
  age toward the 1-year TTL boundary (within 7 days). Calls
  python -m fire_planner col-refresh-stale, upserts via cache.upsert.

monitoring/dashboards/cost-of-living.json (Finance folder):
- Two template variables: $city (single-select from col_snapshot),
  $baseline_city (for COL ratio computation, defaults London).
- Stat row: total w/rent, w/o rent, 1-bed rent, ratio (color-coded).
- All-cities ranked table with gradient-gauged total + colored ratio.
- Cache-freshness table flags rows approaching TTL expiry.

Initial population needs a one-shot: post-Keel-rollout,
  kubectl -n fire-planner exec deploy/fire-planner -- \\
    python -m fire_planner col-seed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:15:38 +00:00
Viktor Barzin
c1cb22896a openclaw: revert model swap + document codex re-auth path
The previous commit promoted modelrelay/auto-fastest to primary as a
workaround for the expired openai-codex OAuth token. But modelrelay
routes to small tool-call-shy models (nvidia/stepfun-ai/step-3.5-flash)
that hallucinate answers instead of using ssh / curl / etc. — exactly
what the v4 learning loop is supposed to leverage.

Revert primary back to openai-codex/gpt-5.4-mini (gpt-5.4-mini is the
only mini variant the Codex backend accepts for ChatGPT Plus tier),
and inline the re-auth command in the model-block comment so future
sessions know exactly what to do when the OAuth token expires:

  kubectl -n openclaw exec -it $(kubectl -n openclaw get pods \
    -l app=openclaw -o jsonpath='{.items[0].metadata.name}') \
    -c openclaw -- node /app/openclaw.mjs models auth login \
    --provider openai-codex

modelrelay/auto-fastest stays in the fallback chain so the agent
remains partially usable while the token is expired.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:12:30 +00:00
Viktor Barzin
247afdb220 cluster-health skill: document tightened #43 thermal threshold (65 C) 2026-05-22 14:09:12 +00:00
Viktor Barzin
4830230984 cluster-health #43: tighten PVE thermal threshold to 65 C
Per Viktor: healthy baseline range is 55-65 C; anything above 65 C is a
signal a VM/workload is using too much CPU and warrants investigation.

Previous thresholds were calibrated to the hardware's TjMax (75/83 C) —
that was too lax, since cluster-load-driven elevation arrives a long
time before throttling. The 65 C cutoff matches the live Prometheus
baseline (Apr 20-May 8 2026: peak 61-69 C, avg 51-55 C) and the
session-observed correlation: above 65 C means the cluster is doing
sustained work that should be looked at, even if hardware is still
nowhere near its limit.

Updated:
  PASS  < 65 C   (within 55-65 baseline)
  WARN  65-82 C  (elevated; check top kvm processes for the culprit)
  FAIL  >= 83 C  (at/above TjMax — throttling imminent)

Verified live: 67 C now WARN (was PASS under the 75 C threshold).
2026-05-22 14:09:08 +00:00
Viktor Barzin
282d7f6182 openclaw: engrain the learning loop at the identity level
User feedback: "this should work for any task, not just calendar.
this learning flow must be strongly engrained to ensure openclaw
gets better over time."

The v3 rules were buried at the bottom of TOOLS.md and only stated
in workflow language. Three changes to make the rule unavoidable:

1. **SOUL.md** — new marker-delimited section "Learning is your
   identity" inserted before ## Boundaries. AGENTS.md tells the
   agent to read SOUL.md first every session, so this is now the
   FIRST thing the agent loads about itself. Frames learning as
   character, not procedure.

2. **TOOLS.md v4** — section moved from the END of the file to
   right after the `# TOOLS.md` title (first substantive content
   on file load). Title strengthened: "THE FLOW — run this on
   EVERY task. Not just hard ones." Concrete examples explicitly
   call out diverse domains (calendar, frigate restart, disk
   usage, inbox summary, deploys) so the universality is
   unmistakable.

3. **learn-from-tasks skill** — opens with "This is universal.
   EVERY task runs through this flow — not just hard ones, not
   just unfamiliar ones. The save at the end is mandatory."

The actual flow (know → ask devvm → save) is unchanged. What
changed is salience: the rule is now the first thing the agent
encounters in three independent surfaces, with stronger framing
that makes "skipping the save" feel like a violation of identity
rather than a missed optimisation.

Marker bumped v3 → v4. Stripper handles v1-v9 idempotently.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 13:18:52 +00:00
66ca8b9e9c trading-bot: revive K8s stack + add meet-kevin-watcher
Uncomment the trading-bot stack (disabled 2026-04-06 due to resource
consumption) and add the new meet_kevin_watcher service container.

Changes:
- Uncomment the /* ... */ block enclosing the entire stack
- Fix db_init job: add -d postgres to psql commands (root user has no
  root-named database — matches pattern used in claude-memory + others)
- Remove 3 disabled containers from trading-bot-workers Pod spec:
  news-fetcher, sentiment-analyzer, trade-executor
- Add new meet-kevin-watcher container (image
  viktorbarzin/trading-bot-service:latest, command
  python -m services.meet_kevin_watcher.main, mem 128Mi/256Mi)
- Extend ExternalSecret with TRADING_OPENROUTER_API_KEY and
  TRADING_MEET_KEVIN_CHANNEL_ID keys (sourced from Vault
  secret/trading-bot)
- Add 4 common_env entries for the Meet Kevin pipeline
  (poll interval, daily cost cap, model slug, prompt version)
- Update lifecycle.ignore_changes to 4 image indices

vault: re-enable pg-trading static role

- Add pg-trading to vault_database_secret_backend_connection allowed_roles
- Uncomment vault_database_secret_backend_static_role.pg_trading
  (was disabled 2026-04-06 with the rest of trading-bot stack)

kyverno: add postgres* to trusted-registries allowlist

- trading-bot db_init uses postgres:16-alpine (Docker Hub library image)
- postgres* was not in the DockerHub bare-name allowlist (unlike mysql*,
  alpine*, nginx*, python* which were already there)

Final workers Pod containers (in order):
  [0] signal-generator
  [1] learning-engine
  [2] market-data
  [3] meet-kevin-watcher (NEW)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 11:23:30 +00:00
Viktor Barzin
60d8d54d6e openclaw: v3 flow — know → ask devvm → (rarely) try yourself
Refines the devvm-fallback into an explicit triage flow that the
agent runs on every task. The default path is to ASK devvm-claude
when uncertain — don't brute-force. Most tasks are solvable there.

## The flow

1. Do I KNOW how? Check `memory_recall` and INDEX.md.
2. If not, SSH devvm and ask claude — and crucially, ask it to
   share the steps + credentials needed so I can do it on my own
   next time. Save the answer in openclaw memory.
3. (RARE) If devvm-claude says no, try in-pod. Most likely fail —
   that's OK.

## Storage moved to memory-indexed location

Learnings now live under
`/workspace/memory/projects/openclaw-learned/` (was
`/workspace/learned/`) so memory-core indexes them and
`memory_recall` surfaces them. Layout:

- `scripts/<task>.md`       runnable recipes
- `knowledge/<topic>.md`    decisions, paths, gotchas
- `credentials/<name>.md`   **POINTERS to Vault, never values**

## Credentials = Vault pointers only

Previous v2 design saved cred values to plaintext NFS files. v3
flips to pointer-only: cred file documents the Vault path + fetch
command (`ssh devvm 'vault kv get -field=foo secret/bar'`), the
consumer, and rotation expectations. The secret stays in Vault.

## Init container also migrates

Strips v1/v2/v3 markers from TOOLS.md before re-inserting v3,
moves any files from the legacy `/workspace/learned/` tree into
the new location, removes the empty legacy dir. User edits
outside the markers always survive.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 11:20:54 +00:00
Viktor Barzin
8e5d682707 openclaw: explicit "use devvm + learn" default behaviour
Refine the init container's devvm-fallback seeding so the OpenClaw
agent treats devvm as its DEFAULT teacher and saves recipes locally
to become independent over time:

1. TOOLS.md v2 section now has two emphatic CRITICAL rules:
   - "TRY DEVVM before giving up" — when stuck, ssh devvm before
     telling the user "I can't do that".
   - "After every task, introspect → save a faster way" — for any
     non-trivial task (especially recurring ones), save the recipe
     to /workspace/learned/ and update INDEX.md.

2. New cc-skill `learn-from-tasks` at
   /home/node/.openclaw/cc-skills/learn-from-tasks/SKILL.md formalises
   both triggers: (A) you're stuck → check INDEX → ask devvm → save;
   (B) you just finished → introspect → save if recurring.

3. /workspace/learned/ scaffold: INDEX.md table-of-contents +
   scripts/, knowledge/, credentials/ (0700) subdirs. Agent checks
   INDEX.md BEFORE reaching for devvm, so saved recipes are
   findable on the next run.

4. Marker migration: strips both v1 and v2 markers before re-inserting
   so user edits outside the markers always survive future restarts.

Security caveat documented inline: credentials in
/workspace/learned/credentials/ are NFS plaintext — acceptable for
home-lab personal scope, NOT for anything more sensitive than what
`ssh devvm` already gives the pod (wizard's access).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 11:12:33 +00:00
Viktor Barzin
ccf1ccdd1d openclaw: also write devvm section to /workspace/TOOLS.md
The OpenClaw agent reads TOOLS.md on every session per AGENTS.md
("environment-specific notes"), but it does NOT auto-search the
memory-core index for "devvm" before answering. Result: the agent
said "I don't have access to the devvm" even though ssh + the
openclaw-task wrapper were fully wired up (verified e2e in
9ad52dfd).

Updated init 6 (seed-devvm-memory-note) to ALSO append a
marker-delimited section to /workspace/TOOLS.md describing the
devvm SSH capability + openclaw-task usage. Idempotent: strips
any prior v1 section before re-inserting, so user edits outside
the markers survive future pod restarts.

The /workspace/memory/projects/openclaw-runtime/devvm-fallback.md
memory note stays in place — it's still indexed by memory-core
and surfaces for memory_recall queries.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 10:50:42 +00:00
Viktor Barzin
9ad52dfd61 openclaw: SSH + tmux task fallback to devvm
Give the OpenClaw pod two new capabilities:

1. Host-tools bundle. New init container `install-host-tools` extracts
   openssh-client + dnsutils + tmux + jq + ripgrep + fd + vault + yq +
   friends into /tools/host-tools/, with the bookworm-slim libs the
   binaries need. PATH + LD_LIBRARY_PATH on the main container point
   ld.so at the bundle. Idempotent via /tools/host-tools/.installed-v1
   marker; smoke test (ldd-based) fails the init at deploy time if any
   binary has unresolved deps. Bundle is ~558 MB on the existing
   /srv/nfs/openclaw/tools NFS.

2. devvm SSH + async task pattern. New init `setup-ssh-config` writes
   id_rsa/config/known_hosts under /home/node/.openclaw/.ssh; main
   container startup symlinks /home/node/.ssh → there. New
   /usr/local/bin/openclaw-task wrapper on devvm manages long-running
   work as tmux sessions on devvm (sessions and logs survive pod
   restarts — they live on devvm, not in the pod). New init container
   `seed-devvm-memory-note` drops a markdown note teaching the pattern;
   main container startup now runs `openclaw memory index --force` so
   the note is searchable on first boot.

Design + verified E2E flow in
docs/plans/2026-05-22-openclaw-devvm-access-design.md. Persistence test
green: spawned a 50s task from pod A, deleted pod A, new pod B saw the
task finish and read its full log.

Pre-existing keel.sh annotation drift on openclaw/{openlobster,
task_webhook} cleaned up in the same apply.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 10:20:00 +00:00
Viktor Barzin
c7b0ebf6a5 state(vault): update encrypted state 2026-05-22 10:04:55 +00:00
Viktor Barzin
8228171104 cluster-health: add checks 43 + 44 (PVE host thermals + load)
Both new checks SSH read-only to the PVE host and emit PASS/WARN/FAIL
via the standard healthcheck output + JSON. They run alongside the
existing 42 checks and surface the same alerts the 2026-05-20/21
optimization session had to gather by hand.

#43 PVE Host Thermals — Xeon E5-2699v4 package + per-core temps
  Reads every /sys/class/hwmon/hwmon0/temp*_input in one SSH round-trip.
  Thresholds tuned to the live TjMax=83 / Tcrit=93:
    PASS  < 75 °C package
    WARN  75-82 °C  (approaching max, action time)
    FAIL  >= 83 °C  (at/above TjMax, throttling imminent)
  Reports hottest core label too so a single hot core doesn't hide in
  the package average.

#44 PVE Host Load — load avg vs 44-thread capacity
  Reads /proc/loadavg, compares 5-min to thread count (44):
    PASS  load_5 < 30   (< 70% threads busy)
    WARN  30-37         (oversubscribed but not saturating)
    FAIL  >= 38         (~85%+ threads busy — scheduler saturation)
  Uses 5-min so brief work spikes don't false-fail.

Both gracefully WARN-degrade if SSH BatchMode fails, matching the
existing check 36 (LVM PVC snapshots) pattern. TOTAL_CHECKS bumped
42 -> 44 and the dispatcher updated.
2026-05-22 09:55:11 +00:00
Viktor Barzin
1b21d4819e postiz: disable unused providers + pin temporal vs Keel force-policy
Two changes in one commit because they are coupled — the DISABLED_PROVIDERS
addition cannot land safely without the Keel exclusion on temporal:

1. Add DISABLED_PROVIDERS env on postiz Helm chart. Live DB audit showed
   only 'instagram-standalone' connected; all other Postiz providers
   were idle-polling Temporal task queues. List excludes x, linkedin,
   reddit, threads, youtube, tiktok, pinterest, dribbble, slack,
   discord, mastodon, bluesky, lemmy, warpcast, vk, beehiiv, telegram,
   wordpress, nostr, farcaster. Keeps facebook + instagram + the
   standalone variant active.

2. temporal deployment needs keel.sh/policy=never (set live via kubectl
   annotate). Keel was rolling temporalio/auto-setup 1.28.1 -> 0.20.0
   on every helm reconcile because :0.20.0 is published in the same
   registry path but is a DIFFERENT (legacy Cassandra-based) image
   stream. Memory id 1933 trap; new variant captured in id 2315-2319.

   The annotation is set live (not in TF) because the existing TF block
   has lifecycle.ignore_changes = [keel.sh/policy] so the chart
   reconcile won't reset it. Long-term fix: add temporal to the
   Kyverno keel-mutate-existing exclude list so it survives a
   namespace re-label.
2026-05-21 10:04:22 +00:00
Viktor Barzin
533a89a010 docs: HA control plane design (3 masters)
Captures today's k8s-upgrade-pipeline session findings — root cause
of repeated upgrade failures is the single-master apiserver outage
window cascading into operator crashloops + storm I/O. HA control
plane with 3 masters + apiserver LB removes the cascade entirely.

Tracked in beads code-n0ow. Plan doc to follow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 09:41:20 +00:00
Viktor Barzin
2dc7e001bd k8s-version-upgrade: retry kubeadm apply on static-pod-hash timeout
kubeadm's `upgrade apply` waits 5min for each static-pod manifest swap
to be picked up by the kubelet (it polls the pod's
`kubernetes.io/config.hash` annotation via apiserver). On a freshly-rebooted
master with apiserver-to-kubelet status sync lagging, that 5min isn't
enough — kubeadm declares the upgrade failed and rolls back.

The thing is: the etcd container HAS already been swapped to the new
image by then (verified live — pod is on registry.k8s.io/etcd:3.6.5-0
when this fires). kubeadm's check is just slow to notice. The 2nd
attempt sees etcd already on target, skips it, and proceeds cleanly.

Wrap `kubeadm upgrade apply` in a 3-attempt loop with 30s between.
Worker phase doesn't need this — `kubeadm upgrade node` has no
static-pod-hash waits.

Today's autonomous-pipeline session: master phase Failed at 5m on
attempt #1 with this exact error, retried, hit same timeout, gave up
(backoffLimit=1). The wrapper turns this from a fatal pipeline halt
into a "wait a bit, try again" that usually completes on attempt #2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 09:32:29 +00:00
Viktor Barzin
fc0510aa67 k8s-version-upgrade: kill-switch + ignore RecentNodeReboot + shorter quiet window
Three changes from today's autonomous-pipeline validation session:

1. **Kill-switch ConfigMap** — chain checks for `k8s-upgrade-killswitch`
   ConfigMap in `k8s-upgrade` namespace at the top of every phase + at the
   start of version-check. Existence halts the chain (exit 0) with a Slack
   message. Single-command emergency stop:
       kubectl -n k8s-upgrade create configmap k8s-upgrade-killswitch \
           --from-literal=reason="storm response"
   Resume:  kubectl -n k8s-upgrade delete cm k8s-upgrade-killswitch
   Role rule for `configmaps` get/list/watch added (resourceName-scoped).

2. **Ignore RecentNodeReboot in halt_on_alert_query everywhere** — the
   chain itself causes reboots. The pre-drain master check, post-upgrade
   worker check, postflight check, and preflight halt-on-alert all now
   pass `RecentNodeReboot` as the extra-ignore. Previously only worker
   phase's post-upgrade gate did this. Master Failed silently this morning
   on the pre-drain check after my own master reboot.

3. **Preflight quiet-baseline 3600s → 600s** — the 1h cooldown after any
   Ready transition meant the chain refused to run for an hour after
   every kured reboot. 10 min is enough for kubelet/control-plane to
   settle; the 24h-between-cluster-reboots invariant lives in
   kured-sentinel-gate, not here.

Validated by running the chain end-to-end: preflight passed in 5s,
master phase now in drain. Today's storm post-mortem (snapshot CoW
amplification + tigera-operator crashloop feedback loop) drove the
kill-switch design.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 09:23:41 +00:00
Viktor Barzin
944cf51f6b authentik: worker replicas 3 -> 2
Workers handle background tasks only (LDAP sync, email, certificate
renewal) — no user-facing traffic, so 2-of-3 redundancy isn't load-
bearing. Reduces sustained CPU by ~100m.

Server replicas unchanged at 3 (PDB minAvailable=2 — user-facing).
PgBouncer pool unchanged at 3 (DB connection pooling).
2026-05-21 09:14:35 +00:00
Viktor Barzin
8c87b77f1b forgejo: disable source archive ZIP/TAR downloads
Bot crawlers were hitting /<owner>/<repo>/archive/<sha>.zip on the
dot_files repo (vim-plugin source trees) — each request synthesised a
fresh ZIP from git history, taking 9.9s and returning 500 under
sustained load. Cost: ~440m sustained forgejo CPU.

Toggle: FORGEJO__repository__DISABLE_DOWNLOAD_SOURCE_ARCHIVES=true.
/archive/* URLs now 404; git clone / OCI registry / API unaffected.

Measured: forgejo pod 440-573m -> 60m steady-state (~85% drop).

(Pod rollout took ~7min on the new RS due to kubelet's recursive
chown of the 2700+ files in the data PVC — fsGroupChangePolicy is
unset and defaults to Always; could be set to OnRootMismatch later.)
2026-05-21 09:12:20 +00:00
Viktor Barzin
af6aa18b25 monitoring: prometheus global scrape 1m -> 2m + UPS pinned 30s
Halves sample volume on all default-scrape jobs (cAdvisor, node-exporter,
service-endpoints, etc.). Memory id 559's earlier scrape-2m tuning was
applied live but not codified — this restores the Helm template.

Companion changes to keep alerting fidelity:
- evaluation_interval kept at 1m (alerts evaluate every minute)
- snmp-ups job pinned to scrape_interval=30s so PowerOutage /
  LowUPSBattery detect within ~30s instead of 2m
- 3 alerts bumped from for:1m to for:3m (HighGPUTemp, LowUPSBattery,
  PowerOutage) for stability above the new 2m global cadence

Other jobs that already had per-job overrides (snmp-idrac 1m,
redfish-idrac 3m, kubernetes-pods 5m, kubernetes-services 5m) unaffected.

Expected: 50-150m sustained CPU saving on Prometheus + apiserver.
Verification ongoing — apiserver settles ~minutes after Prometheus
config reload due to initial-target-scrape burst.
2026-05-21 08:32:57 +00:00
Viktor Barzin
aba061cf2e alloy: switch pod log shipping from apiserver to file-tail
Replaced 'loki.source.kubernetes' with 'loki.source.file' in alloy DS
config. discovery.relabel.pod_logs already sets __path__ to the kubelet
log path (/var/log/pods/*<uid>/<container>/*.log) and varlog host-mount
was already present, so this is a one-line swap.

Why: apiserver was burning ~700m sustained on 'CONNECT pods/log' streams
(13 req/s, ~2200 sec/s of long-lived TCP connections). Streaming pod
logs through the apiserver instead of tailing kubelet's log files was
the dominant residual cost after the recent Loki/Alloy onboarding.

Measured before/after:
- Alloy DS: ~620m total (5 x ~125m) -> ~92m total (5 x ~18m)
- kube-apiserver: peak 1959m midnight burst, settled 632m

(Stuck-pod recovery: alloy-7zg7t on k8s-master needed --force delete
during rollout — FailedKillPod 'unable to signal init: permission denied'
on runc, transient runtime issue, unrelated to this change.)
2026-05-21 08:27:34 +00:00
Viktor Barzin
b6724a5d48 vault: add pg-matrix + pg-technitium static roles to allowed_roles
Both static-roles existed in Vault state (created out-of-band) but
were missing from the postgresql connection's allowed_roles list. Vault
was logging 'is not an allowed role' rotation errors every 10s for both,
sustained CPU waste ~40-70m.

Adopted both via 'import {}' (import blocks removed after first apply
per the canonical adoption pattern).

- pg-matrix: username=matrix, rotation_period=86400 (1d)
- pg-technitium: username=technitium, rotation_period=604800 (7d)

Verified: 'is not an allowed role' errors stopped in vault-0 logs
immediately after apply.
2026-05-21 08:11:11 +00:00
Viktor Barzin
9247a68514 state(vault): update encrypted state 2026-05-21 08:09:11 +00:00
Viktor Barzin
926d507313 k8s-version-upgrade: grant get/list on apps resources for drain
kubectl drain --ignore-daemonsets needs to GET each pod's owner
reference (DaemonSet/StatefulSet/ReplicaSet/Deployment) to classify
which pods can be drained vs ignored. Without these RBAC verbs, drain
bails with 'cannot delete daemonsets ... is forbidden' for every
daemonset-managed pod on the node.
2026-05-21 08:07:29 +00:00
Viktor Barzin
1617285d23 infra: add kubectl + authentik providers across 6 stacks
Provider declarations were applied across freshrss, linkwarden,
navidrome, openclaw, tandoor, vault in prior sessions; lock files
regenerated for the 4 stacks where init had run. Commits the WIP so
downstream Terraform plans can proceed.

- kubectl (gavinbunney/kubectl ~> 1.14): kubernetes_manifest panic
  workaround for Kyverno CRDs (beads code-e2dp)
- authentik (goauthentik/authentik ~> 2024.10): used where stacks
  manage their own Authentik objects
2026-05-21 08:07:22 +00:00
c09230815c openclaw: enable recruiter-api plugin (allowlist + manifest contracts)
Plugin needs three things to load under OpenClaw 2026.5.x:
1. plugins.allow includes 'recruiter-api' (doctor --fix overwrites the
   ConfigMap-baked value, so re-patch via 'openclaw config patch --stdin'
   in the startup command after doctor runs).
2. 'openclaw plugins enable recruiter-api' to flip its registry entry.
3. manifest declares contracts.tools (added in recruiter-responder commit
   83ffd9fa).

Plus: VIKTOR_CHAT_ID env wired from secret/openclaw.viktor_chat_id so the
plugin's polling loop knows which Telegram chat to deliver into.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:56:11 +00:00
57ab903a0c recruiter-responder: deploy d7892396 — OpenClaw-driven flow
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:14:11 +00:00
18928eb8ac recruiter-responder + openclaw: wire gpt-mini secret keys + VIKTOR_CHAT_ID
recruiter-responder ExternalSecret gains GPT_MINI_ENDPOINT/_API_KEY/_MODEL
(NIM-served qwen3-coder-480b — gpt-5.4-mini in OpenClaw is OAuth-only and
not HTTP-accessible to external services). OpenClaw gains VIKTOR_CHAT_ID
env consumed by the recruiter-api plugin's announcement loop.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:10:56 +00:00
Viktor Barzin
0c8b46df55 k8s-version-upgrade: fix two more grep-pipefail bugs
Same `grep -v` / `set -o pipefail` interaction as commit 10b261d2,
in two more callsites the previous fix didn't cover:

  Line 354 (phase_master): control-plane Running check —
    `grep -v Running | wc -l` returns 1 when all pods are Running
    (the happy path), aborting the chain right after master upgrades.

  Line 419 (phase_postflight): on-target node check —
    `grep -v ":v$TARGET_VERSION$" | wc -l` returns 1 when all nodes
    are on the target version (the happy path, exactly when postflight
    should succeed). Aborts at the moment of victory.

Forensics on yesterday's master Job failure (see commit message of
10b261d2 for context): the master Job spawned 16s after the previous
fix's TF apply, before configmap propagation completed on the kubelet.
With those two latent bugs also looming, the chain would have died
post-master-upgrade and again at postflight even if propagation had
been timely.

Wrapping each grep in `{ ... || true; }` so a no-matches result
returns success.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 20:59:10 +00:00
Viktor Barzin
10b261d2db k8s-version-upgrade: fix pipefail abort when no alerts are firing
halt_on_alert_query() ends with `grep -vE "$regex" | sort -u`. When
zero alerts are firing (the desired healthy state), grep matches
nothing and exits 1. Under `set -o pipefail`, the whole pipeline
returns 1; under `set -e`, the caller's `alerts=$(...)` assignment
fails and aborts the script in ~1s with no diagnostic output.

The chain effectively required at least one non-meta alert to be
firing to make any forward progress. Today (2026-05-19) the cluster
is fully clean post-MySQL recovery, the daily 12:00 UTC detection
spawned the preflight Job, and it died instantly — blocking the
1.34.7 → 1.34.8 patch chain.

Fix: wrap the grep in `{ ... || true; }` so a no-matches result
returns success. Preflight verified end-to-end after the fix — the
chain is now in flight (preflight ✓, master phase running).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:19:06 +00:00
f5917f0eb3 security(wave1): W1.6 expand observation from recruiter-responder pilot → tier 3+4 (82 namespaces)
## Change
- Replaced kubectl_manifest.wave1_egress_observe_recruiter_responder with
  kubectl_manifest.wave1_egress_observe_tier34
- namespaceSelector changed from `kubernetes.io/metadata.name == 'recruiter-responder'`
  to `tier in {"3-edge", "4-aux"}` — covers 82 namespaces (17 tier-3-edge + 65 tier-4-aux)
- Legacy pilot GNP wave1-egress-observe-recruiter-responder kubectl-deleted
  (apply_only=true means TF rename does NOT destroy the live old resource;
  cleanup done manually)
- Tier 0/1/2 namespaces explicitly out of wave 1 observation per locked plan
  (cluster infra + GPU workloads, deferred)

## Verification (live cluster, 2026-05-19)
- 82 namespaces match `tier in (3-edge,4-aux)`
- Felix translated the new policy into iptables LOG rule in cali-po-* chain
- LogQL `{job="node-journal"} |~ "calico-packet"` returns real packet metadata
  from multiple namespaces with distinct destinations:
  - east-west pod-to-pod (10.10.108.48, 10.10.122.131)
  - in-cluster service VIP (10.96.0.10 — kube-dns)
  - external (149.154.166.110 — Telegram API from recruiter-responder)

## W1.7 next step (calendar-bound, ~1 week)
- Let observation run for ~1 week
- Aggregate distinct destinations per namespace via LogQL
- Build per-namespace egress allowlist module `tier3_egress_baseline`
- Flip GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]`
- Phased per-namespace as originally planned

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:14:16 +00:00
e9054e6b1b security(wave1): W1.6 observe phase LIVE — Calico GNP action:Log pilot on recruiter-responder
Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico
Enterprise-only field, rejected by OSS v3.26) with the supported primitive:
Calico GlobalNetworkPolicy with `action: Log`.

## Mechanics (verified end-to-end on 2026-05-19)
1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder`
   with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`,
   `types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`.
2. Felix translates to iptables LOG rule in
   `cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5.
3. Linux kernel emits LOG entries to ring buffer with transport=kernel.
4. systemd-journald captures kernel transport entries.
5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`.
6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing
   SRC/DST/PROTO/PORT for every NEW egress connection.

## Verified output sample
`calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132
DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...`

The Allow rule in the GNP keeps egress functional (recruiter-responder
remained 1/1 Running through the apply — verified Python TCP connections to
1.1.1.1, 8.8.8.8, 9.9.9.9 succeed).

## Wave 1 status
W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7
remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"`
samples, build empirical egress allowlist, flip the GNP rules from
`[Log, Allow]` to `[Allow <specific dests>, Deny]`.

Expand observation to additional namespaces by adding entries to
`spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:10:42 +00:00
Viktor Barzin
7a1751a668 upgrade-state: filter transient registry digest-check errors
Keel polls ~175 image manifests hourly against public registries.
Transient i/o timeouts and registry 5xx responses are inherent at
that scale and auto-recover on the next poll, but they were tripping
the Apps row into ⚠ attn — pure noise.

Extend benign_re to cover:
  - failed to check digest + (i/o timeout | connection refused
    | connection reset | context deadline exceeded | TLS handshake
    timeout | no such host | EOF)
  - failed to check digest + non-successful response (status=5xx)

Real actionable digest-check failures (HTTP 401 auth, 404 removed
tag) still surface. Persistent registry-side 5xx is owned by the
registry's own monitoring (forgejo-integrity-probe +
RegistryCatalogInaccessible), not by Keel logs.

Tested locally: Apps row flips from ⚠ attn → ✓ healthy after the
filter is in place; remaining errors-line drops to "(none in last
24h)".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:06:21 +00:00
Viktor Barzin
359b0277f8 dbaas: opt MySQL out of Keel + add do-not-bump warning
Two changes to make the 8.4.8 pin durable:

1. Add `keel.sh/policy: never` annotation on the mysql-standalone
   StatefulSet. The dbaas namespace was already excluded from the
   Kyverno mutate, but the StatefulSet carried orphan Keel annotations
   (force/poll/match-tag) from an earlier policy version that lacked
   the exclusion list. Keel kept watching :8.4.8 for digest changes.
   Now explicitly opted out; Keel logged "image no longer tracked".

2. Expand the inline comment to a banner pointing at the upgrade plan
   docs and the gating beads task. Anyone touching this line sees the
   warning + the path to do it right.

Closes the loop on the 2026-05-18 outage. Real upgrade tracked in
code-963q + docs/plans/2026-05-19-mysql-8.4.9-upgrade-{design,plan}.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 13:21:03 +00:00
Viktor Barzin
0fab599dbc state(dbaas): update encrypted state 2026-05-19 13:20:39 +00:00
Viktor Barzin
9fd54143c2 docs: design + plan for MySQL 8.4.8 → 8.4.9 upgrade
Captures the wipe+reinit strategy (sidestep the broken DD upgrade
path), the IO config bump (innodb_io_capacity 100→2000), root-cause
analysis with explicit uncertainty, verification gates, and rollback.

Not scheduled yet. Tracked in beads code-963q.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 13:10:00 +00:00
669ba97078 security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE
## W1.1 — K8s API audit log shipping (LIVE)
- alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on
  k8s-master node. Verified alloy-7zg7t scheduled on master, tailing
  /var/log/kubernetes/audit.log
- loki.tf "Security Wave 1" rule group: added K2-K9 alert rules
  (skipped K1 per Q7 decision):
  - K2 K8sSATokenFromUnexpectedIP
  - K3 K8sSensitiveSecretReadByUnexpectedActor
  - K4 K8sExecIntoSensitiveNamespace
  - K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user)
  - K6 K8sAuditPolicyModified (kubeadm-config CM change)
  - K7 K8sClusterRoleWildcardCreated (verbs=* + resources=*)
  - K8 K8sAnonymousBindingGranted
  - K9 K8sViktorFromUnexpectedIP
- All rules use source-IP regex matching the wave-1 allowlist
  (10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc,
  100.64-127 tailnet) and `lane = "security"` → #security Slack route.
- Verified: kubectl-audit logs flowing in Loki query
  {job="kubernetes-audit"} returns events with node=k8s-master.
- Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1.

## W1.5 — require-trusted-registries Enforce (LIVE)
- security-policies.tf: flipped Audit→Enforce with explicit allowlist
  built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration.
- Removed `*/*` catch-all (which made Audit→Enforce a no-op).
- Pattern includes 15 explicit registries, 6 DockerHub library bare
  names, 56 DockerHub user repos.
- Verified by admission dry-run:
  - evilcorp.example/malware:v1 → BLOCKED with custom message
  - alpine:3.20 → ALLOWED (matches `alpine*`)
  - docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`)

## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation)
- Tried adding FelixConfiguration with flowLogsFileEnabled=true via
  kubectl_manifest in stacks/calico/main.tf
- Calico OSS rejected with "strict decoding error: unknown field
  spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only
- Removed the failed resource. Documented alternative paths in main.tf
  comment block: GNP with action=Log (iptables NFLOG → journal), Cilium
  migration, eBPF tooling, or Tigera Operator adoption.

## Docs updates
- security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE,
  W1.6/W1.7 blocked
- monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in
  prior session before today's apply)

## Cleanup
- Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed
  their job in the 2026-05-18 apply; should not stay in tree per TF docs)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 06:37:54 +00:00
8ef4f06ac0 recruiter-responder: bump image to 444fa58c (header CRLF fix)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 22:55:09 +00:00
Viktor Barzin
6024cfb410 docs: update MySQL restore runbook + CLAUDE.md after 8.4.9 recovery
Runbook rewritten for the standalone setup (InnoDB Cluster gone since
2026-04-16) and now covers the full disaster-recovery flow we just
executed: stop pod, wipe PVC (incl. PV reclaim-policy flip from Retain
→ Delete), re-apply TF, restore via in-namespace Job, drop+create
static users with fresh Vault passwords, restart dependents.

CLAUDE.md MySQL row notes the 8.4.8 pin + links the runbook.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 22:51:52 +00:00
Viktor Barzin
ea475c3d86 dbaas: pin MySQL to 8.4.8, recover from broken 8.4.9 DD upgrade
The mysql:8.4 floating tag let Keel auto-bump to 8.4.9, whose
data-dictionary upgrade got stuck mid-flight on every attempt
(no progress, no CPU, never completing). Pinning to 8.4.8 +
restoring from the 2026-05-18 00:30 UTC mysqldump puts us back
on a known-good binary.

Closes: code-eme8
Closes: code-k40p

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 22:46:54 +00:00
Viktor Barzin
a04bf3a7f3 state(dbaas): update encrypted state 2026-05-18 22:31:52 +00:00
90e074a4a2 kyverno(wave1): swap kubernetes_manifest → kubectl_manifest + flip 3 security policies to Enforce
## Resolves code-e2dp (Kyverno TF apply blocked)
Root cause: terraform-provider-kubernetes v3.1.0 panics on plan/refresh of
kubernetes_manifest resources holding Kyverno ClusterPolicy CRDs (large
CEL/foreach schemas). Workaround: swap to gavinbunney/kubectl_manifest which
treats manifests as opaque YAML strings.

## Migration mechanics
- Root terragrunt.hcl: added gavinbunney/kubectl provider declaration so all
  stacks get it generated in providers.tf.
- stacks/kyverno/modules/kyverno/versions.tf (new): module-level provider source
  declaration (required for kubectl_manifest in a child module).
- Converted 17 kubernetes_manifest resources across 7 files to kubectl_manifest
  with yaml_body = yamlencode({...}). depends_on chains preserved.
- terraform state rm for all 17 old kubernetes_manifest entries.
- stacks/kyverno/imports.tf (new): TF 1.5+ import blocks mapping each
  kubectl_manifest to its live cluster resource by apiVersion//Kind//name ID.
- One resource (policy_inject_keel_annotations) needed kubectl delete + recreate
  because the kubectl provider couldn't patch it cleanly (resourceVersion=0
  invalid for update — gotcha when adopting a resource previously
  kubernetes_manifest-owned).

## W1.4 — security policies Audit → Enforce (LIVE)
Three policies flipped: deny-privileged-containers, deny-host-namespaces,
restrict-sys-admin. Verified live via kubectl. failurePolicy=Ignore preserved.

## Shared exclude list (35 namespaces)
local.security_policy_exclude_namespaces in security-policies.tf.
- 31 critical from memory id=1970 (Keel rollout list)
- + frigate (camera HW transcoding needs host access)
- + kured (privileged DaemonSet for node reboots)
- + default (etcd backup/defrag CronJobs use hostNetwork)
- + changedetection (uses SYS_ADMIN for chromium sandbox)

## W1.5 — require-trusted-registries stays Audit
Pattern */* allows anything-with-a-slash; Enforce would be a no-op for supply
chain. Tracked under beads code-8ywc as follow-up.

## TF import-blocks
The imports.tf file should be removed in a follow-up cleanup commit once
verified — TF doesn't auto-clean these.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Closes: code-e2dp
2026-05-18 20:10:27 +00:00
0560d81f3a monitoring(wave1): re-enable Loki+Alloy, deploy wave1 alert rules, add #security Slack lane
## Loki + Alloy re-enabled (code-146x)
- Uncommented helm_release.loki, helm_release.alloy, kubernetes_daemon_set_v1.sysctl-inotify,
  kubernetes_config_map.loki_alert_rules, kubernetes_config_map.grafana_loki_datasource
- Reverses the documented "operational overhead vs benefit after node2 incident"
  decision. Re-evaluated because wave 1 security detection layer (beads code-8ywc)
  needs Loki + ruler + alert routing.
- SingleBinary mode, 2-4Gi mem, 50Gi proxmox-lvm PVC, 30-day retention, ruler enabled
  pointed at prometheus-alertmanager.monitoring.svc:9093
- Alloy DaemonSet (4 pods on worker nodes) discovers pod logs via K8s API + pushes
  to Loki
- Loki canaries running (4)
- Vault audit-tail sidecar logs now flowing to Loki: queried
  {namespace="vault",container="audit-tail"} returns live audit JSON

## Wave 1 alert rules deployed (W1.3 partial)
Added "Security Wave 1" rule group to loki_alert_rules configmap:
- V1: VaultRootTokenCreated — auth/token/create with policies=[root]
- V2: VaultAuditDeviceModified — sys/audit/* create/delete/update
- V3: VaultSealChanged — sys/seal update
- V4: VaultPolicyModified — sys/policies/acl/* create/update/delete
- V5: VaultAuthFailureSpike — >10 permission denied/min
- V7: VaultViktorFromUnexpectedIP — auth as me@viktorbarzin.me from non-allowlist source IP
  (allowlist: 10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet)
- S1: PVEsshLoginFromUnexpectedIP — sshd "Accepted" from non-allowlist IP (rule defined,
  fires once promtail/Alloy ships sshd journal with job=sshd-pve)

Verified rules visible via /loki/api/v1/rules. K2-K9 (K8s API audit) deferred to W1.1
which needs the audit policy + apiserver log shipping codified.

## #security Slack lane (Alertmanager)
- New `slack-security` receiver in prometheus_chart_values.tpl, channel #security
- Higher-priority route at top of routes list: matchers `lane = security` →
  slack-security, continue: false (so wave 1 alerts never fall through to #alerts)
- Slack message format includes summary + description + runbook link annotation
- All wave 1 rules set `lane = "security"` label

## Resource summary
- 6 added: helm_release.loki, helm_release.alloy, kubernetes_config_map.grafana_loki_datasource,
  kubernetes_config_map.loki_alert_rules, kubernetes_daemon_set_v1.sysctl-inotify,
  + 1 other
- 5 changed: helm_release.prometheus (alertmanager config — new receiver + route),
  4 deployments (image tag drift from Keel-managed images, unrelated)
- 1 destroyed: null_resource grafana_admin_only_folder_acl["Finance (Personal)"]
  (timestamp-triggered always recreates — not destructive)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Closes: code-146x
2026-05-18 19:51:57 +00:00
Viktor Barzin
34c1d64a88 upgrade-state: suppress known-benign Keel slack-bot-not-configured noise
Keel 1.2.0 registers a Slack socket-mode bot whenever SLACK_BOT_TOKEN is
set, then fails because we don't supply an `xapp-` app-level token:

    bot.slack.Configure(): SLACK_APP_TOKEN must have the prefix "xapp-".
    bot.Run(): can not get configuration for bot [slack]

We don't want the interactive bot — opt-out auto-update + no approval flow
(see stacks/keel/main.tf comment). The Slack NOTIFICATION sender works
independently and continues posting rollout messages to #general fine.

But /upgrade-state's broad `grep level=error` was counting these as real
errors → ⚠ on the Apps row every run. Add a small skip-pattern list so the
two recurring benign lines drop out; any new genuine Keel error still
shows. Reuses `bot.Run()` + `SLACK_APP_TOKEN must have the prev?if|prefix`
(typo in Keel's actual log message preserved as alternation).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 19:45:40 +00:00
82fedf1336 security(wave1): Vault audit-tail sidecar (live) + doc reality-check
## Vault audit-tail sidecar (APPLIED + VERIFIED)
- Added `audit-tail` extraContainer to vault helm chart values: busybox:1.37 with
  `tail -F /vault/audit/vault-audit.log`. Reads the audit PVC (`audit` volume
  from the chart's auditStorage), emits JSON audit events to stdout. kubelet
  captures the stdout; once Loki+Alloy are deployed (blocked on code-146x),
  these logs flow automatically to Loki with `container="audit-tail"`.
- Resources: 5m CPU / 16Mi mem request, 32Mi limit. PVC mount is readOnly.
- Applied via `tg apply -target=helm_release.vault`. All 3 vault pods rolled
  cleanly (OnDelete strategy, manual one-at-a-time, auto-unseal each ~10s).
- Verified: `kubectl logs -n vault vault-2 -c audit-tail` shows live JSON
  audit lines from ESO token issuance, KV reads, etc.

## Doc reality-check
While verifying logs reached Loki, discovered Loki is NOT actually deployed.
`stacks/monitoring/modules/monitoring/loki.tf` defines `helm_release.loki` but
has a self-referencing `depends_on = [helm_release.loki]` that prevented apply.
No `loki` Helm release in the cluster, no Loki pods, no Loki Service. The
monitoring.md "Loki: deployed" claim was aspirational.

- security.md W1.2 row: PENDING → PARTIAL (sidecar live, shipping blocked on
  code-146x)
- security.md W1.3 row: gated on code-146x added
- monitoring.md Loki row: marked NOT DEPLOYED with cross-ref to code-146x

## New beads task
- code-146x P1 — Loki + log shipper missing. Lists the helm_release self-depends_on bug,
  investigation paths, and revised wave 1 sequencing (Loki/Alloy is prereq 0).

## Wave 1 status update
- W1.2: Vault audit device + XFF + audit-tail sidecar all LIVE; Loki shipping blocked on code-146x
- W1.1, W1.3, W1.6, W1.7: still not started (W1.6 also blocked on code-3ad Calico Installation CR)
- W1.4, W1.5: code committed, blocked on code-e2dp (Kyverno provider crash)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 19:37:36 +00:00
Viktor Barzin
3af3f0507b state(vault): update encrypted state 2026-05-18 19:33:17 +00:00
f30c141270 security(wave1): W1.2 Vault XFF (applied) + W1.4/W1.5 Kyverno code prep (apply blocked on provider crash)
## W1.2 — Vault audit device + X-Forwarded-For (APPLIED + VERIFIED)
- Added `x_forwarded_for_authorized_addrs = "10.10.0.0/16"` to vault listener config.
  Trust X-Forwarded-For from in-cluster sources (pod CIDR). Without this, every
  vault audit log entry shows Traefik's pod IP instead of the real client IP —
  the V7 alert rule (Viktor identity from non-allowlist source IP) needs the
  real client IP to be meaningful.
- Applied via `tg apply -target=helm_release.vault` (vault stack has pre-existing
  for_each unknown issues unrelated to this change; -target documented in error
  message itself as the workaround).
- Rolling restart of vault-{0,1,2} performed manually (StatefulSet uses OnDelete
  update strategy, not RollingUpdate). All 3 pods rejoined Raft + auto-unsealed
  within ~10s each. Verified XFF config visible in pod's
  /vault/config/extraconfig-from-values.hcl.
- The `vault_audit "file"` resource was already in TF at line 287 (writing to
  /vault/audit/vault-audit.log) — no change needed.

## W1.4 + W1.5 — Kyverno enforce flip (CODE ONLY, apply BLOCKED)
- Added shared `local.security_policy_exclude_namespaces` (31 critical namespaces
  from memory id=1970 + `frigate, kured, default, changedetection` discovered
  during the live-cluster pre-flight check for privileged/hostNetwork/SYS_ADMIN
  pods that would be blocked by Enforce).
- Flipped 3 security policies Audit → Enforce: deny-privileged-containers,
  deny-host-namespaces, restrict-sys-admin. failurePolicy=Ignore preserved at
  chart level.
- `require-trusted-registries` STAYS in Audit mode pending allowlist tightening
  (current pattern includes `*/*` which matches anything-with-a-slash, so Enforce
  would be a no-op for supply chain). Tracked under beads `code-8ywc` W1.5.

**Apply blocker**: `tg plan` panics with `terraform-provider-kubernetes_v3.1.0`
crash on the kubernetes_manifest resources (`ElementKeyInt(0): can't use
tftypes.Object...` — provider schema mismatch on Kyverno CRDs). The crash
reproduces on the UNMODIFIED file, so it's a pre-existing provider issue, not
caused by these changes. Resolving it requires either upgrading the provider or
finding a kubernetes_manifest-compatible workaround. Tracked under `code-8ywc`.

## Wave 1 status after this commit
- W1.2: APPLIED + VERIFIED (vault XFF + audit device already in place)
- W1.4 + W1.5: code ready, apply blocked on provider crash
- W1.1, W1.3, W1.6, W1.7: not started in this session

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 19:26:39 +00:00
b85a5ecebd monitoring(wealth): drop 6y timeFrom override on META vest cadence 2026-05-18 19:25:29 +00:00
Viktor Barzin
0eb5c8c292 state(vault): update encrypted state 2026-05-18 19:17:04 +00:00
Viktor Barzin
4f64f51bba realestate-crawler: dockerhub pull-secret + lift image-pin on ui/api
Companion to the GHA migration in immovika/realestate-crawler@c2acbf5.

Apps row of /upgrade-state was flagging ⚠ because Keel poll on the four
Deployments returned 401 — DockerHub repo viktorbarzin/realestatecrawler
is private, the Deployments had no imagePullSecrets, and Keel's poll-secret
discovery list came up empty. Pods kept running only because the image
landed in containerd cache months ago.

Adds:
- ExternalSecret `dockerhub-pull-secret` synced from Vault
  secret/viktor.dockerhub_registry_password. ESO template renders the
  dockerconfigjson server-side (Sprig b64enc) so the PAT never sits in
  cleartext in any K8s manifest.
- image_pull_secrets { name = "dockerhub-pull-secret" } on all 4
  Deployments (ui, api, celery, celery-beat).
- Lifts `ignore_changes=[container[0].image]` on ui+api so TF re-asserts
  :latest. CI no longer patches the image to a numeric tag — Keel now
  drives rollouts from digest changes on :latest.

Live state after apply: all 4 Deployments on :latest with
imagePullSecrets=dockerhub-pull-secret; ExternalSecret SecretSynced=True.
Once a GHA build pushes a new digest, Keel will roll all four within ~1h.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 19:11:43 +00:00
0a4ce162f8 monitoring(wealth): keep only FIFO-realized PNL table; pair Positions + vest-cadence side-by-side
- Removed panel 27 (META RSU vest value over time) — superseded by
  vest-cadence chart which carries the same value signal plus the
  share-count overlay.
- Removed panel 28 (per-vest value at vest vs today) — duplicative with
  panel 31's FIFO realized PNL.
- Removed panel 29 (per-sell realized PNL) — same data as panel 31,
  just rolled up by sell date instead of vest date.
- Resized panel 26 (Positions) to w=12 and moved panel 30
  (META vest cadence) to (y=32, x=12, w=12) so they sit side-by-side
  next to the Positions table.
- Moved panel 31 (FIFO realized PNL) to y=118, where the deleted RSU
  chart used to live.
2026-05-18 19:11:13 +00:00
01de3babd6 docs(security): wave 1 plan — Kyverno enforce, NetworkPolicy egress, audit logging, source-IP anomaly
Locked design for wave 1 of cluster security hardening. Plan only — implementation lives in beads
code-8ywc and follow-up commits. Captures:

- security.md: Kyverno policy table updated (Audit → Enforce planned for the four security policies
  with the 31-namespace exclude list). New section "Audit Logging & Anomaly Detection" detailing the
  K8s API audit policy, Vault audit device + X-Forwarded-For trust, source-IP anomaly rules (K9, V7,
  S1), and the rejected-canary-tokens / rejected-K1 rationales. New section "NetworkPolicy
  Default-Deny Egress" describing the observe-then-enforce (γ) approach for tier 3+4.
- monitoring.md: new "Security Alerts (Wave 1)" section listing the 16 rules (K2-K9, V1-V7, S1)
  and the Loki ruler → Alertmanager → #security routing path.
- runbooks/security-incident.md (new): per-alert response playbook with LogQL queries, action
  steps, false-positive triage, and SEV1 escalation.
- .claude/CLAUDE.md: new "Security Posture" section summarising the locked decisions: identity
  allowlist is me@viktorbarzin.me ONLY, source-IP allowlist CIDRs, no public-IP access policy,
  rationale for not adopting canary tokens.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 19:10:16 +00:00
20018cd9b4 monitoring(wealth): per-vest realized PNL via FIFO sell-match
New table panel below the per-sell breakdown. For each vest, FIFO-match
its shares against the subsequent sells (shares from earlier vests get
sold first), and aggregate the matched portions:

  realized_pnl = SUM(matched_qty * (sell_price - vest_price))
  pnl_pct      = realized_pnl / SUM(matched_qty * vest_price) * 100
  days_held    = AVG(sell_date - vest_date) per matched portion

Footer reducer sums shares, vest value, sell value, and realized PNL
so the bottom row is the full-portfolio realized take.
2026-05-18 18:43:46 +00:00
14befe3998 monitoring(wealth): META vest cadence chart — value vs shares (dual axis)
Per-vest event line chart. Left Y axis (blue): vest value at the
time = SUM(quantity * unit_price), in USD. Right Y axis (orange):
number of shares vested. One point per vest date (aggregated when
multiple BUY rows share a date, e.g. 2021-05-18 was 18 + 2 shares).

Lets Viktor see how vest sizes ramped (initial 18 shares -> 38 ->
60s) and how the per-vest USD value tracked META's price ride
across 2020-2026. timeFrom='6y' override pins the panel to the full
vesting window.
2026-05-18 18:38:36 +00:00
3d5841a776 monitoring(wealth): META vest + sell PNL tables with FIFO cost basis
Two new bottom-of-dashboard tables:

Panel 28 'META vests — value at vest vs today': one row per BUY
activity. Shows vest-day price * shares + what those same shares
would be worth at today's META quote, plus the hypo P&L if Viktor
had held everything (color-text on the gain columns).

Panel 29 'META sells — realized PNL vs if held until today':
one row per SELL with FIFO-matched cost basis (LEAST/GREATEST
overlap in cumulative-share space). Shows realized P&L, the
counterfactual P&L had he held until today, and the
'missed by' delta = (today_price - sell_price) * shares.

Both pull today_price dynamically from quote_latest via a CTE so
they self-update as Yahoo updates the META quote. Schwab account
is empty so no live activity is expected.
2026-05-18 18:36:00 +00:00
37c7668181 monitoring(wealth): pin META RSU panel to 6y window
Dashboard default time range is now-180d, but the META vesting + sell
arc spans 2020-11 → 2026-02. With the default window the panel just
showed a flat line at $64 (the empty post-sell residual). timeFrom='6y'
override makes panel 27 always render the full vesting curve regardless
of the dashboard-level time selector.
2026-05-18 18:30:10 +00:00
Viktor Barzin
9a06a76883 k8s-version-upgrade: switch detection cron from weekly to daily
Was `0 12 * * 0` (Sun 12:00 UTC) — patch releases waited up to 6 days
before the chain picked them up. Now `0 12 * * *` (daily 12:00 UTC,
still outside kured's 02:00-06:00 London window). Concurrency is
bounded by Forbid + deterministic job-name idempotency (the detection
job exits early if a preflight Job for the same target already exists),
so back-to-back days can't pile up parallel runs.

- stacks/k8s-version-upgrade/main.tf: var.schedule default + rationale comment
- scripts/upgrade_state.sh: rename next_sunday_noon_utc -> next_daily_noon_utc
  (now returns "Tue 2026-05-19 12:00 UTC" form); change "(Sun cron)" label
  to "(daily cron)"
- .claude/skills/upgrade-state/SKILL.md: cadence column + frontmatter

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 18:29:08 +00:00
2d6082c724 monitoring(wealth): META RSU vest value panel (Schwab account)
Daily total_value timeseries for the Schwab workplace account
(account_id 72d34e09-...). Single-asset account holding META RSUs
that vested 2020-11 → 2026-02 and were sold opportunistically over
the same window. Currency USD (account_currency). Yahoo quote on
META powers WF's daily mark; the historical DAV mirrored into
wealthfolio_sync via pg-sync gives us ~2k days of vesting curve.
2026-05-18 18:25:34 +00:00
Viktor Barzin
9e045e2c16 upgrade-state: skill + script + Keel scrape for periodic three-pipeline audit
Three autonomous-upgrade pipelines run independently — Keel for apps
(hourly registry polling), unattended-upgrades+kured for OS, and the
k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there
was no single place to see whether each was healthy, what's pending,
or whether anything's stuck. The /upgrade-state skill collapses the
state of all three into one table you can run before each Sunday's
k8s-version-check fires.

- stacks/keel/main.tf: add Prometheus pod-annotation scrape on
  container port 9300. Surfaces pending_approvals,
  poll_trigger_tracked_images, and registries_scanned_total{image}
  so the skill has a real timeseries (also opens the door to a
  future "pending_approvals > 0 for 24h" alert).
- scripts/upgrade_state.sh: collector + renderer. Three-row table
  (Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2.
  SSH fan-out (parallel subshells) to all five nodes for apt
  state + reboot-required + uu log; Prometheus query for Keel;
  Pushgateway parse for k8s_upgrade_* gauges. Read-only.
- .claude/skills/upgrade-state/SKILL.md: hardlinked to
  ~/.claude/skills/upgrade-state/SKILL.md so the skill is
  discoverable from both monorepo-rooted and global sessions.

Verification: ran the script, stress-tested the ✗ stalled path by
pushing in_flight=1 + started_timestamp=-100min to Pushgateway and
resetting after — script correctly raised ✗ and exit 2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 10:50:43 +00:00
Viktor Barzin
a9cb806e86 beads-server: codify Keel annotations on Dolt deployment (drift cleanup)
Task 1's recovery from the broken `:latest` image rollout left
keel.sh/policy=never set imperatively via `kubectl annotate` — out of
TF, which violates the "all infra via TF" rule. Now codified alongside
match-tag, trigger, pollSchedule. Removed those three keys from
ignore_changes (was the original "Keel manages these" pattern, no
longer correct for this deployment).

Also added KYVERNO_LIFECYCLE_V1 ignore_changes on the presence_schema
migration Job so future applies don't try to replace it over the
Kyverno-injected ndots dns_config.

Verified: 0 added, 3 changed (unrelated pre-existing drift on
beadboard/workbench/service), 0 destroyed. Dolt pod uninterrupted
(revision 13 preserved).
2026-05-17 22:22:40 +00:00
Viktor Barzin
8a6ec72039 RecentNodeReboot: 24h → 1h threshold, matching upgrade-chain preflight
The 24h kubelet-uptime threshold (process_start_time_seconds < 86400)
was a defense-in-depth duplicate of the 24h-since-Ready-transition
check in kured-sentinel-gate Check 4 — but they used different
signals (kubelet process start vs node Ready transition). Whenever
the cluster cycled through reboots, the alert kept firing for a full
day even after sentinel-gate's check passed, and blocked anything
querying halt-on-alert (kured, K8s version-upgrade preflight).

Tightened to 1h (3600s) for "node just rebooted, give it a settle
window". The cluster-wide 24h-between-reboots invariant lives
exclusively in kured-sentinel-gate Check 4 from now on (independent,
uses lastTransitionTime).

Matched the preflight's own 24h-quiet check in upgrade-step.sh
(86400 → 3600) so it doesn't act as a second blocker.

Empirically verified: all 5 kubelets are >10h up, alert cleared on
next eval after the rule reload.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 22:22:01 +00:00
Viktor Barzin
62fb46353c beads-server: add presence_claims table for agent coordination
Adds the schema for the new agent presence board. Live Dolt is updated
via a hashed-named one-shot Job; the ConfigMap entry preserves fresh-PVC
init.

Also pins the Dolt image to 2.0.3 — :latest on dolthub/dolt-sql-server
currently resolves to 0.50.10, whose docker-entrypoint.sh references an
undefined docker_process_sql function and crash-loops on every init
script in /docker-entrypoint-initdb.d. Keel can still bump this tag
in-cluster (image is in lifecycle.ignore_changes).
2026-05-17 21:24:24 +00:00
Viktor Barzin
4f2959866d k8s-version-upgrade: FQDN SSH targets + python3 in place of envsubst
Two latent bugs in the K8s-version-upgrade pipeline surfaced when a
real detection run ran post-26.04 upgrade today:

1. **DNS**: pod's CoreDNS search path is `<ns>.svc.cluster.local
   svc.cluster.local cluster.local` (+ ndots=2 via Kyverno mutation).
   Unqualified `k8s-master` falls through all of those and then queries
   upstream Technitium for the bare name → NXDOMAIN. The FQDN
   `k8s-master.viktorbarzin.lan` is what Technitium actually serves.
   Suffix every node SSH target with `$NODE_DOMAIN`.

2. **envsubst missing**: claude-agent-service image doesn't ship
   `gettext-base`. Replace `envsubst <template | apply` with
   `python3 -c 'import os,sys; sys.stdout.write(os.path.expandvars(
   sys.stdin.read()))' <template | apply`. Same semantics, image
   already has python3. Multi-line $SCHEDULING_BLOCK is preserved
   correctly through expandvars.

Verified by manually triggering `k8s-version-check` post-fix:
detection now reads `Latest patch: v1.34.8` (currently running 1.34.7)
and spawns `k8s-upgrade-preflight-1-34-8`. The Job pod scheduled and
started; killed before it touched the cluster (will land on Sunday
2026-05-24 12:00 UTC like the schedule says).

Root cause of why these bugs lay dormant: yesterday's first
manual-test detection found "no upgrade needed" so neither code path
exercised SSH or envsubst. Today's apt-source restore (do-release-
upgrade had mangled them) unmasked the v1.34.8 candidate, which made
detection finally proceed past the SSH step.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 21:10:58 +00:00
Viktor Barzin
dd1ec4f1be docs/plans: add agent presence implementation plan (2026-05-17)
15-task plan for a shared presence board so Claude Code sessions can
see which shared infra resources are being actively mutated by other
sessions. Resource-scoped claims on the existing Dolt server,
heartbeat-driven TTL, agent-driven via CLAUDE.md rule + Python CLI.
2026-05-17 21:03:17 +00:00
Viktor Barzin
bcf22640b2 keel: enroll 11 more namespaces (operators + critical infra)
Per user decision, removed authentik, kyverno, metallb-system,
external-secrets, proxmox-csi, nfs-csi, vpa, sealed-secrets,
infra-maintenance from the policy-level exclude list, and added
keel.sh/enrolled=true to aiostreams (alive — 1/1 Running, despite
being earlier flagged as scaled-to-0) and woodpecker.

Net cluster coverage: 197/227 workloads on safe-force (86%), up from
170/227 (74%). All 197 are paired with match-tag=true (digest-only).

Remaining 7 namespaces in Kyverno exclude list (irreducible):
- keel (self-update)
- calico-system + tigera-operator (operator-managed Installation CR)
- cnpg-system + dbaas (state-coupled)
- nvidia (chart-pinned at 570.195.03 per code-8vr0 until NVIDIA ships
  ubuntu26.04 driver images)
- kube-system (k8s built-ins)

Files:
- stacks/kyverno/modules/kyverno/keel-annotations.tf — exclude list
  trimmed from 16 → 7
- stacks/authentik, kyverno, proxmox-csi, nfs-csi, vpa, sealed-secrets,
  servarr/aiostreams, metallb (creates ns "metallb-system"), woodpecker —
  added keel.sh/enrolled=true label on kubernetes_namespace resource
- infra-maintenance was in the policy exclude but the namespace doesn't
  actually exist in the cluster; the removal is a no-op there

Applied via kubectl patch on the live ClusterPolicy + kubectl label on
namespaces because the kubernetes provider v3.1.0 panics on Kyverno
ClusterPolicy refresh — TF source has the desired state for next clean
apply on a fixed provider.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 20:59:14 +00:00
Viktor Barzin
1b340ef531 keel: enroll 15 critical-path namespaces for digest-only auto-update
Per user decision today: monitoring, mailserver, vault, descheduler,
metrics-server, traefik, technitium, crowdsec, redis, reverse-proxy,
reloader, headscale, wireguard, xray, cloudflared now participate in
the same `force + match-tag` regime as the rest of the cluster — Keel
watches the deployment's CURRENT tag for digest changes only and rolls
on push, never rewriting tag strings.

Two-part change:

stacks/kyverno/modules/kyverno/keel-annotations.tf
  Trim the policy-level namespace exclude list from 31 → 16. The 16
  remaining exclusions are the irreducible cluster-operator + state-
  coupled set: keel itself, calico-system + tigera-operator (operator
  loop), authentik (2026-05-17 pgbouncer incident bite), cnpg-system +
  dbaas (state-coupled), kyverno, metallb-system, external-secrets,
  proxmox-csi + nfs-csi + nvidia (just stabilized today, chart-pinned),
  kube-system, vpa, sealed-secrets, infra-maintenance.

stacks/<each-of-15>/.../main.tf
  Add `"keel.sh/enrolled" = "true"` label to the `kubernetes_namespace`
  resource so the Kyverno mutate policy can target the workloads via
  its namespaceSelector matchLabels.

Note on the apply path: the live ClusterPolicy was patched via
`kubectl patch` because the hashicorp/kubernetes provider v3.1.0 panics
during state refresh on Kyverno ClusterPolicy schemas with deeply
nested optional `context.celPreconditions` / `imageRegistry` fields
(see crash dump). The TF source above has the desired state, so any
clean future apply on a fixed provider version will be a no-op against
the live cluster.

Floating-tag workloads in the newly-enrolled set (will roll on every
upstream digest update — acceptable risk per user):
  - wireguard: sclevine/wg:latest (image fixed today via iptables-nft
    postStart shim)
  - xray: teddysun/xray
  - crowdsec-web: viktorbarzin/crowdsec_web
  - monitoring: prompve/prometheus-pve-exporter:latest, prom/snmp-exporter
  - traefik: nginx:1-alpine, openresty/openresty:alpine,
    ghcr.io/tarampampam/error-pages:3
  - redis: haproxy:3.1-alpine, redis:8-alpine

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 12:13:22 +00:00
Viktor Barzin
a2d23c1dfb nvidia: bump driver container memory limit 128Mi → 2Gi
After rolling back k8s-node1's kernel to 6.8.0-117 + spoofing
/etc/os-release to 24.04 so the operator picked the matching
ubuntu24.04 driver image (everything per the workaround documented in
docs/known-issues.md), the driver container still went into a restart
loop. Container status:

    lastState.terminated: { reason: "OOMKilled", exitCode: 137 }

The driver-installer was hitting the namespace LimitRange default of
128Mi during `apt-get install linux-headers-6.8.0-117-generic` — the
last log line on every restart was "Installing Linux kernel
headers..." before SIGKILL. 2Gi gives apt + the DKMS compile step
enough headroom; peak observed during a successful compile in a test
container was ~1.4Gi.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 11:23:52 +00:00
Viktor Barzin
d06a34ccc7 docs: known-issues entry for the Ubuntu 26.04 / NVIDIA driver gap
Captures the workaround applied on k8s-node1 today (kernel rolled back
to 6.8.0-117-generic, apt-mark hold on kernel meta-packages,
/etc/os-release spoofed to 24.04 so NFD reports VERSION_ID=24.04 and
the gpu-operator picks an existing ubuntu24.04 driver image), plus the
trigger that lets us un-mitigate: any ubuntu26.04 tag appearing on
nvcr.io/nvidia/driver.

Linked from the post-mortem and from beads code-8vr0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 11:15:26 +00:00
9521bb0b17 paperless-mcp: deploy MCP for AI document search
- New stack `paperless-mcp` running barryw/PaperlessMCP v0.1.19 (.NET,
  HTTP+SSE on :5000) wraps paperless-ngx's built-in FTS. 43 tools exposed.
- In-cluster only egress to paperless-ngx svc; no Cloudflare hop on
  MCP-internal traffic.
- Read-only at paperless layer: dedicated `claude-mcp` user (non-superuser)
  in new `claude-mcp-readers` group with view-only Django perms; existing
  279 docs bulk-granted view perm via /api/documents/bulk_edit/;
  workflow #2 auto-grants the group on new docs (Consumption Added).
- Gateway-level bearer auth via new Traefik plugin
  Aetherinox/traefik-api-token-middleware@v0.1.4 (loaded in traefik stack
  alongside crowdsec-bouncer); per-stack Middleware CRD `bearer-auth`
  pulls token list from Vault `secret/paperless-mcp/bearer_tokens`.
- Vault `secret/paperless-mcp` holds: paperless_api_token (synced to
  K8s Secret via ESO; pod env via secret_key_ref), bearer_tokens
  (JSON array, read at plan time), bearer_token_viktor_laptop (mirror
  for laptop wiring), paperless_user_password (paperless UI fallback).
- Image auto-update via Keel (semver minor policy, hourly poll).
- Ingress dns_type=proxied → Uptime Kuma external monitor auto-created
  by external-monitor-sync CronJob.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 11:14:35 +00:00
2eb611fd6d recruiter-responder: bump image to 05b95943 (split callback routes)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 11:01:49 +00:00
Viktor Barzin
128cfbbc30 nvidia: pin chart to v25.10.1 after v26.3.1 upgrade revealed missing ubuntu26.04 driver images
k8s-node1 was upgraded to Ubuntu 26.04 (kernel 7.0.0-15-generic) at some
point. NVIDIA has NOT published ubuntu26.04 driver images yet
(skopeo list-tags docker://nvcr.io/nvidia/driver returned 0 ubuntu26.04
tags vs 779 for ubuntu22.04 and 206 for ubuntu24.04).

Attempted fix today: bump gpu-operator chart v25.10.1 → v26.3.1 +
driver 570.195.03 → 580.105.08 + kernelModuleType=open. The chart
applied cleanly but the v26.3.1 operator auto-detects host OS via NFD
labels and constructs `<version>-ubuntu26.04` image tags, which 404 on
pull. Rolled back to chart v25.10.1 and pinned it explicitly here so
future `terraform apply` doesn't surface the same trap again.

Note: chart rollback alone does NOT restore GPU functionality on
k8s-node1. Both v25.10.1 and v26.3.1's operators now pick the
ubuntu26.04 suffix (the NFD label is sticky once detected). The actual
recovery path requires either (a) NVIDIA shipping ubuntu26.04 driver
images, or (b) rolling the host kernel back to 6.8.0-117-generic
(still installed in /boot, headers in /usr/src) + `apt-mark hold` to
prevent re-upgrade. That step needs explicit user authorization for a
node reboot — left as the next action item on code-8vr0.

Files:
  - stacks/nvidia/modules/nvidia/main.tf — explicit version pin,
    explanatory comment
  - stacks/nvidia/modules/nvidia/values.yaml — comment block
    documenting the situation; driver pinned at 570.195.03
  - docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md —
    full timeline, root causes, recovery procedure

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 10:56:05 +00:00
Viktor Barzin
38602f7974 wireguard: switch to iptables-nft so PostUp MASQUERADE works
Wireguard pod CrashLoopBackOff'd for hours with wg-quick's PostUp failing:

    iptables v1.8.4 (legacy): can't initialize iptables table `nat':
    Table does not exist (do you need to insmod?)

sclevine/wg's default `iptables` symlink points to iptables-legacy, which
talks to the kernel's xt-tables. K8s nodes nowadays initialize their
nat table via nftables (calico-node sets it up), so iptables-legacy in
the container sees "no nat table" and bails. Reproduced by ephemerally
debugging the live pod's namespaces (kubectl debug --copy-to + same
mounts as the real pod) — wg-quick output matched verbatim.

Fix: postStart now calls update-alternatives to point iptables and
ip6tables at iptables-nft/ip6tables-nft (already present in the image)
before exec'ing wg-quick. The wg0.conf PostUp MASQUERADE then writes
to the nftables-backed nat table calico already populated. Verified:
new pod went 2/2 Running with 0 restarts after apply.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 10:13:37 +00:00
Viktor Barzin
f4807a22f8 terminal: probe + alerts after Traefik replica routing-table skew
User reported "site loads but failed to connect on the tmux session". Root
cause was a Traefik replica (traefik-db7696fbf-ktjjz) that came up missing
the kubernetes_ingress-derived router for terminal.viktorbarzin.me — only
the IngressRoute CRDs registered. About 1/3 of /token preflight requests
landed on that replica and got 404 with router="-", and WS upgrades
intermittently failed the same way, so the lobby iframe stayed stuck on
"Failed to connect. Retrying...". `kubectl delete pod` on the bad replica
restored the missing router and unblocked the user.

This commit adds the long-term mitigation:

stacks/terminal/main.tf
  - kubernetes_cron_job_v1.webterminal_probe runs every 5min, hits
    /token + /ws via Cloudflare and the in-cluster ttyd Service, pushes
    4 gauges to Pushgateway (token_status, ws_status, ttyd_status,
    last_success_timestamp). Verified the probe end-to-end:
      token=302 ws=302 ttyd=200 ok=1

stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
  - Webterminal group: WebterminalTokenDegraded (warning, 10m),
    WebterminalWebsocketDegraded (critical, 10m),
    WebterminalTtydUnreachable (critical, 10m),
    WebterminalProbeStale (warning, 15m).
  - Traefik Router Parity group: TraefikRouterCountSkew fires when any
    Traefik replica's router count diverges from siblings for >10m —
    catches the same class of issue cluster-wide, not just for terminal.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 10:04:26 +00:00
1fad83a805 recruiter-responder: bump image_tag to 50f43004 (backtest --persist)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 09:57:17 +00:00
Viktor Barzin
a3bebd5c0d nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem)
Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped
control-plane exclusion from the controller Deployment, so both replicas
landed on k8s-master, fought for hostNetwork ports 19809/29653, and one
went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes
holding the ports — only a kubelet restart on master cleared them.

- Pin helm_release.version = "4.13.1" so terraform apply can't drift to
  the broken chart (defense in depth; nfs-csi namespace is already in the
  Kyverno-Keel exclude list)
- Add controller.affinity: podAntiAffinity between replicas +
  nodeAffinity excluding node-role.kubernetes.io/control-plane
- docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md
  captures the root cause + recovery procedure (kubelet restart via
  nsenter is the escalation path when crictl rmp -f fails)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 09:11:09 +00:00
98b7ef40fd broker-sync(fidelity): un-suspend monthly CronJob
The provider now emits delta gains-offset DEPOSITs (broker-sync@98c4729)
which is the simple accumulate-gains approach Viktor signed off on:
each monthly scrape captures (current_pot, real_contribs), and we emit
a single DEPOSIT/WITHDRAWAL sized to growth-since-last-scrape.
dav_corrected handles the dashboard math.

Next scheduled run: 2026-05-20 05:00 UK. Manual trigger via
'kubectl -n broker-sync create job fid-now --from=cronjob/broker-sync-fidelity'.
2026-05-17 00:36:23 +00:00
Viktor Barzin
32db6760cc keel: use +() anchors on policy/match-tag so per-workload overrides stick
Without the anchor, each policy update fires mutateExistingOnPolicyUpdate,
which OVERWRITES existing keel.sh/policy annotations back to 'force'. That
broke the phased rollout — bulk-setting workloads to 'never' didn't stick
because the next policy update reset them.

With +() anchors, the mutate only adds the annotation if missing. New
workloads (in enrolled namespaces) get force+match-tag; existing workloads
with explicit policy=never (out-of-band, for phased rollout) stay never.

Phase 1 rollout state (2026-05-17):
  - 10 workloads on force+match-tag in 10 namespaces (Phase 1)
    enrolled via keel.sh/enrolled=true namespace label:
      linkwarden, excalidraw, diun, echo, foolery, city-guesser,
      jsoncrack, privatebin, ntfy, speedtest
  - 216 workloads on policy=never (out-of-band kubectl annotate)
  - 31 critical namespaces excluded at policy level

Expand to Phase 2 by labeling more namespaces `keel.sh/enrolled=true`
and clearing the `never` annotation off their workloads.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 00:32:19 +00:00
root
05473ab501 Woodpecker CI Update TLS Certificates Commit 2026-05-17 00:10:55 +00:00
Viktor Barzin
0365ed83ca keel: expand critical-namespace exclude list — protects vault/cnpg/authentik/etc.
2026-05-17 incident: Keel rolled authentik 2026.2.2 → 2026.2.3 around 23:36.
The force+match-tag pairing should have constrained Keel to digest-only on
the current tag (not switch to a new tag), but a race between Kyverno's
mutate (injecting match-tag) and Keel's hourly poll caused the workload to
still have the old `force`-only annotation when Keel acted. Result: tag
rewrite, pods cycled, pgbouncer connection failures, login broken.

Manual rollback: `kubectl rollout undo` on all 5 authentik deployments back
to 2026.2.2. Auth restored within ~5 min.

Going forward, critical-namespace workloads are excluded at the policy level
so this race can't recur. They get upgraded via TF (Helm chart version bumps)
on a deliberate cadence, never by Keel.

Live state: 36 workloads on policy=never (35 critical + chrome-service pin
+ 7 CI-driven self-hosted from earlier), 190 on policy=force+match-tag for
opt-out-pure auto-update on the remaining stateless apps.

This matches user direction (2026-05-17): "upgrading is fine as long as we
upgrade correctly and the latest version is healthy" + "keel responsible
for the latest version, phased rollout, graceful".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 00:07:32 +00:00
Viktor Barzin
cdd6781bb9 keel: bump default policy patch → major (user wants latest version)
User: 'i'm happy with occasional breakages. we have alerts.'

Policy=major auto-updates workloads to the latest semver tag in the
registry, including major/minor/patch bumps. Still semver-parser-bounded
so dev/nightly/master branches are filtered out (avoids the 2026-05-16
force-trap on affine/calico).

Live: 217 patch-annotated workloads re-annotated to major. Next Keel
poll (~1h) will pick up any pending major/minor releases.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:53:59 +00:00
b37d15fe33 recruiter-responder: bump image_tag to 94b37a9c (follow-up detection)
Replies from recruiters to our sent decline / engage / ignored threads
are now attached to the existing thread, surface with a 🔁 follow-up
marker in Telegram ("you previously sent"), and re-open thread status
to pending so they show up in recruiter_list status=pending.

Smoke-tested live: Rachel-style follow-up referencing our outbound
msgid + the original recruiter msgid in References → correctly
attached to thread #87, status flipped sent→pending, 3 messages
persisted (in/out/in).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:47:16 +00:00
Viktor Barzin
055030fa24 kyverno: bump background-controller memory 384Mi → 2Gi (OOMKilled processing keel URs)
The mutateExistingOnPolicyUpdate=true on inject-keel-annotations produced
176 UpdateRequests for the initial bulk scan across enrolled namespaces.
At the existing 384Mi limit, kyverno-background-controller OOMKilled while
processing them — no annotations got injected on existing workloads (count
stuck at 30).

Live state already bumped via kubectl set resources; this commit makes it
durable through Terraform. Also lowered the request to 256Mi (the 384Mi
floor was tight against limit; 2Gi headroom for bulk scans, 256Mi steady
state).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:36:16 +00:00
7ef386871f recruiter-responder: bump image_tag to 02a01c9a (Reply-To + quoted body in replies)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:35:52 +00:00
Viktor Barzin
de0e7b9dd6 kyverno: codify aggregated ClusterRole for keel mutate-existing
The previous commit (bc714755) added mutateExistingOnPolicyUpdate=true
to the inject-keel-annotations ClusterPolicy but Kyverno's validate
webhook rejected it: the background-controller SA needs update/patch
on apps/v1 Deployment/StatefulSet/DaemonSet.

Created live via kubectl + now in TF so the next apply is idempotent.
The ClusterRole aggregates into kyverno:background-controller via the
rbac.kyverno.io/aggregate-to-background-controller label.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:30:07 +00:00
fc1c98de69 recruiter-responder: bump image_tag to 59df5f8a (Reply-To honoured)
Reply-To header now extracted on inbound and used for outbound replies.
Verified with a synthetic email From: noreply-careers@megacorp.example
Reply-To: spam@viktorbarzin.me — reply correctly went to spam@ and
threaded under the original (Re: subject + In-Reply-To + References).

Alembic 0003 added messages.reply_to_addr column.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:27:47 +00:00
Viktor Barzin
bc714755ea kyverno: add mutateExistingOnPolicyUpdate=true so existing workloads get annotated
Before this, the inject-keel-annotations policy only fired on admission
events. Workloads that existed BEFORE their namespace got labeled
keel.sh/enrolled=true never received the annotation, so Keel didn't
watch them. Live state was 30 of 226 workloads auto-updating.

With mutateExistingOnPolicyUpdate=true and the required mutate.targets
block, Kyverno's BackgroundScan controller applies the mutate to
existing matching Deployments/StatefulSets/DaemonSets on policy update.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:27:27 +00:00
Viktor Barzin
8b9727ac3e final wave: enroll immich + status-page, retrigger 17 pending Bucket A
* immich: extended 3 V1 lifecycles to V2 (1 Deployment without V1
    skipped — has non-standard lifecycle from earlier work).
  * status-page: enrolled (was missing from original sweep).
  * v6 retrigger marker on 17 stacks that never reached terragrunt
    apply (#704 exit-1 halted mid-loop).

After this lands, expected live enrollment: ~96 / 118 Tier 1 stacks.
The remaining ~22 are operator/Helm-managed and intentionally excluded
(same fight-loop risk as Calico — bump via Helm chart version, not
Keel).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:19:20 +00:00
root
73af01a277 Woodpecker CI deploy [CI SKIP] 2026-05-16 23:17:44 +00:00
Viktor Barzin
e956b78951 Bucket C: enroll 5 raw-deploy stacks in Keel auto-update
* beads-server: 3 Deployments — extended V1 lifecycle blocks to V2
    + KEEL_IGNORE_IMAGE; namespace label.
  * llama-cpp: 1 Deployment — extended V1→V2; namespace label.
  * novelapp: namespace label only (Deployment has non-standard
    lifecycle without V1 dns_config — drift expected, accept for now).
  * plotting-book: namespace label only (same as novelapp).
  * trading-bot: namespace label only (same as novelapp).

immich deferred — the bulk-add script's brace-counter got confused by
a HEREDOC in the file, inserting a lifecycle block in the wrong
position. Needs manual per-Deployment editing.

The 3 ns-only stacks (novelapp, plotting-book, trading-bot) will see
their Deployments mutated by Kyverno but their TF lifecycle doesn't
yet ignore the keel annotations. Expected behavior: drift visible in
terragrunt plan, applied-state oscillates with Kyverno re-injecting.
Acceptable starting point; per-Deployment lifecycle work to fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:14:43 +00:00
Viktor Barzin
eb99ee5635 Bucket A retrigger + Bucket D enrollment (5 module-nested stacks)
After fixing the postgresql-lb MetalLB flap (deleted stuck
ServiceL2Status CR l2-rgt9d), Tier 1 CI can apply again. Combined
commit:

  * Bucket A (16 stacks): re-append CI retrigger marker so the
    previously-pending applies pick up:
      blog calico cyberchef descheduler f1-stream homepage jsoncrack
      k8s-dashboard k8s-version-upgrade kms local-path osm_routing
      real-estate-crawler travel_blog vault webhook_handler

  * Bucket D (5 module-nested stacks): keel.sh/enrolled label on
    namespace + KYVERNO_LIFECYCLE_V2 on Deployments inside the module:
      postiz instagram-poster k8s-portal uptime-kuma vaultwarden

Bucket C (raw-deploy apps without V1 marker on their Deployment
lifecycles) deferred — needs per-Deployment lifecycle block additions
that the bulk script can't safely automate:
  beads-server immich llama-cpp novelapp plotting-book trading-bot

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:10:38 +00:00
Viktor Barzin
629fe24305 kyverno: exclude calico-system from inject-keel-annotations
Stop the hourly Keel-vs-tigera-operator fight loop on calico-node
DaemonSet (v3.26.5 ↔ v3.26.1). Live: re-annotated 4 calico-system
workloads with keel.sh/policy=never; TF: added calico-system to the
namespaces exclude list so any future mutate run won't re-inject.

The previous calico unenrollment (label removal from namespace)
wasn't enough — once Kyverno had stamped the policy=patch annotation
on the Deployments/DaemonSets, removing the namespace label didn't
strip the annotation, so Keel kept watching them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 22:58:20 +00:00
Viktor Barzin
57e3059f5b ci: retrigger v4 — remaining 16 Keel stacks (#701 failed one of them) 2026-05-16 14:13:59 +00:00
Viktor Barzin
cbbb394e88 ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00
Viktor Barzin
e030750507 openclaw: native MCP servers + daily claude-memory sync
Wire ha-mcp, context7, and the in-pod playwright sidecar as native
MCP servers on OpenClaw via `mcp set` in the container startup
(ConfigMap-baked mcp.servers gets stripped by `doctor --fix`; CLI-set
entries persist). HA URL pulled from new Vault key
secret/openclaw.ha_sofia_mcp_url and passed via the
HA_SOFIA_MCP_URL env var.

Add a daily 03:00 UTC `memory-sync` CronJob in the openclaw
namespace: pulls all non-sensitive memories from
claude-memory.claude-memory.svc:80/api/memories, groups by category,
writes 18 Markdown files into /workspace/memory/projects/claude-
memory-sync/ (the path memory-core indexes), then triggers
`openclaw memory index --force` via kubectl exec. Reuses the
existing cluster-healthcheck SA (pods+pods/exec). Smoke test: 1488
memories synced, 25/25 files indexed, search returns hits.

Also drops the legacy /app/extensions entry from
plugins.load.paths (doctor warning), wires HA_SOFIA_MCP_URL env,
and one-shot deletes the stale 2026-02-28 metaclaw-export.json from
the openclaw home volume.

claude_memory MCP intentionally NOT wired — its /mcp/mcp transport
404s on the deployed claude-memory-mcp:17 image (tracked as
code-z1so). Shared knowledge is delivered via the CronJob's REST
sync instead. Adding claude_memory to mcp.servers is a one-line
follow-up once that's fixed.
2026-05-16 14:01:46 +00:00
Viktor Barzin
7bcd3f745c ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
root
a4b5dfd361 Woodpecker CI deploy [CI SKIP] 2026-05-16 13:45:45 +00:00
2841347ec2 wealth: dav_corrected view fixes pension gains-offset miscategorisation
The broker-sync Fidelity provider emits 'unrealised-gains-offset'
DEPOSIT activities to reconcile Wealthfolio's total with the
PlanViewer reported pot, because Wealthfolio doesn't track pension
fund units directly. Wealthfolio's data model treats that DEPOSIT as
a cash contribution, which double-inflates net_contribution and
zeroes out the implied growth.

Add a Postgres view 'dav_corrected' in wealthfolio_sync that
subtracts the cumulative gains-offset from net_contribution per
account per date (re-exporting as 'net_contribution' so it's a
drop-in replacement). All 17 wealth dashboard panels that compute
contribution/growth/ROI now read from the view. Total impact:
portfolio Growth jumps from £301,753.19 to £337,474.39 (exactly
the £35,721.20 Fidelity offset that was previously miscategorised).
2026-05-16 13:43:36 +00:00
Viktor Barzin
70d0623b21 ci: retrigger apply for pending Keel enrollment (~58 stacks)
Bulk enrollment commit 8f4b1956 had its CI pipeline #689 killed before
terragrunt apply ran. The enrollment label + V2 lifecycle changes are
in master but never reached the cluster. Appending a one-line marker
to each pending stack's main.tf so Woodpecker's diff-detection picks
them up and applies them serially.

Idempotent — re-applying a stack whose state already matches is a no-op.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 13:42:57 +00:00
Viktor Barzin
3ed201f873 ci: retry after Keel rollout cascade settled 2026-05-16 13:36:43 +00:00
Viktor Barzin
175ebc5cd0 enrolled-patch stacks: ignore image drift from Keel auto-update
For Deployments enrolled in Keel with policy=patch, the image tag is
updated by Keel as new patches release upstream. Without
ignore_changes on the image field, terragrunt apply would fight Keel
in an endless loop (TF reverts → Keel re-rolls → repeat — same shape
as the calico/tigera-operator fight from earlier).

Adding KEEL_IGNORE_IMAGE marker to the lifecycle of these stacks.
Image string in TF becomes the initial seed; Keel rolls it forward.

Stacks: actualbudget, broker-sync, changedetection, city-guesser,
coturn, dashy, dawarich, diun, ebook2audiobook, ebooks, echo,
excalidraw, foolery, forgejo, freedify.

CI-driven self-hosted stacks (fire-planner, job-hunter, payslip-ingest,
recruiter-responder, claude-agent-service, claude-memory) keep TF
ownership of image and policy=never — their image_tag is set by CI
via terragrunt.hcl inputs, not by Keel. Adding image to ignore_changes
on those would break the CI deploy flow.

Caveat: only container[0].image is added. Multi-container Deployments
(immich, beads, etc.) will need additional container[N].image lines
for any container Keel rolls. Those stacks are not currently enrolled.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 13:24:16 +00:00
Viktor Barzin
6d71a91fad calico: unenroll from Keel — tigera-operator owns DaemonSet spec
Keel kept rewriting calico-node + calico-kube-controllers images to
v3.26.5 (proper patch update); tigera-operator immediately reverted
to v3.26.1 because the Installation CR is the source of truth.
Endless churn but no data loss — Calico stayed healthy throughout.

Removing keel.sh/enrolled label and live label from calico-system ns.
Calico upgrades go through the tigera-operator's Installation CR
manually, not Keel.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 13:18:35 +00:00
Viktor Barzin
2b236a1629 keel: default policy → patch (semver-bounded opt-out auto-update)
Move from `never` (no auto-update) to `patch` for the cluster-wide
default. Keel only auto-updates PATCH versions within the current
major.minor: 0.26.6 → 0.26.7 OK; 0.26.6 → :nightly-latest blocked.
Tag-rewrites that broke calico (v3.26.1 → :master) and affine
(0.26.6 → :nightly-latest) on 2026-05-16 cannot recur with patch.

Caveats:
  * Patch causes Terraform image drift for semver-pinned services —
    drift-detection pipeline will surface it; lifecycle ignore_changes
    on container[].image can be added per stack later if drift is
    noisy.
  * Tags that aren't parseable as semver (:latest, :11, :nightly,
    SHA tags) are ignored by patch — those workloads stay on their
    current image until promoted to `force` policy individually.

Self-hosted CI-driven services + chrome-service kept on `never`
(deliberate pins / CI controls the tag):
  recruiter-responder, claude-agent-service, claude-memory,
  chrome-service, fire-planner, job-hunter, payslip-ingest

Live state already updated via kubectl apply + per-workload patches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 13:17:33 +00:00
8bb704bfd1 recruiter-triage: AI culture & tooling section + warm-engage AI ask
- claude-agent-service bumped to 191ed5dd (new AI section in agent
  template — leadership stance, approved tools, usage limits / quotas,
  code-gen safety, product-side AI depth, follow-up questions for the
  recruiter when the web is sparse).
- recruiter-responder bumped to ab59eeab (deep_research prompt asks
  for AI culture; warm_engage template adds a written-only ask for
  IDE assistants, chat tools, per-seat limits, source-to-external
  model policy).

Smoke-tested 2026-05-16: forced fresh research on Datadog, agent
returned full structured AI section with 7 explicit recruiter
questions covering DLP/IDE/limits/code-gen-policy. $0.80 / 192s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 13:14:27 +00:00
Viktor Barzin
d656e38c9d keel: default policy → never (post-incident safe default)
2026-05-16 incident: Keel's `force` policy switched semver-pinned
images (affine 0.26.6 → :nightly-latest, calico v3.26.1 → :master)
instead of digest-tracking. Force is documented as "always update
to the newest tag in the registry" — only safe on already-mutable
tags like :latest.

Changing the cluster-wide default in inject-keel-annotations to
`never`. The namespace enrollment label + V2 lifecycle suppression
stay in place so opt-in is one annotation per Deployment, but no
service auto-updates until explicitly approved.

To opt in a workload now:
  1. Verify the Deployment image is on a mutable tag (:latest,
     :<major>, or a vendor "stable" tag) — change in Terraform first
     if needed.
  2. Add to the Deployment's metadata.annotations:
       "keel.sh/policy" = "force"   (digest tracking)
       OR
       "keel.sh/policy" = "patch"   (semver patch bumps — also
       requires ignore_changes on the image)

Live policy already updated via kubectl apply + per-workload
override (force → never).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 13:13:16 +00:00
Viktor Barzin
06f48c73ca keel: enable Slack notifications on every upgrade
Wire Keel's Slack notifier to the existing bot token in Vault
(secret/viktor -> slack_bot_token). Posts to #general by default;
override via slack.channel in the Helm values if you want a dedicated
channel like #keel-notifications.

Notification level is "info" so we get every rollout event, not just
errors. Approval flow is OFF — opt-out-pure means all updates apply
unattended. If we later introduce approvals, add slack.approvalsChannel.

Resolves user request: 'keel should send notifications to slack everytime
it upgrades an app'.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 13:01:35 +00:00
8f4b19565c recruiter-responder: bump image_tag to 189ef901
OpenClaw can now answer 'what do we know about <company>?' from cache
via the new recruiter_company_research tool, and recruiter_get embeds
the cached research payload inline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 12:41:05 +00:00
Viktor Barzin
d0f0e10da5 keel: pin chart 1.0.6 → 1.2.0 (1.0.6 doesn't exist)
The Helm repo at https://charts.keel.sh has versions 1.0.0–1.0.5,
1.1.0, 1.2.0. 1.0.6 is not published, so the Phase 0 apply failed
silently. Bump to 1.2.0 (app version 0.21.1, latest stable).
2026-05-16 12:30:19 +00:00
Viktor Barzin
48abb7c520 kured: drop Mon-Fri restriction, reboot any day
The weekday-only schedule was a 2026-03-16-incident-era guardrail when
the rest of the safety net was thin. Today's gates — halt-on-alert,
sentinel-gate Check 4 (24h soak via node Ready transitions), the
K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the
sentinel-path fix from earlier today — make weekend reboots safe and
just clear the backlog faster.

Effect: 5 pending node reboots clear in 5 calendar days instead of
queueing up over weekends. The K8s version-upgrade detection at Sun
12:00 UTC self-defers if a Sunday-morning kured reboot fires (the
RecentNodeReboot alert is in the Upgrade Gates ignore-less list for
the version-upgrade preflight — same mechanism kured uses).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 12:29:01 +00:00
Viktor Barzin
8cbfa6856c Phase 1a: enroll 4 self-hosted services in Keel auto-update
Enrolls the cleanest Woodpecker-build-only self-hosted services into
the inject-keel-annotations ClusterPolicy by labeling their namespaces
keel.sh/enrolled=true. CI already pushes :latest (auto_tag: true) on
each, so Keel will detect the current upstream digest and trigger a
rolling restart when polling starts (1h cadence).

Per-Deployment lifecycle extended with KYVERNO_LIFECYCLE_V2 to suppress
the annotation drift Kyverno will inject (keel.sh/policy, /trigger,
/pollSchedule).

Services included:
  - fire-planner
  - job-hunter
  - payslip-ingest
  - recruiter-responder

Skipped from Phase 1 for follow-up:
  - claude-agent-service (user has WIP on main.tf)
  - claude-memory (Postgres co-deployed; treat in Phase 9 with other DBs)
  - kms (two Deployments; needs per-resource review)
  - wealthfolio (sync sidecar pattern; needs review)
  - chrome-service (deliberate :v4 pin; needs keel.sh/policy: never label)
  - GHA-migrated repos (10) (need per-repo CI cleanup)
  - beadboard, freedify (no CI)

See docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 12:28:54 +00:00
32e3b09d85 recruiter-responder: bump image_tag to f3cb91ff (180d research_cache TTL)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 12:26:15 +00:00
95fb859ea1 recruiter-triage v3: Perks & Office Life section + cache-first deep_research
- claude-agent-service bumped to f764fef6 (agent system prompt adds
  the Perks block: food/health/pension/equity/PTO/parental/equipment/
  learning/wellness/amenities/commuter). 1200-word cap.
- recruiter-responder bumped to 38a2cdaa (cache-first deep_research:
  serves cached payload if fetched_at + ttl_seconds > now; cache
  writes upsert; new force flag bypasses).

Verified end-to-end: deep_research on Datadog now returns full Perks
section (~220s, $0.60, 23 turns). Earlier 500 fixed (was
uq_research_company_tier dup-key on re-run).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 12:21:23 +00:00
Viktor Barzin
910167105e Phase 0: install Keel + Kyverno auto-update annotation injector
Foundation for opt-out-pure auto-update model per
docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.

- New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6).
  Polls registries hourly per design decision #8. Default schedule
  overridable per-workload via keel.sh/pollSchedule annotation.
- New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments,
  StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true`
  with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h.
- Phase 0 enrolls no namespaces. Phase 1 (next session) labels the
  self-hosted set.
- Per-workload opt-out: label `keel.sh/policy: never` (used by rollback
  runbook and chrome-service-style deliberate pins).
- Keel namespace excluded from the mutate — supervisor self-update has
  too-bad a failure mode (decision #11).
- AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the
  ignore_changes block enrolled workloads need.
- .claude/CLAUDE.md: docker-images rule flagged as transitional.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 12:19:34 +00:00
Viktor Barzin
a8302072eb docs/pm: kured silently stalled 6 days + Anubis HA lift (2026-05-16)
Captures the May 10–16 kured-vs-sentinel-gate hostPath mismatch (chart
derived hostPath from configuration.rebootSentinel) and the companion
work to harden the rolling-reboot pipeline against single-replica
PDB deadlocks: Anubis 1→2 replicas with shared Valkey store, kured
drainTimeout=30m, CNPG pg-cluster 2→3 instances. Includes the
mysql-standalone-PDB orphan cleanup and the k8s-node1 containerd-source
drift audit (benign).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 12:17:26 +00:00
Viktor Barzin
a726e963e3 kured + cnpg: drain-safe defaults ahead of Monday reboot wave
Three defensive moves to make the kured rolling-reboot cycle survive
edge cases without operator intervention:

kured (stacks/kured/main.tf):
  - Set `configuration.drainTimeout = "30m"`. Default is unlimited; if
    a future PDB or finalizer stalls drain, kured retries forever and
    the node stays cordoned silently. 30m caps the silent-failure
    window — after timeout kured logs the abort and waits for the
    next period; the node stays Schedulable so cluster capacity isn't
    lost. Lets us fail closed instead of fail-silent.

CNPG pg-cluster (stacks/dbaas/modules/dbaas/main.tf):
  - Bump instances 2 → 3 (1 primary + 2 replicas). With 2 instances the
    failover during a primary-node drain depended on the lone replica
    being caught up; a WAL backlog would stall the drain until the
    replica was current. With 3 instances CNPG always has at least one
    fully-current replica to promote, and the PDB's
    `minAvailable=1` on the primary selector is satisfied throughout
    the switchover. Storage: +20Gi PVC on proxmox-lvm-encrypted (about
    35Gi after autoresize). Memory: +3Gi pod limit.
  - Updated the `triggers.instances` so the null_resource's local-exec
    actually re-applies the YAML (kubectl apply with the new spec). The
    YAML is the source-of-truth but the trigger is what tells terraform
    to re-run the provisioner.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 12:06:30 +00:00
Viktor Barzin
16a470e950 state(dbaas): update encrypted state 2026-05-16 12:05:55 +00:00
Viktor Barzin
1361dfa994 state(dbaas): update encrypted state 2026-05-16 12:05:16 +00:00
Viktor Barzin
6e920f96af anubis: HA with shared valkey/redis store + replicas=2
Anubis pre-2026-05-16 ran at replicas=1 because in-flight PoW challenge
state lived in process memory — a challenge issued by pod A wouldn't be
verifiable by pod B (HTTP 500 "store: key not found"). The PDB at
`minAvailable=1` made this worse: with replicas=1 the eviction API can
NEVER satisfy the constraint, so every drain on a node hosting an Anubis
pod looped forever. This is what stalled the manual K8s upgrade on
2026-05-11 (had to delete pods directly to bypass eviction) and was
about to block kured on Monday 2026-05-18 once the kured sentinel fix
landed.

Anubis upstream has first-class support for a Valkey/Redis-protocol
shared store (documented as the "Kubernetes worker pool" pattern).
Wire it up:

- modules/kubernetes/anubis_instance: add `shared_store_url` variable.
  When set, appends a `store: { backend: valkey, parameters: { url } }`
  block to the rendered policy YAML and defaults replicas to 2 (capped
  at 2). PDB switched from `minAvailable=1` to `maxUnavailable=1` so
  drains can take down one pod at a time. topologySpreadConstraint
  tightened to `DoNotSchedule` so the two replicas land on different
  nodes — a single node loss never takes a whole Anubis instance down.
- All 8 call sites (cyberchef, jsoncrack, kms, homepage, blog,
  travel_blog, real-estate-crawler, f1-stream) opted in. Each picks a
  unique Redis DB index (5–12) on `redis-master.redis:6379`. Cluster
  Redis already runs HA via Sentinel + haproxy, no new infra needed.

Verified: every Anubis Deployment now 2/2 Ready with pods on different
nodes; PDBs allow 1 disruption; Redis DBs 5,7,8,10 already populated
by live traffic post-apply; Palo Alto Networks scanner hit blog right
after apply and the challenge log shows the new state path.

Drain on any worker now succeeds without a `predrain_unstick` workaround
— eviction API is satisfied because at most one pod is unavailable at a
time, and the other replica keeps serving. Monday's kured reboot wave
should roll through cleanly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:54:54 +00:00
cf5b169cbb claude-agent-service: ship recruiter-triage agent + restore missing terragrunt.hcl
- main.tf: bump image_tag to 1b3350c0 (carries the new agent),
  init container  also copies recruiter-triage.md
  into /home/agent/.claude/agents/.
- terragrunt.hcl: restored (file was missing — apply was blocked).
  Standard root include + platform/vault/external-secrets dependencies.

Smoke-tested 2026-05-16: deep_research call on Datadog (thread 42)
via recruiter-responder REST API → 102.5s, $0.43, structured
markdown report with comp bands vs £600k floor, culture signals,
remote policy, recent news, sources cited. End-to-end Tier-2 is live.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:53:49 +00:00
7e5e0e7080 docs: add CONTEXT.md domain glossary [ci skip]
Adds the per-repo domain glossary that engineering skills
(diagnose, tdd, improve-codebase-architecture, grill-with-docs)
read before working in this repo. Terms only — no implementation
detail. Six clusters (code organization, cluster, networking,
storage, secrets, CI/CD), 22 terms, plus relationships, an example
dialogue, and five flagged ambiguities.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:48:19 +00:00
0bb647342d recruiter-responder: expose Gmail IMAP creds for backtest CLI
Pulls vbarzin@gmail.com app password from secret/recruiter-responder
(seeded from secret/wealthfolio.imap_password — same Gmail credential
that wealthfolio uses for broker-statement ingestion). Env vars
GMAIL_IMAP_USER + GMAIL_IMAP_PASS, consumed by 'backtest gmail'.

Backtest verified 2026-05-16 against folder
'companies-I-dont-take-seriously': 20/20 recruiter, 100% company
extraction (9 stated, 6 subject, 4 sender_domain, 1 body), 30% comp,
avg 12s latency.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:28:51 +00:00
Viktor Barzin
c17d87e179 kured: fix sentinel path mismatch that stalled rolling reboots
The kured Helm chart derives the sentinel hostPath from
`dirname(configuration.rebootSentinel)`. Previously
rebootSentinel=/sentinel/gated-reboot-required pointed hostPath at
`/sentinel/` (an empty auto-created directory on every host) while the
kured-sentinel-gate DaemonSet writes to /var/run/gated-reboot-required.

Two different host directories → kured never saw the open gate, even
though the gate's checks were all green every 5 min on every node.
Result: unattended-upgrades has packages waiting on every node since
2026-05-10 (when uu was re-enabled) and kured's hourly log says
"Reboot not required" for the entire period.

Set rebootSentinel=/var/run/gated-reboot-required so the chart mounts
hostPath /var/run — same directory the gate writes to. The in-pod
mountPath (/sentinel) is hardcoded by the chart and doesn't matter,
the symlink chain works out: /sentinel/<file> inside the pod resolves
to /var/run/<file> on the host.

Verified: kured pod can now list /sentinel/gated-reboot-required
(0 B) AND /sentinel/reboot-required (32 B, set by uu on 2026-05-15).
First gated reboot will land Mon 2026-05-18 02:00 London.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:19:13 +00:00
4a12ac60b0 recruiter-responder: bump image_tag to 559e5c57
PDF extraction, tech_stack list, aggressive company/comp inference,
no-phone-call drafts, backtest CLI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 11:13:14 +00:00
43a6eb8b38 recruiter-responder: bump image_tag to bbd178da (structured Telegram + comp floor)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 10:58:47 +00:00
Viktor Barzin
b1b14ee370 service-catalog: add aiostreams entry
Stremio stream aggregator now has its own row in the Active Use tier.
Captures the auth model (own UUID+password, not Authentik), monitoring
posture (canary probe + 3 alerts), and backup pipeline (weekly NFS
dumps of both decrypted config and the Stremio account addon
collection).

Follow-up from the 2026-05-15/16 hardening session: 5 commits on
servarr/aiostreams, none previously catalogued.
2026-05-16 10:47:41 +00:00
Viktor Barzin
8a2442f297 aiostreams: weekly backup of Stremio account addon collection
Adds stremio-account-backup CronJob (Sun 04:00 weekly, offset 1h from
the AIOStreams config-backup at 03:00):

- Logs into api.strem.io with credentials from Vault
  (secret/viktor.stremio_email + stremio_password, now also synced
  into the aiostreams-probe-secrets ExternalSecret)
- Fetches the full addonCollection via addonCollectionGet
- Writes timestamped JSON to the existing aiostreams-backup PVC
  (NFS /srv/nfs/aiostreams-backup/stremio-collection-*.json, mode 0600)
- 90-day retention, logs out to invalidate the auth key
- Pushgateway metrics: stremio_account_backup_{success,bytes,
  addon_count,duration_seconds,last_run_timestamp}

Protects against: accidental "uninstall all" / API regression / wrong
account login wiping the curated set of 22 addons (Cinemeta + 16
MDBList + AIOStreams + More Like This + Formulio + Zamunda + Local).

Verified: manual run wrote 93480 bytes, 22 addons, file present on NFS.
2026-05-15 23:48:41 +00:00
1e8cd542b7 recruiter-responder: public /cb ingress for Telegram URL-button callbacks
- Add ingress_factory module (auth=none, HMAC + expiry are the gate);
  ingress_path=["/cb"] only — /api stays internal, /healthz cluster.
  dns_type=proxied. anti_ai_scraping=false.
- Drop setup_tls_secret module — Kyverno ClusterPolicy `sync-tls-secret`
  auto-clones the wildcard cert into every namespace.
- Bump image_tag to 7383b426 (callback endpoints + SMTP STARTTLS
  hostname relax).
- Wire CALLBACK_BASE_URL=https://recruiter-responder.viktorbarzin.me.
- Drop git-crypt-encrypted wildcard cert files into
  stacks/recruiter-responder/secrets/. Allowlist privkey.pem in a new
  .gitleaksignore — git-crypt encrypts at rest but the working-tree
  copy is plaintext, so gitleaks can't tell.

Smoke-tested end-to-end 2026-05-15 23:45:
  synthetic email -> Telegram with / buttons ->  tapped via curl
  -> 'Sent' HTML page -> thread.status=sent, decision row recorded
  with decided_via=telegram_button, outbound message threaded correctly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 23:46:49 +00:00
Viktor Barzin
3d3e749f4d aiostreams: whitelist Vidhin + Tamtaro sync URLs
Adds two env vars on the AIOStreams deployment:
- WHITELISTED_REGEX_PATTERNS_URLS: Vidhin's release-group regex
  (TRaSH-aligned) so syncedRankedRegexUrls works for the user
- WHITELISTED_SEL_URLS: Vidhin's ranked stream expressions +
  Tamtaro's ISE/PSE/ESE-standard

Gotcha: AIOStreams validates each synced* field against the matching
whitelist — stream-expression files (incl. Vidhin's expressions.json)
go in WHITELISTED_SEL_URLS, not the regex one, even though they live
in Vidhin's regex repo. Mixing them up returns USER_INVALID_CONFIG.

User config: enabled Vidhin's regex + ranked expressions + Tamtaro's
ISEs. Skipped Tamtaro PSE/ESE for now to avoid surprise over-filtering;
can be added later from the same whitelist.
2026-05-15 23:37:47 +00:00
Viktor Barzin
acfc779c45 aiostreams: weekly NFS backup of decrypted user config
Adds aiostreams-config-backup CronJob (Sun 03:00 weekly):
- Pulls /api/v1/user via internal ClusterIP with UUID + password from
  the existing aiostreams-probe-secrets ExternalSecret
- Writes timestamped JSON to nfs-backup PVC mounted at /backup
- 90-day retention, prunes older files
- Pushgateway metrics: aiostreams_config_backup_{success,bytes,duration,last_run_timestamp}

NFS path: 192.168.1.127:/srv/nfs/aiostreams-backup (auto-synced offsite
to Synology via the existing offsite-sync-backup CronJob).

Complements the daily postgresql-backup-per-db pipeline (which dumps
the encrypted blob) by storing the decrypted JSON — usable for human
inspection / disaster recovery even without the AIOStreams password.

Verified: manual job wrote 12931 bytes, file present on NFS.
2026-05-15 23:30:04 +00:00
root
d530cd53b8 Woodpecker CI deploy [CI SKIP] 2026-05-15 23:20:04 +00:00
1ffab190fd recruiter-responder: pin image tag + run plugin installer init as root
- stacks/recruiter-responder/terragrunt.hcl: bump image_tag to 0500c3d3
  (300s LLM timeouts + IMAP BODY.PEEK[] fix).
- stacks/openclaw/main.tf: install-recruiter-plugin init container now
  runs as uid 0 — the openclaw NFS volume is owned by uid 1000 and the
  recruiter-responder image otherwise drops to uid 10001 which can't
  write or chown.

Smoke-tested end-to-end 2026-05-15 ~23:15:
  Synthetic recruiter email -> IMAP IDLE EXISTS push -> qwen3-8b triage
  (12.1s, JSON output complete with company/role/salary/location/tech)
  -> 2 drafts persisted in Postgres -> Telegram sendMessage 200 OK.
Then deleted 3 stale n8n workflows W992Nr7..., 1AU4k7..., IisDNx... from
the n8n Postgres workflow_entity table.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 23:18:43 +00:00
d3d02342db recruiter-responder: vault DB role + switch proactive push to Telegram
- stacks/vault/main.tf: register pg-recruiter-responder static role on
  the postgresql connection (7d password rotation). Adds the role to
  allowed_roles and creates vault_database_secret_backend_static_role
  for `recruiter_responder` user.
- stacks/recruiter-responder/main.tf: drop TASK_WEBHOOK_URL env, swap
  TASK_WEBHOOK_TOKEN secret for TELEGRAM_BOT_TOKEN + TELEGRAM_CHAT_ID.
  Updated header doc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 22:47:45 +00:00
Viktor Barzin
c5ebbc07e4 state(vault): update encrypted state 2026-05-15 22:46:37 +00:00
7b26afa694 recruiter-responder: deploy stack + llama-cpp qwen3-8b + openclaw plugin mount
Three coupled changes for the new recruiter-responder pipeline:

1. stacks/llama-cpp/: add qwen3-8b text-only model to llama-swap. Uses
   unsloth/Qwen3-8B-GGUF Q4_K_M, 16k context, no mmproj. Refactored the
   download Job script + cmd renderer to handle text_only=true (skip
   mmproj download + --mmproj flag). The 3 existing vision models stay
   on text_only=false; no behaviour change for them.

2. stacks/recruiter-responder/: new stack. Namespace, 2 ExternalSecrets
   (app secrets from secret/recruiter-responder, DB creds from Vault DB
   engine static-creds/pg-recruiter-responder), Deployment (replicas=1,
   Recreate -- IMAP IDLE + APScheduler want single leader), Service
   ClusterIP. Image: forgejo.viktorbarzin.me/viktor/recruiter-responder.

3. stacks/openclaw/: add init container `install-recruiter-plugin` that
   uses the recruiter-responder image to copy the .mjs plugin into
   /home/node/.openclaw/extensions/recruiter-api/ on NFS. Couples plugin
   version to the recruiter-responder image tag. Also injects
   RECRUITER_RESPONDER_URL + RECRUITER_RESPONDER_TOKEN env vars (token
   from openclaw-secrets.recruiter_responder_bearer_token, optional).

Pre-apply checklist for recruiter-responder stack:
  - Vault: seed secret/recruiter-responder with webhook_bearer_token,
    imap_{me,spam}_{user,pass}, smtp_password, claude_agent_token,
    task_webhook_token.
  - Vault: add secret/openclaw.recruiter_responder_bearer_token (same as
    above webhook_bearer_token).
  - dbaas: create DB recruiter_responder + role recruiter_responder,
    and Vault DB-engine role static-creds/pg-recruiter-responder.
  - Build + push image via Woodpecker (recruiter-responder repo CI).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 22:38:53 +00:00
Viktor Barzin
b6e334daab aiostreams: 1h stream cache + canary stream-count probe + 3 alerts
Hardening pass following the empty-stream-list incident:

1. STREAM_CACHE_TTL=3600 — re-enables stream payload cache (was -1 /
   disabled). Default behaviour hit all 5 upstream addons on every
   Stremio request; with a 1h TTL repeat requests for the same title
   are instant, while RD cache invalidations still propagate quickly.

2. aiostreams-stream-probe CronJob (every 5 min): fetches the user's
   encryptedPassword via the internal ClusterIP, runs a canary stream
   search for Breaking Bad S01E01, pushes streams_count + probe_success
   to Pushgateway. Uses an ExternalSecret pulling UUID + password from
   Vault secret/viktor. Same pattern as email-roundtrip-monitor.

3. Three alerts in monitoring's prometheus_chart_values.tpl:
   - AIOStreamsStreamCountLow  (< 50 streams for 30m)
   - AIOStreamsProbeFailing    (probe_success == 0 for 30m)
   - AIOStreamsProbeStale      (last_run_timestamp > 30min for 10m)

Verified: probe returned streams=411 success=1 on first run; all 3
alerts loaded into Prometheus with state=inactive health=ok.
2026-05-15 21:39:33 +00:00
root
e037b160d0 Woodpecker CI deploy [CI SKIP] 2026-05-15 21:29:50 +00:00
Viktor Barzin
06b166202d aiostreams: pin nightly + switch to auth=app
- Pin viren070/aiostreams:nightly → :2026.05.14.1326-nightly (avoid
  stale-pull cache, matches 8-char SHA convention for rolling tags)
- Switch ingress auth tier required → app: Authentik forward-auth
  blocks Stremio clients (cannot follow OAuth 302), and AIOStreams
  already enforces UUID + password on /configure and /api/*, with
  Stremio addon URLs using encryptedPassword as a bearer token.
  Result: empty-stream-list issue fixed for public Stremio clients.

Verified: 410 streams returned via public URL for Breaking Bad S01E01
with no cookies, vs 0 before (502→Authentik OIDC redirect).
2026-05-15 21:28:09 +00:00
2d52b583f5 monitoring(wealth): move Positions table under contrib/growth row
Positions panel now sits at y=32 (immediately below the
contrib-vs-market + growth row at y=22..32), and everything from
the per-account stack down shifts 8 rows lower.
2026-05-14 16:58:38 +00:00
f1c56bf257 wealth: positions table panel (shares + cost basis + unrealised return)
pg-sync sidecar now mirrors three extra views from the wealthfolio
SQLite: assets (id/symbol/name/currency), quote_latest (one row per
asset, preferring YAHOO over MANUAL on same-day collisions), and
positions_latest (currently-held positions extracted from the TOTAL
aggregate row of holdings_snapshots — quantity, average cost,
total cost basis).

Wealth dashboard gets a new bottom Positions table joining the three:
symbol, name, shares, avg cost, last price, market value, cost,
gain, return %. Gain and return % are color-text with red<0, green>=0
thresholds.
2026-05-14 16:01:27 +00:00
db763fbc21 terminal: extract app code to viktor/terminal-lobby on Forgejo
The lobby has grown enough (frontend, two Go services, devvm units +
scripts + config) that it earns its own repo. Code now lives at
https://forgejo.viktorbarzin.me/viktor/terminal-lobby with
scripts/deploy.sh covering the manual deploy until CI activation
lands (Woodpecker forge_id=2 activation still 500s; Forgejo Actions
not yet enabled).

This stack now owns only the K8s side — Services, Endpoints,
IngressRoutes, middlewares. main.tf comment block updated to point
at the new repo and the full DevVM port map.

Removed:
- stacks/terminal/files/        (index.html + DevVM artefacts)
- stacks/terminal/tmux-api/     (Go service)
- stacks/terminal/clipboard-upload/ (Go service)
2026-05-13 21:10:56 +00:00
e2aa2931e4 terminal: make slate the default theme 2026-05-13 20:54:54 +00:00
699acdbd0a terminal: theme picker (carbon/slate/mono/ink) replacing violet
Drops the hardcoded violet/indigo palette. Four themes are defined as
CSS variables on body.theme-{carbon,slate,mono,ink}:

- Carbon (default): warm dark, ivory text, restrained amber accent.
- Slate: cool dark, GitHub/Linear-ish charcoal with electric blue.
- Mono: strict greyscale, off-white accent.
- Ink: warm paper light, deep ink, terracotta accent.

The lobby reads the choice from localStorage and applies the class
before render. The picker lives at the bottom of the sidebar
(margin-top: auto pins it). On change, the iframe is bounced through
about:blank so the inner xterm picks up the new computed CSS vars
(--terminal-bg/fg/cursor/selection) on the next mount.

Picker UI uses native buttons, current theme highlighted with the
accent border + color. No gradients, hairline borders only.
2026-05-13 20:46:21 +00:00
29ecb67d8c terminal: rename sessions + drag-and-drop reorder
Backend: POST /sessions/<name>/rename in tmux-api runs tmux
rename-session as the mapped OS user. 400 on bad name, 404 on missing
source, 409 on duplicate target, 401 on missing auth header.

Frontend:
- Rename button per card → prompt() dialog, validates against the
  shared regex. Updates currentActive + hash + iframe.src if the
  renamed session was active.
- Session order is now user-driven, persisted in localStorage
  keyed per osUser. New sessions append at the bottom. The previous
  sort-by-lastActivity is gone.
- HTML5 drag-and-drop reorders cards live during dragover; dragend
  captures the DOM order into localStorage.
- Polling renderLobby is suppressed while a drag is in flight so the
  5s tick doesn't yank the list out from under the user.
2026-05-13 20:18:15 +00:00
Viktor Barzin
0396fdadb2 terminal: inline session switching via sidebar + iframe
Replace full-page navigation with a two-pane lobby. Sidebar holds the
session list as clickable cards; an iframe in the content pane swaps
its src on click so switching sessions takes one click instead of two
navigations.

- #lobby-shell grid (260px sidebar + iframe pane)
- Cards become role=button, kill button stops propagation
- activateSession/deactivateSession with hash routing
  (location.hash <-> active session, replaceState so back stack stays
  clean)
- Killed active session deactivates the iframe before re-render
- 5s session poll preserves currentActive; deactivates if gone
- Mobile media query collapses to one column

CSP frame-ancestors already permits same-origin embedding
(*.viktorbarzin.me), no infra changes needed. Direct-link
?arg=<name> path is unchanged.
2026-05-13 20:08:01 +00:00
root
b7dfc6fcaa Woodpecker CI deploy [CI SKIP] 2026-05-13 19:56:16 +00:00
Viktor Barzin
3a2d9a9bc5 actualbudget: add enabled flag to factory, disable emo
Emo isn't using the instance and the daily bank-sync CronJob has been
failing because the budget has zero accounts (deleted from the UI),
triggering BankSyncStale. Adds an `enabled` toggle that gates the core
Deployment + Service + Ingress + http-api + CronJob behind a single
plan-time bool while preserving the PVC, so we can flip back to true
later to restore the instance as-was.

Also fixes a latent bug where the http-api Service was always created
even when `enable_http_api=false`.

Apply: 7 resources destroyed (emo deploy/svc/ingress/cf dns/http-api
deploy+svc/cronjob), 0 changes for viktor/anca (moved blocks
migrated their state cleanly to the new [0] addresses). Pushgateway
job bank-sync-emo cleared manually; orphaned external-monitor
synced out by external-monitor-sync.
2026-05-13 19:55:21 +00:00
f63f10f7fa terminal: per-Authentik-user OS-user isolation; deny unmapped users
Restores the kernel-level isolation the pre-cutover ttyd-session.sh had,
but keeps the multi-session lobby UX:

- ttyd.service gets `-H X-authentik-username` back. `tmux-attach.sh` reads
  $TTYD_USER, looks up the local part in /etc/ttyd-user-map, denies the
  connection (no fallback to wizard) if there's no mapping, otherwise
  `sudo -n -H -u <os_user> tmux …`. Each Authentik identity → its own
  Unix user → its own `/tmp/tmux-<uid>/default` socket.
- tmux-api scopes every request to the same OS user via the same header.
  Adds /whoami so the lobby HTML can preflight access and render
  "logged in as <os_user> (<authentik>)" instead of leaving the user to
  discover the deny via a reconnect loop.
- Commits /etc/ttyd-user-map and the matching /etc/sudoers.d/ttyd-users
  fragment under files/devvm/ so future operators see one canonical
  source of truth. Current mappings: vbarzin → wizard, emil.barzin → emo.

Adding a user is now: append a line to ttyd-user-map + a NOPASSWD
sudoers line + `useradd -m`. README walks through it.

No Terraform changes — this is all DevVM-side + lobby JS.
2026-05-13 19:25:55 +00:00
e88ce131f1 terminal: cut over to multi-session lobby on terminal.viktorbarzin.me
Promotes the staged multi-session UX from term.viktorbarzin.me to the
primary terminal.viktorbarzin.me hostname. `ttyd.service` on the DevVM
moves to the same ExecStart that `ttyd-multi.service` was running:
`/usr/local/bin/ttyd -W -a -t enableClipboard=true -I
/usr/local/share/ttyd/index.html -p 7681 /usr/local/bin/tmux-attach.sh`.
The lobby HTML supersedes the old per-user-attach index.html
(ttyd-session.sh wrapper retired alongside).

Terraform: retires the `terminal-multi` Service+Endpoints and the
term.viktorbarzin.me ingress (Cloudflare DNS record for `term` is
released by module deletion). The tmux-api Service+Endpoints stay, but
its IngressRoute now matches terminal.viktorbarzin.me — same path-prefix
specificity wins against the catch-all ingress.

DevVM follow-up (applied manually as before — see files/devvm/README.md):
restart ttyd to pick up the new unit, stop+disable ttyd-multi.service.
2026-05-13 16:34:36 +00:00
root
cb37db4e7e Woodpecker CI deploy [CI SKIP] 2026-05-13 15:25:01 +00:00
Viktor Barzin
a724169b0e terminal: add multi-tmux-session lobby on term.viktorbarzin.me (additive)
New hostname term.viktorbarzin.me serves a session-picker UI that lists,
creates, and kills tmux sessions. Visiting ?arg=<name> attaches to that
session (auto-creates via tmux -A). Builds on a fresh ttyd instance
(7685) plus a tmux-api Go binary (7684) on the DevVM, both running as
User=wizard alongside (not replacing) the existing ttyd.service (7681),
ttyd-ro.service (7682), and clipboard-upload (7683). Cutover of
terminal.viktorbarzin.me to the multi-session setup is deferred.

Terraform diff is purely additive — terminal-multi/tmux-api Service +
Endpoints + ingress_multi (term.viktorbarzin.me, Authentik-gated) + an
IngressRoute that path-prefixes /api/sessions/* to tmux-api with the
matching strip-prefix Middleware.

DevVM-side units ship under files/devvm/ with a README — manual scp +
systemctl install (see files/devvm/README.md). ttyd 1.7.7 already
deployed there (≥1.7 needed for -a).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-13 15:24:04 +00:00
5754cfb56f monitoring(wealth): paint declining segments red on growth chart
Mirror the panel 5 treatment on panel 7 (Growth = market value −
contribution). Second SQL column emits the growth value only when
the point is part of a declining segment; field override paints it
red with no fill, spanNulls=false.
2026-05-12 19:22:43 +00:00
Viktor Barzin
0f8faca3d0 payslip-ingest, instagram-poster: suspend two chronic-failure cronjobs
Identified during alert-noise review as steady sources of JobFailed.
Suspending them stops the noise; unsuspend after the per-job blocker is
cleared.

* payslip-ingest/actualbudget-payroll-sync — blocked on Vault
  `secret/payslip-ingest` missing `actualbudget_encryption_password`.
  `actualbudget_api_key` and `actualbudget_budget_sync_id` were added
  (copied from `secret/fire-planner`) in the same session; the
  encryption password is not stored anywhere in Vault and needs to be
  populated separately. ExternalSecret sync has been failing since
  2026-04-25.

* instagram-poster/ig-refresh-token — the deployed image (:da5b4191)
  does not contain the `POST /ig-refresh-token` route; the route is
  defined in uncommitted working-copy changes at
  `instagram-poster/instagram_poster/app.py:695`. Unsuspend after the
  new image rolls.

Each `suspend = true` line carries an inline comment with the unsuspend
trigger.
2026-05-12 13:31:33 +00:00
Viktor Barzin
bf2fda87a8 monitoring: PodImagePullBackOff alert + 2 inhibitors + JobFailed for:2h
Three improvements identified in the 7d alert-noise review:

A. New PodImagePullBackOff alert. `KubeletImagePullErrors` measures
   node-level pull error rate, which doesn't catch a single pod stuck
   in ImagePullBackOff — council-complaints sat broken for ~10h on
   2026-05-12 without paging. The new rule fires per-pod after 30m.

B. Two new inhibit_rules:
   - PVFillingUp (95% used, critical) suppresses PVPredictedFull
     (linear projection, warning) on the same PVC. Pair was producing
     ~24h of redundant firing per 7d.
   - EmailRoundtripFailing (active probe failure) suppresses
     EmailRoundtripStale (derivative >60min no-success). Same outage
     windows, ~14.5h of duplicate firing per 7d.

C. JobFailed for: 30m → 2h. Most cronjobs run every 5–15min; the old
   30-minute window paged on the first failed iteration before the
   next run could recover. 2h means "still failing across at least
   two cron iterations" — much more actionable.

Verified live: rules loaded, inhibitors in alertmanager config,
PodImagePullBackOff is currently inactive (council-complaints
ImagePullBackOff actively detected — see separate fix).
2026-05-12 09:31:46 +00:00
Viktor Barzin
cbb1184ee4 monitoring: TraefikReplicaConfigStale — drop false-positive on stale series
The initial formulation used clamp_min(min(rate[2h]), 0.0001), which
made a recently-deleted pod's lingering rate=0 drive the ratio toward
infinity for up to 2h until the stale series aged out of the rate
window. With for: 2h, this was a near-miss for spurious firing in the
immediate aftermath of restarting the bad replica (our remediation
path).

Tighter formulation:
* 30m rate window — stale series ages out within minutes, not hours
* `min(rate) > 0.0005` floor — filters both stale-zero and fresh-pod
  ramp-up series; the bug's actual rate (~0.00076 in the 2026-05-12
  incident) sits well above it, so true positives still trip
* for: 1h — fast enough to catch the next incident, long enough that
  short rate dips don't flap

Verified: post-deploy `(max/min) > 5 AND min > 0.0005` evaluates to 0
results with the live cluster's tight rate spread (~0.00065–0.0007/s
across all three Traefik replicas).
2026-05-12 08:37:17 +00:00
Viktor Barzin
700f7ae49c monitoring: detect stale Traefik replicas + reduce alert-storm cascading
Two new alertmanager inhibit rules and one new Prometheus alert,
informed by the 2026-05-12 incident where Traefik pod
traefik-db7696fbf-k42wp came back after a SIGTERM with only 6 routers
vs 119 on healthy peers (stale K8s informer cache) and served 404 for
~1/3 of viktorbarzin.me traffic.

* New alert TraefikReplicaConfigStale: fires when max/min reload-rate
  ratio across Traefik pods exceeds 5x for 2h. The 2h window + 2h
  for-clause tolerates legitimate post-restart ramp-up; the bug
  pattern persists indefinitely.
* New inhibit: TraefikReplicaConfigStale suppresses the symptom
  alerts (HighService{Error,4xx,Latency}, IngressTTFB{High,Critical},
  IngressErrorRate5xxHigh, TraefikHighOpenConnections,
  ForwardAuthFallbackActive, AnubisChallengeStoreErrors,
  ExternalAccessDivergence) so only the actionable root cause pages.
* New inhibit: HomeAssistantDown suppresses
  HomeAssistantCriticalSensorUnavailable and
  HomeAssistantMetricsMissing — when HA itself is down, every sensor
  going unavailable is noise (10x firings observed in the last 12h).
* Extend NodeDown and NFSServerUnresponsive target lists to also
  suppress HomeAssistantCriticalSensorUnavailable.
2026-05-12 08:23:18 +00:00
Viktor Barzin
01bc16d592 k8s-version-upgrade: decompose into Job chain to fix self-preemption
The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 23:54:22 +00:00
root
afda1d12b8 Woodpecker CI deploy [CI SKIP] 2026-05-11 20:46:02 +00:00
Viktor Barzin
df5edd8542 wealthfolio, paperless-ngx: drop migration-leftover -proxmox PVCs
The 2026-04-13 encrypted-PVC migration replaced the wealthfolio and
paperless-ngx data volumes with -encrypted variants but never removed
the original -proxmox PVC blocks from TF — both were sitting orphaned
with no pod mounting them, occupying 1Gi each of LVM thin pool. The
autoresizer also logged repeated "failed to get volume stats" for them
(no kubelet stats without a mounted pod), masking real signal.

  * wealthfolio: removed kubernetes_persistent_volume_claim.data_proxmox
  * paperless-ngx: removed kubernetes_persistent_volume_claim.data_proxmox
  (the paperless PVC turned out to be out-of-TF-state, so deleted via
   kubectl after the TF block removal.)
2026-05-11 20:44:17 +00:00
Viktor Barzin
5ef2642eda claude-agent: replace unused 10Gi PVC with 5Gi NFS-backed /persistent
The 10Gi proxmox-lvm-encrypted PVC `claude-agent-workspace-encrypted` was
declared in TF but never wired into the deployment — the `workspace`
volume_mount pointed at an emptyDir, so the PVC sat allocated and idle
from 2026-04-15 to 2026-05-11.

Restructured per the design intent:
  * `workspace` (emptyDir) — fast per-pod ephemeral scratch for git clones.
    Each agent job clones the infra repo fresh, so persistence doesn't
    buy anything and emptyDir avoids RWO contention if the deployment
    is ever scaled past 1 replica.
  * `persistent` (5Gi NFS-backed RWX) — mounted at /persistent for cases
    where the agent needs to write state that should survive pod
    restarts (caches, ad-hoc outputs). RWX so all replicas share it;
    the service's sequential-mutex lock prevents concurrent writes.

Also fixed `fix-perms` init container: the Dockerfile's `WORKDIR
/workspace/infra` causes kubelet to create that path inside the
emptyDir as root:fsGroup with the setgid bit, which uid 1000 can't
write to. Pre-create the path + chmod 0775 to make it writable.

NFS export already exists on the PVE host
(/srv/nfs/claude-agent-persistent, owned 1000:1000).

Verified: pod runs 1/1; `/persistent` writable as agent uid 1000;
git-init successfully clones infra into /workspace/infra.
2026-05-11 20:40:21 +00:00
Viktor Barzin
38d33c98b6 monitoring: drop PVAutoExpanding alert — info-only noise, not actionable
PVAutoExpanding fired at >80% used (info severity), but pvc-autoresizer's
threshold is 10% free (= 90% used) — the alert always fired ~10 points
before any action would have been taken, and there was nothing for an
operator to do during that window either. It was a "heads up" that
didn't surface a problem.

Real failure modes are already covered:
  * PVFillingUp (critical, >95% for 10m) — autoresizer didn't keep up
  * PVPredictedFull (warning, predict_linear 24h) — trend toward exhaustion

Sharpened PVFillingUp's annotation to spell out the likely causes
(storage_limit reached, expansion failing, or missing autoresizer
annotations) so the responder doesn't have to recall the runbook.
2026-05-11 20:13:42 +00:00
858350fa93 monitoring(wealth): paint declining segments red on portfolio chart
Add a second SQL column on panel 5 that returns net_worth only when the
current point's previous or next neighbor is lower — i.e. the point is
part of a declining segment (including the peak and trough endpoints).
A field override draws this 'decline' series in red with no fill and
spanNulls=false, overlaying the green base line so down periods show
up as red on top of the climb.
2026-05-11 20:13:16 +00:00
Viktor Barzin
545cd2b854 healthcheck: probe uptime-kuma via internal Service (port-forward), not public URL
The Uptime Kuma check was hitting https://uptime.viktorbarzin.me, which
sits behind Authentik forward-auth. Authentik 302-redirects the Socket.IO
handshake the uptime-kuma-api library uses, and the library can't
complete the OAuth flow, so every healthcheck reported "Connection
failed" even though the pod was healthy and serving 225 monitors.

Fix: open a transient `kubectl port-forward` to svc/uptime-kuma in the
uptime-kuma namespace for the duration of the check, connect the
library to http://127.0.0.1:<port> (no auth gate), then SIGKILL the
port-forward on the way out. The disown is to suppress bash's "Killed"
job notification on stderr, which corrupted stdout when stderr was
merged for JSON parsing.

Verified end-to-end: healthcheck now reports the real signal —
"external down(3): www, xray-vless, hermes-agent" — the same 3
Cloudflare-facing endpoints flagging in the uptime-kuma logs.
2026-05-11 20:02:57 +00:00
Viktor Barzin
cca7c07050 vault: move audit-PVC autoresizer annotations to kubernetes_annotations
Background: 2026-05-10 someone added `server.auditStorage.annotations`
to vault/main.tf attempting to enable pvc-autoresizer on audit-vault-N
PVCs. The vault helm chart maps that block into the StatefulSet's
volumeClaimTemplates, which is immutable post-creation on existing
StatefulSets. Result: 4 consecutive helm upgrade attempts (rev 16-19)
all rejected with "StatefulSet spec: Forbidden", leaving the release
stuck in failed state since 22:47 UTC that day. Live PVCs were
hand-annotated via `kubectl annotate` as a workaround, but the IaC
declared a path that couldn't be applied — every subsequent tg apply
on the vault stack would re-fail.

Fix:
  * Remove `annotations` block from `server.auditStorage` values
    (with a comment recording why it can't live there).
  * Add `kubernetes_annotations` resources for audit-vault-{0,1,2}
    with `force = true`, so Terraform adopts the existing annotations
    and tracks the desired-state in IaC going forward. The autoresizer
    cares about PVC annotations, not StatefulSet template annotations,
    so this is functionally equivalent.

Done out-of-band before commit (helm state was already corrupted):
  `helm rollback vault 15 -n vault` → revision 20 deployed (clean).

Verified: helm status vault = deployed; audit-vault-0 still has
threshold=10% storage_limit=10Gi annotations; cluster healthcheck
no longer reports vault/vault=failed.
2026-05-11 19:41:35 +00:00
Viktor Barzin
407a17d8cd state(vault): update encrypted state 2026-05-11 19:40:05 +00:00
Viktor Barzin
e692a0a0c5 ci: retrigger image rebuild — prior pipeline aborted during PG outage 2026-05-11 19:30:34 +00:00
Viktor Barzin
205c902de5 docs/auth: sync to current auth enum (required/app/public/none)
Replace the legacy `protected = true` reference with the four-tier
`auth` enum that's been live for weeks. Document the anti-exposure
guard (`scripts/check-ingress-auth-comments.py` + `scripts/tg`)
that enforces the inline-comment convention. Fix two stale paths:

  - `stacks/platform/modules/ingress_factory/` → `modules/kubernetes/ingress_factory/`
  - `stacks/platform/modules/traefik/middleware.tf` → `stacks/traefik/modules/traefik/middleware.tf`

Replace the single `protected = true` example with three: a
default Authentik-gated admin UI, an app-managed backend, and an
intentionally-public webhook receiver. Each example shows the
required comment line above the auth assignment.

[ci skip]
2026-05-11 19:28:42 +00:00
Viktor Barzin
85f1e92ad7 real-estate-crawler: populate SCRAPE_SCHEDULES (daily RENT + weekly BUY, London 1-2 bed)
Wires celery-beat to fire two periodic scrapes via the existing in-app
SchedulesConfig mechanism. Replaces the empty-string fallback with two
inline schedules expressed as Terraform-managed JSON:

- london-rent-daily: every day at 03:00 UTC, RENT, London, 1-2 bed,
  £1900-4000
- london-buy-weekly: every Sunday at 04:00 UTC, BUY, London, 1-2 bed,
  £400k-1.2M

Schedules live in `local.scrape_schedules` (jsonencode'd) rather than
Vault — they're configuration, not secrets, and benefit from being
version-controlled. The previous Vault-backed lookup
(`local.notification_settings["scrape_schedules"]`) was unused.

Verified live: new celery-beat pod logs
`Registering periodic task: london-rent-daily at 3:0` and
`london-buy-weekly at 4:0` immediately after roll-out.

Also tightens the comment above the wrongmove-api `auth = "none"` line
so it passes the new `scripts/check-ingress-auth-comments.py` guard
(pre-existing tech debt that blocked the apply).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 19:26:57 +00:00
Viktor Barzin
aee434469c ci: add python3 to infra-ci image — unblocks scripts/tg auth-comment check
Commit 0712a1b6 added a Python-based ingress_factory auth-comment check
that runs from scripts/tg on every plan/apply. The CI image
(forgejo.viktorbarzin.me/viktor/infra-ci) doesn't ship python3, so every
CI apply has been failing since with:

  env: can't execute 'python3': No such file or directory

Adding python3 to the apk install line restores CI applies for all stacks.
The build-ci-image.yml pipeline auto-fires on this commit (path filter
on ci/Dockerfile), so the rebuild + retag happens without manual action.
2026-05-11 19:26:55 +00:00
Viktor Barzin
53657d9952 infra: document auth = "app|none" tier on every legacy ingress
Sweep through the 30+ stacks that predated the auth = "app" tier
and were tagged auth = "none" without a comment explaining why
they weren't behind Authentik. Each is now self-documenting at the
call site, so the tg-level anti-exposure guard passes and future
readers don't have to reverse-engineer the intent.

Flipped 6 stacks from "none" to "app" — their backends have their
own user auth and the new tier records that more accurately:
  - navidrome   (Subsonic user/password)
  - ntfy        (deny-all default + user.db tokens)
  - nextcloud   (WebDAV/CalDAV/CardDAV app passwords)
  - vaultwarden (Bitwarden-compatible token auth)
  - headscale   (OIDC + preauth keys for Tailscale nodes)
  - paperless-ngx (app-layer login + API tokens)

Kept "none" with a comment on the rest — they're genuinely public,
webhook receivers, native-protocol endpoints, OAuth callbacks, or
Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt),
claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api,
fire-planner /api, forgejo (git/OCI native clients), frigate (HA
integration), immich/frame, insta2spotify /api, instagram-poster
(meta fetcher), k8s-portal, matrix (native bearer), monitoring×2
(HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT),
owntracks (HTTP Basic), postiz, privatebin (client-side enc),
rybbit (analytics tracker), send (E2E file drop), tuya-bridge
(API key), vault (own auth + CLI), webhook_handler, woodpecker
(forgejo webhooks + OAuth), xray (×3 VPN transports).

real-estate-crawler/main.tf:400 already had its comment from a
prior edit — not touched here.

No live state changes — auth = "app" produces the same middleware
chain as auth = "none" (verified earlier this session). This commit
is purely documentation + intent-tagging.
2026-05-11 19:25:48 +00:00
Viktor Barzin
6fb2c1c7ba dbaas+monitoring: bump PG max_connections to 200, add scrape + alerts
Cluster grew past the 100-conn default — steady-state idle was 90/100,
leaving zero headroom for terragrunt applies or transient surges. The
ceiling was being discovered by Terraform crashing (pq: "remaining
connection slots are reserved for roles with the SUPERUSER attribute"),
not by alerting, because we had no PG scrape config at all.

dbaas (Tier 0):
  * max_connections: 100 → 200
  * shared_buffers: 512MB → 1GB (Postgres recommends ~25% of pod memory)
  * effective_cache_size: 1536MB → 2560MB (scaled with pod memory)
  * pod memory: 2Gi → 3Gi (rough rule of thumb: enough for shared_buffers
    + ~16MB work_mem * concurrent sorts + OS cache + overhead)
  * Triggers bump on null_resource.pg_cluster forces CNPG to re-apply,
    which rolls the cluster (standby first, then primary failover).

monitoring:
  * New scrape job 'cnpg' on dbaas namespace pods labeled
    cnpg.io/podRole=instance, port name=metrics (9187). Relabels add
    cnpg_cluster + cnpg_role labels for alert grouping.
  * PGConnectionsHigh (warning, >85% for 10m) — heads-up before exhaustion.
  * PGConnectionsCritical (critical, >95% for 3m) — last call before
    refusing connections.

Verified: cnpg targets up, sum(cnpg_backends_total)=84, max_connections
metric=200, alert ratio 0.42 → both alerts inactive.
2026-05-11 19:20:54 +00:00
Viktor Barzin
0712a1b659 infra/scripts/tg: enforce ingress_factory auth-comment convention
Every `tg plan/apply/destroy/refresh` now runs
`scripts/check-ingress-auth-comments.py` against the current stack
before invoking terragrunt. The check fails closed if any
`auth = "app"` or `auth = "none"` line in the stack's .tf files lacks
an immediately-preceding `# auth = "<tier>": ...` comment documenting
what gates the app (for "app") or why the endpoint is intentionally
public (for "none").

Why tg-level (not git pre-commit): tg is the universal entry point
for all infra changes. CI runs it, headless agents run it, humans
run it. A pre-commit hook only catches the human path. Wiring the
check into tg means the anti-exposure guard fires regardless of who
or what is invoking terragrunt.

Stack-scoped: each stack documents itself the next time it's edited.
The 30+ existing `auth = "none"` stacks that predate this guard are
not blocked from operating today; they'll need the comment added the
next time someone runs `tg plan` on them — at which point the gate
forces a conscious "yes, this is intentional" moment before any
state change can land.

Skipped on: init, fmt, validate, output, etc. — anything that doesn't
read or write infra state.
2026-05-11 19:18:27 +00:00
Viktor Barzin
b91268fef4 state(dbaas): update encrypted state 2026-05-11 19:08:05 +00:00
Viktor Barzin
459b00fa74 infra/ingress_factory: add auth = "app" mode for self-authed backends
Adds a fourth auth tier alongside required/public/none. "app" is
functionally identical to "none" — no Authentik middleware attached —
but the distinct name records intent at the call site: this backend
has its own user login (NextAuth, Django, OAuth, bearer-token API,
etc.) and Authentik would only break it.

Why the new tier: with only required/none, every "the app has its
own auth so drop Authentik" decision looked identical at the call
site to "this is an OAuth callback / webhook receiver / native-client
API". Future readers couldn't tell whether a stack was intentionally
unauthenticated or relying on backend auth. Now they can.

Migrates the 8 stacks flipped earlier this session (novelapp, immich,
linkwarden, tandoor, freshrss, affine, actualbudget, ebooks/audiobookshelf)
from "none" to "app". Confirmed no-op: `tg plan` on novelapp showed
"No changes" — same middleware chain, same live state.

The variable description and the .claude/CLAUDE.md Auth section now
spell out the anti-exposure rule: only pick "app" or "none" AFTER
verifying the app has its own user auth ("app") or the endpoint is
intentionally public ("none"). Default stays "required" so accidental
omission fails closed.

[ci skip]
2026-05-11 18:59:20 +00:00
root
dafd7a18bc Woodpecker CI deploy [CI SKIP] 2026-05-11 18:56:29 +00:00
Viktor Barzin
a980f78b58 actualbudget+monitoring: per-account bank-sync metrics, drop noisy alert
The bank-sync CronJob was posting to /accounts/banksync which fans out to
ALL accounts in a single call. With PSD2/GoCardless's 4-successful-pulls
per-account per-24h quota, a single rate-limited account would 500 the
whole call, and `bank_sync_success` would flip to 0 even though the data
itself was still flowing through manual UI syncs. Result: BankSyncFailing
fired routinely whenever the user had been active in the UI that day —
a structural false positive.

Fix:
  * CronJob: enumerate accounts via GET /accounts, POST per-account
    /accounts/{id}/banksync, emit bank_sync_account_success and
    bank_sync_account_last_success_timestamp labelled by account name.
    Roll up bank_sync_success = 1 iff any account succeeded.
  * Alerts: drop BankSyncFailing (noise generator). Keep BankSyncStale
    at 48h (global drought). Add BankSyncAccountStale at 72h (catches
    single-account auth expiry — the real signal we wanted).

Verified: manual run on bank-sync-viktor pushes 6 per-account success +
timestamp series; roll-up bank_sync_success=1; no firing alerts.
2026-05-11 18:55:15 +00:00
Viktor Barzin
5a271e70ab infra: drop Authentik forward-auth from 7 self-authed apps (auth = "none")
Apps with their own user auth + bearer-token APIs were being broken by
Traefik → Authentik forward-auth: every iOS/Android/native client got a
302 to authentik.viktorbarzin.me instead of the JSON they expected.
Authentik's 302+cookie dance can only be followed by a real browser.

Changed:
  - immich         (Immich mobile app + bearer-token /api)
  - linkwarden     (NextAuth + Linkwarden mobile clients)
  - tandoor        (Django auth + Tandoor mobile clients)
  - freshrss       (Fever/GReader API used by Reeder/FeedMe/etc.)
  - affine         (workspace auth + AFFiNE desktop/mobile sync)
  - actualbudget   (server password + Actual mobile/sync clients)
  - ebooks/abs     (Audiobookshelf iOS/Android app)

Each app's own auth is the gate now. CrowdSec + rate-limit + anti-AI
UA filter still front the ingresses. Same pattern as the novelapp
change earlier this session.

[ci skip]
2026-05-11 18:46:36 +00:00
Viktor Barzin
533e7b2a50 infra/novelapp: drop Authentik forward-auth (auth = "none")
novelapp handles its own user auth via NextAuth + Google OAuth, so the
ingress-level Authentik forward-auth was double-gating. Mobile webviews
(iOS/Android) can't follow the Authentik 302/cookie dance — they saw
HTML challenges where they expected JSON. CrowdSec + rate-limit +
anti-AI UA filter remain in front; novelapp's own login handles users.

[ci skip]
2026-05-11 18:32:15 +00:00
root
e1185721ca Woodpecker CI deploy [CI SKIP] 2026-05-10 22:49:51 +00:00
Viktor Barzin
8516dfbbe4 claude-memory / resume: unblock terragrunt apply (var defaults + psql -d postgres)
Two pre-existing apply failures uncovered during the Phase 4 mass apply,
unrelated to the auth refactor but blocking 100% rollout.

claude-memory:
- `var.claude_memory_db_password` had no default and wasn't passed by
  terragrunt → fall back to Vault `secret/claude-memory.db_password` via
  `coalesce(var.x, data.vault.data["db_password"])`.
- db-init Job was failing with `database "root" does not exist` because
  psql defaults the database name to the user when -d is omitted. Added
  `-d postgres` to all five psql invocations.

resume:
- `var.resume_database_url` had no default and wasn't passed → default to
  empty string. Vault carries the real value at `secret/resume.database_url`
  consumed at the deployment env-var level; the variable here just needs
  a value to satisfy the apply.

Also: priority-pass had lost most of its TF state (only 3 of 8 resources
tracked); imported namespace/service/pvc/deployment/ingress/tls-secret to
re-bind state with live K8s resources. No code change needed there.

Verified after re-apply:
- claude-memory.viktorbarzin.me → 200 (auth=none, native MCP responses)
- priority-pass.viktorbarzin.me → 302 → authentik (auth=required)
- resume.viktorbarzin.me → 302 → authentik public outpost (auth=public)
- 6 of 7 previously-failing applies now green; only vault remains, blocked
  by an unrelated helm chart immutable-StatefulSet-field issue.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 22:47:54 +00:00
Viktor Barzin
2f0e8c88a9 healthcheck: tune noise filters + nvidia-exporter auth=none
Six tuning changes to cluster_healthcheck.sh so PASS sections actually
reflect "nothing to act on":

1. prometheus_alerts: only count severity=warning|critical. Info-level
   alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the
   alert rule itself sets severity; the script should respect it.

2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert
   auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the
   Lets Encrypt wildcard renews weekly; <14d is the only window where
   human attention is genuinely useful.

3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/
   event/image/update domains (transient by design), skip friendly
   names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and
   only count entities whose last_changed > 24h. Was 431/1470,
   most of which were "phone in standby" noise.

4. ha_automations: only flag DISABLED automations as abandoned if
   they've also been untouched (last_changed) for >180 days; raise
   stale threshold 30d → 180d. Was flagging seasonal/holiday-only
   automations as broken.

5. problematic_pods + evicted_pods: exclude pods owned by Jobs.
   CronJob retry leftovers (Error/Failed phase pods that K8s keeps
   around for log inspection) aren't problematic at the cluster level.

6. uptime_kuma: retry the WebSocket login 3x with backoff. Single-
   shot failures were a recurring false-positive even though the
   service was healthy.

Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's
nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll
/metrics and got 302'd to Authentik like the idrac/snmp ones did.
Same fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 22:27:39 +00:00
root
7d7c7e4b7f Woodpecker CI deploy [CI SKIP] 2026-05-10 22:03:36 +00:00
Viktor Barzin
71e8476e06 frigate: lan ingress auth=none for HA Sofia integration
The frigate-lan.viktorbarzin.lan ingress had Authentik forward-auth in
front. HA Sofia's frigate integration polls /api/config and only knows
how to use Frigate's own API key (not browser SSO), so every poll got
a 302 to authentik.viktorbarzin.me and the integration entered the
errors-state. Same pattern as idrac-redfish-exporter (5c594291).

allow_local_access_only IP allowlist + Frigate's API key are enough.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 22:02:21 +00:00
Viktor Barzin
bf752dffa5 fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes
After fixing the threshold=80% misconfig and seeing two PVCs
(prometheus + technitium primary) get stuck Terminating, a 3rd round
showed four more PVCs (frigate, hackmd, immich-postgresql,
paperless-ngx) in the same state. Same root cause: TF spec'd a
smaller storage size than the autoresizer-grown live value, K8s
rejected the shrink, TF force-replaced the PVC, and the
pvc-protection finalizer held it in Terminating while the pod kept
using the underlying volume.

Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests]
on every kubernetes_persistent_volume_claim block that has
resize.topolvm.io/threshold annotations. The pattern was already
documented in .claude/CLAUDE.md but ~63 stacks were missing it.

Live PVCs are unaffected; this only prevents future TF applies from
attempting the destroy+recreate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 21:57:01 +00:00
Viktor Barzin
5c59429182 fix: HA Sofia REST sensors + PVC drift safety
Two real issues found while triaging HomeAssistantCriticalSensorUnavailable
alerts and the prometheus + technitium PVC Terminating-but-in-use
state from the earlier session.

1. idrac-redfish-exporter + snmp-exporter ingresses: auth=required →
   auth=none. HA Sofia REST sensors scrape these endpoints
   programmatically; with Authentik forward-auth in front, every
   request got a 302 to authentik.viktorbarzin.me and the REST
   sensors parsed the HTML login page instead of metrics — leaving
   the R730, UPS, and ~20 other sensors permanently unavailable.
   The allow_local_access_only IP allowlist (192.168.0.0/16 +
   10.0.0.0/8) already gates external access, so authentik on top
   was breaking machine-to-machine traffic for no security gain.

2. prometheus_server_pvc + technitium primary_config_encrypted:
   add lifecycle.ignore_changes = [spec[0].resources[0].requests].
   The autoresizer expands these PVCs; PVCs can't shrink. Without
   the ignore, every TF apply tried to revert the live size back
   to the TF spec value, hit K8s's shrink-forbidden rule, and
   force-replaced the PVC. Because the pod still mounted it, the
   PVC went into Terminating-but-protected limbo — fine until a
   pod restart would have orphaned the volume. Root cause of the
   2026-05-10 PVC Terminating incident.

Bonus: prometheus_server_pvc threshold was the inverted "90%" (the
same bug the bulk fecfa211 sweep fixed elsewhere; my regex only
matched "80%" so this one slipped through). Now "10%".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 21:48:29 +00:00
Viktor Barzin
2db8bdac0d state(dbaas): update encrypted state 2026-05-10 21:00:00 +00:00
Viktor Barzin
5582977e1a vault: enroll audit-vault-0 in pvc-autoresizer (10Gi limit)
audit-vault-0 fills steadily with raft audit logs; without autoresizer
annotations it hits the 2Gi ceiling and Vault stalls on writes
(PVAutoExpanding alert was firing at 81% used). The Vault Helm chart
copies server.auditStorage.annotations onto the PVC at create time.

Live PVC already has the annotations applied via kubectl annotate;
this just keeps TF in sync.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 20:01:06 +00:00
Viktor Barzin
9806c33359 dbaas: pg-cluster threshold 80%→10% in CNPG inheritedMetadata
Same misconfig as the bulk fecfa211 sweep, but the pg-cluster YAML
is buried inside a null_resource local-exec heredoc so the regex
didn't catch it. CNPG operator inherits these annotations onto each
member PVC (pg-cluster-1, pg-cluster-2), and reapplies them on every
reconcile — patching the live PVCs alone bounces back within seconds.

Live state already patched via kubectl patch cluster, this just keeps
TF in sync.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 19:58:11 +00:00
Viktor Barzin
fecfa211fd fix: pvc-autoresizer threshold should be 10%, not 80%
topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE
percentage below which expansion fires (per upstream README). Setting
it to "80%" means "expand when free-space drops below 80%", i.e. as
soon as the PVC crosses 20% utilization — which caused
prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi
in 70 minutes (six 10% bumps, all when the volume was only ~14% used).
Once the SC opt-in fix landed (1e4eac53) and the inode metrics fix
landed (02a12f1a), the autoresizer started actively misfiring across
75+ PVCs cluster-wide.

Flip the value to "10%" everywhere — that's "expand when free-space
drops below 10%", i.e. at 90% utilization, which is the conventional
semantic and matches the alert thresholds in
prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp
at 95%).

The CLAUDE.md PVC template was the source of the misconfig, so update
it too. Live PVC annotations were patched in parallel via kubectl
annotate; TF apply on each affected stack will be a no-op against
those live values.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 19:56:16 +00:00
Viktor Barzin
5d0e17b5ba k8s-version-upgrade: detection script refresh apt before madison + DRY_RUN_OVERRIDE
Test 2 dry-run revealed kubeadm plan reports v1.34.7 as latest while
apt-cache madison (without prior apt-get update) was reporting v1.34.5
— so the CronJob would have dispatched the agent against a stale
target. Now do `sudo apt-get update -qq` for just the kubernetes repo
before querying madison.

Also add a DRY_RUN_OVERRIDE env precedence so future test invocations
can override DRY_RUN without an apply cycle — but Job spec env is
immutable post-create, so this is only useful for CronJob spec edits
(suspend, then add env, then resume). Documented in the runbook.
2026-05-10 19:33:11 +00:00
Viktor Barzin
988bfde45c k8s-version-upgrade: trigger etcd snapshot via existing backup-etcd Job; broaden agent RBAC
Stage 2 now reuses the existing default/backup-etcd CronJob (NFS-backed
PV pointing at 192.168.1.127:/srv/nfs/etcd-backup) instead of trying to
ssh into master and run etcdctl against a non-existent /mnt/main mount.
The agent triggers a one-shot Job from cronjob/backup-etcd, waits up to
10 min, then parses the backup-manage container log for "Backup done"
line + byte count.

Test 2 (dry-run) surfaced 5 real cluster blockers — agent loop works
end-to-end at the planning level.

Expanded the claude-agent ServiceAccount's privileges via a sibling
ClusterRole (claude-agent-upgrade-ops):
  - patch namespaces/k8s-upgrade (in-flight annotation)
  - create batch/jobs (trigger etcd snapshot Job)
  - patch nodes (cordon/uncordon)
  - create pods/eviction (drain)
  - delete pods (drain fallback)
2026-05-10 19:16:12 +00:00
Viktor Barzin
a58d777059 k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
2026-05-10 19:07:42 +00:00
Viktor Barzin
f0ae61358b fire-planner / k8s-portal / insta2spotify: revert auth=public to auth=none
The Phase 4 audit promoted three "smoke-test candidates" from `protected = false`
to `auth = "public"`, but all three are XHR / curl-driven endpoints (fetch()
calls, automation scripts) that don't survive the 302+cookie redirect dance
that the public-auto-login flow requires on first visit. fire-planner's SPA
broke immediately — every fetch() to /api/* hit a cross-origin redirect and
CORS preflight rejected it.

Important learning for the `auth = "public"` design:

  `auth = "public"` is functionally equivalent to a normal Authentik forward-auth
  for the FIRST request — it issues a 302 to authentik to set a guest session
  cookie, then 302s back. This is invisible for top-level browser navigation
  but BREAKS:
    - XHR/fetch() under CORS preflight (preflight rejects redirects)
    - curl/automation scripts that don't preserve cookies across requests
    - Mobile / native clients that can't follow OAuth-style redirects

  Use `auth = "public"` only for top-level HTML pages where the user navigates
  via the browser address bar (or links). For XHR APIs, native-client surfaces,
  webhooks, OAuth callbacks — use `auth = "none"`.

The plan's "smoke test 3 candidates" were misjudged on this front. Reverting
all three to `auth = "none"` (their previous behaviour). The end-to-end public
flow IS verified working via curl + flow API — the design is sound, just the
test targets were wrong.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 19:00:11 +00:00
root
65c4fc6c0b Woodpecker CI deploy [CI SKIP] 2026-05-10 18:57:31 +00:00
Viktor Barzin
77d111f5fc owntracks: explicit auth = "none" — Phase 5 audit completion
The Phase 4 audit pass missed this site because the previous agent scoped
out owntracks (it overrides the factory's middleware list via
extra_annotations to use its own basic-auth middleware). Adding the explicit
auth = "none" satisfies Phase 5's "every ingress has an explicit decision"
goal and makes the intent visible — mobile OwnTracks clients post location
data via HTTP basic-auth and can't follow Authentik forward-auth 302s.

Closes the loop on Phase 5: 122/122 active ingress_factory call sites now
carry an explicit auth = "..." decision (zero callers rely on the default).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:55:04 +00:00
Viktor Barzin
e4f806abe3 ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.

ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
  `protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
  middleware → dedicated public outpost → guest auto-bind. Logged-in users
  keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
  client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
  itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
  ingresses don't need anti-AI noise; the auth flow already discourages bots).

Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true`     → `auth = "required"`
- 8 explicit `protected = false`     → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
  reviewed individually:
  * 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
    homepage, wrongmove UI, privatebin) → `auth = "none"`
  * 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
    handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
    xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
    location ingestion, immich frame kiosk, headscale CP, send anonymous
    drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
    `auth = "none"`
  * Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
    UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
  k8s-portal API, insta2spotify callback.

Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.

Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.

Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:53:49 +00:00
Viktor Barzin
317d6aa99f instagram-poster: disable ig-ingest-stories CronJob until /ig-ingest ships
The endpoint exists in the working copy of instagram_poster/app.py
but isn't committed/built/deployed, so every cron fire returned 404
and triggered JobFailed alerts every 30 min.

Set count = 0 to leave the resource declaration in place — re-enable
by removing that line once the endpoint is in a built image.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:49:25 +00:00
Viktor Barzin
c647791774 scripts: timeout rsync + sqlite calls in daily-backup
Per-PVC rsync had no timeout, so any single hung PVC (e.g. on a
corrupted snapshot or a sqlite held open by a writer) blocked the
whole script until systemd's 4h TimeoutStartSec kicked in, leaving
every later PVC silently unbacked. Today's run hung on
mailserver/roundcubemail-enigma-encrypted at 05:09 and didn't recover
— hence WeeklyBackupFailing alert.

Now:
- rsync per PVC: timeout 30 min, exit 124 logged separately
- sqlite3 per database: timeout 5 min
- /etc/pve rsync: timeout 5 min

Each timed-out PVC bumps PVC_FAIL but the loop keeps moving.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:39:07 +00:00
8ff74bb422 cloudflare: disable AI bot edge-block so x402 can issue payment offers
CF zone was returning 403 to declared AI-bot UAs at the edge
(`ai_bots_protection: "block"`). That meant the in-cluster x402
gateway never saw the request and could never issue an HTTP 402 with
the wallet payment requirements — the bot just bounced.

Adopt `cloudflare_bot_management.zone` via root-module import block,
flip ai_bots_protection to "disabled". Bot Fight Mode (`fight_mode`),
crawler challenge (`crawler_protection`), and managed robots.txt are
unaffected — generic automated traffic still gets the bot fight gate.

End-to-end verified: `User-Agent: Mozilla/5.0 (compatible; ClaudeBot/
1.0;...)` on viktorbarzin.me now returns HTTP 402 (was 403 CF block)
with `payTo=0xCc33...659f`, `amount=10000` micro-USDC, `network=base`.

Trade-off: bots that don't pay still hit origin (instead of CF
blackholing them), so a small bandwidth uptick. Negligible at our
traffic level.
2026-05-10 18:37:29 +00:00
Viktor Barzin
02a12f1ae4 monitoring(prometheus): keep all 4 kubelet_volume_stats_inodes metrics
pvc-autoresizer's GetMetrics() returns volume stats for a PVC only if
all four kubelet_volume_stats metrics (available_bytes, capacity_bytes,
inodes_free, inodes) are retrieved. The keep-list in the
kubernetes-nodes scrape job had available_bytes and capacity_bytes
(post 9d5da4d8) but was missing the two inode metrics, so the
autoresizer's reconcile logged "failed to get volume stats" for every
PVC and never resized anything.

Add kubelet_volume_stats_inodes and kubelet_volume_stats_inodes_free
to the regex.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:36:16 +00:00
Viktor Barzin
0e837b57b8 authentik: add public guest auto-login flow + dedicated outpost + traefik public middleware
Phase 1+2 of default-deny ingress plan. Adds the infrastructure for an `auth = "public"`
ingress tier that auto-binds anonymous requests to a `guest` Authentik user (no UI
prompt), so public sites are still recorded as authenticated by Authentik for audit
purposes — but as `guest`, not by leaking the standard catchall flow.

- guest user in `Public Guests` group (NOT `Allow Login Users`).
- `public-auto-login` flow: stage_binding policy sets `pending_user = guest`,
  `evaluate_on_plan = false` + `re_evaluate_policies = true` so flow_plan is
  populated when the policy mutates it; `authentication = none` lets anonymous
  requests enter.
- `Provider for Public` proxy provider (forward_domain, cookie_domain
  viktorbarzin.me) with `authentication_flow = public-auto-login`.
- Dedicated `public` outpost: only the public provider bound, deployed as
  `ak-outpost-public` Deployment+Service in the `authentik` namespace by
  Authentik's K8s controller.
- `public-auth.viktorbarzin.me` ingress exposes the public outpost's
  `/outpost.goauthentik.io/*` so OAuth callbacks land on it (the embedded
  outpost doesn't know about the public provider, so `authentik.viktorbarzin.me`
  callbacks would fail).
- `authentik-forward-auth-public` traefik middleware points at the public
  outpost service (not via the auth-proxy nginx fallback). The plan's
  `?app=public` dispatch idea was tested and rejected — the embedded outpost
  dispatches purely by Host header, so a dedicated outpost was the only way
  to isolate the public flow without conflicts.

No ingresses use the new middleware yet — Phase 3+4 (the ingress_factory
`auth` variable refactor + audit pass) wires it up. This commit is additive
and behaviour-neutral.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:26:16 +00:00
Viktor Barzin
1e4eac5386 proxmox-csi: opt SCs into pvc-autoresizer (resize.topolvm.io/enabled=true)
Without this annotation on the StorageClass, pvc-autoresizer's controller
filters the SC out at the index lookup stage and never patches any of its
PVCs, regardless of utilization or per-PVC threshold/increase/storage_limit
annotations. Internal metric pvcautoresizer_loop_seconds_total ticked but
no PVCs were ever evaluated — visible cluster-wide as PVAutoExpanding alerts
firing for forgejo-data-encrypted (82%) and audit-vault-0 (81%) without any
ResizeStarted events ever following.

The Prometheus scrape-config fix in 9d5da4d8 was a prerequisite (autoresizer
reads kubelet_volume_stats_available_bytes) but not sufficient on its own.

Also pinning chart version to 0.5.6 so the next apply doesn't incidentally
bump to 0.5.7.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:22:25 +00:00
c36858eddd x402: flip gateway live with Viktor's wallet + Slack payment notifications
Wires the traefik stack to read two new fields from secret/viktor:
  * x402_wallet_address     -> 0xCc33BD250d39752e0ceaB616f8a05F72274a659f
  * alertmanager_slack_api_url (existing) -> reused as the per-payment
    notification webhook so payment events arrive in the same Slack
    channel as other infra alerts.

Gateway now runs `wallet_set:true, dry_run:false`. Verified end-to-end:
  - Browser UA on all 9 sites -> 200 (passes through to Anubis)
  - python-requests/2.31 + scrapy + ClaudeBot UA -> 402 with
    PaymentRequiredResponse, payTo == Viktor's wallet, amount=10000
    micro-USDC, network=base, asset=Base USDC contract
  - Direct Slack-webhook test from inside cluster -> HTTP 200

Image bumped to forgejo.../x402-gateway:d9b83125 with Slack-format
notification payload (text=..., username=x402-gateway,
icon_emoji=💰; auxiliary fields preserved for richer receivers).

Notifications fire on every successful X-PAYMENT validation; failures
on Slack webhook are logged at WARN, never block the request, never
double-charge the bot.
2026-05-10 18:21:37 +00:00
Viktor Barzin
d1777d6119 kured(sentinel-gate): fix auth + write-perm so safety checks actually run
Test 3 validation surfaced two latent bugs in the sentinel-gate
DaemonSet that have been masked since 2026-04-18 (when uu was off,
nothing wrote /var/run/reboot-required, so the gate never had to
fire):

1. automount_service_account_token=false on both the SA and the
   pod spec → kubectl in the script falls back to localhost:8080
   on every call. Each check (`kubectl get nodes`, `kubectl get
   pods -n calico-system`, transition-time read) errors to stderr
   and emits empty stdout. `wc -l` reports 0 → checks "pass" with
   no real data.

2. bitnami/kubectl:latest runs as uid=1001 by default. The hostPath
   /var/run is root:root 0755 → final
   `touch /host/var-run/gated-reboot-required` failed with EACCES.
   Fail-safe by accident — but if anything had ever loosened those
   perms, the broken checks above would have green-lit the gate
   with no real validation.

Fix: enable token mount on the SA + pod, set
securityContext.run_as_user=0 on the container.

Verified post-fix: kubectl returns all 5 nodes, touch succeeds,
sentinel-gate now reports the correct
`BLOCKED: A node transitioned Ready within the last 24 hours
(soak window)` when triggered with k8s-node1's recent reboot
within the cool-down period.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:16:54 +00:00
Viktor Barzin
64c71615e8 scripts: cluster_healthcheck defaults to ~/.kube/config
The previous default of $(pwd)/config required running the script from
the infra/ directory or always passing --kubeconfig. From a parent
shell or any other working directory, the lookup hit a non-existent
file and kubectl returned a stale-token error, masking real check
results.

Now: use $KUBECONFIG if set, then ~/.kube/config, then fall back to
$(pwd)/config for backwards compatibility.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:12:40 +00:00
Viktor Barzin
a245e6e569 docs: add k8s node auto-upgrade runbook + architecture section
The OS-side counterpart to the service-upgrade pipeline. Covers
the unattended-upgrades + kured + sentinel-gate + Prometheus
halt-on-alert design landed in c0991f7f8.

Runbook: ops procedures (verify health, halt rollout, restore
config to a re-imaged node, roll back a bad upgrade, investigate
which alert is blocking).

Architecture doc: extends the existing service-upgrade flow with
a "K8s Node OS Upgrades" section (stack, sources of truth, day-2
mechanism, why-this-design rationale tied to the March 2026
post-mortem).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 17:26:15 +00:00
Viktor Barzin
acd9438e4f monitoring(grafana): swap python3 for jq in folder-ACL local-exec
CI image (ci/Dockerfile) is alpine + jq, no python3. The
grafana_admin_only_folder_acl null_resource was parsing /api/folders
with a python3 oneliner, which crashed every CI apply with
"python3: command not found" and made every monitoring stack apply
fail in CI (worked locally because the dev VM has python3).

jq is already in the CI image and produces the same output.
2026-05-10 17:09:33 +00:00
Viktor Barzin
016584651e docs/plans: 2026-04-20 infra audit design (post-research, post-challenge)
Adds the infra audit plan: 5 parallel research agents (Reliability,
Declarative, Maintenance, Scalability, Security) → 91 raw findings →
2 independent challengers → filtered/corrected/ranked backlog.

Already incorporates the challenger corrections (drops bad metric
pulls, reframes intentional-by-design items). Source for several
follow-ups already shipped this week (kured-prometheus gating, NFS
fsid post-mortem fixes, Authentik outpost postgres-backend).
2026-05-10 17:07:49 +00:00
Viktor Barzin
c0991f7f8f infra: re-enable unattended-upgrades with kured prometheus-gating
Reverses the March 2026 outage mitigation that disabled unattended-
upgrades cluster-wide. Now re-enables it on the k8s template VM with:

  - Allowed-Origins limited to security/updates pockets
  - Package-Blacklist for k8s/containerd/runc/calico-node (apt-mark
    hold on the cluster-critical components)
  - Automatic-Reboot disabled — kured drives the actual reboots
  - Compatible with the existing kured + sentinel-gate flow

kured side:
  - rebootDelay 30s, concurrency 1
  - Sentinel cool-down stretched 30m → 24h (aligns with the 24h soak
    window from the post-mortem)
  - prometheusUrl + alertFilterRegexp wired so any firing non-ignored
    alert halts the rollout. Ignore-list excludes self-referential
    alerts (Watchdog/RebootRequired/KuredNodeWasNotDrained/
    InfoInhibitor) that would otherwise deadlock kured.

Prometheus side (already partly landed in 6c4e0966 — the "Upgrade
Gates" rule group):
  - Refine `KubeQuotaAlmostFull` to include the resourcequota label in
    both the on-clause and the summary, so multi-quota namespaces
    (authentik, beads-server, frigate) report the quota name correctly.

grafana.tf: terraform fmt whitespace only.

Together with the post-mortem 2026-03-22 (memory id=390) the loop is
closed: unattended-upgrades runs again, kernel-class updates can land,
but only when cluster health is green and the reboot window is open.
2026-05-10 17:07:32 +00:00
Viktor Barzin
df435f3daa monitoring: protect grafana ingress with authentik + disable anonymous
- add traefik-authentik-forward-auth to grafana ingress middleware list
- disable auth.anonymous (was Viewer-by-default for the public)
- enable auth.proxy with X-authentik-username so Authentik users get
  signed in seamlessly (no double-login UX)

Prometheus and Alertmanager already had forward-auth — no change.
2026-05-10 17:01:50 +00:00
Viktor Barzin
6c4e096688 authentik: zero-endpoints alert + upgrade-validation checklist
Add `AuthentikForwardAuthFallbackActive` Prometheus alert: fires on
sustained 401/s spike on the websecure entrypoint (>5/s for 5m), which
is the symptom of the auth-proxy Emergency-Access fallback firing —
in turn caused by zero ready endpoints on the outpost service.

Why this rule and not `kube_endpoint_address_available == 0`:
kube-state-metrics endpoint metrics exist as series names but never
have current values in this Prometheus pipeline (something is dropping
them silently). Detecting the failure at the edge via Traefik is more
reliable than instrumenting the broken middle.

Also fix the pre-existing `AuthentikOutpostForwardAuth400Spike` regex
— the service label is `authentik-ak-outpost-...`, not
`authentik-authentik-outpost-...`, so the alert never matched any
series and never could have fired. Verified in Prometheus before/after
the fix.

Add an "Upgrade Validation Checklist" section to
`.claude/reference/authentik-state.md` with the seven-step smoke test
to run after Authentik chart bumps, provider bumps, or outpost pod
recreation. Covers the brittle surfaces (Service selector, JSON
patches, postgres backend wiring, access_token_validity TTL, edge
auth flow, plan-to-zero).
2026-05-10 16:54:48 +00:00
Viktor Barzin
af9556ca96 infra/instagram-poster: shared CNPG-backed benchmark DB, no PVC for scores
The instagram_poster.benchmark CLI was writing scores to a sqlite file
on the pod's data PVC. Moving it to the shared CNPG cluster so the
benchmark scoring path is stateless on the pod, scores survive pod
recreation, and the rotation/backup pipeline applies automatically.

- dbaas: null_resource.pg_instagram_poster_db creates role + DB
  (idempotent CREATE IF NOT EXISTS, password placeholder) — same
  shape as pg_postiz_dbs / pg_wealthfolio_sync_db.
- vault: vault_database_secret_backend_static_role.pg_instagram_poster
  + add to allowed_roles. 7d rotation_period.
- instagram-poster: second ExternalSecret (vault-database store) →
  K8s Secret instagram-poster-benchmark-db with BENCHMARK_PG_HOST/
  PORT/USER/PASSWORD/DATABASE. env_from on the deployment.
  reloader.stakater.com/match=true bounces the pod on rotation.

Code-side: instagram_poster/benchmark.py now resolves the DB URL from
BENCHMARK_DB_URL or BENCHMARK_PG_* env vars; falls back to sqlite for
local DevVM scratch runs. Schema bootstraps via Base.metadata.create_all,
no alembic step needed for the benchmark-only side-DB.

Verified end-to-end via DevVM port-forward: ESO synced, K8s Secret has
all 5 fields, pod env shows BENCHMARK_PG_*, smoke-test scoring 3 photos
landed in the new PG table with subject_category populated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 16:37:33 +00:00
Viktor Barzin
117b99e28f docs/authentik: document postgres session backend + close out 2026-04-18 post-mortem items
Update `.claude/reference/authentik-state.md`:
  - Add `ProxyProvider.access_token_validity = "weeks=4"` to the Session
    Duration table with the gotcha that the gorilla session store binds
    the value once at outpost startup (rollout restart needed).
  - Replace the "session storage moved to Postgres in 2025.10" note that
    falsely implied the migration was automatic — explain that the
    `Outpost.managed` field gates the postgres path and our outpost
    silently stayed on `FilesystemStore` until 2026-05-10.
  - Document the goauthentik 2026.2.2 service-selector bug
    (service.py:52) and the JSON-patch workaround.
  - Document that the standalone embedded-outpost deployment needs
    `AUTHENTIK_POSTGRESQL__*` env vars injected via JSON patch, plus the
    `app.kubernetes.io/component=server` pod label.
  - Note the "Terraform doesn't expose `Outpost.managed`" assumption
    that holds the `managed=embedded` value in place across applies.

Close out post-mortem `2026-04-18-authentik-outpost-shm-full.md`:
  - P2 codify-in-Terraform: DONE.
  - P3 access_token_validity reduce: DONE-alt (we did the opposite —
    bumped to 4 weeks — because postgres backend mooted the storage
    concern).
  - P3 move-off-embedded-outpost: DONE-alt (postgres backend addresses
    the loss-of-state class on the embedded outpost itself).
2026-05-10 16:28:11 +00:00
Viktor Barzin
30cdd05bd8 state(vault): update encrypted state 2026-05-10 16:28:09 +00:00
Viktor Barzin
d4e1b4c71a state(dbaas): update encrypted state 2026-05-10 16:27:51 +00:00
Viktor Barzin
07b943c31c authentik/pgbouncer: image_pull_policy IfNotPresent -> Always (match live)
The HCL declared `IfNotPresent` since module creation but the live
deployment reconciled to `Always` somewhere along the way (likely a
Helm/operator default). Since the image is `:latest`, `Always` is the
correct value — `IfNotPresent` would skip pulling updated images on
pod restart, defeating the point of the floating tag.

Drops the lone remaining drift in the authentik stack so plan-to-zero
holds across the whole stack, not just the resources I just adopted.
2026-05-10 16:22:24 +00:00
Viktor Barzin
59a7349564 authentik: codify proxy provider TTL + adopt embedded outpost
Bump access_token_validity to weeks=4 (was hours=168, UI-managed in
ignore_changes). Drives the cookie Max-Age and the proxysession.expires
TTL — keeps users logged in for 28d instead of 7d.

Adopt the embedded outpost into Terraform so the postgres-session-backend
fix from earlier today (2026-05-10) is described as code:
  - kubernetes_json_patches.deployment carries dshm 2Gi tmpfs, resource
    requests/limits, the app.kubernetes.io/component=server pod label
    (workaround for goauthentik 2026.2.2 service.py:52 selector mismatch
    on standalone embedded outposts), and AUTHENTIK_POSTGRESQL__* envFrom
    the shared `goauthentik` Secret so the postgres session backend can
    connect to the dbaas cluster.
  - kubernetes_json_patches.service replaces the controller-set selector
    (which targets app.kubernetes.io/name=authentik / the goauthentik-server
    pods) with the outpost's own labels — without this, endpoints are
    empty and auth-proxy falls back to Basic-Auth realm "Emergency Access".

The `managed` field ("goauthentik.io/outposts/embedded") is server-set
and not in the Terraform provider's schema, so TF preserves it across
applies (writes only fields it knows about). Plan-to-zero verified.
2026-05-10 16:18:42 +00:00
Viktor Barzin
cd2884ce94 infra/compute: bump k8s-node1 RAM 32 -> 48 GiB
Reason: GPU multi-tenancy (frigate + ytdlp-highlights + llama-swap +
immich-ml) was hitting 94% memory-request saturation on the old size.
The benchmark on 2026-05-10 surfaced this when llama-swap stayed
Pending despite GPU time-slicing being on (nvidia.com/gpu replicas=100)
- the actual constraint was node1 RAM, not GPU.

Procedure: drained node1, qm shutdown 201, qm set 201 --memory 49152,
qm start 201, kubelet picked up new capacity (47 GiB / 45.5 GiB
allocatable), uncordon, restored llama-swap + immich-ml.

Out-of-band qm set is the path here (not Terraform) because VMID 201
is intentionally not managed by TF yet - the telmate/proxmox provider
trips on iSCSI-disked VMs (see infra/stacks/infra/main.tf line 442).
Adopt this VM into TF once we migrate to bpg/proxmox.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 15:24:26 +00:00
Viktor Barzin
764c234b1c infra/llama-cpp: benchmark report + -fa flag fix
Phase 7 of the vision-LLM benchmark plan. Adds:

- docs/benchmarks/2026-05-10-vision-llm.md — curated report (TL;DR,
  per-model analysis, top-N agreement, cost vs cloud APIs, sample
  captions). Verdict: qwen3vl-4b for the request path (3.55 s p50,
  100% parse, decisive top-N distro); qwen3vl-8b for caption polish.
- docs/benchmarks/benchmark-2026-05-10-1424.json — raw 300-row dump
  for diff-checking against future runs.
- main.tf: -fa -> -fa on (b9085 llama.cpp removed the no-value form
  of the flash-attention flag; without the value llama-server exits
  before serving any request).
- llama-cpp.md architecture doc links the report so future operators
  land on the deployed-and-evaluated model from one entry point.

300/300 calls, 0 parse errors, 33m32s wall on a single T4 with the
GPU exclusively allocated. immich-ml was scaled to 0 for the run
(node1 RAM constraint, not GPU - bumping node1 RAM is tracked as a
follow-up).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 15:03:16 +00:00
Viktor Barzin
00dc756716 anubis: only challenge GET requests; allow everything else
PrivateBin's XHR `POST /` (paste creation) was the trigger — Anubis's
catch-all CHALLENGE rule served an HTML challenge page where the JS
expected JSON, breaking paste creation entirely. Same shape will hit
any SPA XHR or CORS preflight on the other 8 Anubis-fronted sites
(homepage actions, kms upload-then-poll, wrongmove search refresh,
jsoncrack share, etc.) the moment it gets exercised.

Add an `ALLOW` rule keyed on `method != "GET"` between the AI/UA-block
imports and the catch-all CHALLENGE. Rationale:

  * AI scrapers consume GET response bodies — they don't POST.
  * State-mutating XHRs and OPTIONS preflight need to bypass the
    challenge or the app breaks.
  * CrowdSec + per-route rate-limit + app-level auth already cover
    abuse on mutating methods, so this gives up nothing.
  * Hard-deny rules for known-bad bots run first, so a declared bad
    bot can't sneak through by sending a POST.

Also added a `checksum/policy` annotation on the Anubis pod template
sourced from `sha256(coalesce(var.policy_yaml, default_policy_yaml))`
so future policy changes auto-roll the deployment instead of needing
a manual `kubectl rollout restart`.

f1-stream had its own policy override (path carve-outs for SvelteKit
asset hashes and JSON data routes); mirrored the new rule there too.

Applied to all 8 Anubis-fronted stacks: blog, kms, f1-stream,
travel_blog, real-estate-crawler, homepage, cyberchef, jsoncrack.
Verified per stack: GET / returns the Anubis challenge page; POST,
PUT, DELETE, OPTIONS pass through to the backend (HTTP 301/405/502
from the upstream app, never the Anubis "not a bot" HTML).
2026-05-10 14:56:12 +00:00
root
4aeefd36a4 Woodpecker CI deploy [CI SKIP] 2026-05-10 14:48:36 +00:00
Viktor Barzin
dc286a67d1 privatebin: drop Anubis — broke XHR paste creation
PrivateBin's UI POSTs the encrypted blob to `/` via XHR. With Anubis in
front, the catch-all CHALLENGE rule returned an HTML challenge page
where the JS expected JSON, so paste creation failed silently for every
user. The challenge cookie didn't bypass it — Anubis appears to issue a
fresh challenge on POST regardless of cookie state.

Pastes are client-side encrypted; AI scrapers gain nothing from
indexing them, so the default `anti_ai_scraping` middleware is enough
protection. Restoring the ingress to point straight at the privatebin
service. CSP `wasm-unsafe-eval` retained — PrivateBin's zlib.wasm
needs it independent of Anubis.

This matches the rule already documented in infra/.claude/CLAUDE.md:
"DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints — clients
without JS can't solve PoW." A SPA's XHR is the same shape.

Verified: GET / returns PrivateBin HTML (not the Anubis challenge),
POST / returns PrivateBin's own JSON error envelope.
2026-05-10 14:47:48 +00:00
Viktor Barzin
34acd98785 infra/llama-cpp: add stack — llama-swap fronting Qwen3-VL + MiniCPM-V
Single Deployment of mostlygeek/llama-swap:cuda hot-swaps three
GGUF vision models (qwen3vl-8b, minicpm-v-4-5, qwen3vl-4b) at one
OpenAI-compat /v1 endpoint on Service llama-swap.llama-cpp.svc.
Idle TTL 10min so models unload between benchmark batches.

Storage: NFS-RWX from /srv/nfs-ssd/llamacpp (30Gi). One-shot
download Job pulls Q4_K_M GGUF + mmproj per model, creates stable
model.gguf / mmproj.gguf symlinks so the llama-swap config is
filename-agnostic, then warms the kernel page cache.

GPU: nvidia.com/gpu=1 = whole T4 — operator must scale immich-ml
to 0 during benchmark windows. wait_for_rollout=false so apply
doesn't block on GPU availability.

Initial use case: vision-LLM benchmark for instagram-poster
candidate scoring; future consumers (HA, agentic tooling) hit
the same endpoint via LiteLLM at the gateway.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 14:13:40 +00:00
Viktor Barzin
ce65dc2385 kms: document native DNS auto-discovery (no client config needed)
LAN clients with DNS suffix viktorbarzin.lan now activate with zero
configuration — Windows queries _vlmcs._tcp.viktorbarzin.lan SRV by
default and the chain resolves through vlmcs.viktorbarzin.lan to the
new 10.0.20.202 KMS IP.

DNS state (Technitium primary, replicated to secondary+tertiary by the
existing technitium-zone-sync CronJob every 30 min):
- _vlmcs._tcp.viktorbarzin.lan SRV 0 0 1688 vlmcs.viktorbarzin.lan
  (was: target=kms.viktorbarzin.lan)
- vlmcs.viktorbarzin.lan A 10.0.20.202   (added)
- kms.viktorbarzin.lan A 10.0.20.200      (unchanged — still the
  Traefik LB for the user-facing website at kms.viktorbarzin.lan/)

vlmcs.viktorbarzin.lan was added as a dedicated KMS-server hostname
rather than retargeting kms.viktorbarzin.lan so the LAN-direct website
keeps working without depending on hairpin NAT through pfSense.

Verified end-to-end on WIN10Pro-DS32 (192.168.1.230):
slmgr /ckms → slmgr /ato → "Product activated successfully" with
"KMS machine name from DNS: vlmcs.viktorbarzin.lan:1688" and
"KMS machine IP address: 10.0.20.202". Real client IP 192.168.1.230
appears in vlmcsd log and in the slack-notifier sent line; second
activation within the dedup window correctly increments
kms_activations_dedup_skipped_total.
2026-05-10 13:51:45 +00:00
Viktor Barzin
d67c3027bc kms: per-connection state in notifier (vlmcsd is multi-threaded)
Bug found via E2E test against the Windows VM (VMID 300). The single
shared `state` dict in slack-notifier.py worked when vlmcsd processed
one connection at a time, but real Windows KMS activations hold the
connection open ~30 seconds (handshake + keep-alive). During that
window vlmcsd accepts other concurrent connections — most relevantly
the new kubelet TCP readiness probe every 5s — and each new OPEN line
reset the shared state, wiping the in-flight activation's
app/product/host before its CLOSE arrived. Result: real activations
were misclassified as probes (no Slack post, no metric increment).

Fix: state is now a dict keyed by `ip:port` with one sub-dict per
in-flight connection. A `__current` pointer tracks the most recent
OPEN so unkeyed detail lines (Application ID, Workstation name, etc.)
can be attributed correctly — vlmcsd writes detail lines immediately
after the OPEN and before any subsequent OPEN, so the heuristic holds.
Orphan CLOSEs (notifier started mid-conn) are now silently dropped
instead of emitting an empty probe event.

Two new regression tests:
- test_kubelet_probe_during_long_activation: 5s probe interleaved into
  a 31s activation block — exact production failure mode.
- test_orphan_close_no_event: bare CLOSE without prior OPEN.

Verified live: triggered slmgr /upk + /ipk + /skms 10.0.20.202 + /ato
on WIN10Pro-DS32. vlmcsd logged the full activation block, notifier
posted to Slack with ip=192.168.1.230 source=external
product='Windows 10 Professional' host='WIN10Pro-DS32.viktorbarzin.lan'
and kms_activations_total{product=Windows 10 Professional,
status=Licensed} 1 — real WAN client IP preserved through the
ETP=Local + dedicated MetalLB IP chain end to end.
2026-05-10 13:21:38 +00:00
Viktor Barzin
9163909fad fire-planner: imagePullPolicy=Always on alembic-migrate init container
After a rollout-restart, the main container (default Always for :latest)
pulled the new image with alembic 0003, but the init container
defaulted to IfNotPresent and reused a cached old image lacking 0003 →
"Can't locate revision identified by '0003'" → CrashLoopBackOff.

Setting Always on the init container so both containers stay in lockstep
across rollouts. Longer term we should switch the deployment to 8-char
git-SHA tags per the cluster policy in .claude/CLAUDE.md, but this
unblocks the Wave 1 deploy in the meantime.
2026-05-10 13:09:15 +00:00
Viktor Barzin
9d4e2fc4a0 kms: dedicate MetalLB IP 10.0.20.202 + filter probe noise
Two coupled fixes for the hourly Slack noise + missing client IPs:

1. Move windows-kms off shared 10.0.20.200 to a dedicated MetalLB IP
   10.0.20.202 with externalTrafficPolicy=Local, so vlmcsd sees real
   WAN client IPs (pfSense WAN forwards do DNAT-only; ETP=Local skips
   kube-proxy SNAT). Same pattern mailserver used pre-2026-04-19.
   Sharing 10.0.20.200 is blocked because all 10 services there are
   ETP=Cluster and MetalLB requires consistent ETP per shared IP.

2. Slack notifier now suppresses Slack posts for bare TCP open/close
   pairs (no Application/Activation block) — these are Uptime Kuma's
   port monitor and the new kubelet readiness/liveness probes. Probe
   counts go to a new metric kms_connection_probes_total{source} where
   source classifies the IP as internal_pod / cluster_node / external.
   Real activations are unaffected.

Pod fluidity: added TCP readiness/liveness probes on 1688 to gate Pod
Ready on the listener actually being up — required for ETP=Local so
MetalLB only advertises 10.0.20.202 from a node where vlmcsd is serving.

pfSense side (applied separately, not codified):
- New alias k8s_kms_lb = 10.0.20.202 (KMS-only)
- WAN:1688 NAT + filter rule retargeted from k8s_shared_lb to k8s_kms_lb
- All other forwards on k8s_shared_lb (WireGuard, HTTPS, shadowsocks,
  smtps, etc.) untouched

Runbook updated. Tests added for classify_source / is_probe / process_line.
2026-05-10 13:03:19 +00:00
Viktor Barzin
295cfd776c fire-planner: expose actualbudget creds via ExternalSecret
Adds 3 new keys (ACTUALBUDGET_API_URL/KEY/SYNC_ID) sourced from
Vault secret/fire-planner so the FastAPI backend can read viktor's
spending from the in-cluster actualbudget HTTP API and prefill the
Annual spending field on the WhatIf form. Vault keys seeded manually
ahead of this commit; ESO has already synced the K8s Secret.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 13:03:18 +00:00
Viktor Barzin
203a71768d x402: consolidate to a single shared forwardAuth gateway
The per-site `x402_instance` module created one Deployment + Service +
PDB per protected host (9 in total, 9×64Mi). Every pod was running the
exact same logic with the same config — the only thing that varied
was the upstream URL, which we don't even need since the gateway can
return 200 to "allow" and Traefik handles the upstream itself.

Refactor to the same pattern as `ai-bot-block`:
 * single deployment + service in `traefik` namespace, 2 replicas, HA
 * Traefik `Middleware` CRD `x402` (forwardAuth → x402-gateway:8080/auth)
 * each consumer ingress just appends `traefik-x402@kubernetescrd` to
   its middleware chain via `extra_middlewares`

x402-gateway gains a `MODE=forwardauth` env var that returns 200 (allow)
or 402 (with x402 PaymentRequiredResponse body) instead of reverse-
proxying. Image: ghcr ... f4804d62.

Pod count: 9 → 2 (78% memory saved). All 9 sites verified still
serving the Anubis challenge to plain curl with identical TTFB.
DRY_RUN until `var.x402_wallet_address` is set on the traefik stack.

Removes `modules/kubernetes/x402_instance/` (dead code now).
2026-05-10 10:54:38 +00:00
Viktor Barzin
786f0434cb x402: deploy payment gateway in front of Anubis on all 9 public sites
Adds modules/kubernetes/x402_instance/ — a small Go reverse proxy
(forgejo.viktorbarzin.me/viktor/x402-gateway:ce333419) that selectively
issues HTTP 402 Payment Required to declared AI-bot User-Agents and
validates X-PAYMENT headers against a Coinbase x402 facilitator.
Browsers are forwarded transparently to Anubis (which then handles the
JS PoW gate as before).

Wired into all nine Anubis-fronted sites:
  ingress -> x402-X -> anubis-X -> backend

While `wallet_address` is empty the gateway runs in DRY_RUN — every
request is transparent-proxied, no 402s issued. This lets the pod sit
in the request path with zero behavioural impact today; flipping the
wallet variable in the per-stack module call activates payment-required
mode for AI-bot UAs.

Default config: Base mainnet USDC, $0.01/req, x402.org/facilitator,
catch-all UA list (ClaudeBot|GPTBot|Bytespider|meta-externalagent|
PerplexityBot|GoogleOther|cohere-ai|Diffbot|Amazonbot|
Applebot-Extended|FacebookBot|ImagesiftBot|YouBot|anthropic-ai|
Claude-Web|petalbot|spawning-ai|scrapy|python-requests).

Verified post-apply: 9/9 pods Running, all 9 sites still serve the
Anubis challenge to plain curl with identical TTFB, x402 logs confirm
"dry_run":true on every instance.
2026-05-10 02:26:10 +00:00
root
c7b996a558 Woodpecker CI deploy [CI SKIP] 2026-05-10 01:25:35 +00:00
Viktor Barzin
3c340c1796 anubis: re-protect f1 with a per-host policy that allows JSON routes
Earlier f1 revert left the host fully unprotected (no Anubis,
exclude_crowdsec=true on the ingress already). Re-add Anubis with
a custom policy_yaml that:

- ALLOWs /_app/* (SvelteKit immutable JS/CSS chunks loaded before
  any cookie exists), /openapi.json, /docs, /api/* (FastAPI meta).
- ALLOWs the 9 known JSON/proxy routes (schedule, streams,
  embed, embed-asset, extract, extractors, health, proxy, relay)
  so the SvelteKit SPA's XHRs return JSON instead of the challenge
  HTML.
- Catch-all CHALLENGE for everything else — the SPA HTML pages
  (which fall through to FastAPI's `/{path}` catch-all) get the
  PoW gate.

The ALLOWed JSON routes are technically scrapeable by a determined
bot, but the user's stated goal is "avoid accidental scrapes" — the
HTML/SPA is the AI-training target, and that stays gated.

Verified: / → Anubis challenge HTML; /schedule, /streams → JSON;
/_app/.../app.js → text/javascript; ClaudeBot UA → Anubis deny page.
2026-05-10 01:24:50 +00:00
Viktor Barzin
b5f48e7b99 anubis: pull f1 off Anubis (XHR-vs-challenge collision) + add latency alerts
f1.viktorbarzin.me is a SPA whose JS fetches /schedule, /embed,
/embed-asset, … on the same path tree. With Anubis fronting `/`,
those XHRs land on the challenge HTML even when the cookie *should*
be valid, breaking the page with `Unexpected token '<', "<!doctype "
... is not valid JSON`. Removed Anubis from f1 — would need a path
carve-out (the way wrongmove does for /api) to re-enable. Added a
top-of-block comment so future me remembers why.

Plus four new Prometheus alerts in `Slow Ingress Latency` group
(stacks/monitoring/.../prometheus_chart_values.tpl):

- IngressTTFBHigh         (warn, 10m, avg latency >1s)
- IngressTTFBCritical     (crit, 5m,  avg latency >3s)
- IngressErrorRate5xxHigh (crit, 5m,  5xx >5%)
- AnubisChallengeStoreErrors (crit, 5m, any 5xx on *anubis* services
  via Traefik — proxies for the in-pod challenge-store error since
  Anubis itself only exposes Go-runtime metrics)

Notes from the alert author: avg-not-p95 because the existing
Prometheus scrape config drops traefik bucket series; once those
are restored, swap to histogram_quantile(0.95). TraefikDown inhibit
rule extended to suppress these four during a Traefik outage.
2026-05-10 01:01:52 +00:00
Viktor Barzin
efd28ccce5 anubis: fix 500 on multi-replica + roll out to 6 more public sites
Browser visits to viktorbarzin.me started returning HTTP 500 with
`store: key not found: "challenge:..."` in pod logs. Root cause:
each Anubis pod stores in-flight challenges in process memory; with
2 replicas behind a ClusterIP, the PoW-solved request can be
routed to a different pod than the one that issued the challenge.
Anubis upstream documents the same caveat ("when running multiple
instances on the same base domain, the key must be the same across
all instances" — true for the ed25519 signing key, but the
challenge store is still pod-local without a shared backend).

Drop module default replicas: 2 → 1. Worst-case: ~1s cold-start on
pod restart. Real fix (Redis-backed challenge store) noted as a
follow-up in CLAUDE.md.

Roll Anubis out to: f1-stream, cyberchef (cc), jsoncrack (json),
privatebin (pb), homepage (home), real-estate-crawler (wrongmove
UI only — `/api` ingress stays direct via path-based ingress carve-
out so XHRs from the SPA bypass the challenge).

End-state: 9 public hosts now Anubis-fronted (blog, www, kms,
travel, f1, cc, json, pb, home, wrongmove). All return the
challenge HTML to bare curl/browser; verified-IP search engines and
/robots.txt + /.well-known still skip via the strict-policy
allowlist.
2026-05-10 00:50:30 +00:00
Viktor Barzin
12fbc404ec anubis: strict bot policy — catch-all CHALLENGE for unmatched UAs
The default upstream policy only WEIGHs Mozilla|Opera UAs and lets
everything else (curl, wget, python-requests, scrapy, headless CLI
scrapers) fall through to the implicit ALLOW. On non-CDN-fronted
hosts (kms, anything dns_type=non-proxied) this meant a plain
`curl https://kms.viktorbarzin.me/` returned the real backend
content with no challenge — defeating the whole point of the
"avoid casual scrapers" intent.

Now the module ships a custom POLICY_FNAME mounted via ConfigMap:
- Imports the upstream deny-pathological / ai-block-aggressive /
  allow-good-crawlers / keep-internet-working snippets unchanged
- Adds a final `path_regex: .*` → action: CHALLENGE catch-all

Result: only IP-verified search engines (Googlebot from Google IPs,
Bingbot, etc.) and well-known paths (robots.txt, .well-known,
favicon, sitemap) skip the challenge. Everything else — including
spoofed-Googlebot-UA-from-random-IP — solves PoW or gets nothing.

Verified post-apply: curl default UA on viktorbarzin.me + kms +
travel returns the Anubis challenge HTML; /robots.txt still 200s
straight through.
2026-05-10 00:21:56 +00:00
Viktor Barzin
c73cd26a73 fire-planner: dual ingress — /api/* unprotected, / behind Authentik
The SPA can't carry an Authentik session on its own fetch() XHRs in
all cases (cross-origin redirect to authentik.viktorbarzin.me on a
stale cookie returns HTML, fetch().json() parse fails). Splitting
the ingress so /api/ paths skip forward-auth lets the React app talk
to its API end-to-end. The browser still has to log in via
Authentik to load the SPA at /.

Verified end-to-end via chrome-service Playwright: dashboard load,
scenario list, what-if run with real Monte Carlo, save-as-scenario
round-trip, run-now on detail, delete — all pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 00:06:40 +00:00
Viktor Barzin
f48da84770 anubis: per-site PoW reverse proxy on blog + kms + travel-blog
Adds modules/kubernetes/anubis_instance/ — a per-site reverse proxy
instance pinned to ghcr.io/techarohq/anubis:v1.25.0. Each instance
issues a 30-day JWT cookie scoped to viktorbarzin.me after a tiny
proof-of-work (difficulty 2 ≈ 250 ms desktop / 700 ms mobile). The
shared ed25519 signing key (Vault: secret/viktor → anubis_ed25519_key)
makes a single solve good across every Anubis-fronted subdomain.

Wired into blog (viktorbarzin.me + www), kms.viktorbarzin.me, and
travel.viktorbarzin.me — each with anti_ai_scraping=false on the
ingress so the redundant ai-bot-block forwardAuth is dropped from the
chain. Skipped forgejo (Git/API clients can't solve PoW) and resume
(replicas=0).

Also tightens bot-block-proxy nginx timeouts (3s/5s → 100ms/200ms) so
any ingress still using the ai-bot-block forwardAuth pays at most
~150 ms when poison-fountain is scaled down, instead of 3 s.

End-to-end TTFB on viktorbarzin.me dropped from ~3.2 s to ~150-200 ms.

Docs: .claude/reference/patterns.md "Anti-AI Scraping" updated to
4 layers; .claude/CLAUDE.md adds the Anubis usage paragraph and
Forgejo/API caveat.
2026-05-10 00:06:21 +00:00
root
27675cb8f1 Woodpecker CI Update TLS Certificates Commit 2026-05-10 00:03:53 +00:00
Viktor Barzin
5ea89ebcd7 postiz: disable signups (DISABLE_REGISTRATION=true)
Admin account already exists; we don't want random users registering
on the public-facing instance. Sign-in only from now on.
2026-05-10 00:02:27 +00:00
Viktor Barzin
ee8dd2f36c fire-planner: ingress port 8080 (was defaulting to 80)
ingress_factory's port var defaults to 80, but fire-planner publishes
on 8080. Traefik logged 'Cannot create service error="service port
not found"' and 404'd every request. Cloudflare's standard
origin-error decoy page (with the noindex meta + cdn-cgi/content
honeypot link) made it look like a bot-block, but it was just the
upstream coming back 404.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 23:33:11 +00:00
root
bb54b176ac Woodpecker CI deploy [CI SKIP] 2026-05-09 22:14:19 +00:00
Viktor Barzin
572d6cd8e0 kms: deploy slack-notifier sidecar with Prometheus metrics + document public exposure
Slack notifier now also exposes /metrics on :9101 with stdlib HTTP — counts
activations and dedup-skips by product, gauges last-activation timestamp.
Pod template gets the standard prometheus.io/scrape annotations so the
cluster-wide kubernetes-pods job picks it up via pod IP. Memory request
bumped to 48Mi to cover counter dicts + HTTPServer.

Plus docs: networking.md footnotes the windows-kms row noting public WAN
exposure with the rate-limited (max-src-conn 50, max-src-conn-rate 10/60,
overload <virusprot> flush) pfSense filter rule, and a new runbook covers
log locations, rate-limit tuning, and how to revoke the WAN forward.

The matching pfSense rule was tightened in place (TCP-only + rate limits)
via SSH; pfSense isn't Terraform-managed.
2026-05-09 22:12:46 +00:00
Viktor Barzin
82dc0f9687 state(dbaas): update encrypted state 2026-05-09 18:05:09 +00:00
Viktor Barzin
cfe969fe43 backup: fix daily-backup silent failures, postiz pg_dump CronJob, doc reconcile
daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr
30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount,
which blocked the next run from completing — root cause of the WeeklyBackupStale
alert going silent (the metric never reached its end-of-script push).

Fixes:
- TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting
  the wall during week 18 runs)
- Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as
  belt-and-braces for any inherited stuck state from a prior crashed run
- TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of
  the alert going blind on systemd kills
- pfsense metric pushed in BOTH success and failure paths (was only on success;
  any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert
  threshold expired)

Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node
OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup
that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to
/srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end:
3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql
image is stripped (no curl/wget/python) — switched to docker.io/library/postgres
matching the dbaas/postgresql-backup pattern with apt-installed curl.

Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed
backup_weekly_last_success_timestamp but the script pushes
daily_backup_last_run_timestamp). Updated to match what's actually emitted, and
added a "default-covered" footnote to the Service Protection Matrix so the
~40 services with PVCs not enumerated in the table are no longer ambiguous.

Manual PVE-host actions (out-of-band, not in TF):
- unmounted 6 stacked snapshots from /tmp/pvc-mount
- pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the
  loop got SIGTERMed against repeatedly, so prune kept failing)
- created /srv/nfs/postiz-backup directory
- triggered a one-shot daily-backup run with the new TimeoutStartSec to
  validate the fix end-to-end

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 17:41:04 +00:00
Viktor Barzin
8f0502230b grafana: env-var datasources + reloader so Vault rotations stop breaking dashboards
Wealth, Payslips, and Job-Hunter Grafana datasources all baked the
rotating PG password into their ConfigMap at TF-apply time, so every
7-day Vault static-role rotation silently broke the panels until a
manual `terragrunt apply`. Same family as the recurring grafana-mysql
backend bug — Grafana caches creds at startup and never picks up the
new ESO-synced password without a restart.

Fix:
- Each source stack now creates an ExternalSecret in `monitoring`
  exposing the rotating password as `<NAME>_PG_PASSWORD` env-var.
- Grafana mounts those via `envFromSecrets` (optional=true so a
  missing source stack doesn't block boot) and the datasource
  ConfigMaps reference `$__env{<NAME>_PG_PASSWORD}` instead of a
  literal password.
- `reloader.stakater.com/auto: "true"` on the Grafana pod restarts
  it whenever any of the four DB-cred Secrets is updated.

Tested end-to-end: forced `vault write -force database/rotate-role/
pg-wealthfolio-sync` → ESO synced (~30s) → reloader fired →
Grafana booted with new env in ~50s total → all three /api/datasources
/uid/*/health endpoints return "Database Connection OK".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 17:38:38 +00:00
Viktor Barzin
f9f19e4c54 mysql: bump to 4Gi limit / 3Gi request; grow /srv/nfs LV to 3 TiB
mysql-standalone OOMKilled May 8 18:05 (anon-rss 2 GB at the 2 Gi limit).
innodb_buffer_pool_size=1Gi plus connection buffers and InnoDB internals
don't fit in 2 Gi. Bumping limit to 4 Gi (request 3 Gi) leaves headroom
without changing the buffer pool config.

/srv/nfs was at 90% (1.7T / 2T); grew the underlying pve/nfs-data LV
1 TiB online and ran resize2fs (now 60% used). Triggered by surfacing
during the 2026-05-09 IO-pressure post-mortem; thinpool had ~4.6 TiB
free.

The post-mortem also covers the stale-NFS-client trigger (legacy
/usr/local/bin/weekly-backup pointing at the decommissioned TrueNAS IP)
and the resulting wedged kthread on the PVE host. Script removed and
node_exporter restarted out-of-band; kthread will clear at next PVE
reboot. See docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 17:01:57 +00:00
Viktor Barzin
dd69dff3a9 ig-poster: bump to da5b4191 (auto-curate from recent favorites) 2026-05-09 13:57:08 +00:00
root
f89205a979 Woodpecker CI deploy [CI SKIP] 2026-05-09 13:39:14 +00:00
Viktor Barzin
47bb175a4f
n8n: real-time training loop + decoupled posting
instagram-approval: after every tap, immediately fetch /candidates?limit=1
and send the next photo as a fresh inline-keyboard message — the user's
tap chains back into this same workflow, so the loop is user-paced.
When the pool is exhausted, send an 'all caught up' summary with the
backlog count + cumulative training stats.

instagram-discover: cron throttled from every-30-min to daily 09:00.
The chain handles ongoing training; the daily run only kickstarts a
session if the user hasn't been tapping. Limit reduced from 3 → 1 so
each kickstart sends a single photo (chain takes over).
2026-05-09 13:37:53 +00:00
Viktor Barzin
77a84ae5e0
ig-poster b17a9737 + n8n discover rewritten to use /candidates with CLIP scoring 2026-05-09 13:31:28 +00:00
Viktor Barzin
b57b29f9f6
ig-poster: bump to 3b862fe4 (EXIF orientation + auto-pending /candidates) 2026-05-09 13:23:24 +00:00
Viktor Barzin
a9ee9cee60
ig-poster: 69e395f2 + sync IMMICH_PG_* via ESO for CLIP scoring; postiz publish-notify n8n workflow 2026-05-09 13:16:24 +00:00
Viktor Barzin
f61e7c9bfc
ig-poster: bump to cac6fa97 + sync POSTIZ_INTEGRATION_ID via ESO 2026-05-09 12:40:35 +00:00
Viktor Barzin
36d5cebb5c
postiz: expose /uploads publicly so Meta IG fetcher can pull JPEGs
Stories+feed posts via Postiz failed with state=ERROR and Postiz
mistranslated the cause as 'Invalid Instagram image resolution
max: 1920x1080px'. Real cause: Postiz hands Meta an upload URL
under https://postiz.viktorbarzin.me/uploads/... and Meta gets a
302 to the Authentik login page instead of bytes. Meta returns
error 36001 (image not fetchable) which Postiz maps to that
misleading resolution string.

Split the ingress: /uploads/* on a public ingress (matches the
instagram-poster /image+/original pattern), everything else
remains behind Authentik forward-auth. /uploads contents are
random UUIDs, low blast radius if scraped.
2026-05-09 12:29:39 +00:00
Viktor Barzin
d62a9dcda1 docs: PVC templates need lifecycle.ignore_changes for autoresizer
The canonical proxmox-lvm and proxmox-lvm-encrypted PVC templates were
missing `lifecycle { ignore_changes = [spec[0].resources[0].requests] }`.
Without it, every PVC created from these templates becomes a drift bomb
the moment pvc-autoresizer expands it: the next `tg apply` on that stack
will try to shrink the PVC back to the TF-declared size, K8s rejects the
shrink, and apply fails.

This was latent because pvc-autoresizer was silently broken cluster-wide
(commit 9d5da4d8 fixed it by allow-listing kubelet_volume_stats_available_bytes
in Prometheus). Now that the autoresizer actually works, every existing
proxmox-lvm/encrypted PVC without ignore_changes is at risk.

Sweep needed (separate task): grep for kubernetes_persistent_volume_claim
across stacks/ and add ignore_changes to any with resize.topolvm.io
annotations.
2026-05-09 12:02:18 +00:00
Viktor Barzin
e22c023c8c
postiz: wire INSTAGRAM_APP_ID/SECRET via ESO for IG-standalone provider
Standalone provider (instagram-standalone OAuth flow) is what the user
is trying after the FB-Login path was blocked by their Business Account
ad-policy flag. Uses modern scope names (instagram_business_*), so no
JS patch needed unlike the FB-Login provider.
2026-05-09 11:43:01 +00:00
Viktor Barzin
aa64500bc5 ci(drift-detection): generate kubeconfig from projected SA token
Same fix as default.yml — drift-detection cron also runs terragrunt
plan on every stack, which requires the kubeconfig at <repo>/config
that terragrunt.hcl injects via -var kube_config_path. Pipeline #547
(latest scheduled drift-detection run) failed with the same
'config_path refers to an invalid path' error.
2026-05-09 11:31:53 +00:00
Viktor Barzin
20738efe4e ci(woodpecker): generate kubeconfig from projected SA token
terragrunt.hcl injects -var kube_config_path=${repo_root}/config for
every terraform invocation, but the pipeline never created that file.
Every commit that touched a TF stack since #545 (2026-05-08) failed
with 'config_path refers to an invalid path: \"../../config\": no such
file or directory' followed by the kubernetes provider falling back
to localhost:80.

Add a step that writes a kubeconfig at <repo>/config using the
projected SA token + cluster CA. The woodpecker namespace's default
SA is already cluster-admin (woodpecker-default ClusterRoleBinding),
so the projected token is sufficient for any stack apply. Using
tokenFile (not an inline token) lets the provider re-read it if
kubelet rotates the projected token mid-pipeline.

#545 was the last green run because that commit only changed the
build-cli pipeline — 0 stacks applied so the missing kubeconfig
never mattered.
2026-05-09 11:26:47 +00:00
Viktor Barzin
ff4cca73b9 chore: remove decommissioned registry.viktorbarzin.me ingress
The old port-5050 R/W private registry was decommissioned 2026-05-07
(forgejo-registry-consolidation Phase 4). The reverse-proxy ingress
+ ExternalName service + Cloudflare DNS record kept pointing at the
dead backend, returning 502 to anyone hitting registry.viktorbarzin.me.

This was driving 3 monitoring artifacts that auto-cleared on cleanup:
- Uptime Kuma external monitor #586 (deleted)
- Pushgateway stale registry-integrity-probe metrics (deleted)
- ExternalAccessDivergence + RegistryIntegrityProbeStale alerts
2026-05-09 11:03:51 +00:00
Viktor Barzin
9d5da4d8e0 fix: restore pvc-autoresizer by allow-listing kubelet_volume_stats_available_bytes
The Prometheus scrape config for the kubernetes-nodes job kept
capacity_bytes + used_bytes but dropped available_bytes. pvc-autoresizer
computes utilization from available/capacity, so without that metric it
was silent for every PVC in the cluster — including mailserver, which
filled to 89% (1.7G/2.0G) and started rejecting all inbound mail with
'452 4.3.1 Insufficient system storage' (15+ hours, all real senders:
Brevo, Gmail, Facebook).

Also bumps the floors of mailserver (2Gi -> 5Gi, limit 10Gi) and forgejo
(15Gi -> 30Gi) PVCs to recover from the immediate outage, and adds
ignore_changes on requests.storage so future autoresizer expansions
don't cause TF drift.
2026-05-09 10:49:17 +00:00
Viktor Barzin
352586f711
ig-poster: pivot to Telegram-only delivery (manual IG upload)
User dropped Postiz/Instagram OAuth (Meta Business Account flagged
+ Postiz scope drift). New pipeline ends at Telegram — full-quality
JPEG delivered to the bot chat, manually uploaded to IG by the user.

- Image bumped to 25e46efd: adds /deliver/{asset_id} endpoint that
  multipart-uploads to Telegram (URL-fetch fails through Cloudflare
  for >5MB), then tags 'posted' in Immich.
- ESO now syncs telegram_bot_token + telegram_chat_id from Vault.
- Public ingress paths grow to ['/image', '/original'] (Authentik
  bypass on /original is harmless — files are user-tagged, low blast
  radius — and useful for ad-hoc browser downloads).
- Memory limit 512Mi -> 1500Mi: full-resolution Pillow HEIC decode
  was OOMing on 12MP+ phone photos.
- discover.json simplified to scan -> deliver per item; approval and
  post workflows already deactivated. Telegram bot webhook removed.
2026-05-09 10:45:02 +00:00
Viktor Barzin
c2e61cdf31
postiz: wire FACEBOOK_APP_ID/SECRET via ESO for IG-Business integration 2026-05-09 09:19:43 +00:00
Viktor Barzin
60dd6c61b5
postiz: idempotent Job to drop default Text search attributes (Temporal SQL visibility caps at 3 Text attrs; auto-setup ships with 2, Postiz adds 2 more — gitroomhq/postiz-app#1504) 2026-05-09 09:16:07 +00:00
Viktor Barzin
d3db617b1b
postiz: bump memory limit to 4Gi (was OOMing during NestJS startup) 2026-05-09 09:12:39 +00:00
Viktor Barzin
8c4a370a34
postiz: add Temporal sidecar; lock both stacks behind Authentik
Postiz backend was crashlooping on connect ECONNREFUSED ::1:7233 —
Postiz needs Temporal for cron/scheduled posts and the Helm chart
doesn't bundle it. Added a single-replica temporalio/auto-setup:1.28.1
Deployment in the postiz namespace, backed by the bundled
postiz-postgresql (separate `temporal` + `temporal_visibility`
databases pre-created via init container), ENABLE_ES=false (Postiz
only uses the workflow engine, not visibility search). Skips
DYNAMIC_CONFIG_FILE_PATH because that file isn't bundled in
auto-setup.

Auth audit:
- postiz: ingress now `protected = true` (Authentik forward-auth).
  Postiz also has its own login on top, but registration is no
  longer exposed to the open internet.
- instagram-poster: split into two ingresses on the same host.
  `/image/*` stays public (Meta + Telegram fetch the 9:16
  derivatives). Everything else (/healthz, /queue, /scan,
  /enqueue, /reject, /post-next) sits behind Authentik. The
  protected ingress sets dns_type=none — the public one already
  created the CF DNS record.
2026-05-09 09:08:21 +00:00
Viktor Barzin
7f7698991e
postiz + n8n: real DB URL + webhook-trigger approval
- postiz: set DATABASE_URL/REDIS_URL pointing at the bundled subcharts;
  the chart does NOT auto-wire even when postgresql.enabled=true, so
  the prisma db:push was failing with empty DATABASE_URL.
- n8n approval workflow: swap telegramTrigger -> webhook node so it
  works without an n8n-stored Telegram credential. Telegram bot's
  webhook is set via setWebhook to https://n8n.viktorbarzin.me/webhook/instagram-approval.
  Parse-callback Code node tolerates both shapes ({body:{callback_query:...}}
  vs {callback_query:...}) so a future move back to telegramTrigger doesn't break.
2026-05-09 00:55:19 +00:00
Viktor Barzin
71e3439650
postiz + instagram-poster: deploy fixes after first apply
- postiz: pin chart name to 'postiz-app' (was 'postiz', wrong path)
  and override bundled bitnami subchart images to bitnamilegacy/* —
  Bitnami removed bitnami/postgresql + bitnami/redis from DockerHub
  in Aug 2025 (Broadcom acquisition).
- postiz: enable initial registration (DISABLE_REGISTRATION=false)
  so first admin user can be created in UI; tighten after.
- instagram-poster: add securityContext (fsGroup/runAsUser=10001)
  so kubelet chowns the PVC mount for the non-root 'poster' user;
  was crashing on alembic with 'unable to open database file'.
- instagram-poster: bump image_tag to 24935ab4 (uvicorn now binds
  to port 8000 to match Service contract; was 8080 -> probe 404).
2026-05-09 00:47:14 +00:00
Viktor Barzin
d5a01b6ad2
instagram-poster: pin image tag to 23f8b4ed (initial push) 2026-05-09 00:09:58 +00:00
Viktor Barzin
7a93e09d2f
add postiz + instagram-poster stacks for IG Stories pipeline
New stacks:
- stacks/postiz/ — Postiz scheduler (Helm chart v1.0.5, image v2.21.7)
  with bundled PG/Redis, /uploads PVC on proxmox-lvm, JWT_SECRET
  via ESO from secret/instagram-poster.
- stacks/instagram-poster/ — custom Python service that polls Immich
  for the 'instagram' tag, reformats photos to 9:16 with blurred-bg
  letterbox, exposes /image/<asset_id> publicly so Postiz can fetch.
  Image: forgejo.viktorbarzin.me/viktor/instagram-poster.

n8n: 3 new workflows (discover, approval, post) for the Telegram
inline-button approval UX. Adds ExternalSecret + env vars for
TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID, IMMICH_API_KEY, plus static
URLs for the new service.

Vault: seed secret/instagram-poster with telegram_bot_token,
telegram_chat_id, immich_api_key, postiz_api_token,
postiz_jwt_secret before applying.
2026-05-09 00:07:44 +00:00
Viktor Barzin
c97f0497dd openclaw: regenerate kubeconfig at pod start using projected SA tokenFile
The previously-baked kubeconfig at /home/node/.openclaw/kubeconfig retained
a service-account token bound to the original (long-dead) pod, so kubectl
calls from inside the openclaw container failed with "the server has asked
for the client to provide credentials" even though the openclaw SA has
cluster-admin and kubelet projects a fresh token at
/var/run/secrets/kubernetes.io/serviceaccount/token.

Add init-container "setup-kubeconfig" that writes a kubeconfig with
tokenFile + certificate-authority paths pointing at the projected
SA volume — kubelet auto-rotates the token, kubectl always reads
fresh creds, no Vault K8s-creds-engine refresh needed.

Verified end-to-end: agent ran `kubectl get nodes -o wide` inside the
pod and delivered a correct one-line summary to Telegram via
openai-codex/gpt-5.4-mini.
2026-05-08 08:07:38 +00:00
Viktor Barzin
7b8d442a26 [ci] build-cli: drop registry.viktorbarzin.me:5050 push (decommissioned)
The build-cli pipeline was still pushing to the
registry.viktorbarzin.me:5050/infra path that no longer exists
post Phase 4 — failing with 'error authenticating: exit status 1'
on every infra push. Drop the second repo + login; DockerHub +
Forgejo are the canonical destinations now.
2026-05-08 07:45:32 +00:00
Viktor Barzin
de96c54456 [woodpecker] Re-fix null_resource trigger after lint reverted it
The helm provider in this Terraform version doesn't support
list-index access on helm_release.metadata[0]. Switch the
woodpecker_server_host_alias trigger to {helm_version, sha256(values)}
which works regardless of provider quirks. (Original fix landed
2026-05-07; got reverted by a linter pass.)
2026-05-08 07:42:56 +00:00
Viktor Barzin
b91ae22edb f1-stream: register HmembedsExtractor in registry
Companion commit to 92474254 — the new extractor wasn't being
registered, only the file was added. Add the import + register call
in create_registry().

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:47:50 +00:00
Viktor Barzin
92474254e6 f1-stream: hmembeds offline decoder — reverse-engineered the JW Player trap
Four-agent parallel investigation finally pinned down what's happening
with the hmembeds.one streams. The TL;DR is unexpected: there is no
fingerprint check, no decoder failure, no broken JS — the obfuscated
decoder is trivial to reproduce, but the upstream origin is dead.

Findings (saved at /tmp/jwre/{findings.md, blob-analysis.md,
fingerprint-gap.md, trace-summary.md}):

1. **The "ZpQw9XkLmN8c3vR3" blob is decoy.** It's an Adcash adblock-
   bypass config — not the stream URL. The actual stream URL is in a
   different inline `<script>` block of the embed HTML.

2. **The real decoder is base64 + XOR with a hardcoded key**, the key
   appears literally in the HTML (e.g. `var k="bux7ver6mow4trh1"`).
   No browser-derived inputs. We can run it in Python in 50µs.

3. **The decoded URL is JWT-bound to /24 of the requestor's IP**. JWT
   payload: `{stream, ip:"176.12.22.0/24", session_id, exp}`. From our
   cluster (egress 176.12.22.76) the JWT IP-binding is satisfied.

4. **The origin still returns 404 (GET) / 403 (HEAD).** Tested both
   curated embeds (Sky F1 888520f3..., DAZN F1 fc3a5463...) — same
   404. Origin landing page (`/`) returns 200, so the host is up;
   the `/sec/<JWT>/<embed_id>.m3u8` endpoint specifically refuses.

5. **No fingerprint surface trips this.** Runtime trace via
   chrome-service hooks confirmed: decoder reads navigator.userAgent
   (heavy), screen dimensions, and a single WebGL getParameter call.
   No canvas, audio, fonts, fetch-to-fingerprint-API. JW Player setup
   is given a valid file URL — the playlist stays empty because JW
   can't fetch the manifest from the (dead) origin.

Verdict: **the legacy curated hmembeds embeds (`888520f3...` Sky F1,
`fc3a5463...` DAZN F1) are upstream-dead.** No browser-side fix is
possible. The community uses these IDs as "24/7 channels" but they're
in a perpetually-offline state right now.

This commit ships the offline decoder anyway, registered as a new
extractor. Two reasons:
- If those origins come back online, no code change needed.
- Future curated hmembeds IDs (added by hand or discovered via
  subreddit posts) will resolve through the same path.

Files added: `extractors/hmembeds.py` (~120 lines incl. the decoder
and a `decode_embed(html) -> str | None` helper that's reusable).
Registered in `__init__.py`. The existing CuratedExtractor stays
disabled; this replaces its mechanism with one that can absorb new
embed IDs without code changes.

Bonus from the agent work:
- Confirmed our stealth.js is sufficient — the runtime trace showed
  the decoder reads only the surfaces we already cover.
- Identified ~10 fingerprint surfaces we don't spoof (platform,
  userAgentData, hardwareConcurrency, deviceMemory, timezone,
  AudioContext, ICE candidates) but proved they're not what's
  blocking us, so no change needed for now.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:47:25 +00:00
Viktor Barzin
02a6a955f5 [woodpecker] Programmatic Forgejo repo registration
Earlier I claimed the OAuth Web UI flow was the only way to onboard
new Forgejo repos in Woodpecker. That's wrong.

Two parts to the actual workaround:
1. Woodpecker session JWTs are HS256 signed with the user's per-user
   `hash` column from the PG `users` table (NOT the global agent
   secret). Mint a session JWT for the Forgejo viktor user (id=2,
   forge_id=2), and you're authenticated as that user.
2. POST /api/repos?forge_remote_id=N as viktor → Woodpecker calls
   Forgejo with viktor's stored OAuth access_token to create the
   webhook + per-repo signing key. Works.

The 500 I saw earlier was from POST'ing as ViktorBarzin (GitHub
admin), whose user row has no Forgejo OAuth token — Woodpecker's
forge-API call fails for that user, surfacing as a 500.

scripts/woodpecker-register-forgejo-repo.sh wraps the whole flow:
extract hash from PG → mint JWT → activate repo. Verified against
viktor/{broker-sync,claude-agent-service,freedify,hmrc-sync} in
this session — all activated cleanly.

Also updated the runbook with the actual mechanism + the
WOODPECKER_FORGE_TIMEOUT=30s tip (the real root cause of the
'context deadline exceeded' failures, NOT the v3.14 upgrade).
2026-05-07 23:33:26 +00:00
Viktor Barzin
bab760b8fb kms: replace inline ConfigMap nginx with custom Hugo image
The kms-web-page deployment now pulls
forgejo.viktorbarzin.me/viktor/kms-website:${var.image_tag} (source
in the new Forgejo repo viktor/kms-website). The ConfigMap-mounted
index.html is gone — the new site is a Hugo build with full GVLK
catalog for every Microsoft KMS-eligible Windows + Office edition,
copy-to-clipboard, dark/light themes.

The container image tag is managed by CI (kubectl set image), so
add lifecycle ignore_changes on container[0].image alongside the
existing dns_config (Kyverno) ignore.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:33:26 +00:00
Viktor Barzin
60aec74acd f1-stream: Stremio addon extractor — TvVoo + StremVerse Sky F1 / DAZN F1
5 parallel research agents surveyed Stremio addons, F1 TV / Sky / DAZN
official APIs, IPTV M3U lists, and free-to-air broadcasters. The clean
finding: two community Stremio addons already index Sky Sports F1 +
DAZN F1 via their public HTTP APIs — no Stremio client required, just
GET /stream/<type>/<id>.json on the addon's hosted instance.

New `stremio.py` extractor pulls from:
- **TvVoo** (`https://tvvoo.hayd.uk/manifest.json`) — wraps Vavoo IPTV.
  Lists Sky Sports F1 UK + Sky Sports F1 HD + Sky Sport F1 IT + Sky
  Sport F1 HD DE + DAZN F1 ES. Returns 2 IP-bound m3u8 URLs per
  channel. Source: github.com/qwertyuiop8899/tvvoo. Vavoo's CDN SSL
  certs are currently expired so most clients fail verification today
  — addon framework is right but delivery is degraded.
- **StremVerse** (`https://stremverse.onrender.com/manifest.json`) —
  Returns 11+ streams per id (`stremevent_591` = F1, `stremevent_866`
  = MotoGP). Mix of DRM-walled DASH, JW-broken-chain JWT URLs, and
  HuggingFace-Space proxies that 404 without a per-instance api_password.

The extractor surfaces 15 candidate URLs per run; verifier filters to
the playable subset. Today that subset is 0 (Vavoo cert expiry + JW
chain + proxy auth), but the wiring is correct: as the addons fix
delivery or rotate to fresh URLs, candidates will start passing.

Other agent findings worth noting (not coded but documented):
- F1 TV Pro live = Widevine DASH; impossible without a CDM. VOD is
  clean HLS but only post-session.
- Sky Go / DAZN / Viaplay / Canal+ = all Widevine + geo-fenced + active
  DMCA enforcement. Pursuing not feasible.
- ServusTV AT (free F1 race weekends) = clean public HLS at
  rbmn-live.akamaized.net/hls/live/2002825/geoSTVATweb/master.m3u8 but
  geo-fenced; needs an Austrian-IP egress proxy/VPN.
- iptv-org/iptv has an F1 Channel (Pluto TV IE) at
  jmp2.uk/plu-6661739641af6400080cd8f1.m3u8 — 24/7 free, BG works,
  but only historic races + shoulder programming. Worth adding as a
  curated entry later.
- boxboxbox.* (community-favourite F1 race-weekend domain) is dead
  across all known TLDs as of today.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:16:39 +00:00
Viktor Barzin
d3d86ce433 [woodpecker] Bump WOODPECKER_FORGE_TIMEOUT 3s → 30s
The default forge-API timeout is 3 seconds. The config-loader makes
4-6 sequential calls per pipeline trigger (probing for .woodpecker dir
then each .woodpecker.{yaml,yml} variant), and Forgejo responses on
this cluster spike to 1-2s under load — easy to trip the cumulative
3s deadline. Result: 'could not load config from forge: context
deadline exceeded' on virtually every pipeline trigger.

This was the actual root cause of the 'Woodpecker forge-API bug'
that v3.13 → v3.14 was supposed to fix — turns out v3.14 didn't
change the timeout default, and the v3.13 successes I saw earlier
were warm-cache flukes.
2026-05-07 23:10:48 +00:00
Viktor Barzin
63ea18d930 [docs] Onboarding runbook for new Forgejo repos in Woodpecker 2026-05-07 22:55:07 +00:00
Viktor Barzin
79caba9904 state(vault): update encrypted state 2026-05-07 22:53:04 +00:00
Viktor Barzin
c5a0424571 f1-stream: subreddit extractor scans r/motorsportsstreams2 (active sub)
User asked specifically for r/motorsportstreams. Reddit banned that sub
years ago; the active 12.5k-subscriber successor is r/motorsportsstreams2.
Added it to SUBREDDITS plus r/f1streams (709 subs, public).

Also extended:
- SEARCH_QUERIES with three Sky Sports F1 / live-stream phrases that
  catch the `[F1 STREAM]` post pattern the community uses on race
  weekends (titles like "[F1 STREAM] Bahrain GP - Live Race | No Buffer
  | Mobile Friendly" linking to boxboxbox.pro/stream-1).
- _INTERESTING_HOSTS allowlist with boxboxbox.{pro,live,lol},
  pitsport.live, ppv.to, streamed.pk, acestrlms/aceztrims, and the
  Super Formula direct CDNs (racelive.jp, cdn.sfgo.jp) — all observed
  in last-50-posts on r/motorsportsstreams2.

Where this leaves us, honestly:
- The r/motorsportsstreams2 megathread "Where to watch every F1 race"
  recommends EXACTLY the four sites we already pull from: pitsport.xyz,
  streamed.pk, ppv.to, acestrlms. The community has the same broken JW
  Player chain we have for Sky Sports F1 24/7 streams. There is no
  free-and-working alternative they know about.
- boxboxbox.pro (the most-promoted F1 stream domain in race-weekend
  posts) is currently NXDOMAIN; .live is parked, .lol unreachable. The
  domain rotates after takedowns; Reddit posts will surface fresh ones
  when posters share them.
- For F1 specifically: extractor surfaces 2 motomundo.net candidates
  (MotoGP wrappers) and lights up to ~6+ during F1 race weekends as
  posters share fresh boxboxbox/equivalent URLs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 22:42:51 +00:00
Viktor Barzin
4326f09baa monitoring(wealth): monthly contrib-vs-mkt as line chart, not bars
User asked for two lines instead of side-by-side bars at monthly
granularity. Converts panel 25 from barchart to timeseries:

  * type: barchart -> timeseries
  * format: table -> time_series, SELECT month::timestamp AS time
  * drawStyle line, lineWidth 2, fillOpacity 0, showPoints auto
  * Same blue (contributions) / green (market gain) colour overrides

Where the green line rises above the blue line is the visual cue that
the market out-earned new contributions for that month -- the trend
the user wants to track.

Diff is small (15 ins / 28 del) because the bar-chart-only fields
(barRadius, barWidth, groupWidth, stacking, xField, xTickLabelRotation)
are dropped.
2026-05-07 22:41:46 +00:00
Viktor Barzin
a2e9a0fecf monitoring(wealth): monthly contributions vs market gain bar chart
Goal stated by user: see when monthly market gain starts to exceed
monthly contributions, i.e. the inflection point where the market is
out-earning savings rather than the other way around.

New panel id=25 between the annual decomposition (13) and per-account
ROI (14): bar chart with two side-by-side bars per month --
contributions (blue) and market gain (green). Same calculation as
panel 13 but month-grain instead of year-grain. Months where the
green bar dwarfs the blue one are visible at a glance.

SQL: same endpoints CTE pattern as panel 13, with date_trunc('month',
valuation_date) as the grouping key. Uses max_complete cutoff so
partial-today doesn't skew the latest month.

Layout: panels at y >= 75 shifted down by 11 (chart height). New
chart at y=75; panel 14 (per-account ROI) -> y=86; panel 10
(activity log) -> y=96.

Spot check (recent months from PG):
  2025-07: contrib +£5,601    market +£42,295   <- big market month
  2025-09: contrib +£1,501    market +£24,206
  2026-02: contrib +£35,501   market +£41,382
  2026-03: contrib +£5,501    market -£38,483   <- correction
  2026-04: contrib +£73,267   market +£21,448
2026-05-07 22:38:31 +00:00
Viktor Barzin
f6e764a9b0 [wealthfolio] Flip wealthfolio-sync CronJob image to Forgejo
The CronJob has been broken since registry-private lost the
wealthfolio-sync image (last successful run 36+ days ago). The image
is built from /home/wizard/code/broker-sync (the brokerage data sync —
Trading 212, Schwab, Fidelity, IMAP-CSV → wealthfolio).

Set up: viktor/broker-sync repo on Forgejo with .woodpecker/build.yml
that pushes to forgejo.viktorbarzin.me/viktor/wealthfolio-sync. Until
Woodpecker recognises the new repo's webhook, the image was bootstrapped
via 'docker pull viktorbarzin/broker-sync:latest && docker tag … &&
docker push forgejo.viktorbarzin.me/viktor/wealthfolio-sync:latest' so
the CronJob unblocks immediately.
2026-05-07 22:38:14 +00:00
Viktor Barzin
8f1ee9a55f [woodpecker] Bump server + agent v3.13.0 → v3.14.0
Fixes the 'could not load config from forge: context deadline exceeded'
issue that blocked every Forgejo-triggered pipeline during the
forgejo-registry-consolidation cutover. Helm chart 3.5.1 stays
(no 3.6 yet); only the image tag overrides change.
2026-05-07 22:28:56 +00:00
Viktor Barzin
5030b44535 [forgejo] Phase 4 final decommission: drop registry-private container + port 5050
Image migration completed (forgejo-migrate-orphan-images.sh ran +
all in-scope images now under forgejo.viktorbarzin.me/viktor/) and
the cluster cutover landed in commit 3148d15d. registry-private is
no longer needed.

* infra/modules/docker-registry/docker-compose.yml — registry-private
  service block removed; nginx 5050 port mapping dropped.
* infra/modules/docker-registry/nginx_registry.conf — upstream
  private block + port 5050 server block removed.
* infra/.woodpecker/build-ci-image.yml — drop the dual-push to
  registry.viktorbarzin.me:5050; only push to Forgejo. Verify-
  integrity step removed (the every-15min forgejo-integrity-probe
  in monitoring covers it). Break-glass tarball step still runs but
  pulls from Forgejo (the only registry left).

The registry-config-sync.yml pipeline will pick this commit up and
sync the new compose+nginx to the VM. Manual final step on the VM:
  ssh root@10.0.20.10 'cd /opt/registry && docker compose up -d --remove-orphans'
to actually destroy the registry-private container — compose does
NOT do orphan removal on a normal up -d.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 19:08:17 +00:00
Viktor Barzin
48e504bbb0 [claude-memory] Restore truncated main.tf — apply Phase 3 image flip on full file
The Phase 3 commit 3148d15d ran into a disk-full ENOSPC during edit
of stacks/claude-memory/main.tf, and the file was committed truncated
at line 286 mid-string ('Cor instead of 'Core Platform' / closing
braces). terraform validate failed with 'Unterminated template string'.
Restoring the trailing 2 lines + re-applying the
viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/
claude-memory-mcp:17 cutover that Phase 3 was meant to do.
2026-05-07 19:05:34 +00:00
Viktor Barzin
69de66ed88 chrome-service: open NP for Traefik → noVNC sidecar (port 6080)
Existing NetworkPolicy only admitted port 3000 (Playwright WS) from
labelled client namespaces, blocking Traefik's traffic to the noVNC
sidecar on port 6080. The chrome.viktorbarzin.me ingress would hang
forever — page never loads, eventually times out.

Adds a second ingress rule allowing TCP/6080 from the traefik
namespace only. Authentik forward-auth still gates external access
at the Traefik layer.

Also reconciles the noVNC image to the new Forgejo registry path
(:v4 unchanged) — already declared in TF, just live-state drift from
the Phase 3 registry consolidation.

Updates the architecture doc; the previous text still described the
old nginx static health stub that noVNC replaced.
2026-05-07 18:40:11 +00:00
Viktor Barzin
49f4956ff7 [forgejo] Restore registry-private temporarily until image migration completes
The Phase 4 docker-compose + nginx changes I landed earlier dropped
the registry-private container's port-5050 listener BEFORE migrating
the existing images to Forgejo. The registry-config-sync pipeline
applied the new nginx config, breaking pulls from registry-private —
which is the source of every image we still need to copy to Forgejo.

Restore registry-private + the 5050 listener until the migration
script has finished. Subsequent commit will drop them once images
are confirmed in Forgejo.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 18:37:22 +00:00
Viktor Barzin
4a636f3fb7 f1-stream: subreddit extractor finds Reddit '[Watch / Download]' threads
Two fixes for the previously-dormant subreddit extractor + a chrome-browser TARGETS pivot to MotoGP weekend live URLs.

1. **Reddit fetch was 403'd by `Accept: application/json`**. Cluster IP +
   that header trips Reddit's anti-bot fingerprint and returns HTML 403.
   Removing the explicit Accept (default `*/*`) restores HTTP 200 with
   JSON. Confirmed via direct httpx test from the f1-stream pod.

2. **Search the right things**. The community uses a stable
   `[Watch / Download] <Series> <Year> - <Round> | <Event>` post pattern
   with selftext links to admin-curated WordPress sites (motomundo.net
   for MotoGP, sister sites for F1 when active). New extractor:
   - Hits both /new.json and /search.json across r/MotorsportsReplays
     and three smaller motorsport subs.
   - Filters posts where title contains `[watch`, `watch online`, or
     flair = `live`.
   - Extracts URLs from selftext (regex), filters to a positive
     `_INTERESTING_HOSTS` allowlist (motomundo, freemotorsports,
     pitsport, rerace, dd12, etc.) so we don't drown the verifier in
     YouTube/Discord/gofile links.
   - Returns each as embed-type so the chrome-service verifier visits.

3. **chrome_browser.TARGETS pivoted** to the live MotoMundo MotoGP
   French GP iframes (motomundo.top/e/<id> + motomundo.upns.xyz/#<id>)
   while the weekend is on. The previous DD12 NASCAR + Acestrlms F1
   targets were both broken JW Player paths anyway.

State after deploy:
- /streams: 3 verified live (WRC Rally Portugal, NASCAR 24/7, Premier League Darts) — Darts is currently active because UK is mid-match.
- Subreddit extractor surfaces the live MotoMundo URL but the verifier
  marks the WordPress wrapper page playable=False (no top-level <video>
  element; the m3u8 lives in nested iframes). Next iteration: drill the
  verifier into iframe contentDocument and capture from there.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 18:37:11 +00:00
Viktor Barzin
3148d15d5a [forgejo] Phases 3+4+5: cutover, decommission, docs sweep
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.

Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
  fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
  — image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
  (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
  from Forgejo. build-ci-image.yml dual-pushes still until next
  build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.

Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
  registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
  Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
  hosts.toml entries for registry.viktorbarzin.me +
  10.0.20.10:5050. (Existing nodes already had the file removed
  manually by `setup-forgejo-containerd-mirror.sh` rollout — the
  cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
  service block removed; nginx 5050 port mapping dropped. Pull-
  through caches for upstream registries (5000/5010/5020/5030/5040)
  stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
  `private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
  integrity_probe + registry_probe_credentials resources stripped.
  forgejo_integrity_probe is the only manifest probe now.

Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
  through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
  diagram now reflect Forgejo. Pre-migration root-cause sentence
  preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
  row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
  to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
  alert annotation simplified now that only one registry is in
  scope.

Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
   match the new template AND `docker compose up -d --remove-orphans`
   to actually stop the registry-private container. Memory id=1078
   confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
   on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
   registry.viktorbarzin.me:5050 from the `repo:` list — at that
   point the post-push integrity check at line 33-107 also needs
   to be repointed at Forgejo or removed (the per-build verify is
   redundant with the every-15min Forgejo probe).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 18:30:02 +00:00
Viktor Barzin
c2f8a300d4 [forgejo] Migration script: exclude empty repos, all-images full mode
Updated to handle the actual situation: wealthfolio-sync and
fire-planner have registry repos but no tags (broken/abandoned
deployments). Skip those with a SKIP marker. Migrate everything
else as a stop-gap until Woodpecker pipelines start producing
Forgejo images on their own.

The image list now covers all private images currently in scope.
2026-05-07 17:21:39 +00:00
Viktor Barzin
39e7409f11 [woodpecker] Persist hostAliases patch via null_resource (chart doesn't expose it)
Helm chart 3.5.1 has no `server.hostAliases` field, so the YAML
addition I made earlier was a no-op. Apply via kubectl patch in a
null_resource keyed on helm revision so it re-asserts on every
chart upgrade. Same pattern as the CoreDNS replicas/affinity patch
in stacks/technitium/.

Without this, every helm upgrade on woodpecker reverts the
hostAliases fix and the Forgejo pipeline triggers start failing
with context-deadline-exceeded again.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 17:18:57 +00:00
Viktor Barzin
00fc0cf5bb [woodpecker] Pin forgejo.viktorbarzin.me to in-cluster Traefik LB
Pipeline triggers from Forgejo were failing with "could not load
config from forge: context deadline exceeded" — Woodpecker's
forge-API fetch path was round-tripping through Cloudflare via the
public IP, hitting 30s deadline timeouts on cold connections. The
in-cluster path via the Traefik LB (10.0.20.200) is consistently
sub-100ms.

Same trick we use for the containerd hosts.toml redirect on each
node — Traefik serves the *.viktorbarzin.me wildcard cert so SNI
verification still passes. OAuth callbacks still use the public
hostname (correct, those come from the user's browser).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 17:13:51 +00:00
Viktor Barzin
de56af883d [forgejo] Bump webhook DELIVER_TIMEOUT 5s -> 30s
Forgejo→Woodpecker webhooks were timing out on first request after
pod restart. The default 5s deadline is too tight for the cold
Cloudflare-tunnel TLS handshake (observed 6-8s). 30s comfortably
covers retries.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 17:10:13 +00:00
Viktor Barzin
9ecee3b95f [forgejo] Allow webhook delivery to ci.viktorbarzin.me + *.viktorbarzin.me
The Forgejo→Woodpecker webhook (so Woodpecker fires on each push to
viktor/<repo>) was being blocked by the existing ALLOWED_HOST_LIST
of *.svc.cluster.local — ci.viktorbarzin.me resolves to the public IP
because Cloudflare proxying wasn't covering that path. Without this
fix, no Woodpecker pipeline run was triggered on push, the dual-push
bake would never start, and Forgejo's package catalog stays empty.

Add ci.viktorbarzin.me explicitly + *.viktorbarzin.me as a future-
proofing wildcard. The list still excludes arbitrary external hosts,
so this is not a security regression — just unblocking the webhook
to our own CI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 17:03:25 +00:00
Viktor Barzin
8b53d99e9a [forgejo] Add chrome-service-novnc:v4 to orphan-image migrator 2026-05-07 16:52:05 +00:00
Viktor Barzin
e93baf8b36 [forgejo] securityContext.fsGroup=1000 so /data is writable to forgejo
Phase 0 enabled packages but the pod crashloops on
`mkdir /data/tmp: permission denied` — Forgejo loads the chunked
upload path (default /data/tmp/package-upload) before s6-overlay
gets a chance to chown /data. fsGroup tells kubelet to recursively
chown the volume to GID 1000 on mount, which fixes it.

Pre-23-day Forgejo deployed with packages off so this code path
never ran.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 16:50:02 +00:00
Viktor Barzin
880f7d1377 [forgejo] Drop the FORGEJO__packages__CHUNKED_UPLOAD_PATH override
Setting it to /data/tmp/package-upload triggers a CrashLoopBackOff
because /data is the volume mount root and is owned by root, not
the forgejo user (uid 1000) — Forgejo can't `mkdir /data/tmp`.

The default value resolves under the AppDataPath (a subdir Forgejo
itself owns) which works fine. Keep the ENABLED=true override; v11
ships packages on but explicit is safer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 16:44:37 +00:00
Viktor Barzin
50bc77d33f f1-stream: add chrome-browser, subreddit, dd12 extractors; fix streamed.pk
User asked to broaden the source pipeline so f1-stream can find F1 (and
adjacent motorsport) streams from Sky Sports / DAZN / Reddit / etc.,
using the in-cluster chrome-service headed browser where needed. Four
changes:

1. **streamed.py**: BASE_URL streamed.su → streamed.pk. The .su domain
   stopped serving the API host in 2026 (only the marketing page is
   left); .pk hosts the JSON API now. Adds 3 events/round (currently
   all routed through embedsports.top — see #2 caveat).

2. **chrome_browser.py** (new): generic chrome-service-driven extractor.
   Connects to the existing chrome-service WS (CHROME_WS_URL +
   CHROME_WS_TOKEN env), navigates a list of TARGETS, captures any HLS
   playlist URL the page fetches at runtime, returns one ExtractedStream
   per discovery. Uses the same stealth init script as the verifier so
   anti-bot checks don't trip the page. Handles iframes (DD12-style
   /nas → /new-nas/jwplayer) and probes child-frame <video>/source
   elements after settle. Caveat: most aggregator sites (pooembed,
   embedsports, hmembeds, even DD12's JW Player path) use a broken
   runtime decoder that produces no m3u8 in our environment, so the
   TARGETS list is currently 0-yielding; the framework is the
   contribution and concrete sites can be added as they're discovered.

3. **subreddit.py** (new): scans r/MotorsportsReplays, r/motorsports,
   r/formula1, r/motogp via the public old.reddit.com JSON API for
   posts whose flair/title indicates a live stream. Discovered URLs
   are returned as embed-type streams; the verifier visits each via
   chrome-service to confirm playability. Note: Reddit currently HTTP
   403's our cluster outbound IP for anonymous JSON requests; the
   extractor returns 0 in that state and logs a debug message. Will
   work from any IP Reddit isn't blocking.

4. **dd12.py** (new): inline-HTML scraper for DD12Streams. The site
   embeds `playerInstance.setup({file: "..."})` directly in HTML — no
   JS decoder needed. Currently surfaces NASCAR Cup Series 24/7 (clean
   BunnyCDN-hosted HLS at w9329432hnf3h34.b-cdn.net/pdfs/master.m3u8);
   add new `(path, label, title)` tuples to CHANNELS as DD12 expands.

Result: /streams now shows 2 verified live streams (Rally TV via
pitsport + DD12 NASCAR Cup 24/7). When the next F1 weekend (Canadian
GP, May 22-24) goes live, pitsport will surface F1 sessions
automatically via the existing pushembdz path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 16:05:25 +00:00
Viktor Barzin
8e92f5a976 [docs] Forgejo registry image-rebuild runbook
Companion to forgejo-registry-breakglass.md but for the more common
case: the Forgejo registry is healthy as a whole, but one image's
manifest/blob references are broken (orphan child, half-pushed
upload, retention-vs-pull race). The
RegistryManifestIntegrityFailure alert annotation already points
here.

Mirrors registry-rebuild-image.md (the registry-private equivalent)
in structure: confirm via probe + curl, delete broken version
through Forgejo API, rebuild via Woodpecker manual run, force
consumers to re-pull, verify integrity recovery.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 16:03:34 +00:00
Viktor Barzin
40a8e66c58 [ci] Phase 1: infra-ci dual-push + break-glass tarball
Adds Forgejo as a second push target on the build-ci-image pipeline
and saves the just-pushed image as a gzipped tarball on the registry
VM disk (/opt/registry/data/private/_breakglass/) so we can recover
infra-ci with `ctr images import` if both registries are down.

* Dual-push: registry.viktorbarzin.me:5050/infra-ci AND
  forgejo.viktorbarzin.me/viktor/infra-ci, in the same
  woodpeckerci/plugin-docker-buildx step. Same image bytes; the
  Forgejo integrity probe (every 15min) catches any divergence.
* Break-glass step: SSHes to 10.0.20.10, docker pulls + saves +
  gzips, keeps last 5 tarballs (latest symlink). Failure-tolerant
  so a transient registry blip doesn't fail the build pipeline.
* Runbook docs/runbooks/forgejo-registry-breakglass.md documents
  the recovery flow (when to use, scp+ctr import, node cordon,
  underlying-issue fix).

Tarball mirrors to Synology automatically through the existing
daily offsite-sync-backup job — no new sync wiring needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 16:01:20 +00:00
Viktor Barzin
6d5204db10 [forgejo] Tolerate missing Vault keys during Phase 0 bootstrap
Wrap the three new Vault key reads in try(...) so the first apply
succeeds even when forgejo_pull_token / forgejo_cleanup_token /
secret/ci/global haven't been populated yet. Without this, CI
auto-apply blocks on the very push that introduces the references —
chicken-and-egg with the runbook order (which is: apply Forgejo bumps,
then create users + PATs, then apply the rest).

Empty tokens are intentionally visible-broken (auth fails, probe
reports auth failure, cleanup CronJob errors) — that's the signal
to run the bootstrap runbook. Subsequent apply picks up the real
values.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 15:53:08 +00:00
Viktor Barzin
5d22b449f9 [forgejo] Phase 0 of registry consolidation: prepare Forgejo OCI registry
Stage 1 of moving private images off the registry:2 container at
registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption
3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk —
pods still pull from the existing registry until Phase 3.

What changes:
* Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi).
  Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive,
  v11 default-on).
* ingress_factory: max_body_size variable was declared but never wired
  in after the nginx→Traefik migration. Now creates a per-ingress
  Buffering middleware when set; default null = no limit (preserves
  existing behavior). Forgejo ingress sets max_body_size=5g to allow
  multi-GB layer pushes.
* Cluster-wide registry-credentials Secret: 4th auths entry for
  forgejo.viktorbarzin.me, populated from Vault secret/viktor/
  forgejo_pull_token (cluster-puller PAT, read:package). Existing
  Kyverno ClusterPolicy syncs cluster-wide — no policy edits.
* Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster
  Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls).
  Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh
  for existing nodes.
* Forgejo retention CronJob (0 4 * * *): keeps newest 10 versions per
  package + always :latest. First 7 days dry-run (DRY_RUN=true);
  flip the local in cleanup.tf after log review.
* Forgejo integrity probe CronJob (*/15): same algorithm as the
  existing registry-integrity-probe. Existing Prometheus alerts
  (RegistryManifestIntegrityFailure et al) made instance-aware so
  they cover both registries during the bake.
* Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/.

Operational note — the apply order is non-trivial because the new
Vault keys (forgejo_pull_token, forgejo_cleanup_token,
secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the
kyverno + monitoring + forgejo stacks. The setup runbook documents
the bootstrap sequence.

Phase 1 (per-project dual-push pipelines) follows in subsequent
commits. Bake clock starts when the last project goes dual-push.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 15:51:34 +00:00
Viktor Barzin
b1c21f78b9 f1-stream: drop broken curated, dedupe streams, accept all pitsport categories
User feedback: every stream on /watch shows ads but the player fails
to load. Three causes, three fixes:

1. CuratedExtractor's two hmembeds 24/7 channels (Sky F1, DAZN F1)
   sat at the top of the list and ALWAYS failed: they load the
   upstream's ad overlay then JW Player throws error 102630 (empty
   playlist; the obfuscated decoder produces no fileURL in our
   environment). Disabled the registration in extractors/__init__.py
   until/unless we find a working bypass — leaving the existing
   `CURATED_BYPASS = {"curated"}` shim in service.py so the swap is
   reversible.

2. Pitsport surfaces every WRC stage / MotoGP session as its own
   /watch UUID, but they all resolve to the same upstream m3u8 URL
   (e.g. RallyTV one master.m3u8 across all 22 Rally de Portugal
   stages). Added URL-keyed dedupe in service.run_extraction so the
   /streams response shows one row per actual stream.

3. The pitsport category filter was still narrowed to motorsport.
   Pitsport.xyz only lists curated sports broadcasts (WRC, MotoGP,
   IndyCar, NASCAR, Premier League Darts, Premier League football…),
   so the site's own selection is the right filter. Replaced the
   hand-maintained MOTORSPORT_KEYWORDS list with `bool(category or
   title)` — anything pitsport returns goes through. Streams that
   aren't actually live get filtered out downstream when the embed
   API returns an empty manifest.

Frontend: hls.js `lowLatencyMode` was on by default but RallyTV (and
most non-LL-HLS providers) don't ship the LL-HLS extensions, which
broke playback in real browsers. Default to `lowLatencyMode: false`.

Result: /streams is now 1 verified live entry (Rally TV WRC stage
currently airing); was 24 with the top 2 always broken + 22 dupes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 15:42:24 +00:00
Viktor Barzin
9db87e5b12 f1-stream: pitsport extractor — broaden categories + new safeStream payload
The previous extractor only surfaced Formula 1/2/3 and never returned
anything outside race weekends. Two fixes:

1. Broadened category filter from {formula 1/2/3} to a motorsport set
   (MotoGP/Moto2/Moto3, WRC/WEC/IndyCar/NASCAR + the F1 series).
   Replaces the NON_F1_KEYWORDS exclusion list with a positive-match
   MOTORSPORT_KEYWORDS set; removes the F1-specific filter on title
   keywords. Old `_is_f1_*` aliases retained as compat shims.

2. Updated `_parse_stream_config` for the current pushembdz.store embed
   payload — Next.js now serves `safeStream` (just title + method) and
   the actual stream URL is fetched at runtime from
   `pushembdz.store/api/stream/<slug>`. Extractor now hits that endpoint
   when the inline link is missing. Treats `method=jwp` as HLS and
   accepts URLs ending in `.css` (pushembdz disguises some HLS playlists
   with a `.css` extension).

End-to-end result: /streams went from 2 (curated, broken JW decoder) to
24 streams marked `is_live=True`. The verifier confirms each via
`manifest_parsed_codec_missing_in_verifier` (Playwright Chromium has no
H.264 — manifest fetch alone is the codec-independent positive signal).
Currently surfaces Rally de Portugal SS1–SS22 (WRC); MotoGP starts
appearing once the French GP weekend goes live tomorrow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 15:25:27 +00:00
Viktor Barzin
867bdba7bb chrome-service: replace static health stub with noVNC view
The static nginx stub at chrome.viktorbarzin.me wasn't useful for
debugging anti-bot interactions. Swap it for a live noVNC HTML5 view
of the headed Chromium session: x11vnc taps Xvfb's :99 over localhost
TCP (added `-listen tcp -ac` to Xvfb), websockify wraps it as a WS
endpoint, and noVNC's vendored web client serves it on :6080.

The ingress chain is unchanged — chrome.viktorbarzin.me stays
Authentik-gated, dns_type=proxied, port 3000 (the Playwright WS) stays
internal-only behind the NetworkPolicy + token. Custom image
`registry.viktorbarzin.me/chrome-service-novnc:v4` (ubuntu:24.04 +
x11vnc + websockify + novnc apt packages) needs imagePullSecrets, so
also added registry-credentials reference to the deployment spec.

x11vnc flags: `-noshm -noxdamage -nopw -shared -forever`. SHM is
disabled because each container has its own /dev/shm so the X server
can't grant access; XDAMAGE isn't compiled into the noble Xvfb. The
sidecar entrypoint waits up to 30s for both Xvfb (:6099) and x11vnc
(:5900) to bind before exec'ing websockify.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 14:17:05 +00:00
Viktor Barzin
d77a02357c chrome-service: in-cluster headed Chromium pool for f1-stream verifier
The f1-stream verifier's in-process headless Chromium kept tripping
hmembeds' disable-devtool.js Performance detector (CDP latency on
console.log vs console.table) and getting redirected to google.com.

This adds a single-replica chrome-service stack running Playwright
launch-server under Xvfb so callers can connect via WS+token to a
shared headed browser. f1-stream's _ensure_browser now prefers
chromium.connect(CHROME_WS_URL/CHROME_WS_TOKEN) and adds a vendored
stealth init script (webdriver/plugins/languages/Permissions/WebGL
spoofs + querySelector hijack to disarm disable-devtool-auto) on
every new context. Falls back to in-process headless if the env
vars aren't set.

Encrypted PVC for profile + npm cache, NetworkPolicy to TCP/3000
gated by client-namespace label, 6h tar.gz backup CronJob to NFS,
Authentik-gated nginx sidecar at chrome.viktorbarzin.me for human
liveness checks. Image pinned to playwright:v1.48.0-noble in
lockstep with the Python client's playwright==1.48.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 10:43:40 +00:00
Viktor Barzin
ae70faf8be openclaw: realtime usage dashboard via Prometheus exporter sidecar
Stdlib-only Python exporter ($1) reads ~/.openclaw/agents/*/sessions/*.jsonl
(assistant messages with usage) plus auth-profiles.json (OAuth expiry,
Plus-tier label) and exposes Prometheus text format on :9099/metrics.
Container is python:3.12-slim; pod template gets prometheus.io/scrape
annotations so the existing kubernetes-pods job picks it up — no
ServiceMonitor needed.

Metrics exported:
  openclaw_codex_messages_total{provider,model,session_kind}    counter
  openclaw_codex_input/output/cache_read/cache_write_tokens_total
  openclaw_codex_message_errors_total{reason}
  openclaw_codex_active_sessions{kind}                          gauge
  openclaw_codex_oauth_expiry_seconds{provider,account,plan}    gauge
  openclaw_codex_last_run_timestamp                             gauge

Grafana dashboard "OpenClaw — Codex Usage" (Applications folder, 30s
refresh): messages/5h vs Plus rate-card, % of 1,200 floor, tokens/5h,
cache hit %, OAuth expiry days, active sessions, last-turn age, errors,
plus per-model timeseries + bar gauge + error table.

Plus rate-card thresholds in the gauge are conservative (1,200/5h floor;
real cap is dynamic 1,200–7,000). Re-baseline if throttling shows up
below 80%.
2026-05-07 09:04:25 +00:00
Viktor Barzin
4b39cb72da openclaw: switch primary to ChatGPT Plus OAuth (openai-codex/gpt-5.4-mini)
Bumps image 2026.2.26 → 2026.5.4 (openai-codex provider plugin landed in
2026.4.21+). Auth profile is OAuth via the device-pairing flow against the
Codex backend (account ancaelena98@gmail.com); token persists in
/home/node/.openclaw/agents/main/agent/auth-state.json on NFS so it survives
pod restarts. Plus tier accepts gpt-5.4-mini (1,200–7,000 local msgs/5h);
gpt-5-mini and gpt-5.1-codex-mini both return errors on Plus, so we pin
gpt-5.4-mini explicitly. doctor --fix auto-promotes the highest-tier model
(gpt-5-pro) after model discovery, so the container command pins the mini
back as default after doctor runs but before gateway start.
2026-05-06 22:06:32 +00:00
Viktor Barzin
e8fe5d6198 f1-stream: drop demo + landing-page extractors, add fetch-proxy injection
Per user feedback: the demo Big Buck Bunny / Apple test streams aren't
useful in an F1-streams app. Removed DemoExtractor entirely. Tightened
the discord-extractor path filter from "any stream-shaped path" to
"direct embed/player path only" — the previous filter still let
sportsurge `/event/...` landing pages through, which the verifier
mistook for playable because they render player-class divs without a
real player.

Embed proxy now also rewrites window.fetch + XMLHttpRequest.open inside
the upstream HTML so that cross-origin XHRs (e.g. the hmembeds
`/sec/<JWT>` token-binding endpoint) go through our /embed-asset relay.
This avoids the CORS reject that fired when the player JS tried to call
hghndasw.gbgdhdffhf.shop/sec/... from an `f1.viktorbarzin.me` origin.

The verifier now requires a `<video>` element to mark embed streams
playable (not just a player-class div). Curated streams bypass the
verifier — hmembeds aggressively detects headless Chromium (devtool
trap, console-clear timing, automation flags) and won't progress past
JW Player init in our pod, but the user's real browser should clear
those checks. We can't honestly headless-verify hmembeds, so we trust
the curator instead of falsely rejecting them.

Image: viktorbarzin/f1-stream:v6.1.1
2026-05-06 21:50:54 +00:00
Viktor Barzin
89c59ccc80 f1-stream: only show streams confirmed playable by headless browser
Cuts the stream list from 23 mostly-broken entries to ~6 confirmed-playable
ones, and adds an iframe-stripping proxy so embed sources (hmembeds, etc.)
load through our origin without X-Frame-Options / CSP / JS frame-buster
blocks.

Why: the previous list was dominated by Discord-shared news article URLs,
hardcoded aggregator landing pages, and other non-stream URLs that all sat
at is_live=true because embed streams skipped the health check entirely.
Users could not tell which links would actually play.

What:
- backend/playback_verifier.py: new headless-Chromium verifier (Playwright)
  that polls each candidate stream for a codec-independent "playable" signal
  (hls.js MANIFEST_PARSED for m3u8; <video>/player div for embed). Replaces
  the unconditional is_live=True for embed streams in service.py.
- backend/embed_proxy.py: new /embed and /embed-asset routes that fetch
  upstream embed pages, strip X-Frame-Options/CSP/Set-Cookie, and inject a
  <base href> + frame-buster-defeat <script> that locks down window.top,
  document.referrer, console.clear/table, and window.location so the
  hmembeds disable-devtool.js redirect-to-google trap can't fire.
- extractors/curated.py: new always-on extractor with two known-good 24/7
  hmembeds embeds (Sky Sports F1, DAZN F1) so the list isn't empty between
  race weekends.
- extractors/__init__.py: register CuratedExtractor first; drop
  FallbackExtractor (its 10 aggregator landing-pages can't iframe-play).
- extractors/discord_source.py: positive-match path filter (must look like
  /embed/, /stream, /watch, /live, /player, *.m3u8, *.php) plus expanded
  domain blocklist for news sites — was 10 noise URLs, now ~1.
- extractors/service.py: run_extraction now health-checks AND verifier-
  checks both stream types; only verified-playable streams reach is_live.
- main.py: register /embed + /embed-asset routes; defer initial extraction
  by 8s so the verifier can reach the local /embed proxy on 127.0.0.1:8000.
- frontend/lib/api.js + watch/+page.svelte: route embed iframes through
  /embed proxy instead of the upstream URL, so X-Frame-Options/CSP can't
  block them.
- Dockerfile: install Playwright chromium + system codec-runtime libs.
- main.tf: bump pod memory 256Mi → 1Gi for chromium.

Verified end-to-end with Playwright against
https://f1.viktorbarzin.me/watch — 6/6 streams reach a player UI; the 3
demo m3u8s actually play (codec-bearing browser); the 3 embeds (Sky
Sports F1, DAZN F1, sportsurge) render iframes through the proxy.

Image: viktorbarzin/f1-stream:v6.0.5

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-06 21:00:07 +00:00
Viktor Barzin
1cc745bbec openclaw: switch primary model to qwen3-coder-480b (qwen3.5-397b dead on NIM)
NVIDIA retired nim/qwen/qwen3.5-397b-a17b — modelrelay shows consistent
TIMEOUTs over 24h+ of pings, and nim/nvidia/llama-3.1-nemotron-ultra-253b-v1
returns 404. With both gone the openclaw failover never reached
mistral-large-3 in time, so every message hung until the 120s embedded-run
timeout. Promote qwen3-coder-480b-a35b-instruct (already in models list, UP
~1-2s, 256k ctx) to primary; drop the dead nemotron-ultra fallback.
2026-05-06 20:35:38 +00:00
Viktor Barzin
2d236dc48e monitoring(wealth): delta panels to 2x4 grid (rows = type, cols = window)
Better visual grouping: instead of 8 paired panels in a single row at
w=3 (cramped, hard to scan), arrange as a 2x4 grid at w=6. Top row
("all" — wealth change incl new money), bottom row ("mkt" — pure
market gain). Columns are timeframes 1d / 7d / 30d / 90d.

Reading vertically: same window, two interpretations side by side.
Reading horizontally: same metric across timeframes.

Layout shift: delta row goes from y=4 (4 wide) to y=4..11 (8 high).
All chart/log panels with y >= 8 shift down by another 4 rows
(net-worth chart 8->12, activity log 81->85, etc.).
2026-05-06 20:29:27 +00:00
Viktor Barzin
e383bcd7fb monitoring(wealth): pair every delta panel with market-only twin
User feedback: net-worth delta panels (1d/7d/30d/90d) confused
because +£174k over 90d looked too big against the £271k cumulative
unrealised gain. Decomposition showed the 90d delta was £114k of new
money in (contributions) + £60k of actual market gain.

So now the delta row shows BOTH:
  Δ Nd (all)  — net-worth change incl new money (the original number)
  Δ Nd (mkt)  — pure market gain, contributions stripped out

Pattern for "(mkt)" panels: same now_snap / past_snap CTEs but
selecting both total_value and net_contribution, then computing
(nw_delta - contrib_delta) = market_gain over window.

Layout: 8 panels at w=3 each on the y=4 row, paired by window
(all next to mkt for each timeframe), so you can see "wealth
change vs investment performance" at a glance.

Verified live (90d): all=+£174,612, mkt=+£60,343, contrib=+£114,268.
2026-05-06 20:25:33 +00:00
Viktor Barzin
941c794678 monitoring(wealth): add delta row (1d / 7d / 30d / 90d net-worth changes)
New row at y=4 with 4 stat panels showing net-worth change over the
trailing windows. Each uses the latest-per-account stitching pattern
(skew-resilient against partial-day syncs) and computes:

  delta = SUM(latest per account) - SUM(latest per account at or
                                       before max_complete - N)

Where max_complete is the most recent date all accounts have a row.
For each window: 1d, 7d, 30d, 90d.

Verified live values: +£8,575 / +£22,696 / +£144,633 / +£174,612.

All panels at y >= 4 shifted down by 4 rows to make room (Net worth
chart 4->8, Per-account stacked 24->28, Activity log 77->81, etc.).

Note: this commit also reformats the dashboard JSON from compact-
object form to indented form (json.dump indent=2 side effect from the
Python patch script). No semantic changes outside the new panels and
y-shifts.
2026-05-06 20:16:31 +00:00
Viktor Barzin
77e502587e kms: switch to non-proxied DNS so port 1688 is reachable externally
Cloudflare cannot proxy raw TCP/1688 (KMS protocol). Switch
kms.viktorbarzin.me from CF-proxied CNAME to direct A/AAAA so
clients can reach the vlmcsd LoadBalancer (10.0.20.200) via the
existing pfSense WAN port-forward for 1688.

Verified end-to-end: vlmcs against 176.12.22.76:1688 completes
the KMS V4 handshake for Office Professional Plus 2019.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-06 20:16:31 +00:00
Viktor Barzin
80e27f69e9
fix: strip conditional headers in bot-block-proxy to fix CalDAV sync
nginx's not_modified_filter evaluated If-Match headers forwarded by
Traefik's forwardAuth, returning 412 and breaking CalDAV VTODO updates
from macOS/iOS Reminders. Switch to OpenResty and clear conditional
headers with Lua before proxy processing.
2026-05-05 22:52:43 +01:00
36 changed files with 1167 additions and 244 deletions

View file

@ -416,7 +416,7 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
### Ingress Returns 502 Bad Gateway
**Symptoms**: Cloudflared tunnel is up, Traefik logs show `dial tcp: lookup <service> on 10.0.20.101:53: no such host`.
**Symptoms**: Cloudflared tunnel is up, Traefik logs show `dial tcp: lookup <service> on 10.0.20.201:53: no such host`.
**Diagnosis**: DNS resolution failed. Check:
1. Is Technitium pod running? `kubectl get pod -n technitium`

View file

@ -181,7 +181,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
| W1.4 Kyverno security policies → Enforce | **LIVE** — 3 policies in Enforce mode with 35-namespace exclude list. |
| W1.5 Kyverno trusted-registries → Enforce | **LIVE** — explicit allowlist (15 registries + 6 DockerHub library bare names + 56 DockerHub user repos). Verified by admission dry-run: `evilcorp.example/malware:v1` BLOCKED, `alpine:3.20` and `docker.io/library/alpine:3.20` ALLOWED. |
| W1.6 Calico observe-phase (pilot: recruiter-responder) | **LIVE** (2026-05-19) — GlobalNetworkPolicy `wave1-egress-observe-recruiter-responder` with rules `[action:Log, action:Allow]`. FelixConfiguration.flowLogsFileEnabled approach abandoned (Calico Enterprise-only field, rejected by OSS v3.26). Log action emits iptables LOG with prefix `calico-packet: ` → kernel → journald → Alloy → Loki. Verified: `{job="node-journal"} \|~ "calico-packet"` returns real packet metadata (SRC/DST/PROTO). Expand to more namespaces by adding to `namespaceSelector`. |
| W1.7 NetworkPolicy phased enforce | **PENDING** — needs ~1 week of W1.6 observation, then build empirical allowlist from Loki queries, flip GNP rules from `[Log, Allow]` to `[Allow specific dests, Deny rest]`. |
| W1.7 NetworkPolicy phased enforce | **PARTIAL ANALYSIS** — first observation snapshot at `docs/architecture/wave1-egress-observation-2026-05-22.md` (36 source namespaces seen so far, 29 thin-profile candidates). Recommend continuing observation through 2026-05-29 (full week) before any enforce flip. Pilot enforce target: `recruiter-responder` (2 destinations only). `servarr` stays in Log+Allow indefinitely (BitTorrent P2P incompatible with static enforce). |
The block below documents the locked design.

View file

@ -86,7 +86,7 @@ sequenceDiagram
| Authentik | OIDC provider | K8s | SSO authentication for Headscale |
| DERP Relay | Embedded in Headscale | K8s (region 999) | Relay for NAT traversal |
| AdGuard DNS | Container | K8s | Global DNS resolver with ad-blocking |
| Technitium DNS | Container | K8s (10.0.20.101) | Internal .lan domain resolver |
| Technitium DNS | Container | K8s (10.0.20.201) | Internal .lan domain resolver |
## How It Works
@ -224,7 +224,7 @@ dns_config:
- Google: `8.8.8.8`, `8.8.4.4`
**Conditional forwarding**:
- `viktorbarzin.lan``10.0.20.101` (Technitium)
- `viktorbarzin.lan``10.0.20.201` (Technitium)
**Ad-blocking lists**:
- AdGuard DNS filter
@ -377,7 +377,7 @@ dns_config:
**Steps**:
1. Verify AdGuard is running: `kubectl get pod -n adguard`
2. Check AdGuard conditional forwarding: Query AdGuard directly: `nslookup nextcloud.viktorbarzin.lan <adguard-ip>`
3. Check Technitium: `nslookup nextcloud.viktorbarzin.lan 10.0.20.101`
3. Check Technitium: `nslookup nextcloud.viktorbarzin.lan 10.0.20.201`
**Common causes**:
1. **AdGuard not forwarding .lan**: Conditional forwarding rule missing or misconfigured.

View file

@ -0,0 +1,141 @@
# Wave 1 W1.6/W1.7 — Egress Observation Snapshot (2026-05-22)
First analysis pass over the Calico GNP `wave1-egress-observe-tier34` data
captured in Loki via `{job="node-journal"} |~ "calico-packet"`.
**Data scope:** ~10000 flow log lines pulled from Loki over ~6h+24h windows.
Loki caps queries at 5000 records so longer windows are sample-capped.
**Coverage:** 36 source namespaces observed making egress (out of 82 selected
by `tier in {3-edge, 4-aux}`). Namespaces missing from data are either idle,
scaled to 0, or producing only intra-namespace traffic (which Calico Log
captures from-workload but most pods in those namespaces talk locally).
## Egress fan-out per namespace
| Namespace | dests | pod-ns | svc | external |
|---|---:|---:|---:|---:|
| affine | 3 | 2 | 1 | 0 |
| beads-server | 4 | 3 | 1 | 0 |
| cyberchef | 2 | 1 | 1 | 0 |
| dawarich | 3 | 2 | 1 | 0 |
| default | 1 | 0 | 0 | 1 |
| ebooks | 3 | 2 | 1 | 0 |
| f1-stream | 16 | 2 | 1 | 13 |
| forgejo | 2 | 1 | 1 | 0 |
| hackmd | 2 | 1 | 1 | 0 |
| homepage | 2 | 1 | 1 | 0 |
| isponsorblocktv | 2 | 0 | 1 | 1 |
| jsoncrack | 2 | 1 | 1 | 0 |
| kms | 2 | 1 | 1 | 0 |
| mailserver | 2 | 0 | 1 | 1 |
| meshcentral | 2 | 2 | 0 | 0 |
| n8n | 2 | 1 | 1 | 0 |
| nextcloud | 5 | 2 | 1 | 2 |
| onlyoffice | 2 | 1 | 1 | 0 |
| openclaw | 18 | 4 | 1 | 13 |
| paperless-ngx | 3 | 2 | 1 | 0 |
| phpipam | 3 | 2 | 1 | 0 |
| poison-fountain | 2 | 1 | 1 | 0 |
| postiz | 9 | 8 | 1 | 0 |
| realestate-crawler | 2 | 1 | 1 | 0 |
| recruiter-responder | 2 | 0 | 1 | 1 |
| rybbit | 2 | 1 | 1 | 0 |
| send | 2 | 1 | 1 | 0 |
| servarr | 134 | 2 | 2 | 130 |
| speedtest | 2 | 1 | 1 | 0 |
| status-page | 10 | 2 | 1 | 7 |
| tandoor | 2 | 1 | 1 | 0 |
| technitium | 5 | 2 | 1 | 2 |
| trading-bot | 5 | 2 | 1 | 2 |
| url | 2 | 1 | 1 | 0 |
| website | 2 | 1 | 1 | 0 |
| woodpecker | 8 | 2 | 1 | 5 |
## Common patterns
**Universal baseline** (every observed namespace makes these):
- `kube-system/kube-dns` UDP/53 — DNS resolution
- Often `dbaas` TCP/3306 (MySQL) or TCP/5432 (Postgres)
- Often `redis` TCP/6379
**Per-namespace specifics** (the part that varies):
- External HTTPS to specific IPs (CDNs, APIs)
- Internal pod-to-pod for service-specific clients
## W1.7 rollout candidates (sorted by simplicity)
**Tier A — trivial egress (recommend first wave):**
`recruiter-responder` has the simplest profile of all observed:
- `kube-system/kube-dns` :53/UDP
- `99.83.136.103` :443/TCP (Telegram API)
That's it. Two destinations. Perfect first enforce candidate.
**Tier B — small egress (≤3 external + ≤5 internal, 29 namespaces):**
affine, beads-server, cyberchef, dawarich, ebooks, forgejo, hackmd, homepage,
isponsorblocktv, jsoncrack, kms, mailserver, meshcentral, n8n, nextcloud,
onlyoffice, paperless-ngx, phpipam, poison-fountain, realestate-crawler,
rybbit, send, speedtest, tandoor, technitium, trading-bot, url, website.
These can be enforce'd in batches of 3-5/day after the recruiter-responder
pilot proves out.
**Tier C — moderate egress (518 external):**
f1-stream (13 ext), openclaw (13 ext), woodpecker (5 ext), status-page (7 ext).
Need per-IP allowlist or domain-based selectors.
**Tier D — broad egress (do NOT enforce statically):**
`servarr` has 130+ external IPs because it runs BitTorrent peer-to-peer.
Static IP enforcement won't work; either leave in Log+Allow mode permanently
or use a port-only allowlist (TCP+UDP 6881+random high ports outbound).
## Important caveats before flipping to enforce
1. **Observation horizon is too short.** Only ~6h of dense data and ~24h
total. CronJobs that run weekly, periodic Vault token rotations (7d),
external service maintenance windows, Keel auto-rollouts pulling new
image versions — all missed. Recommend collecting **at least 7 days**
before declaring an allowlist complete.
2. **`servarr`** is fundamentally incompatible with static enforce — keep
in Log+Allow (or explicit deny for known-bad CIDRs only).
3. **External IPs are dynamic.** Cloudflare-fronted services rotate IPs.
The recruiter-responder external IP `99.83.136.103` is one of Telegram's
API endpoints — Telegram has a CIDR range. Allowing single IPs will break
when DNS resolves to a different IP. Prefer Calico's `domains:` selector
(Calico OSS supports DNS-based egress allowlists via `dns_policy_resolver`)
OR allow the full Cloudflare/AWS CIDR range OR use a per-app egress
gateway.
4. **The observation didn't capture intra-namespace traffic** by design —
the Calico Log rule fires on egress from workload endpoint, but
pod-to-same-namespace-pod traffic on the same node may bypass the
filter chain (varies). Real-world testing needed after enforce flip.
## Suggested next-session sequencing
1. **Continue observation for at least 7 days** before any enforce flip.
Compare data on 2026-05-29 vs today; if no new destinations show up,
the allowlist is stable.
2. **First enforce: recruiter-responder.** GNP with allowlist =
{kube-dns, telegram CIDR, vault svc, eso svc}. Watch for breakage.
3. **Tier B batch rollout** at 3-5 namespaces/day per Keel-style phased
rollout pattern (memory id=1972).
4. **Tier C requires per-namespace investigation** — what are those
external IPs? Map to known services first.
5. **servarr stays in Log+Allow** indefinitely (or migrate to dedicated
egress proxy).
## Source data location
- Loki LogQL: `{job="node-journal"} |~ "calico-packet"`
- Pod IP → namespace map at observation time saved at
`/tmp/pod-ip-map.txt` on the analysis host (ephemeral).
- Analysis scripts: `/tmp/analyze_flows2.py`, `/tmp/build_allowlist.py`.
- Tracked under beads `code-8ywc` (W1.7).

View file

@ -106,7 +106,7 @@ machine:
- network: 0.0.0.0/0
gateway: 10.0.20.1
nameservers:
- 10.0.20.101 # Technitium
- 10.0.20.201 # Technitium
- 1.1.1.1
registries:
mirrors:

View file

@ -1,9 +1,32 @@
# HA Control Plane (3 masters) — Design
**Date**: 2026-05-21
**Status**: Drafted, NOT scheduled
**Beads**: code-n0ow
**Trigger**: today's k8s 1.34.7→1.34.8 autonomous-upgrade session repeatedly hit a storm cascade rooted in single-master apiserver outages
**Date**: 2026-05-21 (decisions locked 2026-05-22; **deferred 2026-05-23**)
**Status**: **DEFERRED** — design + plan complete, NOT scheduled. Awaiting either PVE host capacity expansion OR a separate right-sizing pass on the existing master before this becomes affordable. Paired plan: `2026-05-21-ha-control-plane-plan.md`.
**Beads**: code-n0ow (open, deferred — see `bd show code-n0ow`)
**Trigger**: 2026-05-21 k8s 1.34.7→1.34.8 autonomous-upgrade session repeatedly hit a storm cascade rooted in single-master apiserver outages
## Why deferred (2026-05-23)
Measured during the locking pass:
- **k8s-master uses 4.6 GB of 32 GB allocated** (kube-apiserver 2.6 GB + etcd 660 MB + cm 360 MB + ~1 GB everything else). The 32 GB sizing is ~5-6× oversized vs working set.
- **PVE host is already 98% RAM-committed** — 262 GB allocated to VMs against 267 GB physical, with 1.5 GB of active swap. The planned 3 × 32 GB control plane (+64 GB net) would push allocation to 326 GB → OOM on the host.
- **Software-only HA on a single PVE host has bounded value** — a hypervisor crash still loses all 3 masters. The big resilience wins (kubeadm upgrades, cert rotation, planned reboots) are real but the disaster-recovery angle is limited until a second PVE host exists.
### Revisit triggers — any of:
1. **Second PVE host added** to the lab. Hardware HA becomes possible; HA control plane becomes the natural follow-up. Spread the 3 masters across 2 hosts (2+1).
2. **Cluster-wide right-sizing pass** that frees enough headroom for the original 3 × 32 GB plan, OR pre-agreed amendment to provision 16 GB masters (right-sized to actual usage; 3-4× current working-set headroom).
3. **Storm cascade burns enough hours** that the operational cost outweighs the memory cost — track minutes spent manually nursing kubeadm upgrades; if cumulative > ~10h over a few months, revisit.
### What's still good
The design + plan in this directory remain authoritative. When we revisit:
- All 14 locked decisions stand.
- Challenger amendments (cloud-init template bump, rbac multi-master refactor, HTTPS `/readyz` health check, expanded blast radius, etcd-backup nodeSelector, chain extension as Phase 7) are baked in.
- Only the sizing decision needs revisiting — likely 16 GB per master instead of 32 GB.
- Adding `k8s_master_hosts` list-based refactor to the rbac stack (Phase 1.5) is a **standalone win** that could be done independently of HA — it would future-proof the cluster against the day HA lands. Consider lifting that as its own task.
## Problem statement
@ -50,18 +73,24 @@ The k8s upgrade chain doesn't need to be aware of *any* of this — the
underlying availability of apiserver makes the chain's gates
naturally pass on each iteration.
## Decisions (proposed — to be confirmed)
## Decisions (locked 2026-05-22)
| # | Decision | Notes |
|---|----------|-------|
| 1 | **3 masters** (not 5) | Quorum tolerates 1 failure, sufficient for home-lab. 5 would tolerate 2 but doubles etcd write amplification. |
| 2 | **Sizing**: match current `k8s-master` (8 vCPU, 32GB RAM, ~64 GB disk) for all 3 | Symmetric. New VMs `k8s-master-2`, `k8s-master-3` on Proxmox. |
| 3 | **Apiserver LB**: **pfSense HAProxy** (existing pattern, see mailserver-pfsense-haproxy.md) over keepalived+haproxy-on-each-master | Pros: no per-node moving parts, mirrors the mailserver layout already in production. Cons: pfSense becomes more SPoF — but it's already SPoF for everything else (DNS, gateway, ingress). |
| 4 | **VIP**: pick an unused IP on the cluster VLAN, e.g. `10.0.20.99`, point all kubeconfigs + kubelet `--server` at it | Internal-only VIP; external API access stays via Cloudflared. |
| 5 | **etcd**: kubeadm-managed (existing); just `kubeadm join --control-plane` brings new members into the etcd cluster automatically | Avoids running etcd separately. |
| 6 | **kured-sentinel-gate**: extend "quorum-safe" check to verify ≥2 control-plane nodes Ready before allowing a reboot | Otherwise kured could reboot 2 masters at once and break quorum. |
| 7 | **etcd backup**: today's `etcd-backup` CronJob already takes a snapshot from one member; that's still sufficient (etcd snapshot is a consistent point-in-time). No new work needed. | |
| 8 | **Migration order**: add masters one at a time, run smoke (kubectl from each), then cut over kubeconfigs | Each `kubeadm join --control-plane` is reversible (just `kubeadm reset` + remove from etcd member list). |
| 2 | **Sizing**: match current `k8s-master` (8 vCPU, 32GB RAM, ~64 GB disk) for all 3 | Symmetric. New VMs `k8s-master-2` (VMID 205, 10.0.20.110), `k8s-master-3` (VMID 206, 10.0.20.111). |
| 3 | **Apiserver LB**: **pfSense HAProxy** — new TCP frontend on `10.0.20.99:6443` mirroring the mailserver pattern. Idempotent via `scripts/pfsense-haproxy-bootstrap.php`. | Pros: no per-node moving parts, mirrors existing mailserver layout. Cons: pfSense becomes more SPoF — but it's already SPoF for everything else (gateway/DNS/ingress). |
| 4 | **VIP**: `10.0.20.99` (one below current master `.100`, well clear of MetalLB pool `.200-.220`). Internal-only — external API access stays via Cloudflared. | All kubeconfigs + kubelet.conf entries flip from `10.0.20.100:6443``10.0.20.99:6443`. |
| 5 | **etcd**: kubeadm-managed stacked; `kubeadm join --control-plane` brings new members into the etcd cluster automatically | Avoids running etcd separately. |
| 6 | **kured-sentinel-gate**: extend the bash loop in `stacks/kured/main.tf` with a "≥2 control-plane nodes Ready" check between the existing all-nodes-Ready and calico-Ready checks | Otherwise kured could reboot 2 masters at once and break quorum. |
| 7 | **etcd backup**: `etcdctl snapshot save` from any member is a consistent point-in-time of the full quorum state — but the existing CronJob is pinned `node_name = "k8s-master"`. Phase 4.5 flips this to a control-plane label + toleration so backups don't silently skip when master-1 is drained. | Snapshot CORRECTNESS unchanged; SCHEDULING needs fixing. |
| 8 | **Migration order**: Phase 0 (retrofit existing cluster) → Phase 1 (LB up, single backend, HTTPS health check) → Phase 1.5 (rbac stack refactor) → Phase 2 (cloud-init bump + master-2 join + add to LB) → Phase 3 (master-3 join + add to LB) → Phase 4 (flip clients + workers to VIP) → Phase 4.5 (etcd-backup CronJob fix) → Phase 5 (kured-sentinel-gate quorum check) → Phase 6 (E2E validation) → Phase 7 (k8s-version-upgrade chain extension) | Each kubeadm join is reversible (`kubeadm reset` + `etcdctl member remove`). |
| 9 | **VM provisioning**: cloud-init via `create-template-vm` module, **but the template needs an apt-source bump first** (v1.32 → v1.34) and a control-plane gate on `k8s_join_command` so master VMs don't auto-join as workers. Existing master stays as the legacy manual VM (not rebuilt). | The repo has zero VMs using cloud-init for provisioning today — we're the first user. Update template first, then use it. |
| 10 | **Cert SAN + controlPlaneEndpoint retrofit**: Phase 0, before any new master joins. Patch `kubeadm-config` via `kubeadm init phase upload-config kubeadm --config <file>` (kubeadm-owned write, future-proof against `kubeadm upgrade apply`), regen `apiserver.crt` via `kubeadm init phase certs apiserver`, restart the kube-apiserver pod (~30s outage on the existing master only). | Standard kubeadm retrofit path; `kubeadm join --control-plane` requires controlPlaneEndpoint to be set. |
| 11 | **Multi-master config propagation (Phase 1.5)**: refactor `stacks/rbac/modules/rbac/{apiserver-oidc,audit-policy,etcd-tuning}.tf` to loop over a list of master hosts. Apply BEFORE master-2/3 join so they boot with OIDC, audit policy, and etcd tuning already in place. | Today these stacks SSH into a single master and sed into `kube-apiserver.yaml` — if not propagated, Authentik login flaps depending on which master the LB lands on. |
| 12 | **k8s-version-upgrade chain extension (Phase 7)**: extend `stacks/k8s-version-upgrade/scripts/upgrade-step.sh` to discover and iterate over all control-plane nodes (drain → upgrade → uncordon, gated by quorum check). | Without this, chain only upgrades master-1; masters 2/3 drift behind one version per upgrade. Original autonomous-upgrades goal unmet. |
| 13 | **LB health check**: HTTPS `GET /readyz` (with `verify none` for self-signed apiserver cert), NOT plain TCP. | Plain TCP misses apiserver-NotReady states (etcd unreachable, controller-manager flapping). |
| 14 | **VIP DNS name**: add `k8s-apiserver IN A 10.0.20.99` to `config.tfvars` BEFORE Phase 4. Delete stale `kubernetes IN A 10.0.20.100`. Consumers reference the FQDN, not the bare IP — future renumbering is then a single record change. | |
## Out of scope
@ -74,12 +103,15 @@ naturally pass on each iteration.
| Risk | Mitigation |
|---|---|
| Phase 0 cert regen on existing master triggers a brief apiserver outage (~30s) | Already a known cluster behaviour during static-pod restart. Schedule during a low-activity window. Tigera/operators will crash-loop briefly but recover — same blast radius as today's k8s upgrade. **Once HA is up, future restarts won't have this surface at all.** |
| etcd quorum split-brain during member join | kubeadm join is atomic; if it fails, the new member doesn't join the quorum. Existing etcd stays healthy. |
| LB misconfiguration → all kubectl breaks | Smoke-test from each master before flipping clients. Keep a kubeconfig pointing directly at one master as fallback. |
| Existing kubeconfigs (dev VM, agents, woodpecker) need updating | List all consumers, update in a single TF apply. |
| New masters get scheduled some workload pods unintentionally | Verify control-plane taint is applied at join time. |
| Cluster-wide cert rotation might be needed | kubeadm join handles certs automatically using the `--certificate-key` from `kubeadm init phase upload-certs`. |
| 32GB per master × 3 = 96GB RAM used for control plane alone | Proxmox host has headroom; not blocking. |
| LB misconfiguration → all kubectl breaks | Smoke-test from each master directly (bypass LB) before flipping clients. Keep a kubeconfig pointing at `10.0.20.100:6443` as fallback. |
| Existing kubeconfigs (Woodpecker pipelines, agents, dev VM, in-cluster RBAC default) need updating | Single Terraform apply touches `stacks/rbac/modules/rbac/apiserver-oidc.tf` (default), `.woodpecker/*.yml` (committed kubeconfigs). Worker `kubelet.conf` files patched in Phase 4 via ssh loop. |
| New masters get scheduled workload pods unintentionally | Verify `node-role.kubernetes.io/control-plane:NoSchedule` taint is applied at join time (default with `--control-plane`). |
| Cert rotation propagation | kubeadm join uses the `--certificate-key` from `kubeadm init phase upload-certs` to fetch existing CA materials. Single short-lived secret in `kube-system/kubeadm-certs` (**2h TTL** — Phases 2 + 3 must complete within the window, or re-upload between them). |
| 32GB per master × 3 = 96GB RAM used for control plane alone | PVE host has 272GB total, 176GB allocated to cluster pre-HA. Post-HA: 240GB allocated, 32GB headroom. Sufficient. |
| Pre-existing kubeadm-config does NOT have `controlPlaneEndpoint` set | Phase 0 patches it. Verify: `kubectl -n kube-system get cm kubeadm-config -o yaml \| grep controlPlaneEndpoint` (absent → `10.0.20.99:6443` post-Phase 0). |
| Existing master cert SANs are `[k8s-master, 10.96.0.1, 10.0.20.100]` only — missing VIP | Phase 0 regens with `--apiserver-cert-extra-sans 10.0.20.99` after patching kubeadm-config. |
## Verification
@ -91,12 +123,23 @@ kubectl get nodes -l node-role.kubernetes.io/control-plane=
# etcd quorum healthy
kubectl -n kube-system exec etcd-k8s-master -- etcdctl \
--endpoints=https://10.0.20.100:2379,https://10.0.20.X:2379,https://10.0.20.Y:2379 \
--endpoints=https://10.0.20.100:2379,https://10.0.20.110:2379,https://10.0.20.111:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health --cluster
# Kubeconfig points at VIP
kubectl --kubeconfig ~/.kube/config config view --minify -o jsonpath='{.clusters[0].cluster.server}'
# Expect: https://10.0.20.99:6443
# Worker kubelet.conf points at VIP
for n in k8s-node{1,2,3,4}; do
ssh wizard@$n.viktorbarzin.lan "sudo grep -E '^\s+server:' /etc/kubernetes/kubelet.conf"
done
# Expect: server: https://10.0.20.99:6443 on every node
# Failover test: cordon master-1, reboot it, observe kubectl still works through LB
kubectl drain k8s-master --delete-emptydir-data --ignore-daemonsets
ssh wizard@k8s-master.viktorbarzin.lan sudo reboot
@ -110,7 +153,7 @@ kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check ha-validation
- 2× VMs at 8 vCPU + 32GB RAM each = +64GB RAM on Proxmox host
- ~+128GB disk usage (2× 64GB master disks)
- ~2-4 hours of operator time end-to-end (VM provisioning + kubeadm join + LB config + smoke)
- **~5-7 hours of operator time end-to-end** (cloud-init template bump + Phase 0 retrofit + LB + Phase 1.5 rbac refactor + 2× kubeadm join + Phase 4 cutover + Phase 4.5 etcd-backup fix + Phase 5 kured-gate + Phase 6 validation + Phase 7 chain extension). Phases 06 can land in one session; Phase 7 can be deferred a few days if needed.
## What's already in place from today's work

View file

@ -0,0 +1,325 @@
# HA Control Plane (3 masters) — Plan
**Date**: 2026-05-21 (locked + revised 2026-05-22 after challenger pass)
**Status**: Drafted, awaiting approval
**Pairs with**: `2026-05-21-ha-control-plane-design.md`
**Beads**: `code-n0ow`
## Goal
Migrate the single-master cluster to a 3-master HA control plane behind
a pfSense HAProxy VIP (`10.0.20.99:6443`), enabling autonomous k8s
upgrades without storm-cascade manual nursing.
## Topology — before / after
```
Before After
┌──────────────────────┐
│ pfSense HAProxy │
│ 10.0.20.99:6443 │
│ TCP, /readyz health │
└──┬───────┬───────┬───┘
┌───────────────┐ │ │ │
│ k8s-master │ ▼ ▼ ▼
│ 10.0.20.100 │ ┌──────────────┐ ┌────────────┐ ┌────────────┐
│ apiserver+etcd│ │k8s-master │ │k8s-master-2│ │k8s-master-3│
│ + workers join│ │10.0.20.100 │ │10.0.20.110 │ │10.0.20.111 │
│ directly │ │(VMID 200) │ │(VMID 205) │ │(VMID 206) │
└───────────────┘ │apiserver+etcd│ │apiserver+e.│ │apiserver+e.│
└──────────────┘ └────────────┘ └────────────┘
▲ ▲ ▲
└────────────────┼────────────────┘
etcd quorum (3 members, tolerates 1 down)
```
## Research decisions (locked — see design doc for full table)
| Decision | Value |
|---|---|
| LB strategy | pfSense HAProxy, TCP mode, HTTPS `/readyz` health check |
| VIP | `10.0.20.99` (FQDN `k8s-apiserver.viktorbarzin.lan`) |
| New master IPs | `10.0.20.110`, `10.0.20.111` |
| New master VMIDs | `205`, `206` |
| Master sizing | 8 vCPU, 32 GB RAM, 64 GB disk (matches existing) |
| VM provisioning | cloud-init via `create-template-vm` (template bumped v1.32 → v1.34 first; `k8s_join_command = ""` for masters) |
| etcd | stacked (kubeadm-managed) |
| Multi-master apiserver flags | rbac stack refactored to loop over master list (Phase 1.5) |
| controlPlaneEndpoint + cert SAN retrofit | Phase 0, before any new master joins |
| k8s-version-upgrade chain | extended to multi-master in Phase 7 |
## Callers / blast radius
| Surface | Path | Phase |
|---|---|---|
| Worker `/etc/kubernetes/kubelet.conf` × 4 | nodes 1-4 | 4.2 |
| `/home/wizard/code/infra/config` (root kubeconfig used by every `tg apply`) | repo root | 4.1 |
| `config.tfvars:115` (`kubernetes IN A 10.0.20.100` zone-file record) | repo root | 1.1 (delete) |
| `config.tfvars:231` (`k8s_join_command` for cloud-init template) | repo root | 4.1 (flip to VIP) |
| `stacks/rbac/modules/rbac/{apiserver-oidc,audit-policy,etcd-tuning}.tf` | `var.k8s_master_host` defaults | 1.5 (refactor to list) |
| `.woodpecker/{default,drift-detection,renew-tls,provision-user}.yml` (4 files × 2 refs each — kubeconfig `server:` AND `curl` lines) | repo root | 4.1 |
| `stacks/k8s-portal/.../files/src/routes/{download,setup/script}/+server.ts` (`CLUSTER_SERVER` const used to generate user kubeconfigs) | k8s-portal module | 4.1 |
| `stacks/k8s-version-upgrade/scripts/upgrade-step.sh` (hard-coded `k8s-master` in phase_master) | stack | 7.1 |
| `stacks/infra-maintenance/.../main.tf` lines 98 + 218 (`node_name = "k8s-master"` on etcd-backup + defrag-etcd CronJobs) | stack | 4.5 |
| `kured-sentinel-gate` bash loop | `stacks/kured/main.tf` | 5.1 |
| `docs/architecture/compute.md`, `.claude/skills/uptime-kuma/SKILL.md`, runbooks | docs | 6.3 |
| **No-op surfaces** (confirmed clean): Vault (uses `kubernetes.default.svc`), Cloudflared (no apiserver tunnel), in-cluster `kubernetes.default.svc` / `10.96.0.1`, etcd-backup CORRECTNESS (snapshot is cluster-wide), kubeadm-managed etcd peer certs (auto-generated on join) | | — |
## Edge cases
- **Phase 0 apiserver restart (~30s)** = same blast radius as today's k8s upgrade (tigera/cnpg/gpu-operator briefly crash). The LB doesn't help here because the new cert isn't yet trusted by clients. Accept the brief outage. Schedule during a low-activity window.
- **`kubeadm-certs` secret TTL = 2h** (NOT 24h as initially stated). Phase 2 + 3 must complete within the window, or re-upload between them.
- **pfSense haproxy bootstrap = reset-to-declared-state** on each run (lines 155-158 of the script). Adding master-2 means the apiserver pool is briefly torn down + rebuilt. TCP frontends bounce. Long-poll connections from kubelets break + reconnect. Expect ~2-5s of "kubectl: unable to connect" during pool rewrites.
- **TCP health check is too lax** for apiserver (listener up ≠ ready). Phase 1 uses HTTPS `GET /readyz` with `verify none` — catches NotReady (etcd unreachable, controller-manager flapping).
- **Worker kubelet.conf flip**: kubelet TLS bootstrap re-auths against new endpoint on restart. Expect 5-10s NotReady per node during the Phase 4.2 loop.
- **VIP cannot be the existing master IP**: confirmed `.99` is free (no grep matches, no MetalLB pool conflict — pool is .200-.220).
- **pfSense reboot windows**: pre-Phase-4 OK (clients still on direct IP), post-Phase-4 breaks everything. Don't migrate near a pfSense maintenance window.
## Phased plan
Reversible up to Phase 4. Phase 4+ reverse via the rollback section.
### Phase 0 — Retrofit existing cluster (~30 min, ~30s of apiserver outage)
- [ ] **0.1 Pre-flight**
- [ ] Cluster healthy: `kubectl get nodes` (all Ready), `kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded` empty
- [ ] Recent etcd backup valid: `ls -lh /srv/nfs/etcd-backup/ | tail -5`
- [ ] Proxmox VM snapshot of `k8s-master`: `ssh root@192.168.1.127 qm snapshot 200 pre-ha-retrofit`
- [ ] IPs free: `for ip in 99 110 111; do ping -c1 -W1 10.0.20.$ip && echo "BUSY $ip" || echo "free $ip"; done`
- [ ] **0.2 Patch `kubeadm-config` ConfigMap via kubeadm (NOT kubectl apply)**
- [ ] On master: `sudo kubeadm config print init-defaults --component-configs=KubeletConfiguration > /tmp/kubeadm-new.yaml`
- [ ] Hand-edit /tmp/kubeadm-new.yaml: take the existing CM as base, add `controlPlaneEndpoint: 10.0.20.99:6443` under ClusterConfiguration, add `apiServer.certSANs: [10.0.20.99, k8s-apiserver.viktorbarzin.lan]`
- [ ] Apply via kubeadm (kubeadm-owned, future `kubeadm upgrade apply` won't overwrite): `sudo kubeadm init phase upload-config kubeadm --config /tmp/kubeadm-new.yaml`
- [ ] Verify: `kubectl -n kube-system get cm kubeadm-config -o yaml | grep -E 'controlPlaneEndpoint|certSANs'`
- [ ] **0.3 Regen apiserver cert**
- [ ] On master: `sudo mkdir -p /tmp/apiserver-backup && sudo mv /etc/kubernetes/pki/apiserver.{crt,key} /tmp/apiserver-backup/`
- [ ] `sudo kubeadm init phase certs apiserver` (reads patched kubeadm-config)
- [ ] Verify: `sudo openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -text | grep -A2 'Subject Alternative'` — expect `IP Address:10.0.20.99` PLUS existing SANs (kubeadm adds, doesn't replace)
- [ ] **0.4 Restart kube-apiserver static pod**
- [ ] On master: `sudo kubectl -n kube-system delete pod kube-apiserver-k8s-master --force --grace-period=0`
- [ ] Wait: `kubectl wait --for=condition=Ready pod/kube-apiserver-k8s-master -n kube-system --timeout=180s`
- [ ] Verify: `kubectl get nodes` works (apiserver alive on direct IP)
- [ ] **0.5 Panic-mode rollback procedure (DOCUMENTED ONLY — only run if 0.4 fails)**
- [ ] `sudo cp /tmp/apiserver-backup/apiserver.{crt,key} /etc/kubernetes/pki/`
- [ ] `sudo systemctl restart kubelet` (forces static pod re-read)
- [ ] Wait for apiserver Ready; revert kubeadm-config edits via the file backup
- [ ] **0.6 Verify operators recovered from brief outage**
- [ ] `kubectl get pods -n calico-system -l app=tigera-operator -o wide` — Running, restart count incremented by 1 max
- [ ] `kubectl get pods -n gpu-operator -o wide` — same
- [ ] `kubectl get pods -n cnpg-system -o wide` — same
### Phase 1 — pfSense HAProxy + DNS (~30 min)
- [ ] **1.1 Reserve VIP `10.0.20.99` + DNS**
- [ ] Add Virtual IP on pfSense (Firewall → Virtual IPs → IP Alias on VLAN20, `10.0.20.99/24`)
- [ ] Add `k8s-apiserver-vip → 10.0.20.99` host alias (Firewall → Aliases → Hosts)
- [ ] phpIPAM: register `10.0.20.99` under section "K8s cluster"
- [ ] Add DNS A record `k8s-apiserver IN A 10.0.20.99` to `config.tfvars` (and **delete** stale `kubernetes IN A 10.0.20.100` on line 115)
- [ ] `scripts/tg apply -target=module.technitium` — confirm zone reload
- [ ] **1.2 Extend `infra/scripts/pfsense-haproxy-bootstrap.php` for apiserver pool with HTTPS health check**
- [ ] Add `build_pool_https()` helper variant (or add `$use_https_readyz` param to existing `build_pool()`) that emits `check_type='HTTP'`, `monitor_uri='/readyz'`, `httpchk_method='GET'`, `ssl='yes'`, `sslverify='no'`
- [ ] Add `'apiserver_nodes'` to `$POOL_NAMES`; `'apiserver_proxy_6443'` to `$FRONTEND_NAMES`
- [ ] `build_pool_https('apiserver_nodes', '6443', [['k8s-master', '10.0.20.100']])`
- [ ] `build_frontend('apiserver_proxy_6443', 'K8s apiserver VIP', '10.0.20.99', '6443', 'apiserver_nodes')`
- [ ] **1.3 Deploy + validate**
- [ ] `scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/ && ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'`
- [ ] `ssh admin@10.0.20.1 'sockstat -l | grep 10.0.20.99:6443'` — expect haproxy listening
- [ ] `ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio" | grep apiserver` — backend UP (op_state=2)
- [ ] **1.4 Smoke via VIP**
- [ ] From devvm: `curl --cacert /etc/kubernetes/pki/ca.crt https://10.0.20.99:6443/readyz` — expect `ok`
- [ ] Build a transient kubeconfig pointing at VIP, run `kubectl get nodes` — succeeds
- [ ] **If TLS validation fails: STOP — Phase 0 cert regen didn't include VIP**, rollback Phase 1 and retry Phase 0
### Phase 1.5 — Refactor rbac stack for multi-master (~45 min)
- [ ] **1.5.1 Refactor `stacks/rbac/modules/rbac/{apiserver-oidc,audit-policy,etcd-tuning}.tf`**
- [ ] Replace `var.k8s_master_host = "10.0.20.100"` with `var.k8s_master_hosts = list(string)` (default `["10.0.20.100"]`)
- [ ] Wrap each `null_resource` / `provisioner "remote-exec"` block in `for_each = toset(var.k8s_master_hosts)` so the same sed runs on every master
- [ ] In `stacks/rbac/main.tf` set `k8s_master_hosts = ["10.0.20.100"]` (still single-master in this phase — variable is forward-looking, no behaviour change yet)
- [ ] **1.5.2 `scripts/tg apply` rbac stack** — confirm zero diff against today (no-op refactor)
- [ ] **1.5.3 Verify** — sanity: `ssh wizard@k8s-master 'sudo grep oidc-issuer-url /etc/kubernetes/manifests/kube-apiserver.yaml | wc -l'` — expect `1`. Cluster healthy.
### Phase 2 — Cloud-init template bump + master-2 (~75 min)
- [ ] **2.0 Bump cloud-init template (one-time)**
- [ ] Edit `infra/modules/create-template-vm/cloud_init.yaml`:
- line 49: apt source `pkgs.k8s.io/core:/stable:/v1.32/deb/``pkgs.k8s.io/core:/stable:/v1.34/deb/`
- line 135: wrap `${k8s_join_command}` in a conditional via cloud-init `if:` template logic, or simpler: add `${k8s_join_command_or_noop}` and let the module pass `""` for masters and the real worker join command for workers (default)
- [ ] Update `infra/modules/create-template-vm/main.tf` to add `variable "k8s_join_command" { default = "" }` and a conditional in the templatefile to skip the runcmd line when empty
- [ ] Rebuild the template: `scripts/tg apply -target=module.k8s_template` (or whatever the existing template-build target name is in `stacks/infra/main.tf`)
- [ ] Verify new template registered in Proxmox at the same template_id
- [ ] **2.1 Add master-2 VM to Terraform**
- [ ] In `stacks/infra/main.tf`: add `module "k8s-master-2"` using `create-vm` from the (now-v1.34) k8s template, with master sizing (8 vCPU / 32GB / 64GB), VMID 205, IP `10.0.20.110`, unique MAC, `vmbr1/vlan 20`, `use_cloud_init = true`, and explicitly pass `k8s_join_command = ""` (so first-boot does NOT auto-join as worker)
- [ ] `scripts/tg apply -target=module.k8s-master-2`
- [ ] Verify VM booted: `ssh wizard@k8s-master-2.viktorbarzin.lan uname -a` (expect Ubuntu 26.04 LTS, kernel 7.0.x)
- [ ] **2.2 Prep master-2 for kubeadm join**
- [ ] Confirm versions: `ssh wizard@k8s-master-2.viktorbarzin.lan 'kubeadm version; containerd --version'` — expect kubeadm v1.34.x, containerd 2.2.2+
- [ ] DNS resolves: `getent hosts k8s-master-2.viktorbarzin.lan`
- [ ] **2.3 Upload certs on existing master**
- [ ] `sudo kubeadm init phase upload-certs --upload-certs` → records `--certificate-key <KEY>`
- [ ] **2h TTL** — Phase 2 + 3 must complete within window or re-upload
- [ ] **2.4 Generate join command**
- [ ] `sudo kubeadm token create --print-join-command``kubeadm join 10.0.20.99:6443 --token <T> --discovery-token-ca-cert-hash sha256:<H>`
- [ ] Append `--control-plane --certificate-key <KEY>`
- [ ] **2.5 Run join on master-2**
- [ ] `ssh wizard@k8s-master-2.viktorbarzin.lan` → run sudo join command from 2.4
- [ ] Wait for "This node has joined the cluster"
- [ ] **2.6 Update rbac stack to include master-2 (propagates OIDC/audit/etcd tuning to it)**
- [ ] Edit `stacks/rbac/main.tf`: `k8s_master_hosts = ["10.0.20.100", "10.0.20.110"]`
- [ ] `scripts/tg apply` rbac stack
- [ ] Verify: `ssh wizard@k8s-master-2 'sudo grep -c oidc-issuer-url /etc/kubernetes/manifests/kube-apiserver.yaml'` — expect `1`
- [ ] **2.7 Smoke**
- [ ] `kubectl get nodes` — 6 nodes, master-2 Ready control-plane
- [ ] `kubectl -n kube-system get pods -o wide | grep k8s-master-2` — 4 static pods Running
- [ ] etcd member list shows 2 members
- [ ] `kubectl --server=https://10.0.20.110:6443 get nodes` — direct probe works
- [ ] **2.8 Add master-2 to LB pool**
- [ ] Edit `pfsense-haproxy-bootstrap.php`: pool now `[['k8s-master', '10.0.20.100'], ['k8s-master-2', '10.0.20.110']]`
- [ ] Deploy + verify both backends UP
### Phase 3 — master-3 (~45 min) — same pattern as Phase 2
- [ ] **3.1 Add `module.k8s-master-3` to Terraform** (VMID 206, IP `10.0.20.111`, same template, `k8s_join_command = ""`)
- [ ] **3.2 Prep verify**
- [ ] **3.3 Re-upload certs if >2h since Phase 2.3, refresh `--certificate-key`**
- [ ] **3.4 Generate fresh join command**
- [ ] **3.5 Run join on master-3**
- [ ] **3.6 Update rbac stack: `k8s_master_hosts = [".100", ".110", ".111"]`, apply, verify master-3 has OIDC flag**
- [ ] **3.7 Smoke (7 nodes, 3 control-plane, etcd quorum 3/3)**
- [ ] **3.8 Add master-3 to LB pool — all three backends UP**
### Phase 4 — Cut over clients and workers to VIP (~45 min)
- [ ] **4.1 Update in-repo kubeconfig consumers (single commit)**
- [ ] `/home/wizard/code/infra/config` — flip `server:` to `https://10.0.20.99:6443`
- [ ] `config.tfvars:231``k8s_join_command` to `kubeadm join 10.0.20.99:6443 ...`
- [ ] `stacks/rbac/modules/rbac/apiserver-oidc.tf` — variable `default = "10.0.20.99"` (or whatever the multi-master refactor needs)
- [ ] `.woodpecker/default.yml` — flip server: AND curl URL
- [ ] `.woodpecker/drift-detection.yml` — flip server: AND curl URL
- [ ] `.woodpecker/renew-tls.yml` — flip curl URL (line 18)
- [ ] `.woodpecker/provision-user.yml` — flip curl URL (line 41)
- [ ] `stacks/k8s-portal/modules/k8s-portal/files/src/routes/download/+server.ts``CLUSTER_SERVER` const
- [ ] `stacks/k8s-portal/modules/k8s-portal/files/src/routes/setup/script/+server.ts` — same
- [ ] Final sweep: `cd /home/wizard/code/infra && grep -rn '10.0.20.100:6443' --include='*.tf' --include='*.yml' --include='*.yaml' --include='*.ts' --include='*.php' --include='*.sh'` — handle anything remaining
- [ ] `scripts/tg apply` for rbac + k8s-portal (and any other stacks touched)
- [ ] Commit + push (single conventional commit referencing `code-n0ow`)
- [ ] **4.2 Worker `kubelet.conf` flip (one at a time, with 5-10s expected NotReady)**
```bash
for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
echo "=== $n ==="
ssh wizard@$n.viktorbarzin.lan "sudo sed -i.bak 's|server: https://10.0.20.100:6443|server: https://10.0.20.99:6443|' /etc/kubernetes/kubelet.conf"
ssh wizard@$n.viktorbarzin.lan "sudo systemctl restart kubelet"
kubectl wait --for=condition=Ready node/$n --timeout=180s
echo "$n Ready"
sleep 15
done
```
- [ ] **4.3 Existing master's `kubelet.conf`** — same sed + restart on `k8s-master`
- [ ] **4.4 Verify master-2 + master-3 kubelet.conf already at VIP** (cloud-init join used VIP via controlPlaneEndpoint)
- [ ] **4.5 Verify everything**
- [ ] `kubectl get nodes` — all 7 Ready
- [ ] `kubectl --kubeconfig ~/.kube/config config view --minify -o jsonpath='{.clusters[0].cluster.server}'``https://10.0.20.99:6443`
- [ ] Worker loop: `for n in k8s-{master,node1,node2,node3,node4,master-2,master-3}; do ssh wizard@$n.viktorbarzin.lan "sudo grep server: /etc/kubernetes/kubelet.conf"; done` — all show VIP
- [ ] Trigger a no-op Woodpecker pipeline (commit a typo fix in a runbook) — verify the kubeconfig path through the new VIP
### Phase 4.5 — Fix etcd-backup CronJob node pinning (~15 min)
- [ ] **4.5.1 Edit `stacks/infra-maintenance/modules/infra-maintenance/main.tf`**
- [ ] backup-etcd (line 98): replace `node_name = "k8s-master"` with `nodeSelector { "node-role.kubernetes.io/control-plane" = "" }` + the corresponding toleration block
- [ ] defrag-etcd (line 218): same change
- [ ] **4.5.2 `scripts/tg apply` infra-maintenance**
- [ ] **4.5.3 Verify backup runs** — trigger a manual job-from-cronjob, confirm it lands on one of the 3 masters and produces a valid snapshot
### Phase 5 — kured-sentinel-gate quorum check (~15 min)
- [ ] **5.1 Edit `infra/stacks/kured/main.tf`** (insert into the bash heredoc in the sentinel-gate ConfigMap, between all-nodes-Ready and calico-Ready checks)
```bash
# Check 3b: control-plane quorum safety (HA invariant)
CP_READY=$(kubectl get nodes -l node-role.kubernetes.io/control-plane= --no-headers | grep ' Ready ' | wc -l | tr -d ' ')
if [ "$CP_READY" -lt 2 ]; then
echo " BLOCKED: Only $CP_READY control-plane node(s) Ready (need ≥2 for HA)"
rm -f /host/var-run/gated-reboot-required
sleep 300
continue
fi
echo " Control-plane quorum safe ($CP_READY Ready)"
```
- [ ] **5.2 `scripts/tg apply` kured**
- [ ] **5.3 Verify**
- [ ] `kubectl -n kured logs ds/kured-sentinel-gate | tail -50` — expect "Control-plane quorum safe (3 Ready)" line
- [ ] Negative test: cordon `k8s-master-2`, wait for the gate to re-evaluate, confirm block message. Restore.
### Phase 6 — E2E validation (~30 min)
- [ ] **6.1 Failover test**
- [ ] `kubectl drain k8s-master --delete-emptydir-data --ignore-daemonsets`
- [ ] `ssh wizard@k8s-master.viktorbarzin.lan sudo reboot`
- [ ] During the 50-90s reboot: tight loop `while true; do kubectl get nodes -o name | wc -l; sleep 2; done` from devvm — line count never drops to 0 (LB transparent)
- [ ] After boot: `kubectl uncordon k8s-master`, verify apiserver static pod re-registers in LB pool (op_state=2)
- [ ] **6.2 All-masters apiserver flag parity**
- [ ] `for h in k8s-master k8s-master-2 k8s-master-3; do echo "=== $h ==="; ssh wizard@$h.viktorbarzin.lan 'sudo grep -E "oidc-issuer-url|audit-policy|auto-compaction-retention|snapshot-count" /etc/kubernetes/manifests/{kube-apiserver,etcd}.yaml | sort'; done`
- [ ] Expect identical flag set across all 3 masters
- [ ] **6.3 Update documentation**
- [ ] Add `docs/architecture/control-plane.md` — HA topology, etcd member list, LB config location
- [ ] Update `.claude/reference/proxmox-inventory.md` — add VMIDs 205, 206
- [ ] Add `docs/runbooks/control-plane-add-remove-master.md`
- [ ] Update `docs/runbooks/restore-etcd.md` to cover 3-member quorum restore (was single-master only)
- [ ] Cross-link `docs/runbooks/mailserver-pfsense-haproxy.md` with the new apiserver_proxy_6443 pool
### Phase 7 — Extend k8s-version-upgrade chain to multi-master (~60 min)
- [ ] **7.1 Edit `stacks/k8s-version-upgrade/scripts/upgrade-step.sh`**
- [ ] phase_master: discover masters dynamically — `MASTERS=$($KUBECTL get nodes -l node-role.kubernetes.io/control-plane= -o name | sed 's|node/||')`
- [ ] Wrap drain → `update_k8s.sh` → uncordon → wait-ready in a `for m in $MASTERS; do ... done` loop
- [ ] Between masters: quorum check — `READY=$($KUBECTL get nodes -l node-role.kubernetes.io/control-plane= --no-headers | grep ' Ready ' | wc -l); [ $READY -ge 2 ] || { slack "ABORT quorum lost"; exit 1; }`
- [ ] Update line 9 + 17 comment block to reflect multi-master phase
- [ ] Update line 326-340 containerd-bump section to loop over masters
- [ ] **7.2 Edit `phase_preflight` and the master phase pin**
- [ ] Line 209-210 (scheduling_block): allow any control-plane node to be the target
- [ ] Line 285 (`kubeadm upgrade plan` check): run against the first master in the list, not specifically `k8s-master`
- [ ] **7.3 `scripts/tg apply` k8s-version-upgrade**
- [ ] **7.4 Dry-run test**
- [ ] `kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check ha-validation-$(date +%s)` (no actual upgrade pending — chain should noop the upgrade phase but exercise the discovery loop)
- [ ] Verify logs show 3 masters discovered in correct order
- [ ] **7.5 (Real test on next patch release)** — when 1.34.8 ships:
- [ ] Watch the chain execute drain → upgrade → uncordon across all 3 masters in turn
- [ ] Confirm no manual intervention needed
### Phase 8 — Close out
- [ ] **8.1 Update beads**`bd close code-n0ow` once all 6 acceptance criteria met (see below)
## Rollback plan
### Before Phase 4 (no clients flipped)
- **Phase 0**: restore apiserver cert/key from `/tmp/apiserver-backup/`, edit kubeadm-config back, restart kubelet on master.
- **Phase 1**: remove `apiserver_proxy_6443` + `apiserver_nodes` from `pfsense-haproxy-bootstrap.php`, re-run; revert DNS A record in config.tfvars.
- **Phase 1.5**: revert rbac stack to single `k8s_master_host` var; apply.
- **Phase 2/3**: on failed master `sudo kubeadm reset --force`; from a surviving master `etcdctl member remove <id>`; `tg destroy -target=module.k8s-master-N`.
### After Phase 4 (clients flipped)
- Revert all the Phase 4.1 file changes (single revert commit).
- Reverse the kubelet.conf sed loop (VIP → direct IP) on all 7 nodes.
- Phase 0 controlPlaneEndpoint can stay — harmless even on full rollback.
### Worst case (etcd corruption / multi-master split-brain)
- Restore from latest etcd snapshot via `etcdctl snapshot restore` to a single master.
- Rebuild master VM from the Proxmox snapshot taken in Phase 0.1.
- Cluster back to single-master.
## Acceptance criteria (beads `code-n0ow`)
- [ ] 1. Design doc + plan doc written ✓ (this commit)
- [ ] 2. Plan approved by user
- [ ] 3. 3 masters online, etcd quorum healthy, apiserver LB working
- [ ] 4. k8s upgrade chain runs end-to-end across **all 3 masters** without manual intervention (Phase 7)
- [ ] 5. kured-sentinel-gate respects quorum (Phase 5)
- [ ] 6. etcd backup runs from any control-plane node (Phase 4.5)
## Open questions
None — all locked via 2026-05-22 decision pass + challenger amendment pass.

View file

@ -89,17 +89,26 @@ if [[ "$ROLE" == "master" ]]; then
# sync latency post-master-reboot can exceed it). The etcd image IS
# actually updated by then, so a 2nd attempt sees etcd already on
# target and skips it. Up to 3 attempts with a 30s delay between.
# First attempt: full kubeadm upgrade (incl. etcd). On the static-pod-
# hash 5min-timeout failure, retry with --etcd-upgrade=false. The
# timeout happens reliably for patch upgrades where etcd's image
# doesn't change (kubeadm writes identical manifest → hash doesn't
# change → kubeadm waits forever for a change that will never come).
# Skipping the etcd phase on retry is safe IF etcd is already on the
# right version (which is the only case where this timeout fires).
attempt=1
while ! sudo kubeadm upgrade apply "v$RELEASE" -y; do
extra_flags=""
while ! sudo kubeadm upgrade apply "v$RELEASE" -y $extra_flags; do
if (( attempt >= 3 )); then
echo "ERROR: kubeadm upgrade apply failed after 3 attempts" >&2
exit 1
fi
echo "==> kubeadm apply attempt $attempt failed (likely static-pod-hash 5m timeout). Sleeping 30s then retrying — the previous attempt's manifest writes usually take hold on the 2nd try."
echo "==> kubeadm apply attempt $attempt failed. Retrying with --etcd-upgrade=false (etcd image is unchanged for patch upgrades; kubeadm's static-pod-hash watch is the only thing failing)."
extra_flags="--etcd-upgrade=false"
sleep 30
attempt=$(( attempt + 1 ))
done
echo "==> kubeadm upgrade apply succeeded on attempt $attempt"
echo "==> kubeadm upgrade apply succeeded on attempt $attempt (flags: '$extra_flags')"
else
echo "==> Worker path: kubeadm upgrade node"
sudo kubeadm upgrade node

Binary file not shown.

Binary file not shown.

View file

@ -150,19 +150,6 @@ module "ingress" {
}
}
module "ingress-www" {
source = "../../modules/kubernetes/ingress_factory"
auth = "none" # Anubis-fronted; PoW challenge gates bots, no Authentik
namespace = kubernetes_namespace.website.metadata[0].name
name = "blog-www"
service_name = module.anubis.service_name
port = module.anubis.service_port
extra_middlewares = ["traefik-x402@kubernetescrd"]
full_host = "www.viktorbarzin.me"
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00

View file

@ -271,10 +271,20 @@ resource "kubernetes_cron_job_v1" "imap" {
}
spec {
restart_policy = "OnFailure"
# The broker image's user is uid=10001 gid=999, but the shared
# data PVC's /data root was created with gid=10001 (legacy from
# an earlier image build). Without fsGroup the pod can't write
# to the directory sqlite3 can't create the journal next to
# sync.db, hits 'attempt to write a readonly database'.
# fsGroup=10001 adds the matching gid to the pod's supplemental
# groups so writes succeed.
security_context {
fs_group = 10001
}
container {
name = "broker-sync"
image = local.broker_sync_image
command = ["broker-sync", "imap"]
command = ["broker-sync", "imap-ingest"]
env {
name = "BROKER_SYNC_DATA_DIR"

View file

@ -454,10 +454,10 @@ resource "kubernetes_deployment" "claude_agent" {
resources {
requests = {
cpu = "500m"
memory = "2Gi"
memory = "1Gi"
}
limits = {
memory = "4Gi"
memory = "2Gi"
}
}
}

View file

@ -145,16 +145,6 @@ resource "cloudflare_record" "mail_mx" {
}
resource "cloudflare_record" "mail_domainkey" {
content = "\"v=DKIM1; k=rsa; p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDIDLB8mhAHNqs1s6GeZMQHOxWweoNKIrqo5tqRM3yFilgfPUX34aTIXNZg9xAmlK+2S/xXO1ymt127ZGMjnoFKOEP8/uZ54iHTCnioHaPZWMfJ7o6TYIXjr+9ShKfoJxZLv7lHJ2wKQK3yOw4lg4cvja5nxQ6fNoGRwo+mQ/mgJQIDAQAB\""
name = "s1._domainkey.viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
priority = 1
zone_id = var.cloudflare_zone_id
}
resource "cloudflare_record" "mail_spf" {
# Brevo replaced Mailgun as the outbound relay on 2026-04-12 (see docs/architecture/mailserver.md).
# Soft-fail (~all) is intentional during cutover revisit once relay delivery is stable.

View file

@ -47,6 +47,16 @@ resource "helm_release" "cnpg" {
memory = "256Mi"
}
}
# Tune webhook-cert renewal threshold. CNPG default is 7 days remaining,
# which leaves no buffer when the cluster-health check (#22) flags
# certs at <30d. Bump to 30 days so the operator rotates well before
# external monitoring notices. Cert lifetime stays at chart default 90d.
config = {
data = {
EXPIRING_CHECK_THRESHOLD = "30"
}
}
})]
}

View file

@ -5,7 +5,7 @@ agent:
resources:
requests:
cpu: 25m
memory: 64Mi
memory: 128Mi
limits:
memory: 512Mi
priorityClassName: "tier-1-cluster"

View file

@ -172,11 +172,22 @@ resource "kubernetes_cluster_role" "k8s_upgrade_job" {
# --ignore-daemonsets` can classify each pod's owner. Without daemonsets
# GET permission, drain bails with "cannot delete daemonsets ... is
# forbidden" for every daemonset-managed pod on the node. (2026-05-20)
#
# `patch` on deployments added 2026-05-23: phase_master scales tigera-operator
# to 0 before drain (operator crashloops during apiserver static-pod swaps,
# generates I/O storm that breaks kubeadm's 5-min watch) and back to 1
# after master is upgraded. Until HA control plane lands (beads code-n0ow),
# this is how we keep autonomous upgrades unblocked.
rule {
api_groups = ["apps"]
resources = ["daemonsets", "statefulsets", "replicasets", "deployments"]
verbs = ["get", "list"]
}
rule {
api_groups = ["apps"]
resources = ["deployments", "deployments/scale"]
verbs = ["patch", "update"]
}
# Chain dispatch create the next Job; reconcile via apply on retry.
# In `default` ns to also create the etcd-snapshot Job from cronjob/backup-etcd.
rule {
@ -359,11 +370,17 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
exit 0
fi
# 1. Detect running version
# 1. Detect running version use the OLDEST kubelet across
# all nodes so partial chains (e.g. master upgraded but
# workers still pending) don't trick the chain into
# thinking the upgrade is complete. Was `.items[0]` (master
# only) which made the chain skip when workers were behind.
# Fixed 2026-05-23 after node4-only chain failure.
RUNNING=$(/usr/local/bin/kubectl get nodes \
-o jsonpath='{.items[0].status.nodeInfo.kubeletVersion}' | tr -d v)
-o jsonpath='{range .items[*]}{.status.nodeInfo.kubeletVersion}{"\n"}{end}' \
| tr -d v | sort -V | head -1)
RUNNING_MINOR=$(echo "$RUNNING" | awk -F. '{print $1"."$2}')
echo "Running version: v$RUNNING (minor $RUNNING_MINOR)"
echo "Running version (oldest kubelet): v$RUNNING (minor $RUNNING_MINOR)"
# 2. Latest patch within current minor (refresh master's apt cache)
LATEST_PATCH=$($SSH wizard@k8s-master.viktorbarzin.lan \

View file

@ -94,18 +94,41 @@ push() {
halt_on_alert_query() {
local extra_ignore="${1:-}"
local regex='^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor'
[ -n "$extra_ignore" ] && regex="$regex|$extra_ignore"
regex="$regex)$"
# ALLOWLIST design (refactored 2026-05-23 from a denylist): halt only on
# alerts with severity=critical. Any warning/info-level alert is treated
# as informational and doesn't block the chain.
#
# Why this is the right model:
# - The cluster has long-running warning-level alerts that are NOT
# blockers for a k8s patch (e.g. GPU operator crashloop on the GPU
# node, ingress latency spikes, IO-wait warnings).
# - Maintaining a denylist of every "noisy" alert is a losing battle.
# - Critical alerts are the only ones that should actually stop us
# mid-chain (apiserver down, etcd down, node not ready, etc.).
#
# `extra_ignore` is now mostly historical — kept for backwards compat with
# `halt_on_alert_query "RecentNodeReboot|IngressTTFBCritical"`-style calls. With severity-based
# filtering, RecentNodeReboot (severity=info) is filtered automatically.
# We still build the regex for any critical alert the caller wants to
# explicitly ignore (e.g. a known-broken thing we're aware of).
local ignore_regex=""
[ -n "$extra_ignore" ] && ignore_regex="^($extra_ignore)\$"
# `grep -vE` returns 1 when nothing matches, which under `set -o pipefail`
# bubbles up and (via the caller's `alerts=$(...)`) aborts the whole script.
# Trailing `|| true` keeps a no-alerts-firing cluster from looking like a
# script error. Discovered 2026-05-19 when the chain wouldn't fire on a
# genuinely-clean cluster (every alert was Watchdog/RebootRequired/etc.).
curl -sf "$PROM/api/v1/alerts" \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| { grep -vE "$regex" || true; } | sort -u
# `grep` returns 1 when nothing matches → under `set -o pipefail` that
# bubbles up and aborts the script via the caller's `alerts=$(...)`.
# Trailing `|| true` on each grep handles the no-matches case.
local critical_firing
critical_firing=$(curl -sf "$PROM/api/v1/alerts" \
| jq -r '.data.alerts[]
| select(.state == "firing" and .labels.severity == "critical")
| .labels.alertname' 2>/dev/null \
| sort -u || true)
if [ -n "$ignore_regex" ]; then
echo "$critical_firing" | { grep -vE "$ignore_regex" || true; }
else
echo "$critical_firing"
fi
}
wait_for_node_ready() {
@ -257,7 +280,7 @@ phase_preflight() {
# is set, often daily). Now skipped — check 3 is the single source of truth
# for "is the cluster quiet enough to upgrade".
local alerts
alerts=$(halt_on_alert_query RecentNodeReboot)
alerts=$(halt_on_alert_query "RecentNodeReboot|IngressTTFBCritical")
if [ -n "$alerts" ]; then
slack "ABORT preflight — firing alerts:\n$alerts"
exit 1
@ -357,15 +380,46 @@ phase_preflight() {
}
phase_master() {
# Idempotency: skip the whole phase if k8s-master is already on target.
# The chain can re-run after a partial failure (e.g. workers got cut
# short); without this short-circuit we re-drain and re-kubeadm an
# already-upgraded master for no reason. Added 2026-05-23.
local current_v
current_v=$($KUBECTL get node k8s-master -o jsonpath='{.status.nodeInfo.kubeletVersion}' 2>/dev/null | tr -d v)
if [ "$current_v" = "$TARGET_VERSION" ]; then
slack "k8s-master already on v$TARGET_VERSION (kubelet=$current_v) — skipping master phase"
echo "k8s-master already on v$TARGET_VERSION — skipping"
return 0
fi
slack "Draining k8s-master"
# Re-check halt-on-alert before drain. Always ignore RecentNodeReboot —
# the chain itself causes node reboots, so this alert firing is expected
# mid-chain (e.g. master was already upgraded+rebooted before this phase).
local alerts
alerts=$(halt_on_alert_query RecentNodeReboot)
alerts=$(halt_on_alert_query "RecentNodeReboot|IngressTTFBCritical")
[ -n "$alerts" ] && { slack "ABORT master — alerts firing pre-drain: $alerts"; exit 1; }
# Quiesce noisy operators that crashloop when apiserver briefly disappears
# during the static-pod manifest swaps. The crashloop generates a disk-I/O
# storm (~500 MB/s observed from tigera-operator alone) that slows the
# apiserver↔kubelet status sync past kubeadm's hardcoded 5-min watch on
# `kubernetes.io/config.hash`, causing kubeadm to roll back the upgrade.
#
# The data plane (calico-node DaemonSet, calico-typha, calico-kube-controllers)
# keeps running unchanged — only the OPERATOR (a config reconciler) goes away
# briefly. Restored at the end of the phase below.
#
# If the chain dies between quiesce and restore (e.g. kubeadm fails),
# manually restore with:
# kubectl -n tigera-operator scale deploy tigera-operator --replicas=1
#
# Long-term fix: HA control plane (3 masters) so apiserver never goes down
# — see docs/plans/2026-05-21-ha-control-plane-{design,plan}.md (beads code-n0ow).
echo "Quiescing tigera-operator before master upgrade (it crashes on apiserver outage)"
$KUBECTL -n tigera-operator scale deploy tigera-operator --replicas=0 2>&1 || true
drain_node k8s-master
slack "Running update_k8s.sh on k8s-master (--role master --release $TARGET_VERSION)"
@ -387,21 +441,37 @@ phase_master() {
exit 1
fi
alerts=$(halt_on_alert_query RecentNodeReboot)
alerts=$(halt_on_alert_query "RecentNodeReboot|IngressTTFBCritical")
[ -n "$alerts" ] && { slack "ABORT master — alerts firing post-upgrade: $alerts"; exit 1; }
# Restore tigera-operator (quiesced before drain). It reconciles in seconds.
echo "Restoring tigera-operator"
$KUBECTL -n tigera-operator scale deploy tigera-operator --replicas=1 2>&1 || true
slack "Master on v$TARGET_VERSION, control-plane Running. Dispatching worker chain."
}
phase_worker() {
[ -z "$TARGET_NODE" ] && { echo "ERROR: worker phase requires TARGET_NODE"; exit 2; }
# Idempotency: skip if target node is already on target version. Same
# rationale as phase_master — chains re-running after partial completion
# shouldn't re-drain an already-upgraded worker. Added 2026-05-23.
local current_v
current_v=$($KUBECTL get node "$TARGET_NODE" -o jsonpath='{.status.nodeInfo.kubeletVersion}' 2>/dev/null | tr -d v)
if [ "$current_v" = "$TARGET_VERSION" ]; then
slack "$TARGET_NODE already on v$TARGET_VERSION (kubelet=$current_v) — skipping worker phase"
echo "$TARGET_NODE already on v$TARGET_VERSION — skipping"
return 0
fi
slack "Draining $TARGET_NODE"
# Halt-on-alert wait (up to 30 min). Ignore RecentNodeReboot — the chain
# just rebooted a node, that's the cause and is expected.
local attempt alerts
for attempt in $(seq 1 30); do
alerts=$(halt_on_alert_query RecentNodeReboot)
alerts=$(halt_on_alert_query "RecentNodeReboot|IngressTTFBCritical")
[ -z "$alerts" ] && break
echo "Waiting for alerts to clear (attempt $attempt/30): $alerts"
sleep 60
@ -432,7 +502,7 @@ phase_worker() {
# 10-min soak with halt-on-alert (RecentNodeReboot ignored — we know we restarted it)
echo "Soaking $TARGET_NODE for 10 min..."
for i in $(seq 1 10); do
alerts=$(halt_on_alert_query RecentNodeReboot)
alerts=$(halt_on_alert_query "RecentNodeReboot|IngressTTFBCritical")
[ -n "$alerts" ] && { slack "ABORT $TARGET_NODE mid-soak — alerts: $alerts"; exit 1; }
sleep 60
done
@ -458,7 +528,7 @@ phase_postflight() {
# No alerts firing. Ignore RecentNodeReboot — by definition we just
# rebooted every node; this alert clears naturally in <1h.
local alerts
alerts=$(halt_on_alert_query RecentNodeReboot)
alerts=$(halt_on_alert_query "RecentNodeReboot|IngressTTFBCritical")
[ -n "$alerts" ] && slack "Postflight WARN — alerts still firing (cluster on target, please check):\n$alerts"
# Pod-ready ratio

View file

@ -328,25 +328,35 @@ resource "kubectl_manifest" "policy_require_trusted_registries" {
"docker.n8n.io/*", "registry.gitlab.com/*",
# Private
"forgejo.viktorbarzin.me/*", "10.0.20.10*",
# Legacy private registry (decommissioned 2026-05-07 per CLAUDE.md
# but council-complaints still references migrate to Forgejo).
"registry.viktorbarzin.me/*",
# DockerHub library (bare image names without slash)
"alpine*", "busybox*", "kong*", "mysql*", "nginx*", "postgres*", "python*",
# DockerHub user repos (no registry prefix, has slash)
# enumerated from current cluster state.
"actualbudget/*", "afadil/*", "binwiederhier/*", "bitnami/*",
# enumerated from current cluster state. New entries added
# 2026-05-22 after Enforce caught these as unallowlisted:
# amruthpillai (resume), athomasson2 (ebook2audiobook),
# netboxcommunity (netbox), nousresearch (hermes-agent),
# opentripplanner (osm-routing), rhasspy (whisper/piper).
"actualbudget/*", "afadil/*", "amruthpillai/*", "athomasson2/*",
"binwiederhier/*", "bitnami/*",
"clickhouse/*", "cloudflare/*", "coturn/*", "crowdsecurity/*",
"curlimages/*", "deluan/*", "dgtlmoon/*", "dolthub/*",
"dpage/*", "dperson/*", "edoburu/*", "esanchezm/*",
"freikin/*", "freshrss/*", "hackmdio/*", "hashicorp/*",
"headscale/*", "jhonderson/*", "kebe/*", "library/*",
"lissy93/*", "louislam/*", "matrixdotorg/*", "mendhak/*",
"mghee/*", "mindflavor/*", "mpepping/*", "netsampler/*",
"nvidia/*", "onlyoffice/*", "openresty/*", "owntracks/*",
"mghee/*", "mindflavor/*", "mpepping/*", "netboxcommunity/*",
"netsampler/*", "nousresearch/*", "nvidia/*", "onlyoffice/*",
"openresty/*", "opentripplanner/*", "owntracks/*",
"phpipam/*", "phpmyadmin/*", "privatebin/*", "prom/*",
"prompve/*", "rancher/*", "roundcube/*", "sclevine/*",
"prompve/*", "rancher/*", "rhasspy/*", "roundcube/*", "sclevine/*",
"shadowsocks/*", "shlinkio/*", "stirlingtools/*",
"technitium/*", "teddysun/*", "temporalio/*",
"typhonragewind/*", "tzahi12345/*", "vabene1111/*",
"vaultwarden/*", "viktorbarzin/*", "viren070/*", "zelest/*",
"vaultwarden/*", "viktorbarzin/*", "viren070/*",
"woodpeckerci/*", "zelest/*",
])
}]
}

View file

@ -373,10 +373,22 @@ resource "kubernetes_deployment" "llama_swap" {
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE
# KEEL_LIFECYCLE_V1 stop the applykeel fight: every keel digest
# update patches `keel.sh/update-time` on the pod template and
# `kubernetes.io/change-cause` + bumps the K8s rollout revision on
# the Deployment. Without these ignore_changes, every `tg apply`
# reverts those, forcing a rollout, which keel then re-patches on
# the next 1h poll llama-swap was rolling several times a day
# (~10s model-load downtime each). Upstream :cuda nightly cadence
# still triggers a legitimate daily rollout.
metadata[0].annotations["kubernetes.io/change-cause"],
metadata[0].annotations["deployment.kubernetes.io/revision"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"],
]
}

View file

@ -3,7 +3,7 @@ variable "tier" { type = string }
variable "mailserver_accounts" {}
variable "postfix_account_aliases" {}
variable "opendkim_key" {}
variable "sasl_passwd" {} # For sendgrid i.e relayhost
variable "sasl_passwd" {} # SMTP relay (Brevo) SASL credentials
variable "nfs_server" { type = string }
# Build the virtual-alias map, dropping aliases where BOTH the source and
# target are real mailboxes in var.mailserver_accounts (and are different).
@ -83,7 +83,6 @@ resource "kubernetes_config_map" "mailserver_env_config" {
POSTFIX_MESSAGE_SIZE_LIMIT = 1024 * 1024 * 200 # 200 MB
POSTFIX_REJECT_UNKNOWN_CLIENT_HOSTNAME = "1"
# TLS_LEVEL = "intermediate"
# DEFAULT_RELAY_HOST = "[smtp.sendgrid.net]:587"
DEFAULT_RELAY_HOST = "[smtp-relay.brevo.com]:587"
SPOOF_PROTECTION = "1"
SSL_TYPE = "manual"

View file

@ -2,7 +2,6 @@
# see defaults - https://github.com/docker-mailserver/docker-mailserver/blob/master/target/postfix/main.cf
variable "postfix_cf" {
default = <<EOT
#relayhost = [smtp.sendgrid.net]:587
relayhost = [smtp-relay.brevo.com]:587
smtp_sasl_auth_enable = yes
smtp_sasl_password_maps = hash:/etc/postfix/sasl/passwd

View file

@ -70,7 +70,7 @@ singleBinary:
resources:
requests:
cpu: 250m
memory: 2Gi
memory: 3Gi
limits:
memory: 4Gi

View file

@ -84,12 +84,12 @@ alertmanager:
- source_matchers:
- alertname = NodeDown
target_matchers:
- alertname =~ "NodeNotReady|NodeConditionBad|PodCrashLooping|ContainerOOMKilled|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods|ScrapeTargetDown|NodeLowFreeMemory|PostgreSQLDown|RedisDown|HeadscaleDown|HeadscaleReplicasMismatch|AuthentikDown|PoisonFountainDown|HackmdDown|PrivatebinDown|MailServerDown|EmailRoundtripFailing|EmailRoundtripStale|NodeExporterDown|DockerRegistryDown|HomeAssistantDown|HomeAssistantCriticalSensorUnavailable|CloudflaredDown|TechnitiumDNSDown|iDRACRedfishMetricsMissing|iDRACSNMPMetricsMissing|HomeAssistantMetricsMissing"
- alertname =~ "NodeNotReady|NodeConditionBad|PodCrashLooping|ContainerOOMKilled|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods|ScrapeTargetDown|NodeLowFreeMemory|PostgreSQLDown|RedisDown|HeadscaleDown|HeadscaleReplicasMismatch|AuthentikDown|PoisonFountainDown|HackmdDown|PrivatebinDown|MailServerDown|EmailRoundtripFailing|EmailRoundtripStale|ViktorBarzinApexDrift|ViktorBarzinApexProbeStale|NodeExporterDown|DockerRegistryDown|HomeAssistantDown|HomeAssistantCriticalSensorUnavailable|CloudflaredDown|TechnitiumDNSDown|iDRACRedfishMetricsMissing|iDRACSNMPMetricsMissing|HomeAssistantMetricsMissing"
# NFS down causes mass pod failures and NFS-dependent service outages
- source_matchers:
- alertname = NFSServerUnresponsive
target_matchers:
- alertname =~ "PodCrashLooping|ContainerOOMKilled|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods|ScrapeTargetDown|PostgreSQLDown|RedisDown|AuthentikDown|PoisonFountainDown|HackmdDown|PrivatebinDown|MailServerDown|EmailRoundtripFailing|EmailRoundtripStale|HomeAssistantDown|HomeAssistantCriticalSensorUnavailable"
- alertname =~ "PodCrashLooping|ContainerOOMKilled|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods|ScrapeTargetDown|PostgreSQLDown|RedisDown|AuthentikDown|PoisonFountainDown|HackmdDown|PrivatebinDown|MailServerDown|EmailRoundtripFailing|EmailRoundtripStale|ViktorBarzinApexDrift|ViktorBarzinApexProbeStale|HomeAssistantDown|HomeAssistantCriticalSensorUnavailable"
# Traefik down makes service-level alerts noise
- source_matchers:
- alertname = TraefikDown
@ -1870,12 +1870,21 @@ serverFiles:
annotations:
summary: "Kubelet {{ $labels.operation_type }} p99: {{ $value | printf \"%.0f\" }}s on {{ $labels.instance }} (threshold: 30s)"
- alert: KubeletRunningContainersDrop
expr: (kubelet_running_containers{container_state="running"} - kubelet_running_containers{container_state="running"} offset 10m) < -10
# Relative >50% drop vs. 10m ago, sustained for 5m.
# Absolute-count threshold removed 2026-05-18: routine drains
# routinely drop 10-30 containers and tripped the old `< -10`
# rule; only a >50% drop that persists 5m+ indicates a real
# node-level fault (kubelet hang, runtime crash, mass eviction).
expr: |
(
(kubelet_running_containers{container_state="running"} - kubelet_running_containers{container_state="running"} offset 10m)
/ kubelet_running_containers{container_state="running"} offset 10m
) < -0.5
for: 5m
labels:
severity: critical
annotations:
summary: "Running containers on {{ $labels.instance }} dropped by {{ $value | printf \"%.0f\" }} in 10m"
summary: "Running containers on {{ $labels.instance }} dropped >50% in 10m ({{ $value | printf \"%.2f\" }} ratio)"
- alert: CalicoNodeNotReady
expr: kube_daemonset_status_number_ready{namespace="calico-system", daemonset="calico-node"} < kube_daemonset_status_desired_number_scheduled{namespace="calico-system", daemonset="calico-node"}
for: 5m
@ -1934,8 +1943,11 @@ serverFiles:
annotations:
summary: "Node {{ $labels.node }} kubelet started {{ $value | humanizeDuration }} ago — 1h settle window halts further reboots"
- alert: MysqlStandaloneDown
# Single-replica StatefulSet: brief drain re-scheduling routinely
# takes 1-3 min during k8s upgrades. 3m suppresses those blips;
# real outages persist longer. Raised from 2m on 2026-05-18.
expr: kube_statefulset_status_replicas_ready{statefulset="mysql-standalone"} < 1
for: 2m
for: 3m
labels:
severity: critical
annotations:
@ -2178,6 +2190,9 @@ serverFiles:
annotations:
summary: "Critically slow ingress on {{ $labels.service }}: avg latency {{ $value | printf \"%.2f\" }}s (threshold: 3s for 5m)"
- alert: IngressErrorRate5xxHigh
# Rolling upgrades / pod migrations cause brief 5xx spikes that
# clear within 1-2 min. Only persistent 5xx indicates a real
# problem. Raised from 5m to 10m on 2026-05-18.
expr: |
(
sum(rate(traefik_service_requests_total{code=~"5..", service!~".*nextcloud.*"}[5m])) by (service)
@ -2186,11 +2201,11 @@ serverFiles:
) > 5
and sum(rate(traefik_service_requests_total{service!~".*nextcloud.*"}[5m])) by (service) > 0.1
and on() (time() - process_start_time_seconds{job="prometheus"}) > 900
for: 5m
for: 10m
labels:
severity: critical
annotations:
summary: "5xx rate on {{ $labels.service }}: {{ $value | printf \"%.1f\" }}% (threshold: 5% for 5m)"
summary: "5xx rate on {{ $labels.service }}: {{ $value | printf \"%.1f\" }}% (threshold: 5% for 10m)"
- alert: AnubisChallengeStoreErrors
# Anubis exposes only Go-runtime metrics on :9090 (no anubis_* /
# challenge_* counters), so we proxy via Traefik 5xx on services
@ -2227,12 +2242,23 @@ serverFiles:
annotations:
summary: "Cloudflared: {{ $value | printf \"%.0f\" }} replica(s) unavailable"
- alert: MetalLBSpeakerDown
# kubelet restart during k8s upgrade briefly takes the speaker
# pod down; typical recovery is 30-45s. The full drain+kubeadm+
# apt+kubelet-restart+uncordon cycle in the chain's worker phase
# can take a single node out of MetalLB rotation for 5-7 min in
# the worst case (depending on PDB stickiness). 10m suppresses
# those upgrade-induced blips while still catching genuine
# speaker-down conditions.
# Reverted from 2m → 10m on 2026-05-23 after node4 upgrade
# tripped it mid-soak and aborted the chain. Previous value was
# 5m (set 2026-05-18) which was already correct; a brief patch
# had tightened it.
expr: |
(
kube_daemonset_status_desired_number_scheduled{namespace="metallb-system", daemonset="metallb-speaker"}
- on(namespace, daemonset) kube_daemonset_status_number_ready{namespace="metallb-system", daemonset="metallb-speaker"}
) > 0
for: 5m
for: 10m
labels:
severity: critical
annotations:
@ -2337,6 +2363,30 @@ serverFiles:
severity: warning
annotations:
summary: "Email round-trip monitor never reported - check CronJob in mailserver namespace"
- alert: ViktorBarzinApexDrift
expr: viktorbarzin_apex_correct{job="viktorbarzin-apex-probe"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: "viktorbarzin.me apex A drifted from expected 10.0.20.200"
description: "Technitium serves the split-horizon apex for ~80 *.viktorbarzin.me CNAMEs. If this is wrong, every internal service (auth, vault, immich, ha-sofia, ...) breaks. Check Technitium primary zone records via API or web console."
- alert: ViktorBarzinApexProbeStale
expr: (time() - viktorbarzin_apex_last_correct_timestamp{job="viktorbarzin-apex-probe"}) > 900
for: 5m
labels:
severity: warning
annotations:
summary: "viktorbarzin.me apex probe has not seen a correct result in >15 min"
description: "Probe may be failing intermittently or apex may be drifting. Check CronJob `viktorbarzin-apex-probe` in `technitium` namespace."
- alert: ViktorBarzinApexProbeNeverRun
expr: absent(viktorbarzin_apex_correct{job="viktorbarzin-apex-probe"})
for: 30m
labels:
severity: warning
annotations:
summary: "viktorbarzin.me apex probe never reported"
description: "Check `kubectl -n technitium get cronjob viktorbarzin-apex-probe` and the most recent job pod logs."
- alert: AIOStreamsStreamCountLow
expr: aiostreams_stream_count{job="aiostreams-stream-probe"} < 50
for: 30m

View file

@ -228,7 +228,7 @@ resource "kubernetes_deployment" "n8n" {
service_account_name = kubernetes_service_account.n8n.metadata[0].name
container {
name = "n8n"
image = "docker.n8n.io/n8nio/n8n:1.80.0"
image = "docker.n8n.io/n8nio/n8n:1.80.5"
env {
name = "N8N_PORT"
value = "5678"
@ -352,10 +352,10 @@ resource "kubernetes_deployment" "n8n" {
resources {
requests = {
cpu = "25m"
memory = "1Gi"
memory = "512Mi"
}
limits = {
memory = "1Gi"
memory = "512Mi"
}
}
}

View file

@ -37,7 +37,7 @@ driver:
resources:
requests:
cpu: "50m"
memory: "256Mi"
memory: "822Mi"
limits:
memory: "2Gi"

View file

@ -132,24 +132,29 @@ resource "kubernetes_config_map" "openclaw_config" {
mode = "off"
}
model = {
# ChatGPT Plus OAuth via openai-codex plugin (account:
# ancaelena98@gmail.com). gpt-5.4-mini is the only mini
# variant the Codex backend accepts for Plus tier;
# gpt-5-mini / gpt-5.1-codex-mini return model_not_found
# / "not supported with ChatGPT account". Plus rate-card:
# 1,2007,000 local msgs / 5h on gpt-5.4-mini.
#
# If you see "No API key found for provider openai-codex"
# / "OAuth refresh failed" in logs, the OAuth token has
# expired. Re-auth:
# kubectl -n openclaw exec -it $(kubectl -n openclaw \
# get pods -l app=openclaw -o jsonpath='{.items[0].metadata.name}') \
# -c openclaw -- node /app/openclaw.mjs models auth login \
# --provider openai-codex
# Follow the OAuth URL+code prompt. Tokens persist on the
# openclaw-home PVC so it sticks across pod restarts.
primary = "openai-codex/gpt-5.4-mini"
fallbacks = ["openai-codex/gpt-5.5", "modelrelay/auto-fastest", "nim/qwen/qwen3-coder-480b-a35b-instruct"]
# 2026-05-22: switched primary to nim/meta/llama-3.1-70b-instruct.
# Verified end-to-end with tool calls (sub-second responses,
# proper tool_calls in API response). Auth audit on this date:
# - openai-codex OAuth: EXPIRED (ancaelena98@gmail.com,
# ChatGPT Plus). Re-auth requires interactive TTY:
# kubectl -n openclaw exec -it $(kubectl -n openclaw \
# get pods -l app=openclaw -o jsonpath='{.items[0].metadata.name}') \
# -c openclaw -- node /app/openclaw.mjs models auth \
# login --provider openai-codex
# - secret/openclaw openai_api_key (sk-svcacct):
# insufficient_quota (billing exhausted)
# - openrouter_api_key: "Key limit exceeded"
# - llama_api_key: region-blocked
# - anthropic_api_key: sk-ant-oat- (OAuth refresh token,
# NOT a real x-api-key won't auth)
# - nvidia_api_key: WORKS. nim/meta/llama-3.1-70b-instruct
# and nim/meta/llama-4-maverick-17b-128e-instruct both
# tool-call reliably.
# Keep codex as a fallback so it auto-promotes once
# re-authed; modelrelay last because it routes to a
# small model that hallucinates instead of tool-calling.
primary = "nim/meta/llama-3.1-70b-instruct"
fallbacks = ["nim/meta/llama-4-maverick-17b-128e-instruct", "openai-codex/gpt-5.4-mini", "modelrelay/auto-fastest"]
}
models = {
"modelrelay/auto-fastest" = {}
@ -159,6 +164,8 @@ resource "kubernetes_config_map" "openclaw_config" {
"nim/qwen/qwen3-coder-480b-a35b-instruct" = {}
"nim/nvidia/llama-3.1-nemotron-ultra-253b-v1" = {}
"nim/z-ai/glm5" = {}
"nim/meta/llama-3.1-70b-instruct" = {}
"nim/meta/llama-4-maverick-17b-128e-instruct" = {}
"llama-as-openai/Llama-4-Maverick-17B-128E-Instruct-FP8" = {}
"llama-as-openai/Llama-4-Scout-17B-16E-Instruct-FP8" = {}
"openrouter/stepfun/step-3.5-flash:free" = {}
@ -244,6 +251,8 @@ resource "kubernetes_config_map" "openclaw_config" {
{ id = "qwen/qwen3-coder-480b-a35b-instruct", name = "Qwen 3 Coder", reasoning = false, input = ["text"], contextWindow = 262000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
{ id = "nvidia/llama-3.1-nemotron-ultra-253b-v1", name = "Nemotron Ultra 253B", reasoning = true, input = ["text"], contextWindow = 128000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
{ id = "z-ai/glm5", name = "GLM-5", reasoning = false, input = ["text"], contextWindow = 128000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
{ id = "meta/llama-3.1-70b-instruct", name = "Llama 3.1 70B Instruct", reasoning = false, input = ["text"], contextWindow = 128000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
{ id = "meta/llama-4-maverick-17b-128e-instruct", name = "Llama 4 Maverick (NIM)", reasoning = false, input = ["text"], contextWindow = 1000000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
]
}
openrouter = {
@ -1110,7 +1119,7 @@ resource "kubernetes_deployment" "openclaw" {
# at /home/node/.openclaw/.ssh (set up by init 5).
ln -sfn /home/node/.openclaw/.ssh /home/node/.ssh
node openclaw.mjs doctor --fix 2>/dev/null
node openclaw.mjs models set openai-codex/gpt-5.4-mini 2>/dev/null
node openclaw.mjs models set nim/meta/llama-3.1-70b-instruct 2>/dev/null
node openclaw.mjs mcp set ha "{\"url\":\"$HA_SOFIA_MCP_URL\",\"transport\":\"streamable-http\"}" 2>/dev/null
node openclaw.mjs mcp set context7 '{"command":"npx","args":["-y","@upstash/context7-mcp"]}' 2>/dev/null
node openclaw.mjs mcp set playwright '{"url":"http://localhost:3000/mcp","transport":"streamable-http"}' 2>/dev/null

View file

@ -207,10 +207,10 @@ resource "helm_release" "postiz" {
resources = {
requests = {
cpu = "100m"
memory = "512Mi"
memory = "2Gi"
}
limits = {
memory = "4Gi"
memory = "3Gi"
}
}

View file

@ -83,11 +83,13 @@ resource "helm_release" "proxmox_csi" {
}
}
# LUKS2 Argon2id key derivation needs ~1GiB memory
# LUKS2 Argon2id key derivation needs ~1GiB memory (memory id=712).
# Request bumped from 64Mi 1024Mi (2026-05-23) so the pod is reserved
# for the unlock burst instead of risking OOM under node pressure.
node = {
plugin = {
resources = {
requests = { cpu = "10m", memory = "64Mi" }
requests = { cpu = "10m", memory = "1024Mi" }
limits = { memory = "1280Mi" }
}
}

View file

@ -123,10 +123,10 @@ resource "kubernetes_deployment" "technitium_secondary" {
resources {
requests = {
cpu = "100m"
memory = "2Gi"
memory = "512Mi"
}
limits = {
memory = "2Gi"
memory = "512Mi"
}
}
port {
@ -285,10 +285,10 @@ resource "kubernetes_deployment" "technitium_tertiary" {
resources {
requests = {
cpu = "100m"
memory = "2Gi"
memory = "512Mi"
}
limits = {
memory = "2Gi"
memory = "512Mi"
}
}
port {

View file

@ -179,10 +179,10 @@ resource "kubernetes_deployment" "technitium" {
resources {
requests = {
cpu = "100m"
memory = "2Gi"
memory = "1Gi"
}
limits = {
memory = "2Gi"
memory = "1Gi"
}
}
port {
@ -696,3 +696,106 @@ resource "kubernetes_cron_job_v1" "technitium_dns_optimization" {
}
}
# viktorbarzin.me apex DNS drift probe
# Resolves `viktorbarzin.me A` against the Technitium LoadBalancer IP every
# 5 min and pushes a Pushgateway gauge. Backstop for the entire
# split-horizon zone: every internal `*.viktorbarzin.me` CNAME chains through
# this apex, so if it drifts (ISP rollover, accidental edit), this is the
# canary. Alerts: ViktorBarzinApexDrift, ApexProbeStale, ApexProbeNeverRun
# in stacks/monitoring/.
resource "kubernetes_cron_job_v1" "viktorbarzin_apex_probe" {
metadata {
name = "viktorbarzin-apex-probe"
namespace = kubernetes_namespace.technitium.metadata[0].name
}
spec {
concurrency_policy = "Replace"
schedule = "*/5 * * * *"
successful_jobs_history_limit = 1
failed_jobs_history_limit = 3
job_template {
metadata {}
spec {
backoff_limit = 1
ttl_seconds_after_finished = 300
template {
metadata {}
spec {
container {
name = "probe"
image = "docker.io/library/python:3.12-alpine"
resources {
requests = {
cpu = "10m"
memory = "48Mi"
}
limits = {
memory = "96Mi"
}
}
command = ["/bin/sh", "-c", <<-EOT
pip install --quiet --disable-pip-version-check dnspython requests && python3 -c '
import dns.resolver, requests, time, sys
EXPECTED = {"10.0.20.200"}
NAMESERVER = "10.0.20.201" # Technitium LB IP
NAME = "viktorbarzin.me"
PUSHGATEWAY = "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/viktorbarzin-apex-probe"
resolver = dns.resolver.Resolver(configure=False)
resolver.nameservers = [NAMESERVER]
resolver.timeout = 5
resolver.lifetime = 8
correct = 0
observed = "unknown"
try:
answer = resolver.resolve(NAME, "A")
ips = sorted(str(r) for r in answer)
observed = ",".join(ips)
correct = 1 if set(ips) <= EXPECTED and ips else 0
print(f"apex {NAME} -> {observed} (expected one of {EXPECTED}); correct={correct}")
except Exception as e:
observed = f"error:{type(e).__name__}"
print(f"resolve error: {e}", file=sys.stderr)
metric_lines = [
"# HELP viktorbarzin_apex_correct 1 if viktorbarzin.me apex resolves to expected IP, 0 otherwise",
"# TYPE viktorbarzin_apex_correct gauge",
f"viktorbarzin_apex_correct {correct}",
]
if correct:
metric_lines += [
"# HELP viktorbarzin_apex_last_correct_timestamp Unix time of last correct resolution",
"# TYPE viktorbarzin_apex_last_correct_timestamp gauge",
f"viktorbarzin_apex_last_correct_timestamp {int(time.time())}",
]
metrics = "\n".join(metric_lines) + "\n"
try:
r = requests.post(PUSHGATEWAY, data=metrics, timeout=10)
print(f"pushgateway: {r.status_code}")
except Exception as e:
print(f"pushgateway error: {e}", file=sys.stderr)
sys.exit(0 if correct else 1)
'
EOT
]
}
dns_config {
option {
name = "ndots"
value = "2"
}
}
restart_policy = "OnFailure"
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -24,7 +24,10 @@ locals {
TRADING_CORS_ORIGINS = "[\"https://trading.viktorbarzin.me\"]"
TRADING_MEET_KEVIN_POLL_INTERVAL_SECONDS = "10800"
TRADING_MEET_KEVIN_DAILY_COST_CAP_USD = "5"
TRADING_MEET_KEVIN_LLM_MODEL = "anthropic/claude-sonnet-4.5"
# Haiku-4-5 used in v1 because sk-ant-oat01 OAuth quota on Enterprise
# trips a sticky multi-hour 429 on Sonnet after 5-10 burst calls.
# Switch to "claude-sonnet-4-5" if/when the Enterprise quota allows.
TRADING_MEET_KEVIN_LLM_MODEL = "claude-haiku-4-5-20251001"
TRADING_MEET_KEVIN_PROMPT_VERSION = "v1"
}
}
@ -71,7 +74,7 @@ resource "kubernetes_manifest" "external_secret" {
TRADING_ALPHA_VANTAGE_API_KEY = "{{ .alpha_vantage_api_key }}"
TRADING_FMP_API_KEY = "{{ .fmp_api_key }}"
DBAAS_ROOT_PASSWORD = "{{ .dbaas_root_password }}"
TRADING_OPENROUTER_API_KEY = "{{ .openrouter_api_key }}"
TRADING_ANTHROPIC_OAUTH_TOKEN = "{{ .anthropic_oauth_token }}"
TRADING_MEET_KEVIN_CHANNEL_ID = "{{ .meet_kevin_channel_id }}"
}
}
@ -85,7 +88,7 @@ resource "kubernetes_manifest" "external_secret" {
{ secretKey = "alpha_vantage_api_key", remoteRef = { key = "trading-bot", property = "alpha_vantage_api_key" } },
{ secretKey = "fmp_api_key", remoteRef = { key = "trading-bot", property = "fmp_api_key" } },
{ secretKey = "dbaas_root_password", remoteRef = { key = "trading-bot", property = "dbaas_root_password" } },
{ secretKey = "openrouter_api_key", remoteRef = { key = "trading-bot", property = "openrouter_api_key" } },
{ secretKey = "anthropic_oauth_token", remoteRef = { key = "trading-bot", property = "anthropic_oauth_token" } },
{ secretKey = "meet_kevin_channel_id", remoteRef = { key = "trading-bot", property = "meet_kevin_channel_id" } },
]
}
@ -507,16 +510,59 @@ resource "kubernetes_deployment" "trading-bot-workers" {
}
}
}
container {
name = "kevin-signal-bridge"
image = "viktorbarzin/trading-bot-service:latest"
image_pull_policy = "Always"
command = ["python", "-m", "services.kevin_signal_bridge.main"]
dynamic "env" {
for_each = local.common_env
content {
name = env.key
value = env.value
}
}
env {
name = "TRADING_OTEL_METRICS_PORT"
value = "9098"
}
# Kill-switch off in Phase 1 bridge writes audit rows only,
# never publishes to signals:generated.
env {
name = "TRADING_KEVIN_ENABLE_TRADING"
value = "false"
}
env_from {
secret_ref {
name = "trading-bot-secrets"
}
}
env_from {
secret_ref {
name = "trading-bot-db-creds"
}
}
resources {
requests = {
cpu = "10m"
memory = "128Mi"
}
limits = {
memory = "256Mi"
}
}
}
}
}
}
lifecycle {
# DRIFT_WORKAROUND: CI pipeline owns image tags for all 4 worker containers. Reviewed 2026-05-22.
# DRIFT_WORKAROUND: CI pipeline owns image tags for all 5 worker containers. Reviewed 2026-05-24.
ignore_changes = [
spec[0].template[0].spec[0].container[0].image,
spec[0].template[0].spec[0].container[1].image,
spec[0].template[0].spec[0].container[2].image,
spec[0].template[0].spec[0].container[3].image,
spec[0].template[0].spec[0].container[4].image,
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
]
}

View file

@ -688,6 +688,14 @@ resource "kubernetes_config_map" "auth_proxy_config" {
server {
listen 9000;
# Browsers accumulate one authentik_proxy_<random> cookie per Authentik
# Proxy Provider on the parent domain. With 30+ services under
# viktorbarzin.me the combined Cookie header exceeds nginx's default
# 4 x 8k large_client_header_buffers and trips "Too big request header"
# (431). Bump to 8 x 64k so the auth check accepts the pile.
client_header_buffer_size 8k;
large_client_header_buffers 8 64k;
location /outpost.goauthentik.io/auth/traefik {
proxy_pass http://authentik;
proxy_connect_timeout 3s;

View file

@ -226,11 +226,11 @@ resource "kubernetes_deployment" "shlink" {
# }
resources {
limits = {
memory = "960Mi"
memory = "512Mi"
}
requests = {
cpu = "25m"
memory = "960Mi"
memory = "512Mi"
}
}
port {

View file

@ -91,10 +91,6 @@ resource "kubernetes_deployment" "xray" {
image = "teddysun/xray"
name = "xray"
image_pull_policy = "IfNotPresent"
port {
container_port = 6443 // vless
protocol = "TCP"
}
port {
container_port = 7443 // reality
protocol = "TCP"
@ -174,19 +170,16 @@ resource "kubernetes_service" "xray" {
app = "xray"
}
port {
name = "vless"
port = 6443
protocol = "TCP"
name = "websocket"
port = 8443
target_port = 8443
protocol = "TCP"
}
port {
name = "websocket"
port = 8443
protocol = "TCP"
}
port {
name = "grpc"
port = 9443
protocol = "TCP"
name = "grpc"
port = 9443
target_port = 9443
protocol = "TCP"
}
}
}
@ -249,16 +242,3 @@ module "ingress_grpc" {
}
}
module "ingress_vless" {
source = "../../../../modules/kubernetes/ingress_factory"
# VPN protocol (VLESS) native xray clients, not browsers.
# auth = "none": VPN protocol (VLESS) native xray clients, not browsers; forward-auth incompatible.
auth = "none"
dns_type = "proxied"
namespace = kubernetes_namespace.xray.metadata[0].name
name = "xray-vless"
service_name = "xray"
host = "xray-vless"
port = 6443
tls_secret_name = var.tls_secret_name
}

File diff suppressed because one or more lines are too long