infra

Author	SHA1	Message	Date
Viktor Barzin	7a649ce7eb	crowdsec: pin image to v1.7.8 + remove ENROLL_KEY, CAPI restored Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details Root cause of today's CAPI 403 crashloop: chart 0.21.0 pins appVersion to v1.7.3, but Keel had auto-bumped the running pods to v1.7.8 on 2026-05-16 and they ran fine with CAPI for 8 days. Today's TF apply (`b59acbc1` agent memory bump) re-rendered the deployment from chart defaults, reverting the image to v1.7.3 — and v1.7.3 has a CAPI watcher-auth bug against the current api.crowdsec.net behaviour, so every fresh replica started 403'ing on startup. Fix: set `image.tag: "v1.7.8"` in values.yaml so the image survives future TF applies independently of the chart's appVersion. Verified CAPI auth succeeds on all 3 fresh pods with v1.7.8. Also dropped the ENROLL_KEY env block — the existing key `cmey5e636…` is single-shot and was already consumed by the first replica; subsequent pods hit 403 on `cscli console enroll`. CAPI works WITHOUT console enrollment (separate flows). Re-enable console reporting by generating a fresh enroll key at app.crowdsec.net (procedure documented in the values.yaml comment block). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 11:11:29 +00:00
Viktor Barzin	f55eaae682	docs/backup-dr: document /srv/nfs/anca-elements offsite-sync exclusion Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details	2026-05-24 11:03:50 +00:00
Viktor Barzin	05f047f290	offsite-sync-backup + nfs-change-tracker: exclude /srv/nfs/anca-elements Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details The 771G under /srv/nfs/anca-elements is a downstream replica synced FROM Synology (/volume1/Backup/Anca/Elements) by anca-elements-sync.sh. The offsite-sync pipeline was copying it back to Synology under /volume1/Backup/Viki/nfs/anca-elements, creating a self-duplicate (~122G already partially copied during the last monthly full sync). - nfs-change-tracker.service: drop anca-elements/ from inotify watch (incremental syncs no longer queue these paths) - offsite-sync-backup.sh: --exclude='anca-elements/' on the monthly full rsync; grep -v on the incremental files-from list Deployed to 192.168.1.127:/usr/local/bin/offsite-sync-backup + /etc/systemd/system/nfs-change-tracker.service; service reloaded.	2026-05-24 11:03:09 +00:00
Viktor Barzin	41786b0fca	crowdsec: DISABLE_ONLINE_API=true — break the recurring 403 crashloop Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details CAPI auth at api.crowdsec.net is rejecting watcher logins from inside the cluster within ~1h of registration, even after rotating creds via `cscli capi register`. The same login successfully authenticates from devvm but fails from cluster pods → IP-throttle or account-state issue at the central API. Until that's resolved with CrowdSec support (or the throttle window resets), running with CAPI on is just chronic crashloops on every fresh replica. `DISABLE_ONLINE_API=true` makes the chart entrypoint `conf_set 'del(.api.server.online_client)'`, removing the online_client block entirely. Pods skip CAPI auth, no 403, no crashloop. Trade-off: no community blocklists. Local scenarios + bouncers continue unchanged. Side-effect of disabling CAPI in this chart (v0.21.0) — `role.yaml` is gated on `IsOnlineAPIDisabled=false` while `cscli-lapi-register-job` is gated on `StoreLAPICscliCredentialsInSecret=true` (orthogonal). So the hook runs without the Role it needs, and atomic apply rolls back. Mitigation: pre-created the `crowdsec-lapi-cscli-credentials` Secret manually (the hook short-circuits when the secret already exists) and re-applied the missing Role for future re-enablement. Re-enable path documented in the comment block. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 10:31:03 +00:00
Viktor Barzin	1f6facc8e4	Merge forgejo/master — reconcile 18-day divergence with origin Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details Origin and forgejo had drifted since 2026-05-05 (merge base `b45c45e4`). Each remote was receiving Viktor's commits independently — origin since 2026-05-23 and forgejo from 2026-05-06 to 2026-05-22 14:15. Both had ~30 substantive commits. This merge brings forgejo's work into the local branch. 13 conflict files resolved as follows (all favoured HEAD = origin/local, which is newer in every case): - secrets/{fullchain,privkey}.pem — kept HEAD (renewed 2026-05-24, vs forgejo's 2026-05-17 renewal) - stacks/blog/main.tf — kept HEAD (ingress-www intentionally removed today after DNS+monitor cleanup; forgejo had the old block) - stacks/xray/modules/xray/main.tf — kept HEAD (vless dropped today as dead ingress; forgejo had the old 3-port service) - stacks/k8s-version-upgrade/scripts/upgrade-step.sh — kept HEAD (allowlist refactor, master-phase idempotency skip, tigera-operator quiesce/restore, IngressTTFBCritical ignore — all newer than forgejo) - stacks/k8s-version-upgrade/main.tf — kept HEAD (deployments/scale RBAC, oldest-kubelet detection — both added 2026-05-23) - scripts/update_k8s.sh — kept HEAD (--etcd-upgrade=false fallback) - stacks/llama-cpp/main.tf — kept HEAD (KEEL_LIFECYCLE_V1 ignore_changes block added today, commit `0b1282a1`) - stacks/openclaw/main.tf — kept HEAD (nim/meta/llama-3.1-70b primary) - stacks/trading-bot/main.tf — kept HEAD (claude-haiku-4-5 pin + kevin-signal-bridge container) - stacks/postiz/modules/postiz/main.tf — kept HEAD (memory 2Gi/3Gi bump, despite postiz being destroyed today — kept TF intent) - stacks/nvidia/modules/nvidia/values.yaml — kept HEAD (mem 822Mi) - stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl — kept HEAD (richer alert list + raised StatefulSet `for: 3m`) - stacks/kyverno/modules/kyverno/security-policies.tf — kept HEAD (expanded registry allowlist + comments) - docs/architecture/security.md — kept HEAD (detailed W1.7 analysis) - docs/plans/2026-05-21-ha-control-plane-design.md — kept HEAD (178-line superset incl. 2026-05-23 deferral rationale) Auto-merged (no conflict): broker-sync, claude-agent-service, cloudflared, mailserver, n8n, technitium, traefik, url, proxmox-csi, xray (deployment portion). Brings in forgejo-only substantive commits: fire-planner, openclaw v3 flow + recruiter-responder wiring, several k8s-version-upgrade hardening passes (kill-switch, RecentNodeReboot ignore, pipefail fixes), HA control plane design, security wave 1 expansion to tier 3+4, alloy file-tail switch, prometheus scrape 2m, authentik replica cut, forgejo archive disable. Meta: forgejo and origin drift is a coordination bug. Going forward we need to either (a) have one CI mirror to the other, or (b) standardize on one remote. Filed mentally; not addressed in this commit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 09:41:36 +00:00
Viktor Barzin	0b1282a13c	llama-cpp: ignore_changes for keel/k8s-managed annotations Every `tg apply` was reverting the annotations that keel patches when it detects an upstream digest change — `keel.sh/match-tag` (Kyverno-stamped), `keel.sh/update-time` (on the pod template; what actually triggers the rollout), plus the K8s-managed `kubernetes.io/change-cause` and `deployment.kubernetes.io/revision`. The revert forced a rollout, then the next keel poll re-stamped the annotations, forcing another. With llama-swap's ~10s cold-load on each pod recreate the user noticed. Upstream `ghcr.io/mostlygeek/llama-swap:cuda` is a moving nightly tag — keel still drives one legitimate rollout per day at ~07:25 UTC; this patch stops the apply-driven extra rollouts on top of that. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 09:01:17 +00:00
Viktor Barzin	67f8be4598	trading-bot: add kevin_signal_bridge container (kill-switch OFF for Phase 1) 5th worker container running in audit-only mode. Writes kevin_signal_bridge_state rows showing what it WOULD trade but never publishes to signals:generated. Kill-switch flipped in Phase 2.	2026-05-24 01:22:53 +00:00
Viktor Barzin	6218868ea5	xray: drop dead vless ingress + pin Service target_port The xray-vless ingress, Service port 6443, and container port 6443 had no backing listener — xray.config.json only binds 7443 (REALITY), 8443 (WS) and 9443 (XHTTP). The "xray-vless" hostname was returning 502 since the module was created. Side effect: removing the first Service port slot ("vless"/6443) caused the kubernetes provider to shift targetPort values on the remaining two ports (defaulting only worked at create time, not on port removal). Pinning target_port explicitly makes Service routing deterministic. End-to-end verified: REALITY via public IP:8080 (pfSense forward 8080 -> 10.0.20.200:7443), WS via Cloudflare, XHTTP via Cloudflare — all three transports proxied successfully through a test pod, egress IP correctly resolves to the home WAN.	2026-05-24 01:13:54 +00:00
Viktor Barzin	ae874e028d	postiz: bump memory request 512Mi → 2Gi, limit 4Gi → 3Gi (right-size for next deploy) krr 2026-05-22 flagged postiz-app as critically under-requested when it was running (gap 2.2 GiB above the 512Mi request). Postiz is currently uninstalled in the cluster — this change is only for when the stack is re-deployed later. No apply triggered now. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:11:25 +00:00
Viktor Barzin	b59acbc1db	crowdsec/agent: bump memory request 64Mi → 128Mi krr 2026-05-22 flagged crowdsec-agent DaemonSet (4 pods) as under- requested by ~588 MiB across the cluster. Live usage around the 80-128 MiB mark for active log parsing — 64 MiB request risked eviction ahead of more-needed pods. Limit stays at 512 MiB. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:11:16 +00:00
Viktor Barzin	7108843b38	nvidia/driver-daemonset: bump memory request 256Mi → 822Mi krr 2026-05-22 flagged nvidia-driver-daemonset as critically under-requested (~566 MiB gap). Live driver process holds ~600-800Mi once the kernel module is loaded. Limit stays at 2Gi so the DKMS build during a kernel upgrade still has headroom (documented in values.yaml to need ~1.4 GiB peak). May help unblock code-8vr0 (GPU driver crashloop on node1) if the crashloop was OOM-driven. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:11:06 +00:00
Viktor Barzin	2711d4af05	monitoring/loki: bump memory request 2Gi → 3Gi (close gap to 4Gi limit) krr 2026-05-22 flagged loki as under-requested by 1.9 GiB. Live working set is sitting at ~3 GiB during normal ingestion; the existing 2 GiB request meant scheduler didn't reserve enough room and the pod risked eviction. Limit stays at 4 GiB (documented ceiling in loki.yaml). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:10:55 +00:00
Viktor Barzin	c77984a713	proxmox-csi/node: bump memory request 64Mi → 1Gi (LUKS unlock reservation) The CSI node plugin's LUKS2 Argon2id key derivation peaks at ~1 GiB during unlock (memory id=712 + already-documented in the limits=1280Mi). Request was 64 MiB — meaning the unlock burst ran "best-effort", first in line for OOM under node pressure. krr 2026-05-22 flagged this as a top under-request. Bumping request matches the documented requirement. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:10:44 +00:00
Viktor Barzin	467460cccd	k8s-version-upgrade: ignore IngressTTFBCritical in halt-on-alert check The Synology DSM (port 5001) ingress chronically trips IngressTTFBCritical because of NAS-side latency that is unrelated to k8s upgrades. The chain was halting indefinitely waiting for it to clear. Add it alongside RecentNodeReboot to the per-call ignore regex so the chain can proceed autonomously without manual silences. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:10:44 +00:00
Viktor Barzin	447bfef507	blog: remove www.viktorbarzin.me ingress The www subdomain was internal-only (no Cloudflare DNS record) but the external uptime-kuma monitor still flagged it as down because public DNS resolution failed. Removing the ingress along with the Technitium CNAME makes the failure mode disappear and lets the cluster reach an autonomous-clean state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:10:44 +00:00
root	10ac174627	Woodpecker CI Update TLS Certificates Commit	2026-05-24 00:03:48 +00:00
Viktor Barzin	b4aa8eaf58	technitium: cut memory — primary 2Gi → 1Gi, secondary+tertiary 2Gi → 512Mi Right-sizing per krr report (2026-05-22). Zone data is ~43 MiB; the rest was cache headroom. Primary keeps more (1 GiB) since it owns authoritative zones; replicas get 512 MiB. DNS sanity-checked across CoreDNS and the MetalLB external IP (10.0.20.201) post-rollout. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 10:03:51 +00:00
Viktor Barzin	931d7b6c9d	claude-agent-service: cut memory request 2Gi → 1Gi (limit 4Gi → 2Gi) Right-sizing per krr report (2026-05-22). Kept Burstable QoS (limit > request) so an active agent run still has 2 GiB headroom — krr's 100 MiB recommendation was measured idle and is not safe for an active job. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 10:03:42 +00:00
Viktor Barzin	d76f4c4827	n8n: cut memory request 1Gi → 512Mi (+ image bump 1.80.0 → 1.80.5) Right-sizing per krr report (2026-05-22). Image bump syncs main.tf with the live Keel-managed version to avoid an inadvertent downgrade on apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 10:03:28 +00:00
Viktor Barzin	17c1ef73be	url/shlink: cut memory request 960Mi → 512Mi Right-sizing per krr report (2026-05-22, memory id=2431-2438). Live pod working set is ~80 MiB; 512Mi leaves comfortable headroom for the Symfony+RoadRunner footprint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 10:02:45 +00:00
Viktor Barzin	02ea5da8dc	k8s-version-upgrade: skip phase_master/phase_worker if node already on target The chain wasn't idempotent — re-running on a partially-upgraded cluster would re-drain + re-kubeadm + re-apt an already-upgraded node, causing unnecessary disruption (5-10 min per no-op node) and risking alert re-fires during the unnecessary drain. Today's chain hit this twice: after fixing the version-detection bug (commit `a0f3e155`), the chain correctly resumed but re-did master AND node4 even though both were already on v1.34.8. node4 got cordoned, drained, and is now soaking for 10 min for no reason. Fix: at the top of phase_master and phase_worker, read the node's current kubelet version. If it equals TARGET_VERSION, skip the whole phase (return 0 — spawn_next will fire downstream). Chain advances without disturbing the already-upgraded node. In-flight effect: the current node4 worker pod has the old script mounted from configmap snapshot, so it'll continue. If it fails and retries, the new pod will see node4 on v1.34.8 and short-circuit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 09:53:57 +00:00
Viktor Barzin	a0f3e15562	k8s-version-upgrade: version-check uses oldest kubelet, not master Previous version-check read RUNNING from .items[0].nodeInfo.kubeletVersion — which is just k8s-master. If master is upgraded but workers aren't (e.g. a chain that completed master phase but failed mid-worker), the version-check sees v1.34.8 and decides "no upgrade needed", never spawning the resume phase. Workers stay behind forever. Today's chain hit exactly this: master + node4 upgraded to v1.34.8, worker-node4 Failed mid-soak (alert sensitivity, since loosened), chain dead. Re-triggering the version-check looked at master only, decided cluster was "done", and refused to resume worker chain. Fix: read all node kubelet versions, sort -V, take head -1 (oldest). A partial chain now correctly reports the un-upgraded version and the chain resumes. Trivial change; tested live — chain now correctly reports v1.34.7 (workers' version) and spawns preflight → master → worker chain. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 09:48:50 +00:00
Viktor Barzin	68f8514e61	monitoring: MetalLBSpeakerDown for: 2m → 10m (was upgrade-chain regression) Earlier in this session, commit `503ac4c1` brought the for: from 5m → 2m based on a brief I wrote inaccurately. The brief said the alert "fires immediately" but it was actually already at 5m. The subagent followed the explicit "2m" target and tightened it — opposite of what we wanted. 10m is the right value for our chain: a full drain + kubeadm + apt + kubelet restart + uncordon cycle can take a worker out of MetalLB rotation for 5-7 min in the worst case (PDB stickiness on some pods). 10m suppresses upgrade-induced blips while still catching real speaker-down conditions. node4 worker phase tripped this alert mid-soak today, aborted the chain (Job retry), succeeded on the 2nd attempt only because alerts didn't re-fire fast enough. With 10m the next workers shouldn't need the retry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 09:32:41 +00:00
Viktor Barzin	503ac4c192	monitoring: tune 4 alerts for transient drain/upgrade blips Today's worker-phase rolling upgrade tripped MysqlStandaloneDown, MetalLBSpeakerDown, KubeletRunningContainersDrop, and IngressErrorRate5xxHigh even though every affected workload recovered within 30-60s. Loosen `for:` (and one threshold) on each so they only fire on persistent faults, not on routine drain+kubelet- restart cycles. - MysqlStandaloneDown: for 2m -> 3m (single-replica StatefulSet, drain re-scheduling routinely takes 1-3m). - MetalLBSpeakerDown: for 5m -> 2m (kubelet restart drops the speaker pod for 30-45s; 2m suppresses that blip). - KubeletRunningContainersDrop: absolute `< -10` threshold replaced with relative `< -0.5` (>50% drop vs. 10m ago); routine drains routinely shed 10-30 containers and tripped the old rule. - IngressErrorRate5xxHigh: for 5m -> 10m (rolling pod migrations cause brief 5xx spikes that clear in 1-2m). Severity, labels, and annotation structure preserved; only `for:` durations and the one expression changed. Tactical loosening of four specific alerts -- broader observability audit tracked separately in beads. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 09:28:53 +00:00
Viktor Barzin	ad9f6c8f41	k8s-version-upgrade: halt_on_alert allowlist (severity=critical only) Refactored halt_on_alert_query from denylist ("ignore these noisy alerts") to an allowlist ("only halt on severity=critical"). Today's blocking alerts were all warning/info-level and not actual upgrade blockers: - PodCrashLooping (gpu-operator on the GPU node, code-8vr0, long-standing) - IngressTTFBHigh (Traefik latency, transient) - NodeHighIOWait (chicken-and-egg with our own upgrade I/O) - RecentNodeReboot (chain causes this itself) severity=critical filtering is more robust than maintaining a denylist of every noisy alert that crops up. extra_ignore parameter kept for backwards compatibility but is rarely needed now (critical alerts are the only ones that should actually halt the chain). Tested end-to-end this session — master successfully upgraded to v1.34.8 via the autonomous chain after the apiserver state-repair (apiserver manifest had been pinned at v1.34.2 from a previous month's rollback; required a one-time manual edit + kubelet reload to bring back to v1.34.7, after which the chain ran cleanly). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 09:14:39 +00:00
Viktor Barzin	0025511b6a	docs: Technitium DNS IP — 10.0.20.101 → 10.0.20.201 Stragglers from the same drift as commit b288a59 (monorepo) / the 2026-05-22 viktorbarzin.me apex incident — the `.101` references were left over from the NodePort exposure era. Technitium's actual MetalLB LB IP is `.201` (in pool 10.0.20.200-220). - architecture/vpn.md — Technitium component cell + AdGuard forwarder example + nslookup troubleshooting hint - architecture/networking.md — 502 ingress troubleshooting snippet - plans/2026-02-22-talos-linux-migration-evaluation.md — nameservers example	2026-05-23 08:53:52 +00:00
Viktor Barzin	68a503e29f	kyverno: allowlist woodpeckerci/* for CI step pods Wave-1 trusted-registries allowlist was missing woodpeckerci/* which is used by every .woodpecker.yml's clone step (woodpeckerci/plugin-git) and build steps (woodpeckerci/plugin-docker-buildx). Result: ALL Woodpecker pipelines have been failing at the git step since the Audit→Enforce flip on 2026-05-19. First surfaced via code-da4h (recruiter-responder pushes not building). Added between viren070/* and zelest/* in the same DockerHub-user-repos block as the 2026-05-22 batch (commit `2d35d72a`). Closes: code-da4h	2026-05-23 08:52:48 +00:00
Viktor Barzin	000d306542	technitium: add viktorbarzin.me apex DNS drift probe + alerts Every internal .viktorbarzin.me hostname (~80 services) chains through the split-horizon `viktorbarzin.me` apex A record. If the apex drifts (ISP rollover, accidental edit), every internal service breaks at once — the 2026-05-22 ha-sofia incident was exactly this. This adds a backstop probe so the next drift surfaces in <10 min instead of via user-reported outage: - CronJob `viktorbarzin-apex-probe` in `technitium` namespace, every 5 min, resolves `viktorbarzin.me A` against the Technitium LB IP (10.0.20.201) and pushes `viktorbarzin_apex_correct` + `_last_correct_timestamp` to Pushgateway. Python+dnspython, ~30 LOC. - 3 Prometheus alerts: - `ViktorBarzinApexDrift` (critical, 10m) — apex resolved to anything other than 10.0.20.200. - `ViktorBarzinApexProbeStale` (warning, 5m on 15m gap) — probe stopped succeeding. - `ViktorBarzinApexProbeNeverRun` (warning, 30m absent) — probe never reported. - Added the new alert names to the Slack receiver matcher in both routes alongside EmailRoundtrip. Verified: rules loaded as inactive (apex is correct), metric flowing, manual probe job pass observed.	2026-05-23 08:41:14 +00:00
Viktor Barzin	4713c3a6d9	k8s-version-upgrade: tigera quiesce + etcd-skip retry + IO-wait alert ignore Three changes unblocking the autonomous chain for k8s patch upgrades: 1. phase_master quiesces tigera-operator before drain, restores after. Tigera crashes immediately if apiserver is unreachable (no retry logic) and crashlooping it during master static-pod swaps generates ~500MB/s disk I/O that pushes kubeadm's 5-min static-pod-hash watch past its limit. Quiesce removes the storm contributor; calico data plane keeps running unchanged (data plane is the DaemonSet+Typha, operator is just the reconciler). 2. update_k8s.sh retries with --etcd-upgrade=false on the 2nd attempt. For patch upgrades (1.34.7→1.34.8), etcd's image doesn't change — kubeadm writes an identical manifest, hash doesn't update, watch times out and rolls back forever. The skip-etcd retry sidesteps it for the legitimate no-change case while still doing a full etcd upgrade on the first attempt (correct for minor-version bumps). 3. halt_on_alert_query also ignores IngressTTFBHigh + NodeHighIOWait. Both are symptoms-not-causes: ingress latency spikes briefly during any pod-restart wave; high IOwait is exactly what upgrade activity causes (chicken-and-egg). The inline quiet-baseline check (Ready transition <10min) is the real cluster-churn gate. RBAC: k8s-upgrade-job ClusterRole gains `patch` on deployments + scale subresource so the chain can do the scale-to-0/back-to-1 on tigera. These three together get the chain past the cascade that's been blocking 1.34.7→1.34.8 for a week. Long-term fix is still HA control plane (beads code-n0ow); these are the bridge. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 08:40:11 +00:00
Viktor Barzin	6f4a569d1c	traefik: bump auth-proxy nginx header buffers to handle Authentik cookie pile Browsers accumulate one authentik_proxy_<random> cookie per Authentik Proxy Provider under viktorbarzin.me (Path=/). With 30+ services the combined Cookie header exceeds nginx's default 4 x 8k large_client_header_buffers and trips '431 Request Header Fields Too Large' at the forward-auth nginx (traefik/auth-proxy). Bumped to: client_header_buffer_size 8k large_client_header_buffers 8 64k Matches the pattern used on the London Flint 2 router nginx (memory id=647).	2026-05-23 08:34:33 +00:00
Viktor Barzin	7f63d35d0a	docs/plans: HA control plane — design + plan + deferral Investigated, designed, and planned the 3-master HA control plane migration triggered by 2026-05-21's autonomous k8s upgrade cascade. Locked 14 design decisions across two passes: - 10 initial decisions (LB strategy, IPs, sizing, etcd, kured gate, etc) - 4 challenger-pass amendments (cloud-init template bump, rbac stack multi-master refactor, HTTPS /readyz health check, expanded blast radius to include /home/wizard/code/infra/config root kubeconfig, config.tfvars, k8s-portal user kubeconfigs, etcd-backup nodeSelector, k8s-version-upgrade chain extension as Phase 7) Plan covers 11 phases end-to-end including panic-mode rollback. DEFERRED before execution. PVE host is 98% RAM-committed (262 GB allocated / 267 GB physical, 1.5 GB swap active); the planned 3 x 32 GB masters would push allocation to 326 GB and OOM the host. k8s-master currently uses only 4.6 GB of its 32 GB allocation (5-6x oversized). Revisit triggers documented in design doc: 1. Second PVE host added → hardware HA becomes possible. 2. Right-sizing pass OR planning masters at 16 GB each. 3. Cumulative manual upgrade nursing > ~10h. Standalone candidate worth lifting independently: Phase 1.5's rbac stack refactor (apiserver-oidc + audit-policy + etcd-tuning to loop over k8s_master_hosts list) — future-proofs the cluster without committing to the HA migration. Refs: code-n0ow (open, deferred via bd note). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 08:32:15 +00:00
Viktor Barzin	70a334e431	trading-bot: pin Meet Kevin LLM model to claude-haiku-4-5 Sonnet-4-5 trips Anthropic per-account rate_limit_error on the OAuth bearer (sk-ant-oat01) tokens after 5-10 burst calls — sticky multi-hour quota. Haiku-4-5 has much higher RPM and processes the 16-video backfill cleanly (~30s/video with inter-call throttle). Comment above the env line documents the rationale for future re-evaluation.	2026-05-22 20:43:05 +00:00
Viktor Barzin	5258f09230	mailserver: decommission SendGrid Remove leftover SendGrid references after the Brevo migration was completed: - Delete TF `cloudflare_record.mail_domainkey` (TXT at `s1._domainkey`, SendGrid-era DKIM, hidden behind the SendGrid CNAME but would re-emerge once the CNAME is removed). - Clean up commented-out `smtp.sendgrid.net` relayhost references and the `# For sendgrid` comment on `sasl_passwd` in the mailserver module. DNS records deleted out-of-band (not TF-managed): - CF: `s1._domainkey CNAME` + `s2._domainkey CNAME` → sendgrid.net (manual entries) - Technitium internal `viktorbarzin.me`: `em7107`, `s1._domainkey`, `s2._domainkey` CNAMEs → sendgrid.net Verified end-to-end mail flow unaffected (Brevo outbound + IMAP receive, roundtrip 20.4s — identical to baseline). Active DKIM (`mail._domainkey` local + `brevo1/brevo2._domainkey` Brevo) untouched.	2026-05-22 20:08:38 +00:00
Viktor Barzin	b233aba710	openclaw: switch primary to nim/meta/llama-3.1-70b-instruct Auth audit on 2026-05-22 — all the broken paths and the one that works: - openai-codex OAuth: EXPIRED (ChatGPT Plus, ancaelena98@gmail.com) - secret/openclaw → openai_api_key (sk-svcacct): insufficient_quota - openrouter_api_key: "Key limit exceeded (total limit)" - llama_api_key: region-blocked - anthropic_api_key: sk-ant-oat-… (OAuth refresh token, not a real x-api-key — won't auth via x-api-key header) - nvidia_api_key (NIM): WORKS. The key was already baked into the openclaw.json providers.nim.apiKey from secret/openclaw → nvidia_api_key. Two NIM models verified end-to-end (call from inside openclaw pod with tool-call schema, both returned proper {tool_calls:[…]} JSON): - meta/llama-3.1-70b-instruct — 0.58s, primary - meta/llama-4-maverick-17b-128e — 16s, smarter, fallback Fallback chain: maverick → openai-codex (auto-promotes once re-authed) → modelrelay/auto-fastest (last resort, hallucinates instead of tool-calling, but at least responds). Models registered in both `agents.defaults.models` (allowlist) and `models.providers.nim.models` (capability declarations) so the agent sees them as available tools. Startup `models set` updated to pin the new primary across `doctor --fix` runs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 15:23:17 +00:00
Viktor Barzin	3962513036	security(wave1): W1.7 analysis snapshot — observation data → allowlist plan First analysis pass over Calico GNP wave1-egress-observe-tier34 data captured in Loki since 2026-05-19. Pulled ~10000 flow log lines covering 36 source namespaces (of 82 selected by tier 3+4). Analysis script outputs preserved on the dev host at /tmp/{analyze_flows2,build_allowlist}.py. ## Findings Universal baseline (every observed ns): - DNS to kube-system/kube-dns UDP/53 - Often mysql.dbaas TCP/3306 or pg.dbaas TCP/5432 - Often redis.redis TCP/6379 Rollout tiering by egress fan-out: - Tier A (recruiter-responder only): 2 destinations, ideal pilot - Tier B (29 namespaces): ≤3 external IPs, ≤5 internal — batch rollout - Tier C (4 namespaces: f1-stream/openclaw/woodpecker/status-page): needs per-IP investigation - Tier D (servarr): 130+ external IPs (BitTorrent P2P) — keep Log+Allow permanently or move to dedicated egress proxy ## Caveats blocking immediate enforce - Observation horizon too short: ~6h dense data, ~24h total. Need ≥7 days to catch weekly CronJobs, Vault token rotations, Keel pulls. - External IPs are dynamic (Cloudflare/AWS rotate). Static IP allowlists will break — need DNS-based selectors or CIDR ranges. - Some intra-namespace traffic bypasses the Calico filter chain. ## Recommended next steps 1. Continue observation through 2026-05-29 (full week). Compare destination set day-over-day; if stable, allowlist is ready. 2. First enforce: recruiter-responder (allowlist = kube-dns + telegram CIDR + vault/ESO service IPs). 3. Tier B phased rollout at 3-5 ns/day after pilot proves out. Full analysis: docs/architecture/wave1-egress-observation-2026-05-22.md Tracked under beads code-8ywc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 15:22:25 +00:00
Viktor Barzin	2d35d72a53	kyverno(wave1): add 7 missing registries to trusted-registries allowlist Discovered via W1.5 enforcement when querying live cluster state: PolicyViolation events on 5 deployments (council-complaints, ebook2audiobook, hermes-agent, netbox, whisper/piper) trying to admit images from registries not in the original enumeration. Added entries: - amruthpillai/* (resume — reactive-resume) - athomasson2/* (ebook2audiobook) - netboxcommunity/* (netbox) - nousresearch/* (hermes-agent) - opentripplanner/* (osm-routing) - rhasspy/* (whisper, piper) - registry.viktorbarzin.me/* (legacy private registry — council-complaints still references; should migrate to forgejo) The legacy registry.viktorbarzin.me was supposedly decommissioned 2026-05-07 per CLAUDE.md but council-complaints still uses it — separate cleanup task. ## Verification - kubectl delete + reapply (kubectl_manifest resourceVersion=0 patch gotcha, same as 2026-05-18 inject-keel-annotations) - Dry-run admission of previously-blocked images now PASS: - netboxcommunity/netbox:v4.5.0-beta1 ✓ - rhasspy/wyoming-whisper:3.1.0 ✓ - registry.viktorbarzin.me/council-complaints:1c56f8f ✓ - Policy still in Enforce mode ## Observation status (W1.6) - Calico GNP wave1-egress-observe-tier34 still applied, 82 ns selected - Loki `{job="node-journal"} \|~ "calico-packet"` returns ~5000 lines/hour - No errors from observation infrastructure Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 15:17:16 +00:00
Viktor Barzin	c11ac7d486	cnpg: bump webhook-cert renewal threshold 7d -> 30d Root cause of the recurring 'cnpg-webhook-cert' TLS expiry warn: CNPG default 'expiringCheckThreshold = 7' means the operator only regenerates the self-signed webhook cert when remaining lifetime drops BELOW 7 days. Our cluster-health check #22 alerts at <30d. Result: ~23 days of WARN before CNPG would even attempt rotation. Set EXPIRING_CHECK_THRESHOLD=30 via the chart's config.data map so the operator now regenerates with 30d buffer, aligning with our monitoring threshold. Cert lifetime stays at chart default 90d. Verified after apply: operator runtime config shows 'expiringCheckThreshold:30'. Companion in-session action: deleted the existing soon-to-expire secret and bounced the operator to force an immediate fresh 90-day cert (notBefore=May 22, notAfter=Aug 20).	2026-05-22 15:00:41 +00:00
Viktor Barzin	96f9db0b13	state(cnpg): update encrypted state	2026-05-22 15:00:04 +00:00
Viktor Barzin	6367b783c7	broker-sync(imap): fix command name + add fsGroup for sync.db writes Two latent issues found while diagnosing why the May 2026 META vest didn't land: 1. broker-sync-imap CronJob's command was 'broker-sync imap', but the actual CLI subcommand is 'imap-ingest'. Every scheduled run had been failing with 'No such command imap' since day-one. 2. Pod runs as uid=10001 gid=999; PVC /data dir is mode 2775 group=10001. Without fsGroup in the pod's securityContext the pod gets only 'other' (r-x) perms on the dir, so sqlite3 can't create journal/WAL files next to sync.db -- hits 'attempt to write a readonly database'. fsGroup=10001 adds the matching gid to the pod's supplemental groups so writes work. Schwab email-sender regex fix is in broker-sync@d860aef.	2026-05-22 14:41:54 +00:00
Viktor Barzin	fa536cc08b	ci: retry after Keel rollout cascade settled	2026-05-22 14:41:54 +00:00
Viktor Barzin	a3bcb5e12f	fire-planner: COL refresh CronJob + Grafana Cost-of-Living dashboard Operational layer for the new col_snapshot cache shipped in fire-planner@e72fd22: stacks/fire-planner: - fire-planner-col-refresh CronJob — Sun 04:00 UTC, no-op until rows age toward the 1-year TTL boundary (within 7 days). Calls python -m fire_planner col-refresh-stale, upserts via cache.upsert. monitoring/dashboards/cost-of-living.json (Finance folder): - Two template variables: $city (single-select from col_snapshot), $baseline_city (for COL ratio computation, defaults London). - Stat row: total w/rent, w/o rent, 1-bed rent, ratio (color-coded). - All-cities ranked table with gradient-gauged total + colored ratio. - Cache-freshness table flags rows approaching TTL expiry. Initial population needs a one-shot: post-Keel-rollout, kubectl -n fire-planner exec deploy/fire-planner -- \\ python -m fire_planner col-seed Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	d4c76a07a2	openclaw: revert model swap + document codex re-auth path The previous commit promoted modelrelay/auto-fastest to primary as a workaround for the expired openai-codex OAuth token. But modelrelay routes to small tool-call-shy models (nvidia/stepfun-ai/step-3.5-flash) that hallucinate answers instead of using ssh / curl / etc. — exactly what the v4 learning loop is supposed to leverage. Revert primary back to openai-codex/gpt-5.4-mini (gpt-5.4-mini is the only mini variant the Codex backend accepts for ChatGPT Plus tier), and inline the re-auth command in the model-block comment so future sessions know exactly what to do when the OAuth token expires: kubectl -n openclaw exec -it $(kubectl -n openclaw get pods \ -l app=openclaw -o jsonpath='{.items[0].metadata.name}') \ -c openclaw -- node /app/openclaw.mjs models auth login \ --provider openai-codex modelrelay/auto-fastest stays in the fallback chain so the agent remains partially usable while the token is expired. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	6457aa6d8f	cluster-health skill: document tightened #43 thermal threshold (65 C)	2026-05-22 14:17:01 +00:00
Viktor Barzin	6950b8f197	cluster-health #43 : tighten PVE thermal threshold to 65 C Per Viktor: healthy baseline range is 55-65 C; anything above 65 C is a signal a VM/workload is using too much CPU and warrants investigation. Previous thresholds were calibrated to the hardware's TjMax (75/83 C) — that was too lax, since cluster-load-driven elevation arrives a long time before throttling. The 65 C cutoff matches the live Prometheus baseline (Apr 20-May 8 2026: peak 61-69 C, avg 51-55 C) and the session-observed correlation: above 65 C means the cluster is doing sustained work that should be looked at, even if hardware is still nowhere near its limit. Updated: PASS < 65 C (within 55-65 baseline) WARN 65-82 C (elevated; check top kvm processes for the culprit) FAIL >= 83 C (at/above TjMax — throttling imminent) Verified live: 67 C now WARN (was PASS under the 75 C threshold).	2026-05-22 14:17:01 +00:00
Viktor Barzin	dbb3dc04d3	openclaw: engrain the learning loop at the identity level User feedback: "this should work for any task, not just calendar. this learning flow must be strongly engrained to ensure openclaw gets better over time." The v3 rules were buried at the bottom of TOOLS.md and only stated in workflow language. Three changes to make the rule unavoidable: 1. SOUL.md — new marker-delimited section "Learning is your identity" inserted before ## Boundaries. AGENTS.md tells the agent to read SOUL.md first every session, so this is now the FIRST thing the agent loads about itself. Frames learning as character, not procedure. 2. TOOLS.md v4 — section moved from the END of the file to right after the `# TOOLS.md` title (first substantive content on file load). Title strengthened: "THE FLOW — run this on EVERY task. Not just hard ones." Concrete examples explicitly call out diverse domains (calendar, frigate restart, disk usage, inbox summary, deploys) so the universality is unmistakable. 3. learn-from-tasks skill — opens with "This is universal. EVERY task runs through this flow — not just hard ones, not just unfamiliar ones. The save at the end is mandatory." The actual flow (know → ask devvm → save) is unchanged. What changed is salience: the rule is now the first thing the agent encounters in three independent surfaces, with stronger framing that makes "skipping the save" feel like a violation of identity rather than a missed optimisation. Marker bumped v3 → v4. Stripper handles v1-v9 idempotently. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	854817e2e3	trading-bot: revive K8s stack + add meet-kevin-watcher Uncomment the trading-bot stack (disabled 2026-04-06 due to resource consumption) and add the new meet_kevin_watcher service container. Changes: - Uncomment the /* ... / block enclosing the entire stack - Fix db_init job: add -d postgres to psql commands (root user has no root-named database — matches pattern used in claude-memory + others) - Remove 3 disabled containers from trading-bot-workers Pod spec: news-fetcher, sentiment-analyzer, trade-executor - Add new meet-kevin-watcher container (image viktorbarzin/trading-bot-service:latest, command python -m services.meet_kevin_watcher.main, mem 128Mi/256Mi) - Extend ExternalSecret with TRADING_OPENROUTER_API_KEY and TRADING_MEET_KEVIN_CHANNEL_ID keys (sourced from Vault secret/trading-bot) - Add 4 common_env entries for the Meet Kevin pipeline (poll interval, daily cost cap, model slug, prompt version) - Update lifecycle.ignore_changes to 4 image indices vault: re-enable pg-trading static role - Add pg-trading to vault_database_secret_backend_connection allowed_roles - Uncomment vault_database_secret_backend_static_role.pg_trading (was disabled 2026-04-06 with the rest of trading-bot stack) kyverno: add postgres to trusted-registries allowlist - trading-bot db_init uses postgres:16-alpine (Docker Hub library image) - postgres* was not in the DockerHub bare-name allowlist (unlike mysql, alpine, nginx, python which were already there) Final workers Pod containers (in order): [0] signal-generator [1] learning-engine [2] market-data [3] meet-kevin-watcher (NEW) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	d0a4876825	openclaw: v3 flow — know → ask devvm → (rarely) try yourself Refines the devvm-fallback into an explicit triage flow that the agent runs on every task. The default path is to ASK devvm-claude when uncertain — don't brute-force. Most tasks are solvable there. ## The flow 1. Do I KNOW how? Check `memory_recall` and INDEX.md. 2. If not, SSH devvm and ask claude — and crucially, ask it to share the steps + credentials needed so I can do it on my own next time. Save the answer in openclaw memory. 3. (RARE) If devvm-claude says no, try in-pod. Most likely fail — that's OK. ## Storage moved to memory-indexed location Learnings now live under `/workspace/memory/projects/openclaw-learned/` (was `/workspace/learned/`) so memory-core indexes them and `memory_recall` surfaces them. Layout: - `scripts/<task>.md` runnable recipes - `knowledge/<topic>.md` decisions, paths, gotchas - `credentials/<name>.md` POINTERS to Vault, never values ## Credentials = Vault pointers only Previous v2 design saved cred values to plaintext NFS files. v3 flips to pointer-only: cred file documents the Vault path + fetch command (`ssh devvm 'vault kv get -field=foo secret/bar'`), the consumer, and rotation expectations. The secret stays in Vault. ## Init container also migrates Strips v1/v2/v3 markers from TOOLS.md before re-inserting v3, moves any files from the legacy `/workspace/learned/` tree into the new location, removes the empty legacy dir. User edits outside the markers always survive. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	ef67a53676	openclaw: explicit "use devvm + learn" default behaviour Refine the init container's devvm-fallback seeding so the OpenClaw agent treats devvm as its DEFAULT teacher and saves recipes locally to become independent over time: 1. TOOLS.md v2 section now has two emphatic CRITICAL rules: - "TRY DEVVM before giving up" — when stuck, ssh devvm before telling the user "I can't do that". - "After every task, introspect → save a faster way" — for any non-trivial task (especially recurring ones), save the recipe to /workspace/learned/ and update INDEX.md. 2. New cc-skill `learn-from-tasks` at /home/node/.openclaw/cc-skills/learn-from-tasks/SKILL.md formalises both triggers: (A) you're stuck → check INDEX → ask devvm → save; (B) you just finished → introspect → save if recurring. 3. /workspace/learned/ scaffold: INDEX.md table-of-contents + scripts/, knowledge/, credentials/ (0700) subdirs. Agent checks INDEX.md BEFORE reaching for devvm, so saved recipes are findable on the next run. 4. Marker migration: strips both v1 and v2 markers before re-inserting so user edits outside the markers always survive future restarts. Security caveat documented inline: credentials in /workspace/learned/credentials/ are NFS plaintext — acceptable for home-lab personal scope, NOT for anything more sensitive than what `ssh devvm` already gives the pod (wizard's access). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	43802d2452	openclaw: also write devvm section to /workspace/TOOLS.md The OpenClaw agent reads TOOLS.md on every session per AGENTS.md ("environment-specific notes"), but it does NOT auto-search the memory-core index for "devvm" before answering. Result: the agent said "I don't have access to the devvm" even though ssh + the openclaw-task wrapper were fully wired up (verified e2e in `9ad52dfd`). Updated init 6 (seed-devvm-memory-note) to ALSO append a marker-delimited section to /workspace/TOOLS.md describing the devvm SSH capability + openclaw-task usage. Idempotent: strips any prior v1 section before re-inserting, so user edits outside the markers survive future pod restarts. The /workspace/memory/projects/openclaw-runtime/devvm-fallback.md memory note stays in place — it's still indexed by memory-core and surfaces for memory_recall queries. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00
Viktor Barzin	7e558de8f0	openclaw: SSH + tmux task fallback to devvm Give the OpenClaw pod two new capabilities: 1. Host-tools bundle. New init container `install-host-tools` extracts openssh-client + dnsutils + tmux + jq + ripgrep + fd + vault + yq + friends into /tools/host-tools/, with the bookworm-slim libs the binaries need. PATH + LD_LIBRARY_PATH on the main container point ld.so at the bundle. Idempotent via /tools/host-tools/.installed-v1 marker; smoke test (ldd-based) fails the init at deploy time if any binary has unresolved deps. Bundle is ~558 MB on the existing /srv/nfs/openclaw/tools NFS. 2. devvm SSH + async task pattern. New init `setup-ssh-config` writes id_rsa/config/known_hosts under /home/node/.openclaw/.ssh; main container startup symlinks /home/node/.ssh → there. New /usr/local/bin/openclaw-task wrapper on devvm manages long-running work as tmux sessions on devvm (sessions and logs survive pod restarts — they live on devvm, not in the pod). New init container `seed-devvm-memory-note` drops a markdown note teaching the pattern; main container startup now runs `openclaw memory index --force` so the note is searchable on first boot. Design + verified E2E flow in docs/plans/2026-05-22-openclaw-devvm-access-design.md. Persistence test green: spawned a 50s task from pod A, deleted pod A, new pod B saw the task finish and read its full log. Pre-existing keel.sh annotation drift on openclaw/{openlobster, task_webhook} cleaned up in the same apply. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:01 +00:00

1 2 3 4 5 ...

3733 commits