infra

Author	SHA1	Message	Date
Viktor Barzin	5482f46125	RecentNodeReboot: 24h → 1h threshold, matching upgrade-chain preflight The 24h kubelet-uptime threshold (process_start_time_seconds < 86400) was a defense-in-depth duplicate of the 24h-since-Ready-transition check in kured-sentinel-gate Check 4 — but they used different signals (kubelet process start vs node Ready transition). Whenever the cluster cycled through reboots, the alert kept firing for a full day even after sentinel-gate's check passed, and blocked anything querying halt-on-alert (kured, K8s version-upgrade preflight). Tightened to 1h (3600s) for "node just rebooted, give it a settle window". The cluster-wide 24h-between-reboots invariant lives exclusively in kured-sentinel-gate Check 4 from now on (independent, uses lastTransitionTime). Matched the preflight's own 24h-quiet check in upgrade-step.sh (86400 → 3600) so it doesn't act as a second blocker. Empirically verified: all 5 kubelets are >10h up, alert cleared on next eval after the rule reload. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	d1dcc5d12d	beads-server: add presence_claims table for agent coordination Adds the schema for the new agent presence board. Live Dolt is updated via a hashed-named one-shot Job; the ConfigMap entry preserves fresh-PVC init. Also pins the Dolt image to 2.0.3 — :latest on dolthub/dolt-sql-server currently resolves to 0.50.10, whose docker-entrypoint.sh references an undefined docker_process_sql function and crash-loops on every init script in /docker-entrypoint-initdb.d. Keel can still bump this tag in-cluster (image is in lifecycle.ignore_changes).	2026-05-22 14:16:57 +00:00
Viktor Barzin	e4e2babd6a	k8s-version-upgrade: FQDN SSH targets + python3 in place of envsubst Two latent bugs in the K8s-version-upgrade pipeline surfaced when a real detection run ran post-26.04 upgrade today: 1. DNS: pod's CoreDNS search path is `<ns>.svc.cluster.local svc.cluster.local cluster.local` (+ ndots=2 via Kyverno mutation). Unqualified `k8s-master` falls through all of those and then queries upstream Technitium for the bare name → NXDOMAIN. The FQDN `k8s-master.viktorbarzin.lan` is what Technitium actually serves. Suffix every node SSH target with `$NODE_DOMAIN`. 2. envsubst missing: claude-agent-service image doesn't ship `gettext-base`. Replace `envsubst <template \| apply` with `python3 -c 'import os,sys; sys.stdout.write(os.path.expandvars( sys.stdin.read()))' <template \| apply`. Same semantics, image already has python3. Multi-line $SCHEDULING_BLOCK is preserved correctly through expandvars. Verified by manually triggering `k8s-version-check` post-fix: detection now reads `Latest patch: v1.34.8` (currently running 1.34.7) and spawns `k8s-upgrade-preflight-1-34-8`. The Job pod scheduled and started; killed before it touched the cluster (will land on Sunday 2026-05-24 12:00 UTC like the schedule says). Root cause of why these bugs lay dormant: yesterday's first manual-test detection found "no upgrade needed" so neither code path exercised SSH or envsubst. Today's apt-source restore (do-release- upgrade had mangled them) unmasked the v1.34.8 candidate, which made detection finally proceed past the SSH step. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	6de4549a96	docs/plans: add agent presence implementation plan (2026-05-17) 15-task plan for a shared presence board so Claude Code sessions can see which shared infra resources are being actively mutated by other sessions. Resource-scoped claims on the existing Dolt server, heartbeat-driven TTL, agent-driven via CLAUDE.md rule + Python CLI.	2026-05-22 14:16:56 +00:00
Viktor Barzin	23d8aa89c4	keel: enroll 11 more namespaces (operators + critical infra) Per user decision, removed authentik, kyverno, metallb-system, external-secrets, proxmox-csi, nfs-csi, vpa, sealed-secrets, infra-maintenance from the policy-level exclude list, and added keel.sh/enrolled=true to aiostreams (alive — 1/1 Running, despite being earlier flagged as scaled-to-0) and woodpecker. Net cluster coverage: 197/227 workloads on safe-force (86%), up from 170/227 (74%). All 197 are paired with match-tag=true (digest-only). Remaining 7 namespaces in Kyverno exclude list (irreducible): - keel (self-update) - calico-system + tigera-operator (operator-managed Installation CR) - cnpg-system + dbaas (state-coupled) - nvidia (chart-pinned at 570.195.03 per code-8vr0 until NVIDIA ships ubuntu26.04 driver images) - kube-system (k8s built-ins) Files: - stacks/kyverno/modules/kyverno/keel-annotations.tf — exclude list trimmed from 16 → 7 - stacks/authentik, kyverno, proxmox-csi, nfs-csi, vpa, sealed-secrets, servarr/aiostreams, metallb (creates ns "metallb-system"), woodpecker — added keel.sh/enrolled=true label on kubernetes_namespace resource - infra-maintenance was in the policy exclude but the namespace doesn't actually exist in the cluster; the removal is a no-op there Applied via kubectl patch on the live ClusterPolicy + kubectl label on namespaces because the kubernetes provider v3.1.0 panics on Kyverno ClusterPolicy refresh — TF source has the desired state for next clean apply on a fixed provider. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	3bdba9f388	keel: enroll 15 critical-path namespaces for digest-only auto-update Per user decision today: monitoring, mailserver, vault, descheduler, metrics-server, traefik, technitium, crowdsec, redis, reverse-proxy, reloader, headscale, wireguard, xray, cloudflared now participate in the same `force + match-tag` regime as the rest of the cluster — Keel watches the deployment's CURRENT tag for digest changes only and rolls on push, never rewriting tag strings. Two-part change: stacks/kyverno/modules/kyverno/keel-annotations.tf Trim the policy-level namespace exclude list from 31 → 16. The 16 remaining exclusions are the irreducible cluster-operator + state- coupled set: keel itself, calico-system + tigera-operator (operator loop), authentik (2026-05-17 pgbouncer incident bite), cnpg-system + dbaas (state-coupled), kyverno, metallb-system, external-secrets, proxmox-csi + nfs-csi + nvidia (just stabilized today, chart-pinned), kube-system, vpa, sealed-secrets, infra-maintenance. stacks/<each-of-15>/.../main.tf Add `"keel.sh/enrolled" = "true"` label to the `kubernetes_namespace` resource so the Kyverno mutate policy can target the workloads via its namespaceSelector matchLabels. Note on the apply path: the live ClusterPolicy was patched via `kubectl patch` because the hashicorp/kubernetes provider v3.1.0 panics during state refresh on Kyverno ClusterPolicy schemas with deeply nested optional `context.celPreconditions` / `imageRegistry` fields (see crash dump). The TF source above has the desired state, so any clean future apply on a fixed provider version will be a no-op against the live cluster. Floating-tag workloads in the newly-enrolled set (will roll on every upstream digest update — acceptable risk per user): - wireguard: sclevine/wg:latest (image fixed today via iptables-nft postStart shim) - xray: teddysun/xray - crowdsec-web: viktorbarzin/crowdsec_web - monitoring: prompve/prometheus-pve-exporter:latest, prom/snmp-exporter - traefik: nginx:1-alpine, openresty/openresty:alpine, ghcr.io/tarampampam/error-pages:3 - redis: haproxy:3.1-alpine, redis:8-alpine Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	f5cf6ec051	nvidia: bump driver container memory limit 128Mi → 2Gi After rolling back k8s-node1's kernel to 6.8.0-117 + spoofing /etc/os-release to 24.04 so the operator picked the matching ubuntu24.04 driver image (everything per the workaround documented in docs/known-issues.md), the driver container still went into a restart loop. Container status: lastState.terminated: { reason: "OOMKilled", exitCode: 137 } The driver-installer was hitting the namespace LimitRange default of 128Mi during `apt-get install linux-headers-6.8.0-117-generic` — the last log line on every restart was "Installing Linux kernel headers..." before SIGKILL. 2Gi gives apt + the DKMS compile step enough headroom; peak observed during a successful compile in a test container was ~1.4Gi. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	63cbd0aba5	docs: known-issues entry for the Ubuntu 26.04 / NVIDIA driver gap Captures the workaround applied on k8s-node1 today (kernel rolled back to 6.8.0-117-generic, apt-mark hold on kernel meta-packages, /etc/os-release spoofed to 24.04 so NFD reports VERSION_ID=24.04 and the gpu-operator picks an existing ubuntu24.04 driver image), plus the trigger that lets us un-mitigate: any ubuntu26.04 tag appearing on nvcr.io/nvidia/driver. Linked from the post-mortem and from beads code-8vr0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	22c18dc061	paperless-mcp: deploy MCP for AI document search - New stack `paperless-mcp` running barryw/PaperlessMCP v0.1.19 (.NET, HTTP+SSE on :5000) wraps paperless-ngx's built-in FTS. 43 tools exposed. - In-cluster only egress to paperless-ngx svc; no Cloudflare hop on MCP-internal traffic. - Read-only at paperless layer: dedicated `claude-mcp` user (non-superuser) in new `claude-mcp-readers` group with view-only Django perms; existing 279 docs bulk-granted view perm via /api/documents/bulk_edit/; workflow #2 auto-grants the group on new docs (Consumption Added). - Gateway-level bearer auth via new Traefik plugin Aetherinox/traefik-api-token-middleware@v0.1.4 (loaded in traefik stack alongside crowdsec-bouncer); per-stack Middleware CRD `bearer-auth` pulls token list from Vault `secret/paperless-mcp/bearer_tokens`. - Vault `secret/paperless-mcp` holds: paperless_api_token (synced to K8s Secret via ESO; pod env via secret_key_ref), bearer_tokens (JSON array, read at plan time), bearer_token_viktor_laptop (mirror for laptop wiring), paperless_user_password (paperless UI fallback). - Image auto-update via Keel (semver minor policy, hourly poll). - Ingress dns_type=proxied → Uptime Kuma external monitor auto-created by external-monitor-sync CronJob. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	d1e7121115	recruiter-responder: bump image to 05b95943 (split callback routes) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	c72b839a2f	nvidia: pin chart to v25.10.1 after v26.3.1 upgrade revealed missing ubuntu26.04 driver images k8s-node1 was upgraded to Ubuntu 26.04 (kernel 7.0.0-15-generic) at some point. NVIDIA has NOT published ubuntu26.04 driver images yet (skopeo list-tags docker://nvcr.io/nvidia/driver returned 0 ubuntu26.04 tags vs 779 for ubuntu22.04 and 206 for ubuntu24.04). Attempted fix today: bump gpu-operator chart v25.10.1 → v26.3.1 + driver 570.195.03 → 580.105.08 + kernelModuleType=open. The chart applied cleanly but the v26.3.1 operator auto-detects host OS via NFD labels and constructs `<version>-ubuntu26.04` image tags, which 404 on pull. Rolled back to chart v25.10.1 and pinned it explicitly here so future `terraform apply` doesn't surface the same trap again. Note: chart rollback alone does NOT restore GPU functionality on k8s-node1. Both v25.10.1 and v26.3.1's operators now pick the ubuntu26.04 suffix (the NFD label is sticky once detected). The actual recovery path requires either (a) NVIDIA shipping ubuntu26.04 driver images, or (b) rolling the host kernel back to 6.8.0-117-generic (still installed in /boot, headers in /usr/src) + `apt-mark hold` to prevent re-upgrade. That step needs explicit user authorization for a node reboot — left as the next action item on code-8vr0. Files: - stacks/nvidia/modules/nvidia/main.tf — explicit version pin, explanatory comment - stacks/nvidia/modules/nvidia/values.yaml — comment block documenting the situation; driver pinned at 570.195.03 - docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md — full timeline, root causes, recovery procedure Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	62efded1b6	wireguard: switch to iptables-nft so PostUp MASQUERADE works Wireguard pod CrashLoopBackOff'd for hours with wg-quick's PostUp failing: iptables v1.8.4 (legacy): can't initialize iptables table `nat': Table does not exist (do you need to insmod?) sclevine/wg's default `iptables` symlink points to iptables-legacy, which talks to the kernel's xt-tables. K8s nodes nowadays initialize their nat table via nftables (calico-node sets it up), so iptables-legacy in the container sees "no nat table" and bails. Reproduced by ephemerally debugging the live pod's namespaces (kubectl debug --copy-to + same mounts as the real pod) — wg-quick output matched verbatim. Fix: postStart now calls update-alternatives to point iptables and ip6tables at iptables-nft/ip6tables-nft (already present in the image) before exec'ing wg-quick. The wg0.conf PostUp MASQUERADE then writes to the nftables-backed nat table calico already populated. Verified: new pod went 2/2 Running with 0 restarts after apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	45c8e88e89	terminal: probe + alerts after Traefik replica routing-table skew User reported "site loads but failed to connect on the tmux session". Root cause was a Traefik replica (traefik-db7696fbf-ktjjz) that came up missing the kubernetes_ingress-derived router for terminal.viktorbarzin.me — only the IngressRoute CRDs registered. About 1/3 of /token preflight requests landed on that replica and got 404 with router="-", and WS upgrades intermittently failed the same way, so the lobby iframe stayed stuck on "Failed to connect. Retrying...". `kubectl delete pod` on the bad replica restored the missing router and unblocked the user. This commit adds the long-term mitigation: stacks/terminal/main.tf - kubernetes_cron_job_v1.webterminal_probe runs every 5min, hits /token + /ws via Cloudflare and the in-cluster ttyd Service, pushes 4 gauges to Pushgateway (token_status, ws_status, ttyd_status, last_success_timestamp). Verified the probe end-to-end: token=302 ws=302 ttyd=200 ok=1 stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl - Webterminal group: WebterminalTokenDegraded (warning, 10m), WebterminalWebsocketDegraded (critical, 10m), WebterminalTtydUnreachable (critical, 10m), WebterminalProbeStale (warning, 15m). - Traefik Router Parity group: TraefikRouterCountSkew fires when any Traefik replica's router count diverges from siblings for >10m — catches the same class of issue cluster-wide, not just for terminal. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	d828b51670	recruiter-responder: bump image_tag to 50f43004 (backtest --persist) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	0480477f44	nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem) Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped control-plane exclusion from the controller Deployment, so both replicas landed on k8s-master, fought for hostNetwork ports 19809/29653, and one went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes holding the ports — only a kubelet restart on master cleared them. - Pin helm_release.version = "4.13.1" so terraform apply can't drift to the broken chart (defense in depth; nfs-csi namespace is already in the Kyverno-Keel exclude list) - Add controller.affinity: podAntiAffinity between replicas + nodeAffinity excluding node-role.kubernetes.io/control-plane - docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md captures the root cause + recovery procedure (kubelet restart via nsenter is the escalation path when crictl rmp -f fails) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	e398e717f1	broker-sync(fidelity): un-suspend monthly CronJob The provider now emits delta gains-offset DEPOSITs (broker-sync@98c4729) which is the simple accumulate-gains approach Viktor signed off on: each monthly scrape captures (current_pot, real_contribs), and we emit a single DEPOSIT/WITHDRAWAL sized to growth-since-last-scrape. dav_corrected handles the dashboard math. Next scheduled run: 2026-05-20 05:00 UK. Manual trigger via 'kubectl -n broker-sync create job fid-now --from=cronjob/broker-sync-fidelity'.	2026-05-22 14:16:56 +00:00
Viktor Barzin	195b5e4061	keel: use +() anchors on policy/match-tag so per-workload overrides stick Without the anchor, each policy update fires mutateExistingOnPolicyUpdate, which OVERWRITES existing keel.sh/policy annotations back to 'force'. That broke the phased rollout — bulk-setting workloads to 'never' didn't stick because the next policy update reset them. With +() anchors, the mutate only adds the annotation if missing. New workloads (in enrolled namespaces) get force+match-tag; existing workloads with explicit policy=never (out-of-band, for phased rollout) stay never. Phase 1 rollout state (2026-05-17): - 10 workloads on force+match-tag in 10 namespaces (Phase 1) enrolled via keel.sh/enrolled=true namespace label: linkwarden, excalidraw, diun, echo, foolery, city-guesser, jsoncrack, privatebin, ntfy, speedtest - 216 workloads on policy=never (out-of-band kubectl annotate) - 31 critical namespaces excluded at policy level Expand to Phase 2 by labeling more namespaces `keel.sh/enrolled=true` and clearing the `never` annotation off their workloads. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
root	b8ab4613e4	Woodpecker CI Update TLS Certificates Commit	2026-05-22 14:16:55 +00:00
Viktor Barzin	25fcf80651	keel: expand critical-namespace exclude list — protects vault/cnpg/authentik/etc. 2026-05-17 incident: Keel rolled authentik 2026.2.2 → 2026.2.3 around 23:36. The force+match-tag pairing should have constrained Keel to digest-only on the current tag (not switch to a new tag), but a race between Kyverno's mutate (injecting match-tag) and Keel's hourly poll caused the workload to still have the old `force`-only annotation when Keel acted. Result: tag rewrite, pods cycled, pgbouncer connection failures, login broken. Manual rollback: `kubectl rollout undo` on all 5 authentik deployments back to 2026.2.2. Auth restored within ~5 min. Going forward, critical-namespace workloads are excluded at the policy level so this race can't recur. They get upgraded via TF (Helm chart version bumps) on a deliberate cadence, never by Keel. Live state: 36 workloads on policy=never (35 critical + chrome-service pin + 7 CI-driven self-hosted from earlier), 190 on policy=force+match-tag for opt-out-pure auto-update on the remaining stateless apps. This matches user direction (2026-05-17): "upgrading is fine as long as we upgrade correctly and the latest version is healthy" + "keel responsible for the latest version, phased rollout, graceful". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	74ecda6cc3	keel: bump default policy patch → major (user wants latest version) User: 'i'm happy with occasional breakages. we have alerts.' Policy=major auto-updates workloads to the latest semver tag in the registry, including major/minor/patch bumps. Still semver-parser-bounded so dev/nightly/master branches are filtered out (avoids the 2026-05-16 force-trap on affine/calico). Live: 217 patch-annotated workloads re-annotated to major. Next Keel poll (~1h) will pick up any pending major/minor releases. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	6c8546bb84	recruiter-responder: bump image_tag to 94b37a9c (follow-up detection) Replies from recruiters to our sent decline / engage / ignored threads are now attached to the existing thread, surface with a 🔁 follow-up marker in Telegram ("you previously sent"), and re-open thread status to pending so they show up in recruiter_list status=pending. Smoke-tested live: Rachel-style follow-up referencing our outbound msgid + the original recruiter msgid in References → correctly attached to thread #87, status flipped sent→pending, 3 messages persisted (in/out/in). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	7e540292ad	kyverno: bump background-controller memory 384Mi → 2Gi (OOMKilled processing keel URs) The mutateExistingOnPolicyUpdate=true on inject-keel-annotations produced 176 UpdateRequests for the initial bulk scan across enrolled namespaces. At the existing 384Mi limit, kyverno-background-controller OOMKilled while processing them — no annotations got injected on existing workloads (count stuck at 30). Live state already bumped via kubectl set resources; this commit makes it durable through Terraform. Also lowered the request to 256Mi (the 384Mi floor was tight against limit; 2Gi headroom for bulk scans, 256Mi steady state). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	257679166b	recruiter-responder: bump image_tag to 02a01c9a (Reply-To + quoted body in replies) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	7e1ecaf74c	kyverno: codify aggregated ClusterRole for keel mutate-existing The previous commit (bc714755) added mutateExistingOnPolicyUpdate=true to the inject-keel-annotations ClusterPolicy but Kyverno's validate webhook rejected it: the background-controller SA needs update/patch on apps/v1 Deployment/StatefulSet/DaemonSet. Created live via kubectl + now in TF so the next apply is idempotent. The ClusterRole aggregates into kyverno:background-controller via the rbac.kyverno.io/aggregate-to-background-controller label. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	bede247e98	recruiter-responder: bump image_tag to 59df5f8a (Reply-To honoured) Reply-To header now extracted on inbound and used for outbound replies. Verified with a synthetic email From: noreply-careers@megacorp.example Reply-To: spam@viktorbarzin.me — reply correctly went to spam@ and threaded under the original (Re: subject + In-Reply-To + References). Alembic 0003 added messages.reply_to_addr column. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	a2e8afc3ed	kyverno: add mutateExistingOnPolicyUpdate=true so existing workloads get annotated Before this, the inject-keel-annotations policy only fired on admission events. Workloads that existed BEFORE their namespace got labeled keel.sh/enrolled=true never received the annotation, so Keel didn't watch them. Live state was 30 of 226 workloads auto-updating. With mutateExistingOnPolicyUpdate=true and the required mutate.targets block, Kyverno's BackgroundScan controller applies the mutate to existing matching Deployments/StatefulSets/DaemonSets on policy update. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	cdeb89d5f1	final wave: enroll immich + status-page, retrigger 17 pending Bucket A * immich: extended 3 V1 lifecycles to V2 (1 Deployment without V1 skipped — has non-standard lifecycle from earlier work). * status-page: enrolled (was missing from original sweep). * v6 retrigger marker on 17 stacks that never reached terragrunt apply (#704 exit-1 halted mid-loop). After this lands, expected live enrollment: ~96 / 118 Tier 1 stacks. The remaining ~22 are operator/Helm-managed and intentionally excluded (same fight-loop risk as Calico — bump via Helm chart version, not Keel). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
root	caba0e811f	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:55 +00:00
Viktor Barzin	4944e508aa	Bucket C: enroll 5 raw-deploy stacks in Keel auto-update * beads-server: 3 Deployments — extended V1 lifecycle blocks to V2 + KEEL_IGNORE_IMAGE; namespace label. * llama-cpp: 1 Deployment — extended V1→V2; namespace label. * novelapp: namespace label only (Deployment has non-standard lifecycle without V1 dns_config — drift expected, accept for now). * plotting-book: namespace label only (same as novelapp). * trading-bot: namespace label only (same as novelapp). immich deferred — the bulk-add script's brace-counter got confused by a HEREDOC in the file, inserting a lifecycle block in the wrong position. Needs manual per-Deployment editing. The 3 ns-only stacks (novelapp, plotting-book, trading-bot) will see their Deployments mutated by Kyverno but their TF lifecycle doesn't yet ignore the keel annotations. Expected behavior: drift visible in terragrunt plan, applied-state oscillates with Kyverno re-injecting. Acceptable starting point; per-Deployment lifecycle work to fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	b57596d930	Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) After fixing the postgresql-lb MetalLB flap (deleted stuck ServiceL2Status CR l2-rgt9d), Tier 1 CI can apply again. Combined commit: * Bucket A (16 stacks): re-append CI retrigger marker so the previously-pending applies pick up: blog calico cyberchef descheduler f1-stream homepage jsoncrack k8s-dashboard k8s-version-upgrade kms local-path osm_routing real-estate-crawler travel_blog vault webhook_handler * Bucket D (5 module-nested stacks): keel.sh/enrolled label on namespace + KYVERNO_LIFECYCLE_V2 on Deployments inside the module: postiz instagram-poster k8s-portal uptime-kuma vaultwarden Bucket C (raw-deploy apps without V1 marker on their Deployment lifecycles) deferred — needs per-Deployment lifecycle block additions that the bulk script can't safely automate: beads-server immich llama-cpp novelapp plotting-book trading-bot Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	ec60af5fd4	kyverno: exclude calico-system from inject-keel-annotations Stop the hourly Keel-vs-tigera-operator fight loop on calico-node DaemonSet (v3.26.5 ↔ v3.26.1). Live: re-annotated 4 calico-system workloads with keel.sh/policy=never; TF: added calico-system to the namespaces exclude list so any future mutate run won't re-inject. The previous calico unenrollment (label removal from namespace) wasn't enough — once Kyverno had stamped the policy=patch annotation on the Deployments/DaemonSets, removing the namespace label didn't strip the annotation, so Keel kept watching them. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:54 +00:00
Viktor Barzin	b48ddc09d6	ci: retrigger v4 — remaining 16 Keel stacks (#701 failed one of them)	2026-05-22 14:16:54 +00:00
Viktor Barzin	978237441e	ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks	2026-05-22 14:16:54 +00:00
Viktor Barzin	1dd8f4e2bf	openclaw: native MCP servers + daily claude-memory sync Wire ha-mcp, context7, and the in-pod playwright sidecar as native MCP servers on OpenClaw via `mcp set` in the container startup (ConfigMap-baked mcp.servers gets stripped by `doctor --fix`; CLI-set entries persist). HA URL pulled from new Vault key secret/openclaw.ha_sofia_mcp_url and passed via the HA_SOFIA_MCP_URL env var. Add a daily 03:00 UTC `memory-sync` CronJob in the openclaw namespace: pulls all non-sensitive memories from claude-memory.claude-memory.svc:80/api/memories, groups by category, writes 18 Markdown files into /workspace/memory/projects/claude- memory-sync/ (the path memory-core indexes), then triggers `openclaw memory index --force` via kubectl exec. Reuses the existing cluster-healthcheck SA (pods+pods/exec). Smoke test: 1488 memories synced, 25/25 files indexed, search returns hits. Also drops the legacy /app/extensions entry from plugins.load.paths (doctor warning), wires HA_SOFIA_MCP_URL env, and one-shot deletes the stale 2026-02-28 metaclaw-export.json from the openclaw home volume. claude_memory MCP intentionally NOT wired — its /mcp/mcp transport 404s on the deployed claude-memory-mcp:17 image (tracked as code-z1so). Shared knowledge is delivered via the CronJob's REST sync instead. Adding claude_memory to mcp.servers is a one-line follow-up once that's fixed.	2026-05-22 14:16:53 +00:00
Viktor Barzin	0c73974362	ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698 )	2026-05-22 14:16:53 +00:00
root	87f7b25a13	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:53 +00:00
Viktor Barzin	126cfb7022	wealth: dav_corrected view fixes pension gains-offset miscategorisation The broker-sync Fidelity provider emits 'unrealised-gains-offset' DEPOSIT activities to reconcile Wealthfolio's total with the PlanViewer reported pot, because Wealthfolio doesn't track pension fund units directly. Wealthfolio's data model treats that DEPOSIT as a cash contribution, which double-inflates net_contribution and zeroes out the implied growth. Add a Postgres view 'dav_corrected' in wealthfolio_sync that subtracts the cumulative gains-offset from net_contribution per account per date (re-exporting as 'net_contribution' so it's a drop-in replacement). All 17 wealth dashboard panels that compute contribution/growth/ROI now read from the view. Total impact: portfolio Growth jumps from £301,753.19 to £337,474.39 (exactly the £35,721.20 Fidelity offset that was previously miscategorised).	2026-05-22 14:16:52 +00:00
Viktor Barzin	6769526e1e	ci: retrigger apply for pending Keel enrollment (~58 stacks) Bulk enrollment commit 8f4b1956 had its CI pipeline #689 killed before terragrunt apply ran. The enrollment label + V2 lifecycle changes are in master but never reached the cluster. Appending a one-line marker to each pending stack's main.tf so Woodpecker's diff-detection picks them up and applies them serially. Idempotent — re-applying a stack whose state already matches is a no-op. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:52 +00:00
Viktor Barzin	137b4cbcf7	ci: retry after Keel rollout cascade settled	2026-05-22 14:16:51 +00:00
Viktor Barzin	e5f6d16b2e	enrolled-patch stacks: ignore image drift from Keel auto-update For Deployments enrolled in Keel with policy=patch, the image tag is updated by Keel as new patches release upstream. Without ignore_changes on the image field, terragrunt apply would fight Keel in an endless loop (TF reverts → Keel re-rolls → repeat — same shape as the calico/tigera-operator fight from earlier). Adding KEEL_IGNORE_IMAGE marker to the lifecycle of these stacks. Image string in TF becomes the initial seed; Keel rolls it forward. Stacks: actualbudget, broker-sync, changedetection, city-guesser, coturn, dashy, dawarich, diun, ebook2audiobook, ebooks, echo, excalidraw, foolery, forgejo, freedify. CI-driven self-hosted stacks (fire-planner, job-hunter, payslip-ingest, recruiter-responder, claude-agent-service, claude-memory) keep TF ownership of image and policy=never — their image_tag is set by CI via terragrunt.hcl inputs, not by Keel. Adding image to ignore_changes on those would break the CI deploy flow. Caveat: only container[0].image is added. Multi-container Deployments (immich, beads, etc.) will need additional container[N].image lines for any container Keel rolls. Those stacks are not currently enrolled. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:51 +00:00
Viktor Barzin	8519e37975	calico: unenroll from Keel — tigera-operator owns DaemonSet spec Keel kept rewriting calico-node + calico-kube-controllers images to v3.26.5 (proper patch update); tigera-operator immediately reverted to v3.26.1 because the Installation CR is the source of truth. Endless churn but no data loss — Calico stayed healthy throughout. Removing keel.sh/enrolled label and live label from calico-system ns. Calico upgrades go through the tigera-operator's Installation CR manually, not Keel. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:50 +00:00
Viktor Barzin	8f18621dd5	keel: default policy → patch (semver-bounded opt-out auto-update) Move from `never` (no auto-update) to `patch` for the cluster-wide default. Keel only auto-updates PATCH versions within the current major.minor: 0.26.6 → 0.26.7 OK; 0.26.6 → :nightly-latest blocked. Tag-rewrites that broke calico (v3.26.1 → :master) and affine (0.26.6 → :nightly-latest) on 2026-05-16 cannot recur with patch. Caveats: * Patch causes Terraform image drift for semver-pinned services — drift-detection pipeline will surface it; lifecycle ignore_changes on container[].image can be added per stack later if drift is noisy. * Tags that aren't parseable as semver (:latest, :11, :nightly, SHA tags) are ignored by patch — those workloads stay on their current image until promoted to `force` policy individually. Self-hosted CI-driven services + chrome-service kept on `never` (deliberate pins / CI controls the tag): recruiter-responder, claude-agent-service, claude-memory, chrome-service, fire-planner, job-hunter, payslip-ingest Live state already updated via kubectl apply + per-workload patches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:50 +00:00
Viktor Barzin	662695908a	recruiter-triage: AI culture & tooling section + warm-engage AI ask - claude-agent-service bumped to 191ed5dd (new AI section in agent template — leadership stance, approved tools, usage limits / quotas, code-gen safety, product-side AI depth, follow-up questions for the recruiter when the web is sparse). - recruiter-responder bumped to ab59eeab (deep_research prompt asks for AI culture; warm_engage template adds a written-only ask for IDE assistants, chat tools, per-seat limits, source-to-external model policy). Smoke-tested 2026-05-16: forced fresh research on Datadog, agent returned full structured AI section with 7 explicit recruiter questions covering DLP/IDE/limits/code-gen-policy. $0.80 / 192s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:50 +00:00
Viktor Barzin	0a6b2489f7	keel: default policy → never (post-incident safe default) 2026-05-16 incident: Keel's `force` policy switched semver-pinned images (affine 0.26.6 → :nightly-latest, calico v3.26.1 → :master) instead of digest-tracking. Force is documented as "always update to the newest tag in the registry" — only safe on already-mutable tags like :latest. Changing the cluster-wide default in inject-keel-annotations to `never`. The namespace enrollment label + V2 lifecycle suppression stay in place so opt-in is one annotation per Deployment, but no service auto-updates until explicitly approved. To opt in a workload now: 1. Verify the Deployment image is on a mutable tag (:latest, :<major>, or a vendor "stable" tag) — change in Terraform first if needed. 2. Add to the Deployment's metadata.annotations: "keel.sh/policy" = "force" (digest tracking) OR "keel.sh/policy" = "patch" (semver patch bumps — also requires ignore_changes on the image) Live policy already updated via kubectl apply + per-workload override (force → never). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:50 +00:00
Viktor Barzin	9765f6b9a4	keel: enable Slack notifications on every upgrade Wire Keel's Slack notifier to the existing bot token in Vault (secret/viktor -> slack_bot_token). Posts to #general by default; override via slack.channel in the Helm values if you want a dedicated channel like #keel-notifications. Notification level is "info" so we get every rollout event, not just errors. Approval flow is OFF — opt-out-pure means all updates apply unattended. If we later introduce approvals, add slack.approvalsChannel. Resolves user request: 'keel should send notifications to slack everytime it upgrades an app'. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:50 +00:00
Viktor Barzin	3027ab85a8	recruiter-responder: bump image_tag to 189ef901 OpenClaw can now answer 'what do we know about <company>?' from cache via the new recruiter_company_research tool, and recruiter_get embeds the cached research payload inline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:49 +00:00
Viktor Barzin	be3b94da85	keel: pin chart 1.0.6 → 1.2.0 (1.0.6 doesn't exist) The Helm repo at https://charts.keel.sh has versions 1.0.0–1.0.5, 1.1.0, 1.2.0. 1.0.6 is not published, so the Phase 0 apply failed silently. Bump to 1.2.0 (app version 0.21.1, latest stable).	2026-05-22 14:16:48 +00:00
Viktor Barzin	411524a10d	kured: drop Mon-Fri restriction, reboot any day The weekday-only schedule was a 2026-03-16-incident-era guardrail when the rest of the safety net was thin. Today's gates — halt-on-alert, sentinel-gate Check 4 (24h soak via node Ready transitions), the K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the sentinel-path fix from earlier today — make weekend reboots safe and just clear the backlog faster. Effect: 5 pending node reboots clear in 5 calendar days instead of queueing up over weekends. The K8s version-upgrade detection at Sun 12:00 UTC self-defers if a Sunday-morning kured reboot fires (the RecentNodeReboot alert is in the Upgrade Gates ignore-less list for the version-upgrade preflight — same mechanism kured uses). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	2e52583abd	Phase 1a: enroll 4 self-hosted services in Keel auto-update Enrolls the cleanest Woodpecker-build-only self-hosted services into the inject-keel-annotations ClusterPolicy by labeling their namespaces keel.sh/enrolled=true. CI already pushes :latest (auto_tag: true) on each, so Keel will detect the current upstream digest and trigger a rolling restart when polling starts (1h cadence). Per-Deployment lifecycle extended with KYVERNO_LIFECYCLE_V2 to suppress the annotation drift Kyverno will inject (keel.sh/policy, /trigger, /pollSchedule). Services included: - fire-planner - job-hunter - payslip-ingest - recruiter-responder Skipped from Phase 1 for follow-up: - claude-agent-service (user has WIP on main.tf) - claude-memory (Postgres co-deployed; treat in Phase 9 with other DBs) - kms (two Deployments; needs per-resource review) - wealthfolio (sync sidecar pattern; needs review) - chrome-service (deliberate :v4 pin; needs keel.sh/policy: never label) - GHA-migrated repos (10) (need per-repo CI cleanup) - beadboard, freedify (no CI) See docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	5acfab5bb9	recruiter-responder: bump image_tag to f3cb91ff (180d research_cache TTL) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00

1 2 3 4 5 ...

3319 commits