infra

Author	SHA1	Message	Date
Viktor Barzin	62efded1b6	wireguard: switch to iptables-nft so PostUp MASQUERADE works Wireguard pod CrashLoopBackOff'd for hours with wg-quick's PostUp failing: iptables v1.8.4 (legacy): can't initialize iptables table `nat': Table does not exist (do you need to insmod?) sclevine/wg's default `iptables` symlink points to iptables-legacy, which talks to the kernel's xt-tables. K8s nodes nowadays initialize their nat table via nftables (calico-node sets it up), so iptables-legacy in the container sees "no nat table" and bails. Reproduced by ephemerally debugging the live pod's namespaces (kubectl debug --copy-to + same mounts as the real pod) — wg-quick output matched verbatim. Fix: postStart now calls update-alternatives to point iptables and ip6tables at iptables-nft/ip6tables-nft (already present in the image) before exec'ing wg-quick. The wg0.conf PostUp MASQUERADE then writes to the nftables-backed nat table calico already populated. Verified: new pod went 2/2 Running with 0 restarts after apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	45c8e88e89	terminal: probe + alerts after Traefik replica routing-table skew User reported "site loads but failed to connect on the tmux session". Root cause was a Traefik replica (traefik-db7696fbf-ktjjz) that came up missing the kubernetes_ingress-derived router for terminal.viktorbarzin.me — only the IngressRoute CRDs registered. About 1/3 of /token preflight requests landed on that replica and got 404 with router="-", and WS upgrades intermittently failed the same way, so the lobby iframe stayed stuck on "Failed to connect. Retrying...". `kubectl delete pod` on the bad replica restored the missing router and unblocked the user. This commit adds the long-term mitigation: stacks/terminal/main.tf - kubernetes_cron_job_v1.webterminal_probe runs every 5min, hits /token + /ws via Cloudflare and the in-cluster ttyd Service, pushes 4 gauges to Pushgateway (token_status, ws_status, ttyd_status, last_success_timestamp). Verified the probe end-to-end: token=302 ws=302 ttyd=200 ok=1 stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl - Webterminal group: WebterminalTokenDegraded (warning, 10m), WebterminalWebsocketDegraded (critical, 10m), WebterminalTtydUnreachable (critical, 10m), WebterminalProbeStale (warning, 15m). - Traefik Router Parity group: TraefikRouterCountSkew fires when any Traefik replica's router count diverges from siblings for >10m — catches the same class of issue cluster-wide, not just for terminal. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	d828b51670	recruiter-responder: bump image_tag to 50f43004 (backtest --persist) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	0480477f44	nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem) Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped control-plane exclusion from the controller Deployment, so both replicas landed on k8s-master, fought for hostNetwork ports 19809/29653, and one went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes holding the ports — only a kubelet restart on master cleared them. - Pin helm_release.version = "4.13.1" so terraform apply can't drift to the broken chart (defense in depth; nfs-csi namespace is already in the Kyverno-Keel exclude list) - Add controller.affinity: podAntiAffinity between replicas + nodeAffinity excluding node-role.kubernetes.io/control-plane - docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md captures the root cause + recovery procedure (kubelet restart via nsenter is the escalation path when crictl rmp -f fails) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	e398e717f1	broker-sync(fidelity): un-suspend monthly CronJob The provider now emits delta gains-offset DEPOSITs (broker-sync@98c4729) which is the simple accumulate-gains approach Viktor signed off on: each monthly scrape captures (current_pot, real_contribs), and we emit a single DEPOSIT/WITHDRAWAL sized to growth-since-last-scrape. dav_corrected handles the dashboard math. Next scheduled run: 2026-05-20 05:00 UK. Manual trigger via 'kubectl -n broker-sync create job fid-now --from=cronjob/broker-sync-fidelity'.	2026-05-22 14:16:56 +00:00
Viktor Barzin	195b5e4061	keel: use +() anchors on policy/match-tag so per-workload overrides stick Without the anchor, each policy update fires mutateExistingOnPolicyUpdate, which OVERWRITES existing keel.sh/policy annotations back to 'force'. That broke the phased rollout — bulk-setting workloads to 'never' didn't stick because the next policy update reset them. With +() anchors, the mutate only adds the annotation if missing. New workloads (in enrolled namespaces) get force+match-tag; existing workloads with explicit policy=never (out-of-band, for phased rollout) stay never. Phase 1 rollout state (2026-05-17): - 10 workloads on force+match-tag in 10 namespaces (Phase 1) enrolled via keel.sh/enrolled=true namespace label: linkwarden, excalidraw, diun, echo, foolery, city-guesser, jsoncrack, privatebin, ntfy, speedtest - 216 workloads on policy=never (out-of-band kubectl annotate) - 31 critical namespaces excluded at policy level Expand to Phase 2 by labeling more namespaces `keel.sh/enrolled=true` and clearing the `never` annotation off their workloads. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	25fcf80651	keel: expand critical-namespace exclude list — protects vault/cnpg/authentik/etc. 2026-05-17 incident: Keel rolled authentik 2026.2.2 → 2026.2.3 around 23:36. The force+match-tag pairing should have constrained Keel to digest-only on the current tag (not switch to a new tag), but a race between Kyverno's mutate (injecting match-tag) and Keel's hourly poll caused the workload to still have the old `force`-only annotation when Keel acted. Result: tag rewrite, pods cycled, pgbouncer connection failures, login broken. Manual rollback: `kubectl rollout undo` on all 5 authentik deployments back to 2026.2.2. Auth restored within ~5 min. Going forward, critical-namespace workloads are excluded at the policy level so this race can't recur. They get upgraded via TF (Helm chart version bumps) on a deliberate cadence, never by Keel. Live state: 36 workloads on policy=never (35 critical + chrome-service pin + 7 CI-driven self-hosted from earlier), 190 on policy=force+match-tag for opt-out-pure auto-update on the remaining stateless apps. This matches user direction (2026-05-17): "upgrading is fine as long as we upgrade correctly and the latest version is healthy" + "keel responsible for the latest version, phased rollout, graceful". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	74ecda6cc3	keel: bump default policy patch → major (user wants latest version) User: 'i'm happy with occasional breakages. we have alerts.' Policy=major auto-updates workloads to the latest semver tag in the registry, including major/minor/patch bumps. Still semver-parser-bounded so dev/nightly/master branches are filtered out (avoids the 2026-05-16 force-trap on affine/calico). Live: 217 patch-annotated workloads re-annotated to major. Next Keel poll (~1h) will pick up any pending major/minor releases. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	6c8546bb84	recruiter-responder: bump image_tag to 94b37a9c (follow-up detection) Replies from recruiters to our sent decline / engage / ignored threads are now attached to the existing thread, surface with a 🔁 follow-up marker in Telegram ("you previously sent"), and re-open thread status to pending so they show up in recruiter_list status=pending. Smoke-tested live: Rachel-style follow-up referencing our outbound msgid + the original recruiter msgid in References → correctly attached to thread #87, status flipped sent→pending, 3 messages persisted (in/out/in). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	7e540292ad	kyverno: bump background-controller memory 384Mi → 2Gi (OOMKilled processing keel URs) The mutateExistingOnPolicyUpdate=true on inject-keel-annotations produced 176 UpdateRequests for the initial bulk scan across enrolled namespaces. At the existing 384Mi limit, kyverno-background-controller OOMKilled while processing them — no annotations got injected on existing workloads (count stuck at 30). Live state already bumped via kubectl set resources; this commit makes it durable through Terraform. Also lowered the request to 256Mi (the 384Mi floor was tight against limit; 2Gi headroom for bulk scans, 256Mi steady state). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	257679166b	recruiter-responder: bump image_tag to 02a01c9a (Reply-To + quoted body in replies) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	7e1ecaf74c	kyverno: codify aggregated ClusterRole for keel mutate-existing The previous commit (`bc714755`) added mutateExistingOnPolicyUpdate=true to the inject-keel-annotations ClusterPolicy but Kyverno's validate webhook rejected it: the background-controller SA needs update/patch on apps/v1 Deployment/StatefulSet/DaemonSet. Created live via kubectl + now in TF so the next apply is idempotent. The ClusterRole aggregates into kyverno:background-controller via the rbac.kyverno.io/aggregate-to-background-controller label. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	bede247e98	recruiter-responder: bump image_tag to 59df5f8a (Reply-To honoured) Reply-To header now extracted on inbound and used for outbound replies. Verified with a synthetic email From: noreply-careers@megacorp.example Reply-To: spam@viktorbarzin.me — reply correctly went to spam@ and threaded under the original (Re: subject + In-Reply-To + References). Alembic 0003 added messages.reply_to_addr column. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	a2e8afc3ed	kyverno: add mutateExistingOnPolicyUpdate=true so existing workloads get annotated Before this, the inject-keel-annotations policy only fired on admission events. Workloads that existed BEFORE their namespace got labeled keel.sh/enrolled=true never received the annotation, so Keel didn't watch them. Live state was 30 of 226 workloads auto-updating. With mutateExistingOnPolicyUpdate=true and the required mutate.targets block, Kyverno's BackgroundScan controller applies the mutate to existing matching Deployments/StatefulSets/DaemonSets on policy update. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	cdeb89d5f1	final wave: enroll immich + status-page, retrigger 17 pending Bucket A * immich: extended 3 V1 lifecycles to V2 (1 Deployment without V1 skipped — has non-standard lifecycle from earlier work). * status-page: enrolled (was missing from original sweep). * v6 retrigger marker on 17 stacks that never reached terragrunt apply (#704 exit-1 halted mid-loop). After this lands, expected live enrollment: ~96 / 118 Tier 1 stacks. The remaining ~22 are operator/Helm-managed and intentionally excluded (same fight-loop risk as Calico — bump via Helm chart version, not Keel). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
root	caba0e811f	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:55 +00:00
Viktor Barzin	4944e508aa	Bucket C: enroll 5 raw-deploy stacks in Keel auto-update * beads-server: 3 Deployments — extended V1 lifecycle blocks to V2 + KEEL_IGNORE_IMAGE; namespace label. * llama-cpp: 1 Deployment — extended V1→V2; namespace label. * novelapp: namespace label only (Deployment has non-standard lifecycle without V1 dns_config — drift expected, accept for now). * plotting-book: namespace label only (same as novelapp). * trading-bot: namespace label only (same as novelapp). immich deferred — the bulk-add script's brace-counter got confused by a HEREDOC in the file, inserting a lifecycle block in the wrong position. Needs manual per-Deployment editing. The 3 ns-only stacks (novelapp, plotting-book, trading-bot) will see their Deployments mutated by Kyverno but their TF lifecycle doesn't yet ignore the keel annotations. Expected behavior: drift visible in terragrunt plan, applied-state oscillates with Kyverno re-injecting. Acceptable starting point; per-Deployment lifecycle work to fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	b57596d930	Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) After fixing the postgresql-lb MetalLB flap (deleted stuck ServiceL2Status CR l2-rgt9d), Tier 1 CI can apply again. Combined commit: * Bucket A (16 stacks): re-append CI retrigger marker so the previously-pending applies pick up: blog calico cyberchef descheduler f1-stream homepage jsoncrack k8s-dashboard k8s-version-upgrade kms local-path osm_routing real-estate-crawler travel_blog vault webhook_handler * Bucket D (5 module-nested stacks): keel.sh/enrolled label on namespace + KYVERNO_LIFECYCLE_V2 on Deployments inside the module: postiz instagram-poster k8s-portal uptime-kuma vaultwarden Bucket C (raw-deploy apps without V1 marker on their Deployment lifecycles) deferred — needs per-Deployment lifecycle block additions that the bulk script can't safely automate: beads-server immich llama-cpp novelapp plotting-book trading-bot Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	ec60af5fd4	kyverno: exclude calico-system from inject-keel-annotations Stop the hourly Keel-vs-tigera-operator fight loop on calico-node DaemonSet (v3.26.5 ↔ v3.26.1). Live: re-annotated 4 calico-system workloads with keel.sh/policy=never; TF: added calico-system to the namespaces exclude list so any future mutate run won't re-inject. The previous calico unenrollment (label removal from namespace) wasn't enough — once Kyverno had stamped the policy=patch annotation on the Deployments/DaemonSets, removing the namespace label didn't strip the annotation, so Keel kept watching them. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:54 +00:00
Viktor Barzin	b48ddc09d6	ci: retrigger v4 — remaining 16 Keel stacks (#701 failed one of them)	2026-05-22 14:16:54 +00:00
Viktor Barzin	978237441e	ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks	2026-05-22 14:16:54 +00:00
Viktor Barzin	1dd8f4e2bf	openclaw: native MCP servers + daily claude-memory sync Wire ha-mcp, context7, and the in-pod playwright sidecar as native MCP servers on OpenClaw via `mcp set` in the container startup (ConfigMap-baked mcp.servers gets stripped by `doctor --fix`; CLI-set entries persist). HA URL pulled from new Vault key secret/openclaw.ha_sofia_mcp_url and passed via the HA_SOFIA_MCP_URL env var. Add a daily 03:00 UTC `memory-sync` CronJob in the openclaw namespace: pulls all non-sensitive memories from claude-memory.claude-memory.svc:80/api/memories, groups by category, writes 18 Markdown files into /workspace/memory/projects/claude- memory-sync/ (the path memory-core indexes), then triggers `openclaw memory index --force` via kubectl exec. Reuses the existing cluster-healthcheck SA (pods+pods/exec). Smoke test: 1488 memories synced, 25/25 files indexed, search returns hits. Also drops the legacy /app/extensions entry from plugins.load.paths (doctor warning), wires HA_SOFIA_MCP_URL env, and one-shot deletes the stale 2026-02-28 metaclaw-export.json from the openclaw home volume. claude_memory MCP intentionally NOT wired — its /mcp/mcp transport 404s on the deployed claude-memory-mcp:17 image (tracked as code-z1so). Shared knowledge is delivered via the CronJob's REST sync instead. Adding claude_memory to mcp.servers is a one-line follow-up once that's fixed.	2026-05-22 14:16:53 +00:00
Viktor Barzin	0c73974362	ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698 )	2026-05-22 14:16:53 +00:00
root	87f7b25a13	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:53 +00:00
Viktor Barzin	126cfb7022	wealth: dav_corrected view fixes pension gains-offset miscategorisation The broker-sync Fidelity provider emits 'unrealised-gains-offset' DEPOSIT activities to reconcile Wealthfolio's total with the PlanViewer reported pot, because Wealthfolio doesn't track pension fund units directly. Wealthfolio's data model treats that DEPOSIT as a cash contribution, which double-inflates net_contribution and zeroes out the implied growth. Add a Postgres view 'dav_corrected' in wealthfolio_sync that subtracts the cumulative gains-offset from net_contribution per account per date (re-exporting as 'net_contribution' so it's a drop-in replacement). All 17 wealth dashboard panels that compute contribution/growth/ROI now read from the view. Total impact: portfolio Growth jumps from £301,753.19 to £337,474.39 (exactly the £35,721.20 Fidelity offset that was previously miscategorised).	2026-05-22 14:16:52 +00:00
Viktor Barzin	6769526e1e	ci: retrigger apply for pending Keel enrollment (~58 stacks) Bulk enrollment commit `8f4b1956` had its CI pipeline #689 killed before terragrunt apply ran. The enrollment label + V2 lifecycle changes are in master but never reached the cluster. Appending a one-line marker to each pending stack's main.tf so Woodpecker's diff-detection picks them up and applies them serially. Idempotent — re-applying a stack whose state already matches is a no-op. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:52 +00:00
Viktor Barzin	e5f6d16b2e	enrolled-patch stacks: ignore image drift from Keel auto-update For Deployments enrolled in Keel with policy=patch, the image tag is updated by Keel as new patches release upstream. Without ignore_changes on the image field, terragrunt apply would fight Keel in an endless loop (TF reverts → Keel re-rolls → repeat — same shape as the calico/tigera-operator fight from earlier). Adding KEEL_IGNORE_IMAGE marker to the lifecycle of these stacks. Image string in TF becomes the initial seed; Keel rolls it forward. Stacks: actualbudget, broker-sync, changedetection, city-guesser, coturn, dashy, dawarich, diun, ebook2audiobook, ebooks, echo, excalidraw, foolery, forgejo, freedify. CI-driven self-hosted stacks (fire-planner, job-hunter, payslip-ingest, recruiter-responder, claude-agent-service, claude-memory) keep TF ownership of image and policy=never — their image_tag is set by CI via terragrunt.hcl inputs, not by Keel. Adding image to ignore_changes on those would break the CI deploy flow. Caveat: only container[0].image is added. Multi-container Deployments (immich, beads, etc.) will need additional container[N].image lines for any container Keel rolls. Those stacks are not currently enrolled. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:51 +00:00
Viktor Barzin	8519e37975	calico: unenroll from Keel — tigera-operator owns DaemonSet spec Keel kept rewriting calico-node + calico-kube-controllers images to v3.26.5 (proper patch update); tigera-operator immediately reverted to v3.26.1 because the Installation CR is the source of truth. Endless churn but no data loss — Calico stayed healthy throughout. Removing keel.sh/enrolled label and live label from calico-system ns. Calico upgrades go through the tigera-operator's Installation CR manually, not Keel. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:50 +00:00
Viktor Barzin	8f18621dd5	keel: default policy → patch (semver-bounded opt-out auto-update) Move from `never` (no auto-update) to `patch` for the cluster-wide default. Keel only auto-updates PATCH versions within the current major.minor: 0.26.6 → 0.26.7 OK; 0.26.6 → :nightly-latest blocked. Tag-rewrites that broke calico (v3.26.1 → :master) and affine (0.26.6 → :nightly-latest) on 2026-05-16 cannot recur with patch. Caveats: * Patch causes Terraform image drift for semver-pinned services — drift-detection pipeline will surface it; lifecycle ignore_changes on container[].image can be added per stack later if drift is noisy. * Tags that aren't parseable as semver (:latest, :11, :nightly, SHA tags) are ignored by patch — those workloads stay on their current image until promoted to `force` policy individually. Self-hosted CI-driven services + chrome-service kept on `never` (deliberate pins / CI controls the tag): recruiter-responder, claude-agent-service, claude-memory, chrome-service, fire-planner, job-hunter, payslip-ingest Live state already updated via kubectl apply + per-workload patches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:50 +00:00
Viktor Barzin	662695908a	recruiter-triage: AI culture & tooling section + warm-engage AI ask - claude-agent-service bumped to 191ed5dd (new AI section in agent template — leadership stance, approved tools, usage limits / quotas, code-gen safety, product-side AI depth, follow-up questions for the recruiter when the web is sparse). - recruiter-responder bumped to ab59eeab (deep_research prompt asks for AI culture; warm_engage template adds a written-only ask for IDE assistants, chat tools, per-seat limits, source-to-external model policy). Smoke-tested 2026-05-16: forced fresh research on Datadog, agent returned full structured AI section with 7 explicit recruiter questions covering DLP/IDE/limits/code-gen-policy. $0.80 / 192s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:50 +00:00
Viktor Barzin	0a6b2489f7	keel: default policy → never (post-incident safe default) 2026-05-16 incident: Keel's `force` policy switched semver-pinned images (affine 0.26.6 → :nightly-latest, calico v3.26.1 → :master) instead of digest-tracking. Force is documented as "always update to the newest tag in the registry" — only safe on already-mutable tags like :latest. Changing the cluster-wide default in inject-keel-annotations to `never`. The namespace enrollment label + V2 lifecycle suppression stay in place so opt-in is one annotation per Deployment, but no service auto-updates until explicitly approved. To opt in a workload now: 1. Verify the Deployment image is on a mutable tag (:latest, :<major>, or a vendor "stable" tag) — change in Terraform first if needed. 2. Add to the Deployment's metadata.annotations: "keel.sh/policy" = "force" (digest tracking) OR "keel.sh/policy" = "patch" (semver patch bumps — also requires ignore_changes on the image) Live policy already updated via kubectl apply + per-workload override (force → never). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:50 +00:00
Viktor Barzin	9765f6b9a4	keel: enable Slack notifications on every upgrade Wire Keel's Slack notifier to the existing bot token in Vault (secret/viktor -> slack_bot_token). Posts to #general by default; override via slack.channel in the Helm values if you want a dedicated channel like #keel-notifications. Notification level is "info" so we get every rollout event, not just errors. Approval flow is OFF — opt-out-pure means all updates apply unattended. If we later introduce approvals, add slack.approvalsChannel. Resolves user request: 'keel should send notifications to slack everytime it upgrades an app'. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:50 +00:00
Viktor Barzin	3027ab85a8	recruiter-responder: bump image_tag to 189ef901 OpenClaw can now answer 'what do we know about <company>?' from cache via the new recruiter_company_research tool, and recruiter_get embeds the cached research payload inline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:49 +00:00
Viktor Barzin	be3b94da85	keel: pin chart 1.0.6 → 1.2.0 (1.0.6 doesn't exist) The Helm repo at https://charts.keel.sh has versions 1.0.0–1.0.5, 1.1.0, 1.2.0. 1.0.6 is not published, so the Phase 0 apply failed silently. Bump to 1.2.0 (app version 0.21.1, latest stable).	2026-05-22 14:16:48 +00:00
Viktor Barzin	411524a10d	kured: drop Mon-Fri restriction, reboot any day The weekday-only schedule was a 2026-03-16-incident-era guardrail when the rest of the safety net was thin. Today's gates — halt-on-alert, sentinel-gate Check 4 (24h soak via node Ready transitions), the K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the sentinel-path fix from earlier today — make weekend reboots safe and just clear the backlog faster. Effect: 5 pending node reboots clear in 5 calendar days instead of queueing up over weekends. The K8s version-upgrade detection at Sun 12:00 UTC self-defers if a Sunday-morning kured reboot fires (the RecentNodeReboot alert is in the Upgrade Gates ignore-less list for the version-upgrade preflight — same mechanism kured uses). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	2e52583abd	Phase 1a: enroll 4 self-hosted services in Keel auto-update Enrolls the cleanest Woodpecker-build-only self-hosted services into the inject-keel-annotations ClusterPolicy by labeling their namespaces keel.sh/enrolled=true. CI already pushes :latest (auto_tag: true) on each, so Keel will detect the current upstream digest and trigger a rolling restart when polling starts (1h cadence). Per-Deployment lifecycle extended with KYVERNO_LIFECYCLE_V2 to suppress the annotation drift Kyverno will inject (keel.sh/policy, /trigger, /pollSchedule). Services included: - fire-planner - job-hunter - payslip-ingest - recruiter-responder Skipped from Phase 1 for follow-up: - claude-agent-service (user has WIP on main.tf) - claude-memory (Postgres co-deployed; treat in Phase 9 with other DBs) - kms (two Deployments; needs per-resource review) - wealthfolio (sync sidecar pattern; needs review) - chrome-service (deliberate :v4 pin; needs keel.sh/policy: never label) - GHA-migrated repos (10) (need per-repo CI cleanup) - beadboard, freedify (no CI) See docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	5acfab5bb9	recruiter-responder: bump image_tag to f3cb91ff (180d research_cache TTL) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	e5a65c11a9	recruiter-triage v3: Perks & Office Life section + cache-first deep_research - claude-agent-service bumped to f764fef6 (agent system prompt adds the Perks block: food/health/pension/equity/PTO/parental/equipment/ learning/wellness/amenities/commuter). 1200-word cap. - recruiter-responder bumped to 38a2cdaa (cache-first deep_research: serves cached payload if fetched_at + ttl_seconds > now; cache writes upsert; new force flag bypasses). Verified end-to-end: deep_research on Datadog now returns full Perks section (~220s, $0.60, 23 turns). Earlier 500 fixed (was uq_research_company_tier dup-key on re-run). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	020f62555b	Phase 0: install Keel + Kyverno auto-update annotation injector Foundation for opt-out-pure auto-update model per docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. - New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6). Polls registries hourly per design decision #8. Default schedule overridable per-workload via keel.sh/pollSchedule annotation. - New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments, StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true` with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h. - Phase 0 enrolls no namespaces. Phase 1 (next session) labels the self-hosted set. - Per-workload opt-out: label `keel.sh/policy: never` (used by rollback runbook and chrome-service-style deliberate pins). - Keel namespace excluded from the mutate — supervisor self-update has too-bad a failure mode (decision #11). - AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the ignore_changes block enrolled workloads need. - .claude/CLAUDE.md: docker-images rule flagged as transitional. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	3ef860b2be	kured + cnpg: drain-safe defaults ahead of Monday reboot wave Three defensive moves to make the kured rolling-reboot cycle survive edge cases without operator intervention: kured (stacks/kured/main.tf): - Set `configuration.drainTimeout = "30m"`. Default is unlimited; if a future PDB or finalizer stalls drain, kured retries forever and the node stays cordoned silently. 30m caps the silent-failure window — after timeout kured logs the abort and waits for the next period; the node stays Schedulable so cluster capacity isn't lost. Lets us fail closed instead of fail-silent. CNPG pg-cluster (stacks/dbaas/modules/dbaas/main.tf): - Bump instances 2 → 3 (1 primary + 2 replicas). With 2 instances the failover during a primary-node drain depended on the lone replica being caught up; a WAL backlog would stall the drain until the replica was current. With 3 instances CNPG always has at least one fully-current replica to promote, and the PDB's `minAvailable=1` on the primary selector is satisfied throughout the switchover. Storage: +20Gi PVC on proxmox-lvm-encrypted (about 35Gi after autoresize). Memory: +3Gi pod limit. - Updated the `triggers.instances` so the null_resource's local-exec actually re-applies the YAML (kubectl apply with the new spec). The YAML is the source-of-truth but the trigger is what tells terraform to re-run the provisioner. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	5768216d0e	anubis: HA with shared valkey/redis store + replicas=2 Anubis pre-2026-05-16 ran at replicas=1 because in-flight PoW challenge state lived in process memory — a challenge issued by pod A wouldn't be verifiable by pod B (HTTP 500 "store: key not found"). The PDB at `minAvailable=1` made this worse: with replicas=1 the eviction API can NEVER satisfy the constraint, so every drain on a node hosting an Anubis pod looped forever. This is what stalled the manual K8s upgrade on 2026-05-11 (had to delete pods directly to bypass eviction) and was about to block kured on Monday 2026-05-18 once the kured sentinel fix landed. Anubis upstream has first-class support for a Valkey/Redis-protocol shared store (documented as the "Kubernetes worker pool" pattern). Wire it up: - modules/kubernetes/anubis_instance: add `shared_store_url` variable. When set, appends a `store: { backend: valkey, parameters: { url } }` block to the rendered policy YAML and defaults replicas to 2 (capped at 2). PDB switched from `minAvailable=1` to `maxUnavailable=1` so drains can take down one pod at a time. topologySpreadConstraint tightened to `DoNotSchedule` so the two replicas land on different nodes — a single node loss never takes a whole Anubis instance down. - All 8 call sites (cyberchef, jsoncrack, kms, homepage, blog, travel_blog, real-estate-crawler, f1-stream) opted in. Each picks a unique Redis DB index (5–12) on `redis-master.redis:6379`. Cluster Redis already runs HA via Sentinel + haproxy, no new infra needed. Verified: every Anubis Deployment now 2/2 Ready with pods on different nodes; PDBs allow 1 disruption; Redis DBs 5,7,8,10 already populated by live traffic post-apply; Palo Alto Networks scanner hit blog right after apply and the challenge log shows the new state path. Drain on any worker now succeeds without a `predrain_unstick` workaround — eviction API is satisfied because at most one pod is unavailable at a time, and the other replica keeps serving. Monday's kured reboot wave should roll through cleanly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	3025879478	claude-agent-service: ship recruiter-triage agent + restore missing terragrunt.hcl - main.tf: bump image_tag to 1b3350c0 (carries the new agent), init container also copies recruiter-triage.md into /home/agent/.claude/agents/. - terragrunt.hcl: restored (file was missing — apply was blocked). Standard root include + platform/vault/external-secrets dependencies. Smoke-tested 2026-05-16: deep_research call on Datadog (thread 42) via recruiter-responder REST API → 102.5s, $0.43, structured markdown report with comp bands vs £600k floor, culture signals, remote policy, recent news, sources cited. End-to-end Tier-2 is live. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	ce5f3ec209	recruiter-responder: expose Gmail IMAP creds for backtest CLI Pulls vbarzin@gmail.com app password from secret/recruiter-responder (seeded from secret/wealthfolio.imap_password — same Gmail credential that wealthfolio uses for broker-statement ingestion). Env vars GMAIL_IMAP_USER + GMAIL_IMAP_PASS, consumed by 'backtest gmail'. Backtest verified 2026-05-16 against folder 'companies-I-dont-take-seriously': 20/20 recruiter, 100% company extraction (9 stated, 6 subject, 4 sender_domain, 1 body), 30% comp, avg 12s latency. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	065982d978	kured: fix sentinel path mismatch that stalled rolling reboots The kured Helm chart derives the sentinel hostPath from `dirname(configuration.rebootSentinel)`. Previously rebootSentinel=/sentinel/gated-reboot-required pointed hostPath at `/sentinel/` (an empty auto-created directory on every host) while the kured-sentinel-gate DaemonSet writes to /var/run/gated-reboot-required. Two different host directories → kured never saw the open gate, even though the gate's checks were all green every 5 min on every node. Result: unattended-upgrades has packages waiting on every node since 2026-05-10 (when uu was re-enabled) and kured's hourly log says "Reboot not required" for the entire period. Set rebootSentinel=/var/run/gated-reboot-required so the chart mounts hostPath /var/run — same directory the gate writes to. The in-pod mountPath (/sentinel) is hardcoded by the chart and doesn't matter, the symlink chain works out: /sentinel/<file> inside the pod resolves to /var/run/<file> on the host. Verified: kured pod can now list /sentinel/gated-reboot-required (0 B) AND /sentinel/reboot-required (32 B, set by uu on 2026-05-15). First gated reboot will land Mon 2026-05-18 02:00 London. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	80e6314bf0	recruiter-responder: bump image_tag to 559e5c57 PDF extraction, tech_stack list, aggressive company/comp inference, no-phone-call drafts, backtest CLI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	8e11caff8d	recruiter-responder: bump image_tag to bbd178da (structured Telegram + comp floor) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	24ce3e267d	aiostreams: weekly backup of Stremio account addon collection Adds stremio-account-backup CronJob (Sun 04:00 weekly, offset 1h from the AIOStreams config-backup at 03:00): - Logs into api.strem.io with credentials from Vault (secret/viktor.stremio_email + stremio_password, now also synced into the aiostreams-probe-secrets ExternalSecret) - Fetches the full addonCollection via addonCollectionGet - Writes timestamped JSON to the existing aiostreams-backup PVC (NFS /srv/nfs/aiostreams-backup/stremio-collection-*.json, mode 0600) - 90-day retention, logs out to invalidate the auth key - Pushgateway metrics: stremio_account_backup_{success,bytes, addon_count,duration_seconds,last_run_timestamp} Protects against: accidental "uninstall all" / API regression / wrong account login wiping the curated set of 22 addons (Cinemeta + 16 MDBList + AIOStreams + More Like This + Formulio + Zamunda + Local). Verified: manual run wrote 93480 bytes, 22 addons, file present on NFS.	2026-05-22 14:16:47 +00:00
Viktor Barzin	aa6e9b0242	recruiter-responder: public /cb ingress for Telegram URL-button callbacks - Add ingress_factory module (auth=none, HMAC + expiry are the gate); ingress_path=["/cb"] only — /api stays internal, /healthz cluster. dns_type=proxied. anti_ai_scraping=false. - Drop setup_tls_secret module — Kyverno ClusterPolicy `sync-tls-secret` auto-clones the wildcard cert into every namespace. - Bump image_tag to 7383b426 (callback endpoints + SMTP STARTTLS hostname relax). - Wire CALLBACK_BASE_URL=https://recruiter-responder.viktorbarzin.me. - Drop git-crypt-encrypted wildcard cert files into stacks/recruiter-responder/secrets/. Allowlist privkey.pem in a new .gitleaksignore — git-crypt encrypts at rest but the working-tree copy is plaintext, so gitleaks can't tell. Smoke-tested end-to-end 2026-05-15 23:45: synthetic email -> Telegram with ✅/❌ buttons -> ✅ tapped via curl -> 'Sent' HTML page -> thread.status=sent, decision row recorded with decided_via=telegram_button, outbound message threaded correctly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	77010b769a	aiostreams: whitelist Vidhin + Tamtaro sync URLs Adds two env vars on the AIOStreams deployment: - WHITELISTED_REGEX_PATTERNS_URLS: Vidhin's release-group regex (TRaSH-aligned) so syncedRankedRegexUrls works for the user - WHITELISTED_SEL_URLS: Vidhin's ranked stream expressions + Tamtaro's ISE/PSE/ESE-standard Gotcha: AIOStreams validates each synced* field against the matching whitelist — stream-expression files (incl. Vidhin's expressions.json) go in WHITELISTED_SEL_URLS, not the regex one, even though they live in Vidhin's regex repo. Mixing them up returns USER_INVALID_CONFIG. User config: enabled Vidhin's regex + ranked expressions + Tamtaro's ISEs. Skipped Tamtaro PSE/ESE for now to avoid surprise over-filtering; can be added later from the same whitelist.	2026-05-22 14:16:47 +00:00
Viktor Barzin	c396092c86	aiostreams: weekly NFS backup of decrypted user config Adds aiostreams-config-backup CronJob (Sun 03:00 weekly): - Pulls /api/v1/user via internal ClusterIP with UUID + password from the existing aiostreams-probe-secrets ExternalSecret - Writes timestamped JSON to nfs-backup PVC mounted at /backup - 90-day retention, prunes older files - Pushgateway metrics: aiostreams_config_backup_{success,bytes,duration,last_run_timestamp} NFS path: 192.168.1.127:/srv/nfs/aiostreams-backup (auto-synced offsite to Synology via the existing offsite-sync-backup CronJob). Complements the daily postgresql-backup-per-db pipeline (which dumps the encrypted blob) by storing the decrypted JSON — usable for human inspection / disaster recovery even without the AIOStreams password. Verified: manual job wrote 12931 bytes, file present on NFS.	2026-05-22 14:16:47 +00:00

1 2 3 4 5 ...

960 commits