infra

Author	SHA1	Message	Date
Viktor Barzin	cdeb89d5f1	final wave: enroll immich + status-page, retrigger 17 pending Bucket A * immich: extended 3 V1 lifecycles to V2 (1 Deployment without V1 skipped — has non-standard lifecycle from earlier work). * status-page: enrolled (was missing from original sweep). * v6 retrigger marker on 17 stacks that never reached terragrunt apply (#704 exit-1 halted mid-loop). After this lands, expected live enrollment: ~96 / 118 Tier 1 stacks. The remaining ~22 are operator/Helm-managed and intentionally excluded (same fight-loop risk as Calico — bump via Helm chart version, not Keel). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
root	caba0e811f	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:55 +00:00
Viktor Barzin	4944e508aa	Bucket C: enroll 5 raw-deploy stacks in Keel auto-update * beads-server: 3 Deployments — extended V1 lifecycle blocks to V2 + KEEL_IGNORE_IMAGE; namespace label. * llama-cpp: 1 Deployment — extended V1→V2; namespace label. * novelapp: namespace label only (Deployment has non-standard lifecycle without V1 dns_config — drift expected, accept for now). * plotting-book: namespace label only (same as novelapp). * trading-bot: namespace label only (same as novelapp). immich deferred — the bulk-add script's brace-counter got confused by a HEREDOC in the file, inserting a lifecycle block in the wrong position. Needs manual per-Deployment editing. The 3 ns-only stacks (novelapp, plotting-book, trading-bot) will see their Deployments mutated by Kyverno but their TF lifecycle doesn't yet ignore the keel annotations. Expected behavior: drift visible in terragrunt plan, applied-state oscillates with Kyverno re-injecting. Acceptable starting point; per-Deployment lifecycle work to fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	b57596d930	Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) After fixing the postgresql-lb MetalLB flap (deleted stuck ServiceL2Status CR l2-rgt9d), Tier 1 CI can apply again. Combined commit: * Bucket A (16 stacks): re-append CI retrigger marker so the previously-pending applies pick up: blog calico cyberchef descheduler f1-stream homepage jsoncrack k8s-dashboard k8s-version-upgrade kms local-path osm_routing real-estate-crawler travel_blog vault webhook_handler * Bucket D (5 module-nested stacks): keel.sh/enrolled label on namespace + KYVERNO_LIFECYCLE_V2 on Deployments inside the module: postiz instagram-poster k8s-portal uptime-kuma vaultwarden Bucket C (raw-deploy apps without V1 marker on their Deployment lifecycles) deferred — needs per-Deployment lifecycle block additions that the bulk script can't safely automate: beads-server immich llama-cpp novelapp plotting-book trading-bot Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:55 +00:00
Viktor Barzin	ec60af5fd4	kyverno: exclude calico-system from inject-keel-annotations Stop the hourly Keel-vs-tigera-operator fight loop on calico-node DaemonSet (v3.26.5 ↔ v3.26.1). Live: re-annotated 4 calico-system workloads with keel.sh/policy=never; TF: added calico-system to the namespaces exclude list so any future mutate run won't re-inject. The previous calico unenrollment (label removal from namespace) wasn't enough — once Kyverno had stamped the policy=patch annotation on the Deployments/DaemonSets, removing the namespace label didn't strip the annotation, so Keel kept watching them. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:54 +00:00
Viktor Barzin	b48ddc09d6	ci: retrigger v4 — remaining 16 Keel stacks (#701 failed one of them)	2026-05-22 14:16:54 +00:00
Viktor Barzin	978237441e	ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks	2026-05-22 14:16:54 +00:00
Viktor Barzin	1dd8f4e2bf	openclaw: native MCP servers + daily claude-memory sync Wire ha-mcp, context7, and the in-pod playwright sidecar as native MCP servers on OpenClaw via `mcp set` in the container startup (ConfigMap-baked mcp.servers gets stripped by `doctor --fix`; CLI-set entries persist). HA URL pulled from new Vault key secret/openclaw.ha_sofia_mcp_url and passed via the HA_SOFIA_MCP_URL env var. Add a daily 03:00 UTC `memory-sync` CronJob in the openclaw namespace: pulls all non-sensitive memories from claude-memory.claude-memory.svc:80/api/memories, groups by category, writes 18 Markdown files into /workspace/memory/projects/claude- memory-sync/ (the path memory-core indexes), then triggers `openclaw memory index --force` via kubectl exec. Reuses the existing cluster-healthcheck SA (pods+pods/exec). Smoke test: 1488 memories synced, 25/25 files indexed, search returns hits. Also drops the legacy /app/extensions entry from plugins.load.paths (doctor warning), wires HA_SOFIA_MCP_URL env, and one-shot deletes the stale 2026-02-28 metaclaw-export.json from the openclaw home volume. claude_memory MCP intentionally NOT wired — its /mcp/mcp transport 404s on the deployed claude-memory-mcp:17 image (tracked as code-z1so). Shared knowledge is delivered via the CronJob's REST sync instead. Adding claude_memory to mcp.servers is a one-line follow-up once that's fixed.	2026-05-22 14:16:53 +00:00
Viktor Barzin	0c73974362	ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698 )	2026-05-22 14:16:53 +00:00
root	87f7b25a13	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:53 +00:00
Viktor Barzin	126cfb7022	wealth: dav_corrected view fixes pension gains-offset miscategorisation The broker-sync Fidelity provider emits 'unrealised-gains-offset' DEPOSIT activities to reconcile Wealthfolio's total with the PlanViewer reported pot, because Wealthfolio doesn't track pension fund units directly. Wealthfolio's data model treats that DEPOSIT as a cash contribution, which double-inflates net_contribution and zeroes out the implied growth. Add a Postgres view 'dav_corrected' in wealthfolio_sync that subtracts the cumulative gains-offset from net_contribution per account per date (re-exporting as 'net_contribution' so it's a drop-in replacement). All 17 wealth dashboard panels that compute contribution/growth/ROI now read from the view. Total impact: portfolio Growth jumps from £301,753.19 to £337,474.39 (exactly the £35,721.20 Fidelity offset that was previously miscategorised).	2026-05-22 14:16:52 +00:00
Viktor Barzin	6769526e1e	ci: retrigger apply for pending Keel enrollment (~58 stacks) Bulk enrollment commit 8f4b1956 had its CI pipeline #689 killed before terragrunt apply ran. The enrollment label + V2 lifecycle changes are in master but never reached the cluster. Appending a one-line marker to each pending stack's main.tf so Woodpecker's diff-detection picks them up and applies them serially. Idempotent — re-applying a stack whose state already matches is a no-op. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:52 +00:00
Viktor Barzin	137b4cbcf7	ci: retry after Keel rollout cascade settled	2026-05-22 14:16:51 +00:00
Viktor Barzin	e5f6d16b2e	enrolled-patch stacks: ignore image drift from Keel auto-update For Deployments enrolled in Keel with policy=patch, the image tag is updated by Keel as new patches release upstream. Without ignore_changes on the image field, terragrunt apply would fight Keel in an endless loop (TF reverts → Keel re-rolls → repeat — same shape as the calico/tigera-operator fight from earlier). Adding KEEL_IGNORE_IMAGE marker to the lifecycle of these stacks. Image string in TF becomes the initial seed; Keel rolls it forward. Stacks: actualbudget, broker-sync, changedetection, city-guesser, coturn, dashy, dawarich, diun, ebook2audiobook, ebooks, echo, excalidraw, foolery, forgejo, freedify. CI-driven self-hosted stacks (fire-planner, job-hunter, payslip-ingest, recruiter-responder, claude-agent-service, claude-memory) keep TF ownership of image and policy=never — their image_tag is set by CI via terragrunt.hcl inputs, not by Keel. Adding image to ignore_changes on those would break the CI deploy flow. Caveat: only container[0].image is added. Multi-container Deployments (immich, beads, etc.) will need additional container[N].image lines for any container Keel rolls. Those stacks are not currently enrolled. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:51 +00:00
Viktor Barzin	8519e37975	calico: unenroll from Keel — tigera-operator owns DaemonSet spec Keel kept rewriting calico-node + calico-kube-controllers images to v3.26.5 (proper patch update); tigera-operator immediately reverted to v3.26.1 because the Installation CR is the source of truth. Endless churn but no data loss — Calico stayed healthy throughout. Removing keel.sh/enrolled label and live label from calico-system ns. Calico upgrades go through the tigera-operator's Installation CR manually, not Keel. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:50 +00:00
Viktor Barzin	8f18621dd5	keel: default policy → patch (semver-bounded opt-out auto-update) Move from `never` (no auto-update) to `patch` for the cluster-wide default. Keel only auto-updates PATCH versions within the current major.minor: 0.26.6 → 0.26.7 OK; 0.26.6 → :nightly-latest blocked. Tag-rewrites that broke calico (v3.26.1 → :master) and affine (0.26.6 → :nightly-latest) on 2026-05-16 cannot recur with patch. Caveats: * Patch causes Terraform image drift for semver-pinned services — drift-detection pipeline will surface it; lifecycle ignore_changes on container[].image can be added per stack later if drift is noisy. * Tags that aren't parseable as semver (:latest, :11, :nightly, SHA tags) are ignored by patch — those workloads stay on their current image until promoted to `force` policy individually. Self-hosted CI-driven services + chrome-service kept on `never` (deliberate pins / CI controls the tag): recruiter-responder, claude-agent-service, claude-memory, chrome-service, fire-planner, job-hunter, payslip-ingest Live state already updated via kubectl apply + per-workload patches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:50 +00:00
Viktor Barzin	662695908a	recruiter-triage: AI culture & tooling section + warm-engage AI ask - claude-agent-service bumped to 191ed5dd (new AI section in agent template — leadership stance, approved tools, usage limits / quotas, code-gen safety, product-side AI depth, follow-up questions for the recruiter when the web is sparse). - recruiter-responder bumped to ab59eeab (deep_research prompt asks for AI culture; warm_engage template adds a written-only ask for IDE assistants, chat tools, per-seat limits, source-to-external model policy). Smoke-tested 2026-05-16: forced fresh research on Datadog, agent returned full structured AI section with 7 explicit recruiter questions covering DLP/IDE/limits/code-gen-policy. $0.80 / 192s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:50 +00:00
Viktor Barzin	0a6b2489f7	keel: default policy → never (post-incident safe default) 2026-05-16 incident: Keel's `force` policy switched semver-pinned images (affine 0.26.6 → :nightly-latest, calico v3.26.1 → :master) instead of digest-tracking. Force is documented as "always update to the newest tag in the registry" — only safe on already-mutable tags like :latest. Changing the cluster-wide default in inject-keel-annotations to `never`. The namespace enrollment label + V2 lifecycle suppression stay in place so opt-in is one annotation per Deployment, but no service auto-updates until explicitly approved. To opt in a workload now: 1. Verify the Deployment image is on a mutable tag (:latest, :<major>, or a vendor "stable" tag) — change in Terraform first if needed. 2. Add to the Deployment's metadata.annotations: "keel.sh/policy" = "force" (digest tracking) OR "keel.sh/policy" = "patch" (semver patch bumps — also requires ignore_changes on the image) Live policy already updated via kubectl apply + per-workload override (force → never). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:50 +00:00
Viktor Barzin	9765f6b9a4	keel: enable Slack notifications on every upgrade Wire Keel's Slack notifier to the existing bot token in Vault (secret/viktor -> slack_bot_token). Posts to #general by default; override via slack.channel in the Helm values if you want a dedicated channel like #keel-notifications. Notification level is "info" so we get every rollout event, not just errors. Approval flow is OFF — opt-out-pure means all updates apply unattended. If we later introduce approvals, add slack.approvalsChannel. Resolves user request: 'keel should send notifications to slack everytime it upgrades an app'. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:50 +00:00
Viktor Barzin	3027ab85a8	recruiter-responder: bump image_tag to 189ef901 OpenClaw can now answer 'what do we know about <company>?' from cache via the new recruiter_company_research tool, and recruiter_get embeds the cached research payload inline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:49 +00:00
Viktor Barzin	be3b94da85	keel: pin chart 1.0.6 → 1.2.0 (1.0.6 doesn't exist) The Helm repo at https://charts.keel.sh has versions 1.0.0–1.0.5, 1.1.0, 1.2.0. 1.0.6 is not published, so the Phase 0 apply failed silently. Bump to 1.2.0 (app version 0.21.1, latest stable).	2026-05-22 14:16:48 +00:00
Viktor Barzin	411524a10d	kured: drop Mon-Fri restriction, reboot any day The weekday-only schedule was a 2026-03-16-incident-era guardrail when the rest of the safety net was thin. Today's gates — halt-on-alert, sentinel-gate Check 4 (24h soak via node Ready transitions), the K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the sentinel-path fix from earlier today — make weekend reboots safe and just clear the backlog faster. Effect: 5 pending node reboots clear in 5 calendar days instead of queueing up over weekends. The K8s version-upgrade detection at Sun 12:00 UTC self-defers if a Sunday-morning kured reboot fires (the RecentNodeReboot alert is in the Upgrade Gates ignore-less list for the version-upgrade preflight — same mechanism kured uses). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	2e52583abd	Phase 1a: enroll 4 self-hosted services in Keel auto-update Enrolls the cleanest Woodpecker-build-only self-hosted services into the inject-keel-annotations ClusterPolicy by labeling their namespaces keel.sh/enrolled=true. CI already pushes :latest (auto_tag: true) on each, so Keel will detect the current upstream digest and trigger a rolling restart when polling starts (1h cadence). Per-Deployment lifecycle extended with KYVERNO_LIFECYCLE_V2 to suppress the annotation drift Kyverno will inject (keel.sh/policy, /trigger, /pollSchedule). Services included: - fire-planner - job-hunter - payslip-ingest - recruiter-responder Skipped from Phase 1 for follow-up: - claude-agent-service (user has WIP on main.tf) - claude-memory (Postgres co-deployed; treat in Phase 9 with other DBs) - kms (two Deployments; needs per-resource review) - wealthfolio (sync sidecar pattern; needs review) - chrome-service (deliberate :v4 pin; needs keel.sh/policy: never label) - GHA-migrated repos (10) (need per-repo CI cleanup) - beadboard, freedify (no CI) See docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	5acfab5bb9	recruiter-responder: bump image_tag to f3cb91ff (180d research_cache TTL) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	e5a65c11a9	recruiter-triage v3: Perks & Office Life section + cache-first deep_research - claude-agent-service bumped to f764fef6 (agent system prompt adds the Perks block: food/health/pension/equity/PTO/parental/equipment/ learning/wellness/amenities/commuter). 1200-word cap. - recruiter-responder bumped to 38a2cdaa (cache-first deep_research: serves cached payload if fetched_at + ttl_seconds > now; cache writes upsert; new force flag bypasses). Verified end-to-end: deep_research on Datadog now returns full Perks section (~220s, $0.60, 23 turns). Earlier 500 fixed (was uq_research_company_tier dup-key on re-run). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	020f62555b	Phase 0: install Keel + Kyverno auto-update annotation injector Foundation for opt-out-pure auto-update model per docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. - New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6). Polls registries hourly per design decision #8. Default schedule overridable per-workload via keel.sh/pollSchedule annotation. - New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments, StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true` with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h. - Phase 0 enrolls no namespaces. Phase 1 (next session) labels the self-hosted set. - Per-workload opt-out: label `keel.sh/policy: never` (used by rollback runbook and chrome-service-style deliberate pins). - Keel namespace excluded from the mutate — supervisor self-update has too-bad a failure mode (decision #11). - AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the ignore_changes block enrolled workloads need. - .claude/CLAUDE.md: docker-images rule flagged as transitional. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	9476649539	docs/pm: kured silently stalled 6 days + Anubis HA lift (2026-05-16) Captures the May 10–16 kured-vs-sentinel-gate hostPath mismatch (chart derived hostPath from configuration.rebootSentinel) and the companion work to harden the rolling-reboot pipeline against single-replica PDB deadlocks: Anubis 1→2 replicas with shared Valkey store, kured drainTimeout=30m, CNPG pg-cluster 2→3 instances. Includes the mysql-standalone-PDB orphan cleanup and the k8s-node1 containerd-source drift audit (benign). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	3ef860b2be	kured + cnpg: drain-safe defaults ahead of Monday reboot wave Three defensive moves to make the kured rolling-reboot cycle survive edge cases without operator intervention: kured (stacks/kured/main.tf): - Set `configuration.drainTimeout = "30m"`. Default is unlimited; if a future PDB or finalizer stalls drain, kured retries forever and the node stays cordoned silently. 30m caps the silent-failure window — after timeout kured logs the abort and waits for the next period; the node stays Schedulable so cluster capacity isn't lost. Lets us fail closed instead of fail-silent. CNPG pg-cluster (stacks/dbaas/modules/dbaas/main.tf): - Bump instances 2 → 3 (1 primary + 2 replicas). With 2 instances the failover during a primary-node drain depended on the lone replica being caught up; a WAL backlog would stall the drain until the replica was current. With 3 instances CNPG always has at least one fully-current replica to promote, and the PDB's `minAvailable=1` on the primary selector is satisfied throughout the switchover. Storage: +20Gi PVC on proxmox-lvm-encrypted (about 35Gi after autoresize). Memory: +3Gi pod limit. - Updated the `triggers.instances` so the null_resource's local-exec actually re-applies the YAML (kubectl apply with the new spec). The YAML is the source-of-truth but the trigger is what tells terraform to re-run the provisioner. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	4ff3638065	state(dbaas): update encrypted state	2026-05-22 14:16:48 +00:00
Viktor Barzin	08bf5e47b7	state(dbaas): update encrypted state	2026-05-22 14:16:48 +00:00
Viktor Barzin	5768216d0e	anubis: HA with shared valkey/redis store + replicas=2 Anubis pre-2026-05-16 ran at replicas=1 because in-flight PoW challenge state lived in process memory — a challenge issued by pod A wouldn't be verifiable by pod B (HTTP 500 "store: key not found"). The PDB at `minAvailable=1` made this worse: with replicas=1 the eviction API can NEVER satisfy the constraint, so every drain on a node hosting an Anubis pod looped forever. This is what stalled the manual K8s upgrade on 2026-05-11 (had to delete pods directly to bypass eviction) and was about to block kured on Monday 2026-05-18 once the kured sentinel fix landed. Anubis upstream has first-class support for a Valkey/Redis-protocol shared store (documented as the "Kubernetes worker pool" pattern). Wire it up: - modules/kubernetes/anubis_instance: add `shared_store_url` variable. When set, appends a `store: { backend: valkey, parameters: { url } }` block to the rendered policy YAML and defaults replicas to 2 (capped at 2). PDB switched from `minAvailable=1` to `maxUnavailable=1` so drains can take down one pod at a time. topologySpreadConstraint tightened to `DoNotSchedule` so the two replicas land on different nodes — a single node loss never takes a whole Anubis instance down. - All 8 call sites (cyberchef, jsoncrack, kms, homepage, blog, travel_blog, real-estate-crawler, f1-stream) opted in. Each picks a unique Redis DB index (5–12) on `redis-master.redis:6379`. Cluster Redis already runs HA via Sentinel + haproxy, no new infra needed. Verified: every Anubis Deployment now 2/2 Ready with pods on different nodes; PDBs allow 1 disruption; Redis DBs 5,7,8,10 already populated by live traffic post-apply; Palo Alto Networks scanner hit blog right after apply and the challenge log shows the new state path. Drain on any worker now succeeds without a `predrain_unstick` workaround — eviction API is satisfied because at most one pod is unavailable at a time, and the other replica keeps serving. Monday's kured reboot wave should roll through cleanly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	3025879478	claude-agent-service: ship recruiter-triage agent + restore missing terragrunt.hcl - main.tf: bump image_tag to 1b3350c0 (carries the new agent), init container also copies recruiter-triage.md into /home/agent/.claude/agents/. - terragrunt.hcl: restored (file was missing — apply was blocked). Standard root include + platform/vault/external-secrets dependencies. Smoke-tested 2026-05-16: deep_research call on Datadog (thread 42) via recruiter-responder REST API → 102.5s, $0.43, structured markdown report with comp bands vs £600k floor, culture signals, remote policy, recent news, sources cited. End-to-end Tier-2 is live. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	ea2342b8e2	docs: add CONTEXT.md domain glossary [ci skip] Adds the per-repo domain glossary that engineering skills (diagnose, tdd, improve-codebase-architecture, grill-with-docs) read before working in this repo. Terms only — no implementation detail. Six clusters (code organization, cluster, networking, storage, secrets, CI/CD), 22 terms, plus relationships, an example dialogue, and five flagged ambiguities. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	ce5f3ec209	recruiter-responder: expose Gmail IMAP creds for backtest CLI Pulls vbarzin@gmail.com app password from secret/recruiter-responder (seeded from secret/wealthfolio.imap_password — same Gmail credential that wealthfolio uses for broker-statement ingestion). Env vars GMAIL_IMAP_USER + GMAIL_IMAP_PASS, consumed by 'backtest gmail'. Backtest verified 2026-05-16 against folder 'companies-I-dont-take-seriously': 20/20 recruiter, 100% company extraction (9 stated, 6 subject, 4 sender_domain, 1 body), 30% comp, avg 12s latency. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	065982d978	kured: fix sentinel path mismatch that stalled rolling reboots The kured Helm chart derives the sentinel hostPath from `dirname(configuration.rebootSentinel)`. Previously rebootSentinel=/sentinel/gated-reboot-required pointed hostPath at `/sentinel/` (an empty auto-created directory on every host) while the kured-sentinel-gate DaemonSet writes to /var/run/gated-reboot-required. Two different host directories → kured never saw the open gate, even though the gate's checks were all green every 5 min on every node. Result: unattended-upgrades has packages waiting on every node since 2026-05-10 (when uu was re-enabled) and kured's hourly log says "Reboot not required" for the entire period. Set rebootSentinel=/var/run/gated-reboot-required so the chart mounts hostPath /var/run — same directory the gate writes to. The in-pod mountPath (/sentinel) is hardcoded by the chart and doesn't matter, the symlink chain works out: /sentinel/<file> inside the pod resolves to /var/run/<file> on the host. Verified: kured pod can now list /sentinel/gated-reboot-required (0 B) AND /sentinel/reboot-required (32 B, set by uu on 2026-05-15). First gated reboot will land Mon 2026-05-18 02:00 London. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	80e6314bf0	recruiter-responder: bump image_tag to 559e5c57 PDF extraction, tech_stack list, aggressive company/comp inference, no-phone-call drafts, backtest CLI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	8e11caff8d	recruiter-responder: bump image_tag to bbd178da (structured Telegram + comp floor) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	391c002f9a	service-catalog: add aiostreams entry Stremio stream aggregator now has its own row in the Active Use tier. Captures the auth model (own UUID+password, not Authentik), monitoring posture (canary probe + 3 alerts), and backup pipeline (weekly NFS dumps of both decrypted config and the Stremio account addon collection). Follow-up from the 2026-05-15/16 hardening session: 5 commits on servarr/aiostreams, none previously catalogued.	2026-05-22 14:16:47 +00:00
Viktor Barzin	24ce3e267d	aiostreams: weekly backup of Stremio account addon collection Adds stremio-account-backup CronJob (Sun 04:00 weekly, offset 1h from the AIOStreams config-backup at 03:00): - Logs into api.strem.io with credentials from Vault (secret/viktor.stremio_email + stremio_password, now also synced into the aiostreams-probe-secrets ExternalSecret) - Fetches the full addonCollection via addonCollectionGet - Writes timestamped JSON to the existing aiostreams-backup PVC (NFS /srv/nfs/aiostreams-backup/stremio-collection-*.json, mode 0600) - 90-day retention, logs out to invalidate the auth key - Pushgateway metrics: stremio_account_backup_{success,bytes, addon_count,duration_seconds,last_run_timestamp} Protects against: accidental "uninstall all" / API regression / wrong account login wiping the curated set of 22 addons (Cinemeta + 16 MDBList + AIOStreams + More Like This + Formulio + Zamunda + Local). Verified: manual run wrote 93480 bytes, 22 addons, file present on NFS.	2026-05-22 14:16:47 +00:00
Viktor Barzin	aa6e9b0242	recruiter-responder: public /cb ingress for Telegram URL-button callbacks - Add ingress_factory module (auth=none, HMAC + expiry are the gate); ingress_path=["/cb"] only — /api stays internal, /healthz cluster. dns_type=proxied. anti_ai_scraping=false. - Drop setup_tls_secret module — Kyverno ClusterPolicy `sync-tls-secret` auto-clones the wildcard cert into every namespace. - Bump image_tag to 7383b426 (callback endpoints + SMTP STARTTLS hostname relax). - Wire CALLBACK_BASE_URL=https://recruiter-responder.viktorbarzin.me. - Drop git-crypt-encrypted wildcard cert files into stacks/recruiter-responder/secrets/. Allowlist privkey.pem in a new .gitleaksignore — git-crypt encrypts at rest but the working-tree copy is plaintext, so gitleaks can't tell. Smoke-tested end-to-end 2026-05-15 23:45: synthetic email -> Telegram with ✅/❌ buttons -> ✅ tapped via curl -> 'Sent' HTML page -> thread.status=sent, decision row recorded with decided_via=telegram_button, outbound message threaded correctly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	77010b769a	aiostreams: whitelist Vidhin + Tamtaro sync URLs Adds two env vars on the AIOStreams deployment: - WHITELISTED_REGEX_PATTERNS_URLS: Vidhin's release-group regex (TRaSH-aligned) so syncedRankedRegexUrls works for the user - WHITELISTED_SEL_URLS: Vidhin's ranked stream expressions + Tamtaro's ISE/PSE/ESE-standard Gotcha: AIOStreams validates each synced* field against the matching whitelist — stream-expression files (incl. Vidhin's expressions.json) go in WHITELISTED_SEL_URLS, not the regex one, even though they live in Vidhin's regex repo. Mixing them up returns USER_INVALID_CONFIG. User config: enabled Vidhin's regex + ranked expressions + Tamtaro's ISEs. Skipped Tamtaro PSE/ESE for now to avoid surprise over-filtering; can be added later from the same whitelist.	2026-05-22 14:16:47 +00:00
Viktor Barzin	c396092c86	aiostreams: weekly NFS backup of decrypted user config Adds aiostreams-config-backup CronJob (Sun 03:00 weekly): - Pulls /api/v1/user via internal ClusterIP with UUID + password from the existing aiostreams-probe-secrets ExternalSecret - Writes timestamped JSON to nfs-backup PVC mounted at /backup - 90-day retention, prunes older files - Pushgateway metrics: aiostreams_config_backup_{success,bytes,duration,last_run_timestamp} NFS path: 192.168.1.127:/srv/nfs/aiostreams-backup (auto-synced offsite to Synology via the existing offsite-sync-backup CronJob). Complements the daily postgresql-backup-per-db pipeline (which dumps the encrypted blob) by storing the decrypted JSON — usable for human inspection / disaster recovery even without the AIOStreams password. Verified: manual job wrote 12931 bytes, file present on NFS.	2026-05-22 14:16:47 +00:00
root	1177a82452	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:47 +00:00
Viktor Barzin	a98b00324d	recruiter-responder: pin image tag + run plugin installer init as root - stacks/recruiter-responder/terragrunt.hcl: bump image_tag to 0500c3d3 (300s LLM timeouts + IMAP BODY.PEEK[] fix). - stacks/openclaw/main.tf: install-recruiter-plugin init container now runs as uid 0 — the openclaw NFS volume is owned by uid 1000 and the recruiter-responder image otherwise drops to uid 10001 which can't write or chown. Smoke-tested end-to-end 2026-05-15 ~23:15: Synthetic recruiter email -> IMAP IDLE EXISTS push -> qwen3-8b triage (12.1s, JSON output complete with company/role/salary/location/tech) -> 2 drafts persisted in Postgres -> Telegram sendMessage 200 OK. Then deleted 3 stale n8n workflows W992Nr7..., 1AU4k7..., IisDNx... from the n8n Postgres workflow_entity table. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	a72590db7d	recruiter-responder: vault DB role + switch proactive push to Telegram - stacks/vault/main.tf: register pg-recruiter-responder static role on the postgresql connection (7d password rotation). Adds the role to allowed_roles and creates vault_database_secret_backend_static_role for `recruiter_responder` user. - stacks/recruiter-responder/main.tf: drop TASK_WEBHOOK_URL env, swap TASK_WEBHOOK_TOKEN secret for TELEGRAM_BOT_TOKEN + TELEGRAM_CHAT_ID. Updated header doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:46 +00:00
Viktor Barzin	89e9471e87	state(vault): update encrypted state	2026-05-22 14:16:46 +00:00
Viktor Barzin	7e1580ba8c	recruiter-responder: deploy stack + llama-cpp qwen3-8b + openclaw plugin mount Three coupled changes for the new recruiter-responder pipeline: 1. stacks/llama-cpp/: add qwen3-8b text-only model to llama-swap. Uses unsloth/Qwen3-8B-GGUF Q4_K_M, 16k context, no mmproj. Refactored the download Job script + cmd renderer to handle text_only=true (skip mmproj download + --mmproj flag). The 3 existing vision models stay on text_only=false; no behaviour change for them. 2. stacks/recruiter-responder/: new stack. Namespace, 2 ExternalSecrets (app secrets from secret/recruiter-responder, DB creds from Vault DB engine static-creds/pg-recruiter-responder), Deployment (replicas=1, Recreate -- IMAP IDLE + APScheduler want single leader), Service ClusterIP. Image: forgejo.viktorbarzin.me/viktor/recruiter-responder. 3. stacks/openclaw/: add init container `install-recruiter-plugin` that uses the recruiter-responder image to copy the .mjs plugin into /home/node/.openclaw/extensions/recruiter-api/ on NFS. Couples plugin version to the recruiter-responder image tag. Also injects RECRUITER_RESPONDER_URL + RECRUITER_RESPONDER_TOKEN env vars (token from openclaw-secrets.recruiter_responder_bearer_token, optional). Pre-apply checklist for recruiter-responder stack: - Vault: seed secret/recruiter-responder with webhook_bearer_token, imap_{me,spam}_{user,pass}, smtp_password, claude_agent_token, task_webhook_token. - Vault: add secret/openclaw.recruiter_responder_bearer_token (same as above webhook_bearer_token). - dbaas: create DB recruiter_responder + role recruiter_responder, and Vault DB-engine role static-creds/pg-recruiter-responder. - Build + push image via Woodpecker (recruiter-responder repo CI). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:46 +00:00
Viktor Barzin	95b9f7bc89	aiostreams: 1h stream cache + canary stream-count probe + 3 alerts Hardening pass following the empty-stream-list incident: 1. STREAM_CACHE_TTL=3600 — re-enables stream payload cache (was -1 / disabled). Default behaviour hit all 5 upstream addons on every Stremio request; with a 1h TTL repeat requests for the same title are instant, while RD cache invalidations still propagate quickly. 2. aiostreams-stream-probe CronJob (every 5 min): fetches the user's encryptedPassword via the internal ClusterIP, runs a canary stream search for Breaking Bad S01E01, pushes streams_count + probe_success to Pushgateway. Uses an ExternalSecret pulling UUID + password from Vault secret/viktor. Same pattern as email-roundtrip-monitor. 3. Three alerts in monitoring's prometheus_chart_values.tpl: - AIOStreamsStreamCountLow (< 50 streams for 30m) - AIOStreamsProbeFailing (probe_success == 0 for 30m) - AIOStreamsProbeStale (last_run_timestamp > 30min for 10m) Verified: probe returned streams=411 success=1 on first run; all 3 alerts loaded into Prometheus with state=inactive health=ok.	2026-05-22 14:16:46 +00:00
root	fba5ee2df4	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:46 +00:00
Viktor Barzin	c73234982f	aiostreams: pin nightly + switch to auth=app - Pin viren070/aiostreams:nightly → :2026.05.14.1326-nightly (avoid stale-pull cache, matches 8-char SHA convention for rolling tags) - Switch ingress auth tier required → app: Authentik forward-auth blocks Stremio clients (cannot follow OAuth 302), and AIOStreams already enforces UUID + password on /configure and /api/*, with Stremio addon URLs using encryptedPassword as a bearer token. Result: empty-stream-list issue fixed for public Stremio clients. Verified: 410 streams returned via public URL for Breaking Bad S01E01 with no cookies, vs 0 before (502→Authentik OIDC redirect).	2026-05-22 14:16:46 +00:00

1 2 3 4 5 ...

3293 commits