Commit graph

960 commits

Author SHA1 Message Date
Viktor Barzin
62efded1b6 wireguard: switch to iptables-nft so PostUp MASQUERADE works
Wireguard pod CrashLoopBackOff'd for hours with wg-quick's PostUp failing:

    iptables v1.8.4 (legacy): can't initialize iptables table `nat':
    Table does not exist (do you need to insmod?)

sclevine/wg's default `iptables` symlink points to iptables-legacy, which
talks to the kernel's xt-tables. K8s nodes nowadays initialize their
nat table via nftables (calico-node sets it up), so iptables-legacy in
the container sees "no nat table" and bails. Reproduced by ephemerally
debugging the live pod's namespaces (kubectl debug --copy-to + same
mounts as the real pod) — wg-quick output matched verbatim.

Fix: postStart now calls update-alternatives to point iptables and
ip6tables at iptables-nft/ip6tables-nft (already present in the image)
before exec'ing wg-quick. The wg0.conf PostUp MASQUERADE then writes
to the nftables-backed nat table calico already populated. Verified:
new pod went 2/2 Running with 0 restarts after apply.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
Viktor Barzin
45c8e88e89 terminal: probe + alerts after Traefik replica routing-table skew
User reported "site loads but failed to connect on the tmux session". Root
cause was a Traefik replica (traefik-db7696fbf-ktjjz) that came up missing
the kubernetes_ingress-derived router for terminal.viktorbarzin.me — only
the IngressRoute CRDs registered. About 1/3 of /token preflight requests
landed on that replica and got 404 with router="-", and WS upgrades
intermittently failed the same way, so the lobby iframe stayed stuck on
"Failed to connect. Retrying...". `kubectl delete pod` on the bad replica
restored the missing router and unblocked the user.

This commit adds the long-term mitigation:

stacks/terminal/main.tf
  - kubernetes_cron_job_v1.webterminal_probe runs every 5min, hits
    /token + /ws via Cloudflare and the in-cluster ttyd Service, pushes
    4 gauges to Pushgateway (token_status, ws_status, ttyd_status,
    last_success_timestamp). Verified the probe end-to-end:
      token=302 ws=302 ttyd=200 ok=1

stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
  - Webterminal group: WebterminalTokenDegraded (warning, 10m),
    WebterminalWebsocketDegraded (critical, 10m),
    WebterminalTtydUnreachable (critical, 10m),
    WebterminalProbeStale (warning, 15m).
  - Traefik Router Parity group: TraefikRouterCountSkew fires when any
    Traefik replica's router count diverges from siblings for >10m —
    catches the same class of issue cluster-wide, not just for terminal.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
d828b51670 recruiter-responder: bump image_tag to 50f43004 (backtest --persist)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
Viktor Barzin
0480477f44 nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem)
Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped
control-plane exclusion from the controller Deployment, so both replicas
landed on k8s-master, fought for hostNetwork ports 19809/29653, and one
went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes
holding the ports — only a kubelet restart on master cleared them.

- Pin helm_release.version = "4.13.1" so terraform apply can't drift to
  the broken chart (defense in depth; nfs-csi namespace is already in the
  Kyverno-Keel exclude list)
- Add controller.affinity: podAntiAffinity between replicas +
  nodeAffinity excluding node-role.kubernetes.io/control-plane
- docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md
  captures the root cause + recovery procedure (kubelet restart via
  nsenter is the escalation path when crictl rmp -f fails)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
e398e717f1 broker-sync(fidelity): un-suspend monthly CronJob
The provider now emits delta gains-offset DEPOSITs (broker-sync@98c4729)
which is the simple accumulate-gains approach Viktor signed off on:
each monthly scrape captures (current_pot, real_contribs), and we emit
a single DEPOSIT/WITHDRAWAL sized to growth-since-last-scrape.
dav_corrected handles the dashboard math.

Next scheduled run: 2026-05-20 05:00 UK. Manual trigger via
'kubectl -n broker-sync create job fid-now --from=cronjob/broker-sync-fidelity'.
2026-05-22 14:16:56 +00:00
Viktor Barzin
195b5e4061 keel: use +() anchors on policy/match-tag so per-workload overrides stick
Without the anchor, each policy update fires mutateExistingOnPolicyUpdate,
which OVERWRITES existing keel.sh/policy annotations back to 'force'. That
broke the phased rollout — bulk-setting workloads to 'never' didn't stick
because the next policy update reset them.

With +() anchors, the mutate only adds the annotation if missing. New
workloads (in enrolled namespaces) get force+match-tag; existing workloads
with explicit policy=never (out-of-band, for phased rollout) stay never.

Phase 1 rollout state (2026-05-17):
  - 10 workloads on force+match-tag in 10 namespaces (Phase 1)
    enrolled via keel.sh/enrolled=true namespace label:
      linkwarden, excalidraw, diun, echo, foolery, city-guesser,
      jsoncrack, privatebin, ntfy, speedtest
  - 216 workloads on policy=never (out-of-band kubectl annotate)
  - 31 critical namespaces excluded at policy level

Expand to Phase 2 by labeling more namespaces `keel.sh/enrolled=true`
and clearing the `never` annotation off their workloads.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
Viktor Barzin
25fcf80651 keel: expand critical-namespace exclude list — protects vault/cnpg/authentik/etc.
2026-05-17 incident: Keel rolled authentik 2026.2.2 → 2026.2.3 around 23:36.
The force+match-tag pairing should have constrained Keel to digest-only on
the current tag (not switch to a new tag), but a race between Kyverno's
mutate (injecting match-tag) and Keel's hourly poll caused the workload to
still have the old `force`-only annotation when Keel acted. Result: tag
rewrite, pods cycled, pgbouncer connection failures, login broken.

Manual rollback: `kubectl rollout undo` on all 5 authentik deployments back
to 2026.2.2. Auth restored within ~5 min.

Going forward, critical-namespace workloads are excluded at the policy level
so this race can't recur. They get upgraded via TF (Helm chart version bumps)
on a deliberate cadence, never by Keel.

Live state: 36 workloads on policy=never (35 critical + chrome-service pin
+ 7 CI-driven self-hosted from earlier), 190 on policy=force+match-tag for
opt-out-pure auto-update on the remaining stateless apps.

This matches user direction (2026-05-17): "upgrading is fine as long as we
upgrade correctly and the latest version is healthy" + "keel responsible
for the latest version, phased rollout, graceful".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
74ecda6cc3 keel: bump default policy patch → major (user wants latest version)
User: 'i'm happy with occasional breakages. we have alerts.'

Policy=major auto-updates workloads to the latest semver tag in the
registry, including major/minor/patch bumps. Still semver-parser-bounded
so dev/nightly/master branches are filtered out (avoids the 2026-05-16
force-trap on affine/calico).

Live: 217 patch-annotated workloads re-annotated to major. Next Keel
poll (~1h) will pick up any pending major/minor releases.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
6c8546bb84 recruiter-responder: bump image_tag to 94b37a9c (follow-up detection)
Replies from recruiters to our sent decline / engage / ignored threads
are now attached to the existing thread, surface with a 🔁 follow-up
marker in Telegram ("you previously sent"), and re-open thread status
to pending so they show up in recruiter_list status=pending.

Smoke-tested live: Rachel-style follow-up referencing our outbound
msgid + the original recruiter msgid in References → correctly
attached to thread #87, status flipped sent→pending, 3 messages
persisted (in/out/in).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
7e540292ad kyverno: bump background-controller memory 384Mi → 2Gi (OOMKilled processing keel URs)
The mutateExistingOnPolicyUpdate=true on inject-keel-annotations produced
176 UpdateRequests for the initial bulk scan across enrolled namespaces.
At the existing 384Mi limit, kyverno-background-controller OOMKilled while
processing them — no annotations got injected on existing workloads (count
stuck at 30).

Live state already bumped via kubectl set resources; this commit makes it
durable through Terraform. Also lowered the request to 256Mi (the 384Mi
floor was tight against limit; 2Gi headroom for bulk scans, 256Mi steady
state).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
257679166b recruiter-responder: bump image_tag to 02a01c9a (Reply-To + quoted body in replies)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
7e1ecaf74c kyverno: codify aggregated ClusterRole for keel mutate-existing
The previous commit (bc714755) added mutateExistingOnPolicyUpdate=true
to the inject-keel-annotations ClusterPolicy but Kyverno's validate
webhook rejected it: the background-controller SA needs update/patch
on apps/v1 Deployment/StatefulSet/DaemonSet.

Created live via kubectl + now in TF so the next apply is idempotent.
The ClusterRole aggregates into kyverno:background-controller via the
rbac.kyverno.io/aggregate-to-background-controller label.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
bede247e98 recruiter-responder: bump image_tag to 59df5f8a (Reply-To honoured)
Reply-To header now extracted on inbound and used for outbound replies.
Verified with a synthetic email From: noreply-careers@megacorp.example
Reply-To: spam@viktorbarzin.me — reply correctly went to spam@ and
threaded under the original (Re: subject + In-Reply-To + References).

Alembic 0003 added messages.reply_to_addr column.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
a2e8afc3ed kyverno: add mutateExistingOnPolicyUpdate=true so existing workloads get annotated
Before this, the inject-keel-annotations policy only fired on admission
events. Workloads that existed BEFORE their namespace got labeled
keel.sh/enrolled=true never received the annotation, so Keel didn't
watch them. Live state was 30 of 226 workloads auto-updating.

With mutateExistingOnPolicyUpdate=true and the required mutate.targets
block, Kyverno's BackgroundScan controller applies the mutate to
existing matching Deployments/StatefulSets/DaemonSets on policy update.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
cdeb89d5f1 final wave: enroll immich + status-page, retrigger 17 pending Bucket A
* immich: extended 3 V1 lifecycles to V2 (1 Deployment without V1
    skipped — has non-standard lifecycle from earlier work).
  * status-page: enrolled (was missing from original sweep).
  * v6 retrigger marker on 17 stacks that never reached terragrunt
    apply (#704 exit-1 halted mid-loop).

After this lands, expected live enrollment: ~96 / 118 Tier 1 stacks.
The remaining ~22 are operator/Helm-managed and intentionally excluded
(same fight-loop risk as Calico — bump via Helm chart version, not
Keel).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
root
caba0e811f Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:55 +00:00
Viktor Barzin
4944e508aa Bucket C: enroll 5 raw-deploy stacks in Keel auto-update
* beads-server: 3 Deployments — extended V1 lifecycle blocks to V2
    + KEEL_IGNORE_IMAGE; namespace label.
  * llama-cpp: 1 Deployment — extended V1→V2; namespace label.
  * novelapp: namespace label only (Deployment has non-standard
    lifecycle without V1 dns_config — drift expected, accept for now).
  * plotting-book: namespace label only (same as novelapp).
  * trading-bot: namespace label only (same as novelapp).

immich deferred — the bulk-add script's brace-counter got confused by
a HEREDOC in the file, inserting a lifecycle block in the wrong
position. Needs manual per-Deployment editing.

The 3 ns-only stacks (novelapp, plotting-book, trading-bot) will see
their Deployments mutated by Kyverno but their TF lifecycle doesn't
yet ignore the keel annotations. Expected behavior: drift visible in
terragrunt plan, applied-state oscillates with Kyverno re-injecting.
Acceptable starting point; per-Deployment lifecycle work to fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
b57596d930 Bucket A retrigger + Bucket D enrollment (5 module-nested stacks)
After fixing the postgresql-lb MetalLB flap (deleted stuck
ServiceL2Status CR l2-rgt9d), Tier 1 CI can apply again. Combined
commit:

  * Bucket A (16 stacks): re-append CI retrigger marker so the
    previously-pending applies pick up:
      blog calico cyberchef descheduler f1-stream homepage jsoncrack
      k8s-dashboard k8s-version-upgrade kms local-path osm_routing
      real-estate-crawler travel_blog vault webhook_handler

  * Bucket D (5 module-nested stacks): keel.sh/enrolled label on
    namespace + KYVERNO_LIFECYCLE_V2 on Deployments inside the module:
      postiz instagram-poster k8s-portal uptime-kuma vaultwarden

Bucket C (raw-deploy apps without V1 marker on their Deployment
lifecycles) deferred — needs per-Deployment lifecycle block additions
that the bulk script can't safely automate:
  beads-server immich llama-cpp novelapp plotting-book trading-bot

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
ec60af5fd4 kyverno: exclude calico-system from inject-keel-annotations
Stop the hourly Keel-vs-tigera-operator fight loop on calico-node
DaemonSet (v3.26.5 ↔ v3.26.1). Live: re-annotated 4 calico-system
workloads with keel.sh/policy=never; TF: added calico-system to the
namespaces exclude list so any future mutate run won't re-inject.

The previous calico unenrollment (label removal from namespace)
wasn't enough — once Kyverno had stamped the policy=patch annotation
on the Deployments/DaemonSets, removing the namespace label didn't
strip the annotation, so Keel kept watching them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:54 +00:00
Viktor Barzin
b48ddc09d6 ci: retrigger v4 — remaining 16 Keel stacks (#701 failed one of them) 2026-05-22 14:16:54 +00:00
Viktor Barzin
978237441e ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-22 14:16:54 +00:00
Viktor Barzin
1dd8f4e2bf openclaw: native MCP servers + daily claude-memory sync
Wire ha-mcp, context7, and the in-pod playwright sidecar as native
MCP servers on OpenClaw via `mcp set` in the container startup
(ConfigMap-baked mcp.servers gets stripped by `doctor --fix`; CLI-set
entries persist). HA URL pulled from new Vault key
secret/openclaw.ha_sofia_mcp_url and passed via the
HA_SOFIA_MCP_URL env var.

Add a daily 03:00 UTC `memory-sync` CronJob in the openclaw
namespace: pulls all non-sensitive memories from
claude-memory.claude-memory.svc:80/api/memories, groups by category,
writes 18 Markdown files into /workspace/memory/projects/claude-
memory-sync/ (the path memory-core indexes), then triggers
`openclaw memory index --force` via kubectl exec. Reuses the
existing cluster-healthcheck SA (pods+pods/exec). Smoke test: 1488
memories synced, 25/25 files indexed, search returns hits.

Also drops the legacy /app/extensions entry from
plugins.load.paths (doctor warning), wires HA_SOFIA_MCP_URL env,
and one-shot deletes the stale 2026-02-28 metaclaw-export.json from
the openclaw home volume.

claude_memory MCP intentionally NOT wired — its /mcp/mcp transport
404s on the deployed claude-memory-mcp:17 image (tracked as
code-z1so). Shared knowledge is delivered via the CronJob's REST
sync instead. Adding claude_memory to mcp.servers is a one-line
follow-up once that's fixed.
2026-05-22 14:16:53 +00:00
Viktor Barzin
0c73974362 ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-22 14:16:53 +00:00
root
87f7b25a13 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:53 +00:00
126cfb7022 wealth: dav_corrected view fixes pension gains-offset miscategorisation
The broker-sync Fidelity provider emits 'unrealised-gains-offset'
DEPOSIT activities to reconcile Wealthfolio's total with the
PlanViewer reported pot, because Wealthfolio doesn't track pension
fund units directly. Wealthfolio's data model treats that DEPOSIT as
a cash contribution, which double-inflates net_contribution and
zeroes out the implied growth.

Add a Postgres view 'dav_corrected' in wealthfolio_sync that
subtracts the cumulative gains-offset from net_contribution per
account per date (re-exporting as 'net_contribution' so it's a
drop-in replacement). All 17 wealth dashboard panels that compute
contribution/growth/ROI now read from the view. Total impact:
portfolio Growth jumps from £301,753.19 to £337,474.39 (exactly
the £35,721.20 Fidelity offset that was previously miscategorised).
2026-05-22 14:16:52 +00:00
Viktor Barzin
6769526e1e ci: retrigger apply for pending Keel enrollment (~58 stacks)
Bulk enrollment commit 8f4b1956 had its CI pipeline #689 killed before
terragrunt apply ran. The enrollment label + V2 lifecycle changes are
in master but never reached the cluster. Appending a one-line marker
to each pending stack's main.tf so Woodpecker's diff-detection picks
them up and applies them serially.

Idempotent — re-applying a stack whose state already matches is a no-op.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:52 +00:00
Viktor Barzin
e5f6d16b2e enrolled-patch stacks: ignore image drift from Keel auto-update
For Deployments enrolled in Keel with policy=patch, the image tag is
updated by Keel as new patches release upstream. Without
ignore_changes on the image field, terragrunt apply would fight Keel
in an endless loop (TF reverts → Keel re-rolls → repeat — same shape
as the calico/tigera-operator fight from earlier).

Adding KEEL_IGNORE_IMAGE marker to the lifecycle of these stacks.
Image string in TF becomes the initial seed; Keel rolls it forward.

Stacks: actualbudget, broker-sync, changedetection, city-guesser,
coturn, dashy, dawarich, diun, ebook2audiobook, ebooks, echo,
excalidraw, foolery, forgejo, freedify.

CI-driven self-hosted stacks (fire-planner, job-hunter, payslip-ingest,
recruiter-responder, claude-agent-service, claude-memory) keep TF
ownership of image and policy=never — their image_tag is set by CI
via terragrunt.hcl inputs, not by Keel. Adding image to ignore_changes
on those would break the CI deploy flow.

Caveat: only container[0].image is added. Multi-container Deployments
(immich, beads, etc.) will need additional container[N].image lines
for any container Keel rolls. Those stacks are not currently enrolled.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:51 +00:00
Viktor Barzin
8519e37975 calico: unenroll from Keel — tigera-operator owns DaemonSet spec
Keel kept rewriting calico-node + calico-kube-controllers images to
v3.26.5 (proper patch update); tigera-operator immediately reverted
to v3.26.1 because the Installation CR is the source of truth.
Endless churn but no data loss — Calico stayed healthy throughout.

Removing keel.sh/enrolled label and live label from calico-system ns.
Calico upgrades go through the tigera-operator's Installation CR
manually, not Keel.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:50 +00:00
Viktor Barzin
8f18621dd5 keel: default policy → patch (semver-bounded opt-out auto-update)
Move from `never` (no auto-update) to `patch` for the cluster-wide
default. Keel only auto-updates PATCH versions within the current
major.minor: 0.26.6 → 0.26.7 OK; 0.26.6 → :nightly-latest blocked.
Tag-rewrites that broke calico (v3.26.1 → :master) and affine
(0.26.6 → :nightly-latest) on 2026-05-16 cannot recur with patch.

Caveats:
  * Patch causes Terraform image drift for semver-pinned services —
    drift-detection pipeline will surface it; lifecycle ignore_changes
    on container[].image can be added per stack later if drift is
    noisy.
  * Tags that aren't parseable as semver (:latest, :11, :nightly,
    SHA tags) are ignored by patch — those workloads stay on their
    current image until promoted to `force` policy individually.

Self-hosted CI-driven services + chrome-service kept on `never`
(deliberate pins / CI controls the tag):
  recruiter-responder, claude-agent-service, claude-memory,
  chrome-service, fire-planner, job-hunter, payslip-ingest

Live state already updated via kubectl apply + per-workload patches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:50 +00:00
662695908a recruiter-triage: AI culture & tooling section + warm-engage AI ask
- claude-agent-service bumped to 191ed5dd (new AI section in agent
  template — leadership stance, approved tools, usage limits / quotas,
  code-gen safety, product-side AI depth, follow-up questions for the
  recruiter when the web is sparse).
- recruiter-responder bumped to ab59eeab (deep_research prompt asks
  for AI culture; warm_engage template adds a written-only ask for
  IDE assistants, chat tools, per-seat limits, source-to-external
  model policy).

Smoke-tested 2026-05-16: forced fresh research on Datadog, agent
returned full structured AI section with 7 explicit recruiter
questions covering DLP/IDE/limits/code-gen-policy. $0.80 / 192s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:50 +00:00
Viktor Barzin
0a6b2489f7 keel: default policy → never (post-incident safe default)
2026-05-16 incident: Keel's `force` policy switched semver-pinned
images (affine 0.26.6 → :nightly-latest, calico v3.26.1 → :master)
instead of digest-tracking. Force is documented as "always update
to the newest tag in the registry" — only safe on already-mutable
tags like :latest.

Changing the cluster-wide default in inject-keel-annotations to
`never`. The namespace enrollment label + V2 lifecycle suppression
stay in place so opt-in is one annotation per Deployment, but no
service auto-updates until explicitly approved.

To opt in a workload now:
  1. Verify the Deployment image is on a mutable tag (:latest,
     :<major>, or a vendor "stable" tag) — change in Terraform first
     if needed.
  2. Add to the Deployment's metadata.annotations:
       "keel.sh/policy" = "force"   (digest tracking)
       OR
       "keel.sh/policy" = "patch"   (semver patch bumps — also
       requires ignore_changes on the image)

Live policy already updated via kubectl apply + per-workload
override (force → never).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:50 +00:00
Viktor Barzin
9765f6b9a4 keel: enable Slack notifications on every upgrade
Wire Keel's Slack notifier to the existing bot token in Vault
(secret/viktor -> slack_bot_token). Posts to #general by default;
override via slack.channel in the Helm values if you want a dedicated
channel like #keel-notifications.

Notification level is "info" so we get every rollout event, not just
errors. Approval flow is OFF — opt-out-pure means all updates apply
unattended. If we later introduce approvals, add slack.approvalsChannel.

Resolves user request: 'keel should send notifications to slack everytime
it upgrades an app'.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:50 +00:00
3027ab85a8 recruiter-responder: bump image_tag to 189ef901
OpenClaw can now answer 'what do we know about <company>?' from cache
via the new recruiter_company_research tool, and recruiter_get embeds
the cached research payload inline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:49 +00:00
Viktor Barzin
be3b94da85 keel: pin chart 1.0.6 → 1.2.0 (1.0.6 doesn't exist)
The Helm repo at https://charts.keel.sh has versions 1.0.0–1.0.5,
1.1.0, 1.2.0. 1.0.6 is not published, so the Phase 0 apply failed
silently. Bump to 1.2.0 (app version 0.21.1, latest stable).
2026-05-22 14:16:48 +00:00
Viktor Barzin
411524a10d kured: drop Mon-Fri restriction, reboot any day
The weekday-only schedule was a 2026-03-16-incident-era guardrail when
the rest of the safety net was thin. Today's gates — halt-on-alert,
sentinel-gate Check 4 (24h soak via node Ready transitions), the
K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the
sentinel-path fix from earlier today — make weekend reboots safe and
just clear the backlog faster.

Effect: 5 pending node reboots clear in 5 calendar days instead of
queueing up over weekends. The K8s version-upgrade detection at Sun
12:00 UTC self-defers if a Sunday-morning kured reboot fires (the
RecentNodeReboot alert is in the Upgrade Gates ignore-less list for
the version-upgrade preflight — same mechanism kured uses).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
Viktor Barzin
2e52583abd Phase 1a: enroll 4 self-hosted services in Keel auto-update
Enrolls the cleanest Woodpecker-build-only self-hosted services into
the inject-keel-annotations ClusterPolicy by labeling their namespaces
keel.sh/enrolled=true. CI already pushes :latest (auto_tag: true) on
each, so Keel will detect the current upstream digest and trigger a
rolling restart when polling starts (1h cadence).

Per-Deployment lifecycle extended with KYVERNO_LIFECYCLE_V2 to suppress
the annotation drift Kyverno will inject (keel.sh/policy, /trigger,
/pollSchedule).

Services included:
  - fire-planner
  - job-hunter
  - payslip-ingest
  - recruiter-responder

Skipped from Phase 1 for follow-up:
  - claude-agent-service (user has WIP on main.tf)
  - claude-memory (Postgres co-deployed; treat in Phase 9 with other DBs)
  - kms (two Deployments; needs per-resource review)
  - wealthfolio (sync sidecar pattern; needs review)
  - chrome-service (deliberate :v4 pin; needs keel.sh/policy: never label)
  - GHA-migrated repos (10) (need per-repo CI cleanup)
  - beadboard, freedify (no CI)

See docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
5acfab5bb9 recruiter-responder: bump image_tag to f3cb91ff (180d research_cache TTL)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
e5a65c11a9 recruiter-triage v3: Perks & Office Life section + cache-first deep_research
- claude-agent-service bumped to f764fef6 (agent system prompt adds
  the Perks block: food/health/pension/equity/PTO/parental/equipment/
  learning/wellness/amenities/commuter). 1200-word cap.
- recruiter-responder bumped to 38a2cdaa (cache-first deep_research:
  serves cached payload if fetched_at + ttl_seconds > now; cache
  writes upsert; new force flag bypasses).

Verified end-to-end: deep_research on Datadog now returns full Perks
section (~220s, $0.60, 23 turns). Earlier 500 fixed (was
uq_research_company_tier dup-key on re-run).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
Viktor Barzin
020f62555b Phase 0: install Keel + Kyverno auto-update annotation injector
Foundation for opt-out-pure auto-update model per
docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.

- New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6).
  Polls registries hourly per design decision #8. Default schedule
  overridable per-workload via keel.sh/pollSchedule annotation.
- New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments,
  StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true`
  with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h.
- Phase 0 enrolls no namespaces. Phase 1 (next session) labels the
  self-hosted set.
- Per-workload opt-out: label `keel.sh/policy: never` (used by rollback
  runbook and chrome-service-style deliberate pins).
- Keel namespace excluded from the mutate — supervisor self-update has
  too-bad a failure mode (decision #11).
- AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the
  ignore_changes block enrolled workloads need.
- .claude/CLAUDE.md: docker-images rule flagged as transitional.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
Viktor Barzin
3ef860b2be kured + cnpg: drain-safe defaults ahead of Monday reboot wave
Three defensive moves to make the kured rolling-reboot cycle survive
edge cases without operator intervention:

kured (stacks/kured/main.tf):
  - Set `configuration.drainTimeout = "30m"`. Default is unlimited; if
    a future PDB or finalizer stalls drain, kured retries forever and
    the node stays cordoned silently. 30m caps the silent-failure
    window — after timeout kured logs the abort and waits for the
    next period; the node stays Schedulable so cluster capacity isn't
    lost. Lets us fail closed instead of fail-silent.

CNPG pg-cluster (stacks/dbaas/modules/dbaas/main.tf):
  - Bump instances 2 → 3 (1 primary + 2 replicas). With 2 instances the
    failover during a primary-node drain depended on the lone replica
    being caught up; a WAL backlog would stall the drain until the
    replica was current. With 3 instances CNPG always has at least one
    fully-current replica to promote, and the PDB's
    `minAvailable=1` on the primary selector is satisfied throughout
    the switchover. Storage: +20Gi PVC on proxmox-lvm-encrypted (about
    35Gi after autoresize). Memory: +3Gi pod limit.
  - Updated the `triggers.instances` so the null_resource's local-exec
    actually re-applies the YAML (kubectl apply with the new spec). The
    YAML is the source-of-truth but the trigger is what tells terraform
    to re-run the provisioner.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
Viktor Barzin
5768216d0e anubis: HA with shared valkey/redis store + replicas=2
Anubis pre-2026-05-16 ran at replicas=1 because in-flight PoW challenge
state lived in process memory — a challenge issued by pod A wouldn't be
verifiable by pod B (HTTP 500 "store: key not found"). The PDB at
`minAvailable=1` made this worse: with replicas=1 the eviction API can
NEVER satisfy the constraint, so every drain on a node hosting an Anubis
pod looped forever. This is what stalled the manual K8s upgrade on
2026-05-11 (had to delete pods directly to bypass eviction) and was
about to block kured on Monday 2026-05-18 once the kured sentinel fix
landed.

Anubis upstream has first-class support for a Valkey/Redis-protocol
shared store (documented as the "Kubernetes worker pool" pattern).
Wire it up:

- modules/kubernetes/anubis_instance: add `shared_store_url` variable.
  When set, appends a `store: { backend: valkey, parameters: { url } }`
  block to the rendered policy YAML and defaults replicas to 2 (capped
  at 2). PDB switched from `minAvailable=1` to `maxUnavailable=1` so
  drains can take down one pod at a time. topologySpreadConstraint
  tightened to `DoNotSchedule` so the two replicas land on different
  nodes — a single node loss never takes a whole Anubis instance down.
- All 8 call sites (cyberchef, jsoncrack, kms, homepage, blog,
  travel_blog, real-estate-crawler, f1-stream) opted in. Each picks a
  unique Redis DB index (5–12) on `redis-master.redis:6379`. Cluster
  Redis already runs HA via Sentinel + haproxy, no new infra needed.

Verified: every Anubis Deployment now 2/2 Ready with pods on different
nodes; PDBs allow 1 disruption; Redis DBs 5,7,8,10 already populated
by live traffic post-apply; Palo Alto Networks scanner hit blog right
after apply and the challenge log shows the new state path.

Drain on any worker now succeeds without a `predrain_unstick` workaround
— eviction API is satisfied because at most one pod is unavailable at a
time, and the other replica keeps serving. Monday's kured reboot wave
should roll through cleanly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
3025879478 claude-agent-service: ship recruiter-triage agent + restore missing terragrunt.hcl
- main.tf: bump image_tag to 1b3350c0 (carries the new agent),
  init container  also copies recruiter-triage.md
  into /home/agent/.claude/agents/.
- terragrunt.hcl: restored (file was missing — apply was blocked).
  Standard root include + platform/vault/external-secrets dependencies.

Smoke-tested 2026-05-16: deep_research call on Datadog (thread 42)
via recruiter-responder REST API → 102.5s, $0.43, structured
markdown report with comp bands vs £600k floor, culture signals,
remote policy, recent news, sources cited. End-to-end Tier-2 is live.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
ce5f3ec209 recruiter-responder: expose Gmail IMAP creds for backtest CLI
Pulls vbarzin@gmail.com app password from secret/recruiter-responder
(seeded from secret/wealthfolio.imap_password — same Gmail credential
that wealthfolio uses for broker-statement ingestion). Env vars
GMAIL_IMAP_USER + GMAIL_IMAP_PASS, consumed by 'backtest gmail'.

Backtest verified 2026-05-16 against folder
'companies-I-dont-take-seriously': 20/20 recruiter, 100% company
extraction (9 stated, 6 subject, 4 sender_domain, 1 body), 30% comp,
avg 12s latency.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
Viktor Barzin
065982d978 kured: fix sentinel path mismatch that stalled rolling reboots
The kured Helm chart derives the sentinel hostPath from
`dirname(configuration.rebootSentinel)`. Previously
rebootSentinel=/sentinel/gated-reboot-required pointed hostPath at
`/sentinel/` (an empty auto-created directory on every host) while the
kured-sentinel-gate DaemonSet writes to /var/run/gated-reboot-required.

Two different host directories → kured never saw the open gate, even
though the gate's checks were all green every 5 min on every node.
Result: unattended-upgrades has packages waiting on every node since
2026-05-10 (when uu was re-enabled) and kured's hourly log says
"Reboot not required" for the entire period.

Set rebootSentinel=/var/run/gated-reboot-required so the chart mounts
hostPath /var/run — same directory the gate writes to. The in-pod
mountPath (/sentinel) is hardcoded by the chart and doesn't matter,
the symlink chain works out: /sentinel/<file> inside the pod resolves
to /var/run/<file> on the host.

Verified: kured pod can now list /sentinel/gated-reboot-required
(0 B) AND /sentinel/reboot-required (32 B, set by uu on 2026-05-15).
First gated reboot will land Mon 2026-05-18 02:00 London.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
80e6314bf0 recruiter-responder: bump image_tag to 559e5c57
PDF extraction, tech_stack list, aggressive company/comp inference,
no-phone-call drafts, backtest CLI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
8e11caff8d recruiter-responder: bump image_tag to bbd178da (structured Telegram + comp floor)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
Viktor Barzin
24ce3e267d aiostreams: weekly backup of Stremio account addon collection
Adds stremio-account-backup CronJob (Sun 04:00 weekly, offset 1h from
the AIOStreams config-backup at 03:00):

- Logs into api.strem.io with credentials from Vault
  (secret/viktor.stremio_email + stremio_password, now also synced
  into the aiostreams-probe-secrets ExternalSecret)
- Fetches the full addonCollection via addonCollectionGet
- Writes timestamped JSON to the existing aiostreams-backup PVC
  (NFS /srv/nfs/aiostreams-backup/stremio-collection-*.json, mode 0600)
- 90-day retention, logs out to invalidate the auth key
- Pushgateway metrics: stremio_account_backup_{success,bytes,
  addon_count,duration_seconds,last_run_timestamp}

Protects against: accidental "uninstall all" / API regression / wrong
account login wiping the curated set of 22 addons (Cinemeta + 16
MDBList + AIOStreams + More Like This + Formulio + Zamunda + Local).

Verified: manual run wrote 93480 bytes, 22 addons, file present on NFS.
2026-05-22 14:16:47 +00:00
aa6e9b0242 recruiter-responder: public /cb ingress for Telegram URL-button callbacks
- Add ingress_factory module (auth=none, HMAC + expiry are the gate);
  ingress_path=["/cb"] only — /api stays internal, /healthz cluster.
  dns_type=proxied. anti_ai_scraping=false.
- Drop setup_tls_secret module — Kyverno ClusterPolicy `sync-tls-secret`
  auto-clones the wildcard cert into every namespace.
- Bump image_tag to 7383b426 (callback endpoints + SMTP STARTTLS
  hostname relax).
- Wire CALLBACK_BASE_URL=https://recruiter-responder.viktorbarzin.me.
- Drop git-crypt-encrypted wildcard cert files into
  stacks/recruiter-responder/secrets/. Allowlist privkey.pem in a new
  .gitleaksignore — git-crypt encrypts at rest but the working-tree
  copy is plaintext, so gitleaks can't tell.

Smoke-tested end-to-end 2026-05-15 23:45:
  synthetic email -> Telegram with / buttons ->  tapped via curl
  -> 'Sent' HTML page -> thread.status=sent, decision row recorded
  with decided_via=telegram_button, outbound message threaded correctly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
Viktor Barzin
77010b769a aiostreams: whitelist Vidhin + Tamtaro sync URLs
Adds two env vars on the AIOStreams deployment:
- WHITELISTED_REGEX_PATTERNS_URLS: Vidhin's release-group regex
  (TRaSH-aligned) so syncedRankedRegexUrls works for the user
- WHITELISTED_SEL_URLS: Vidhin's ranked stream expressions +
  Tamtaro's ISE/PSE/ESE-standard

Gotcha: AIOStreams validates each synced* field against the matching
whitelist — stream-expression files (incl. Vidhin's expressions.json)
go in WHITELISTED_SEL_URLS, not the regex one, even though they live
in Vidhin's regex repo. Mixing them up returns USER_INVALID_CONFIG.

User config: enabled Vidhin's regex + ranked expressions + Tamtaro's
ISEs. Skipped Tamtaro PSE/ESE for now to avoid surprise over-filtering;
can be added later from the same whitelist.
2026-05-22 14:16:47 +00:00
Viktor Barzin
c396092c86 aiostreams: weekly NFS backup of decrypted user config
Adds aiostreams-config-backup CronJob (Sun 03:00 weekly):
- Pulls /api/v1/user via internal ClusterIP with UUID + password from
  the existing aiostreams-probe-secrets ExternalSecret
- Writes timestamped JSON to nfs-backup PVC mounted at /backup
- 90-day retention, prunes older files
- Pushgateway metrics: aiostreams_config_backup_{success,bytes,duration,last_run_timestamp}

NFS path: 192.168.1.127:/srv/nfs/aiostreams-backup (auto-synced offsite
to Synology via the existing offsite-sync-backup CronJob).

Complements the daily postgresql-backup-per-db pipeline (which dumps
the encrypted blob) by storing the decrypted JSON — usable for human
inspection / disaster recovery even without the AIOStreams password.

Verified: manual job wrote 12931 bytes, file present on NFS.
2026-05-22 14:16:47 +00:00