Wireguard pod CrashLoopBackOff'd for hours with wg-quick's PostUp failing:
iptables v1.8.4 (legacy): can't initialize iptables table `nat':
Table does not exist (do you need to insmod?)
sclevine/wg's default `iptables` symlink points to iptables-legacy, which
talks to the kernel's xt-tables. K8s nodes nowadays initialize their
nat table via nftables (calico-node sets it up), so iptables-legacy in
the container sees "no nat table" and bails. Reproduced by ephemerally
debugging the live pod's namespaces (kubectl debug --copy-to + same
mounts as the real pod) — wg-quick output matched verbatim.
Fix: postStart now calls update-alternatives to point iptables and
ip6tables at iptables-nft/ip6tables-nft (already present in the image)
before exec'ing wg-quick. The wg0.conf PostUp MASQUERADE then writes
to the nftables-backed nat table calico already populated. Verified:
new pod went 2/2 Running with 0 restarts after apply.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User reported "site loads but failed to connect on the tmux session". Root
cause was a Traefik replica (traefik-db7696fbf-ktjjz) that came up missing
the kubernetes_ingress-derived router for terminal.viktorbarzin.me — only
the IngressRoute CRDs registered. About 1/3 of /token preflight requests
landed on that replica and got 404 with router="-", and WS upgrades
intermittently failed the same way, so the lobby iframe stayed stuck on
"Failed to connect. Retrying...". `kubectl delete pod` on the bad replica
restored the missing router and unblocked the user.
This commit adds the long-term mitigation:
stacks/terminal/main.tf
- kubernetes_cron_job_v1.webterminal_probe runs every 5min, hits
/token + /ws via Cloudflare and the in-cluster ttyd Service, pushes
4 gauges to Pushgateway (token_status, ws_status, ttyd_status,
last_success_timestamp). Verified the probe end-to-end:
token=302 ws=302 ttyd=200 ok=1
stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
- Webterminal group: WebterminalTokenDegraded (warning, 10m),
WebterminalWebsocketDegraded (critical, 10m),
WebterminalTtydUnreachable (critical, 10m),
WebterminalProbeStale (warning, 15m).
- Traefik Router Parity group: TraefikRouterCountSkew fires when any
Traefik replica's router count diverges from siblings for >10m —
catches the same class of issue cluster-wide, not just for terminal.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped
control-plane exclusion from the controller Deployment, so both replicas
landed on k8s-master, fought for hostNetwork ports 19809/29653, and one
went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes
holding the ports — only a kubelet restart on master cleared them.
- Pin helm_release.version = "4.13.1" so terraform apply can't drift to
the broken chart (defense in depth; nfs-csi namespace is already in the
Kyverno-Keel exclude list)
- Add controller.affinity: podAntiAffinity between replicas +
nodeAffinity excluding node-role.kubernetes.io/control-plane
- docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md
captures the root cause + recovery procedure (kubelet restart via
nsenter is the escalation path when crictl rmp -f fails)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The provider now emits delta gains-offset DEPOSITs (broker-sync@98c4729)
which is the simple accumulate-gains approach Viktor signed off on:
each monthly scrape captures (current_pot, real_contribs), and we emit
a single DEPOSIT/WITHDRAWAL sized to growth-since-last-scrape.
dav_corrected handles the dashboard math.
Next scheduled run: 2026-05-20 05:00 UK. Manual trigger via
'kubectl -n broker-sync create job fid-now --from=cronjob/broker-sync-fidelity'.
Without the anchor, each policy update fires mutateExistingOnPolicyUpdate,
which OVERWRITES existing keel.sh/policy annotations back to 'force'. That
broke the phased rollout — bulk-setting workloads to 'never' didn't stick
because the next policy update reset them.
With +() anchors, the mutate only adds the annotation if missing. New
workloads (in enrolled namespaces) get force+match-tag; existing workloads
with explicit policy=never (out-of-band, for phased rollout) stay never.
Phase 1 rollout state (2026-05-17):
- 10 workloads on force+match-tag in 10 namespaces (Phase 1)
enrolled via keel.sh/enrolled=true namespace label:
linkwarden, excalidraw, diun, echo, foolery, city-guesser,
jsoncrack, privatebin, ntfy, speedtest
- 216 workloads on policy=never (out-of-band kubectl annotate)
- 31 critical namespaces excluded at policy level
Expand to Phase 2 by labeling more namespaces `keel.sh/enrolled=true`
and clearing the `never` annotation off their workloads.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 incident: Keel rolled authentik 2026.2.2 → 2026.2.3 around 23:36.
The force+match-tag pairing should have constrained Keel to digest-only on
the current tag (not switch to a new tag), but a race between Kyverno's
mutate (injecting match-tag) and Keel's hourly poll caused the workload to
still have the old `force`-only annotation when Keel acted. Result: tag
rewrite, pods cycled, pgbouncer connection failures, login broken.
Manual rollback: `kubectl rollout undo` on all 5 authentik deployments back
to 2026.2.2. Auth restored within ~5 min.
Going forward, critical-namespace workloads are excluded at the policy level
so this race can't recur. They get upgraded via TF (Helm chart version bumps)
on a deliberate cadence, never by Keel.
Live state: 36 workloads on policy=never (35 critical + chrome-service pin
+ 7 CI-driven self-hosted from earlier), 190 on policy=force+match-tag for
opt-out-pure auto-update on the remaining stateless apps.
This matches user direction (2026-05-17): "upgrading is fine as long as we
upgrade correctly and the latest version is healthy" + "keel responsible
for the latest version, phased rollout, graceful".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User: 'i'm happy with occasional breakages. we have alerts.'
Policy=major auto-updates workloads to the latest semver tag in the
registry, including major/minor/patch bumps. Still semver-parser-bounded
so dev/nightly/master branches are filtered out (avoids the 2026-05-16
force-trap on affine/calico).
Live: 217 patch-annotated workloads re-annotated to major. Next Keel
poll (~1h) will pick up any pending major/minor releases.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replies from recruiters to our sent decline / engage / ignored threads
are now attached to the existing thread, surface with a 🔁 follow-up
marker in Telegram ("you previously sent"), and re-open thread status
to pending so they show up in recruiter_list status=pending.
Smoke-tested live: Rachel-style follow-up referencing our outbound
msgid + the original recruiter msgid in References → correctly
attached to thread #87, status flipped sent→pending, 3 messages
persisted (in/out/in).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The mutateExistingOnPolicyUpdate=true on inject-keel-annotations produced
176 UpdateRequests for the initial bulk scan across enrolled namespaces.
At the existing 384Mi limit, kyverno-background-controller OOMKilled while
processing them — no annotations got injected on existing workloads (count
stuck at 30).
Live state already bumped via kubectl set resources; this commit makes it
durable through Terraform. Also lowered the request to 256Mi (the 384Mi
floor was tight against limit; 2Gi headroom for bulk scans, 256Mi steady
state).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit (bc714755) added mutateExistingOnPolicyUpdate=true
to the inject-keel-annotations ClusterPolicy but Kyverno's validate
webhook rejected it: the background-controller SA needs update/patch
on apps/v1 Deployment/StatefulSet/DaemonSet.
Created live via kubectl + now in TF so the next apply is idempotent.
The ClusterRole aggregates into kyverno:background-controller via the
rbac.kyverno.io/aggregate-to-background-controller label.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reply-To header now extracted on inbound and used for outbound replies.
Verified with a synthetic email From: noreply-careers@megacorp.example
Reply-To: spam@viktorbarzin.me — reply correctly went to spam@ and
threaded under the original (Re: subject + In-Reply-To + References).
Alembic 0003 added messages.reply_to_addr column.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Before this, the inject-keel-annotations policy only fired on admission
events. Workloads that existed BEFORE their namespace got labeled
keel.sh/enrolled=true never received the annotation, so Keel didn't
watch them. Live state was 30 of 226 workloads auto-updating.
With mutateExistingOnPolicyUpdate=true and the required mutate.targets
block, Kyverno's BackgroundScan controller applies the mutate to
existing matching Deployments/StatefulSets/DaemonSets on policy update.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* immich: extended 3 V1 lifecycles to V2 (1 Deployment without V1
skipped — has non-standard lifecycle from earlier work).
* status-page: enrolled (was missing from original sweep).
* v6 retrigger marker on 17 stacks that never reached terragrunt
apply (#704 exit-1 halted mid-loop).
After this lands, expected live enrollment: ~96 / 118 Tier 1 stacks.
The remaining ~22 are operator/Helm-managed and intentionally excluded
(same fight-loop risk as Calico — bump via Helm chart version, not
Keel).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* beads-server: 3 Deployments — extended V1 lifecycle blocks to V2
+ KEEL_IGNORE_IMAGE; namespace label.
* llama-cpp: 1 Deployment — extended V1→V2; namespace label.
* novelapp: namespace label only (Deployment has non-standard
lifecycle without V1 dns_config — drift expected, accept for now).
* plotting-book: namespace label only (same as novelapp).
* trading-bot: namespace label only (same as novelapp).
immich deferred — the bulk-add script's brace-counter got confused by
a HEREDOC in the file, inserting a lifecycle block in the wrong
position. Needs manual per-Deployment editing.
The 3 ns-only stacks (novelapp, plotting-book, trading-bot) will see
their Deployments mutated by Kyverno but their TF lifecycle doesn't
yet ignore the keel annotations. Expected behavior: drift visible in
terragrunt plan, applied-state oscillates with Kyverno re-injecting.
Acceptable starting point; per-Deployment lifecycle work to fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stop the hourly Keel-vs-tigera-operator fight loop on calico-node
DaemonSet (v3.26.5 ↔ v3.26.1). Live: re-annotated 4 calico-system
workloads with keel.sh/policy=never; TF: added calico-system to the
namespaces exclude list so any future mutate run won't re-inject.
The previous calico unenrollment (label removal from namespace)
wasn't enough — once Kyverno had stamped the policy=patch annotation
on the Deployments/DaemonSets, removing the namespace label didn't
strip the annotation, so Keel kept watching them.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wire ha-mcp, context7, and the in-pod playwright sidecar as native
MCP servers on OpenClaw via `mcp set` in the container startup
(ConfigMap-baked mcp.servers gets stripped by `doctor --fix`; CLI-set
entries persist). HA URL pulled from new Vault key
secret/openclaw.ha_sofia_mcp_url and passed via the
HA_SOFIA_MCP_URL env var.
Add a daily 03:00 UTC `memory-sync` CronJob in the openclaw
namespace: pulls all non-sensitive memories from
claude-memory.claude-memory.svc:80/api/memories, groups by category,
writes 18 Markdown files into /workspace/memory/projects/claude-
memory-sync/ (the path memory-core indexes), then triggers
`openclaw memory index --force` via kubectl exec. Reuses the
existing cluster-healthcheck SA (pods+pods/exec). Smoke test: 1488
memories synced, 25/25 files indexed, search returns hits.
Also drops the legacy /app/extensions entry from
plugins.load.paths (doctor warning), wires HA_SOFIA_MCP_URL env,
and one-shot deletes the stale 2026-02-28 metaclaw-export.json from
the openclaw home volume.
claude_memory MCP intentionally NOT wired — its /mcp/mcp transport
404s on the deployed claude-memory-mcp:17 image (tracked as
code-z1so). Shared knowledge is delivered via the CronJob's REST
sync instead. Adding claude_memory to mcp.servers is a one-line
follow-up once that's fixed.
The broker-sync Fidelity provider emits 'unrealised-gains-offset'
DEPOSIT activities to reconcile Wealthfolio's total with the
PlanViewer reported pot, because Wealthfolio doesn't track pension
fund units directly. Wealthfolio's data model treats that DEPOSIT as
a cash contribution, which double-inflates net_contribution and
zeroes out the implied growth.
Add a Postgres view 'dav_corrected' in wealthfolio_sync that
subtracts the cumulative gains-offset from net_contribution per
account per date (re-exporting as 'net_contribution' so it's a
drop-in replacement). All 17 wealth dashboard panels that compute
contribution/growth/ROI now read from the view. Total impact:
portfolio Growth jumps from £301,753.19 to £337,474.39 (exactly
the £35,721.20 Fidelity offset that was previously miscategorised).
Bulk enrollment commit 8f4b1956 had its CI pipeline #689 killed before
terragrunt apply ran. The enrollment label + V2 lifecycle changes are
in master but never reached the cluster. Appending a one-line marker
to each pending stack's main.tf so Woodpecker's diff-detection picks
them up and applies them serially.
Idempotent — re-applying a stack whose state already matches is a no-op.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
For Deployments enrolled in Keel with policy=patch, the image tag is
updated by Keel as new patches release upstream. Without
ignore_changes on the image field, terragrunt apply would fight Keel
in an endless loop (TF reverts → Keel re-rolls → repeat — same shape
as the calico/tigera-operator fight from earlier).
Adding KEEL_IGNORE_IMAGE marker to the lifecycle of these stacks.
Image string in TF becomes the initial seed; Keel rolls it forward.
Stacks: actualbudget, broker-sync, changedetection, city-guesser,
coturn, dashy, dawarich, diun, ebook2audiobook, ebooks, echo,
excalidraw, foolery, forgejo, freedify.
CI-driven self-hosted stacks (fire-planner, job-hunter, payslip-ingest,
recruiter-responder, claude-agent-service, claude-memory) keep TF
ownership of image and policy=never — their image_tag is set by CI
via terragrunt.hcl inputs, not by Keel. Adding image to ignore_changes
on those would break the CI deploy flow.
Caveat: only container[0].image is added. Multi-container Deployments
(immich, beads, etc.) will need additional container[N].image lines
for any container Keel rolls. Those stacks are not currently enrolled.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Keel kept rewriting calico-node + calico-kube-controllers images to
v3.26.5 (proper patch update); tigera-operator immediately reverted
to v3.26.1 because the Installation CR is the source of truth.
Endless churn but no data loss — Calico stayed healthy throughout.
Removing keel.sh/enrolled label and live label from calico-system ns.
Calico upgrades go through the tigera-operator's Installation CR
manually, not Keel.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move from `never` (no auto-update) to `patch` for the cluster-wide
default. Keel only auto-updates PATCH versions within the current
major.minor: 0.26.6 → 0.26.7 OK; 0.26.6 → :nightly-latest blocked.
Tag-rewrites that broke calico (v3.26.1 → :master) and affine
(0.26.6 → :nightly-latest) on 2026-05-16 cannot recur with patch.
Caveats:
* Patch causes Terraform image drift for semver-pinned services —
drift-detection pipeline will surface it; lifecycle ignore_changes
on container[].image can be added per stack later if drift is
noisy.
* Tags that aren't parseable as semver (:latest, :11, :nightly,
SHA tags) are ignored by patch — those workloads stay on their
current image until promoted to `force` policy individually.
Self-hosted CI-driven services + chrome-service kept on `never`
(deliberate pins / CI controls the tag):
recruiter-responder, claude-agent-service, claude-memory,
chrome-service, fire-planner, job-hunter, payslip-ingest
Live state already updated via kubectl apply + per-workload patches.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- claude-agent-service bumped to 191ed5dd (new AI section in agent
template — leadership stance, approved tools, usage limits / quotas,
code-gen safety, product-side AI depth, follow-up questions for the
recruiter when the web is sparse).
- recruiter-responder bumped to ab59eeab (deep_research prompt asks
for AI culture; warm_engage template adds a written-only ask for
IDE assistants, chat tools, per-seat limits, source-to-external
model policy).
Smoke-tested 2026-05-16: forced fresh research on Datadog, agent
returned full structured AI section with 7 explicit recruiter
questions covering DLP/IDE/limits/code-gen-policy. $0.80 / 192s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 incident: Keel's `force` policy switched semver-pinned
images (affine 0.26.6 → :nightly-latest, calico v3.26.1 → :master)
instead of digest-tracking. Force is documented as "always update
to the newest tag in the registry" — only safe on already-mutable
tags like :latest.
Changing the cluster-wide default in inject-keel-annotations to
`never`. The namespace enrollment label + V2 lifecycle suppression
stay in place so opt-in is one annotation per Deployment, but no
service auto-updates until explicitly approved.
To opt in a workload now:
1. Verify the Deployment image is on a mutable tag (:latest,
:<major>, or a vendor "stable" tag) — change in Terraform first
if needed.
2. Add to the Deployment's metadata.annotations:
"keel.sh/policy" = "force" (digest tracking)
OR
"keel.sh/policy" = "patch" (semver patch bumps — also
requires ignore_changes on the image)
Live policy already updated via kubectl apply + per-workload
override (force → never).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wire Keel's Slack notifier to the existing bot token in Vault
(secret/viktor -> slack_bot_token). Posts to #general by default;
override via slack.channel in the Helm values if you want a dedicated
channel like #keel-notifications.
Notification level is "info" so we get every rollout event, not just
errors. Approval flow is OFF — opt-out-pure means all updates apply
unattended. If we later introduce approvals, add slack.approvalsChannel.
Resolves user request: 'keel should send notifications to slack everytime
it upgrades an app'.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
OpenClaw can now answer 'what do we know about <company>?' from cache
via the new recruiter_company_research tool, and recruiter_get embeds
the cached research payload inline.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Helm repo at https://charts.keel.sh has versions 1.0.0–1.0.5,
1.1.0, 1.2.0. 1.0.6 is not published, so the Phase 0 apply failed
silently. Bump to 1.2.0 (app version 0.21.1, latest stable).
The weekday-only schedule was a 2026-03-16-incident-era guardrail when
the rest of the safety net was thin. Today's gates — halt-on-alert,
sentinel-gate Check 4 (24h soak via node Ready transitions), the
K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the
sentinel-path fix from earlier today — make weekend reboots safe and
just clear the backlog faster.
Effect: 5 pending node reboots clear in 5 calendar days instead of
queueing up over weekends. The K8s version-upgrade detection at Sun
12:00 UTC self-defers if a Sunday-morning kured reboot fires (the
RecentNodeReboot alert is in the Upgrade Gates ignore-less list for
the version-upgrade preflight — same mechanism kured uses).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Enrolls the cleanest Woodpecker-build-only self-hosted services into
the inject-keel-annotations ClusterPolicy by labeling their namespaces
keel.sh/enrolled=true. CI already pushes :latest (auto_tag: true) on
each, so Keel will detect the current upstream digest and trigger a
rolling restart when polling starts (1h cadence).
Per-Deployment lifecycle extended with KYVERNO_LIFECYCLE_V2 to suppress
the annotation drift Kyverno will inject (keel.sh/policy, /trigger,
/pollSchedule).
Services included:
- fire-planner
- job-hunter
- payslip-ingest
- recruiter-responder
Skipped from Phase 1 for follow-up:
- claude-agent-service (user has WIP on main.tf)
- claude-memory (Postgres co-deployed; treat in Phase 9 with other DBs)
- kms (two Deployments; needs per-resource review)
- wealthfolio (sync sidecar pattern; needs review)
- chrome-service (deliberate :v4 pin; needs keel.sh/policy: never label)
- GHA-migrated repos (10) (need per-repo CI cleanup)
- beadboard, freedify (no CI)
See docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- claude-agent-service bumped to f764fef6 (agent system prompt adds
the Perks block: food/health/pension/equity/PTO/parental/equipment/
learning/wellness/amenities/commuter). 1200-word cap.
- recruiter-responder bumped to 38a2cdaa (cache-first deep_research:
serves cached payload if fetched_at + ttl_seconds > now; cache
writes upsert; new force flag bypasses).
Verified end-to-end: deep_research on Datadog now returns full Perks
section (~220s, $0.60, 23 turns). Earlier 500 fixed (was
uq_research_company_tier dup-key on re-run).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Foundation for opt-out-pure auto-update model per
docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.
- New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6).
Polls registries hourly per design decision #8. Default schedule
overridable per-workload via keel.sh/pollSchedule annotation.
- New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments,
StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true`
with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h.
- Phase 0 enrolls no namespaces. Phase 1 (next session) labels the
self-hosted set.
- Per-workload opt-out: label `keel.sh/policy: never` (used by rollback
runbook and chrome-service-style deliberate pins).
- Keel namespace excluded from the mutate — supervisor self-update has
too-bad a failure mode (decision #11).
- AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the
ignore_changes block enrolled workloads need.
- .claude/CLAUDE.md: docker-images rule flagged as transitional.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three defensive moves to make the kured rolling-reboot cycle survive
edge cases without operator intervention:
kured (stacks/kured/main.tf):
- Set `configuration.drainTimeout = "30m"`. Default is unlimited; if
a future PDB or finalizer stalls drain, kured retries forever and
the node stays cordoned silently. 30m caps the silent-failure
window — after timeout kured logs the abort and waits for the
next period; the node stays Schedulable so cluster capacity isn't
lost. Lets us fail closed instead of fail-silent.
CNPG pg-cluster (stacks/dbaas/modules/dbaas/main.tf):
- Bump instances 2 → 3 (1 primary + 2 replicas). With 2 instances the
failover during a primary-node drain depended on the lone replica
being caught up; a WAL backlog would stall the drain until the
replica was current. With 3 instances CNPG always has at least one
fully-current replica to promote, and the PDB's
`minAvailable=1` on the primary selector is satisfied throughout
the switchover. Storage: +20Gi PVC on proxmox-lvm-encrypted (about
35Gi after autoresize). Memory: +3Gi pod limit.
- Updated the `triggers.instances` so the null_resource's local-exec
actually re-applies the YAML (kubectl apply with the new spec). The
YAML is the source-of-truth but the trigger is what tells terraform
to re-run the provisioner.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Anubis pre-2026-05-16 ran at replicas=1 because in-flight PoW challenge
state lived in process memory — a challenge issued by pod A wouldn't be
verifiable by pod B (HTTP 500 "store: key not found"). The PDB at
`minAvailable=1` made this worse: with replicas=1 the eviction API can
NEVER satisfy the constraint, so every drain on a node hosting an Anubis
pod looped forever. This is what stalled the manual K8s upgrade on
2026-05-11 (had to delete pods directly to bypass eviction) and was
about to block kured on Monday 2026-05-18 once the kured sentinel fix
landed.
Anubis upstream has first-class support for a Valkey/Redis-protocol
shared store (documented as the "Kubernetes worker pool" pattern).
Wire it up:
- modules/kubernetes/anubis_instance: add `shared_store_url` variable.
When set, appends a `store: { backend: valkey, parameters: { url } }`
block to the rendered policy YAML and defaults replicas to 2 (capped
at 2). PDB switched from `minAvailable=1` to `maxUnavailable=1` so
drains can take down one pod at a time. topologySpreadConstraint
tightened to `DoNotSchedule` so the two replicas land on different
nodes — a single node loss never takes a whole Anubis instance down.
- All 8 call sites (cyberchef, jsoncrack, kms, homepage, blog,
travel_blog, real-estate-crawler, f1-stream) opted in. Each picks a
unique Redis DB index (5–12) on `redis-master.redis:6379`. Cluster
Redis already runs HA via Sentinel + haproxy, no new infra needed.
Verified: every Anubis Deployment now 2/2 Ready with pods on different
nodes; PDBs allow 1 disruption; Redis DBs 5,7,8,10 already populated
by live traffic post-apply; Palo Alto Networks scanner hit blog right
after apply and the challenge log shows the new state path.
Drain on any worker now succeeds without a `predrain_unstick` workaround
— eviction API is satisfied because at most one pod is unavailable at a
time, and the other replica keeps serving. Monday's kured reboot wave
should roll through cleanly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- main.tf: bump image_tag to 1b3350c0 (carries the new agent),
init container also copies recruiter-triage.md
into /home/agent/.claude/agents/.
- terragrunt.hcl: restored (file was missing — apply was blocked).
Standard root include + platform/vault/external-secrets dependencies.
Smoke-tested 2026-05-16: deep_research call on Datadog (thread 42)
via recruiter-responder REST API → 102.5s, $0.43, structured
markdown report with comp bands vs £600k floor, culture signals,
remote policy, recent news, sources cited. End-to-end Tier-2 is live.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The kured Helm chart derives the sentinel hostPath from
`dirname(configuration.rebootSentinel)`. Previously
rebootSentinel=/sentinel/gated-reboot-required pointed hostPath at
`/sentinel/` (an empty auto-created directory on every host) while the
kured-sentinel-gate DaemonSet writes to /var/run/gated-reboot-required.
Two different host directories → kured never saw the open gate, even
though the gate's checks were all green every 5 min on every node.
Result: unattended-upgrades has packages waiting on every node since
2026-05-10 (when uu was re-enabled) and kured's hourly log says
"Reboot not required" for the entire period.
Set rebootSentinel=/var/run/gated-reboot-required so the chart mounts
hostPath /var/run — same directory the gate writes to. The in-pod
mountPath (/sentinel) is hardcoded by the chart and doesn't matter,
the symlink chain works out: /sentinel/<file> inside the pod resolves
to /var/run/<file> on the host.
Verified: kured pod can now list /sentinel/gated-reboot-required
(0 B) AND /sentinel/reboot-required (32 B, set by uu on 2026-05-15).
First gated reboot will land Mon 2026-05-18 02:00 London.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds stremio-account-backup CronJob (Sun 04:00 weekly, offset 1h from
the AIOStreams config-backup at 03:00):
- Logs into api.strem.io with credentials from Vault
(secret/viktor.stremio_email + stremio_password, now also synced
into the aiostreams-probe-secrets ExternalSecret)
- Fetches the full addonCollection via addonCollectionGet
- Writes timestamped JSON to the existing aiostreams-backup PVC
(NFS /srv/nfs/aiostreams-backup/stremio-collection-*.json, mode 0600)
- 90-day retention, logs out to invalidate the auth key
- Pushgateway metrics: stremio_account_backup_{success,bytes,
addon_count,duration_seconds,last_run_timestamp}
Protects against: accidental "uninstall all" / API regression / wrong
account login wiping the curated set of 22 addons (Cinemeta + 16
MDBList + AIOStreams + More Like This + Formulio + Zamunda + Local).
Verified: manual run wrote 93480 bytes, 22 addons, file present on NFS.
- Add ingress_factory module (auth=none, HMAC + expiry are the gate);
ingress_path=["/cb"] only — /api stays internal, /healthz cluster.
dns_type=proxied. anti_ai_scraping=false.
- Drop setup_tls_secret module — Kyverno ClusterPolicy `sync-tls-secret`
auto-clones the wildcard cert into every namespace.
- Bump image_tag to 7383b426 (callback endpoints + SMTP STARTTLS
hostname relax).
- Wire CALLBACK_BASE_URL=https://recruiter-responder.viktorbarzin.me.
- Drop git-crypt-encrypted wildcard cert files into
stacks/recruiter-responder/secrets/. Allowlist privkey.pem in a new
.gitleaksignore — git-crypt encrypts at rest but the working-tree
copy is plaintext, so gitleaks can't tell.
Smoke-tested end-to-end 2026-05-15 23:45:
synthetic email -> Telegram with ✅/❌ buttons -> ✅ tapped via curl
-> 'Sent' HTML page -> thread.status=sent, decision row recorded
with decided_via=telegram_button, outbound message threaded correctly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds two env vars on the AIOStreams deployment:
- WHITELISTED_REGEX_PATTERNS_URLS: Vidhin's release-group regex
(TRaSH-aligned) so syncedRankedRegexUrls works for the user
- WHITELISTED_SEL_URLS: Vidhin's ranked stream expressions +
Tamtaro's ISE/PSE/ESE-standard
Gotcha: AIOStreams validates each synced* field against the matching
whitelist — stream-expression files (incl. Vidhin's expressions.json)
go in WHITELISTED_SEL_URLS, not the regex one, even though they live
in Vidhin's regex repo. Mixing them up returns USER_INVALID_CONFIG.
User config: enabled Vidhin's regex + ranked expressions + Tamtaro's
ISEs. Skipped Tamtaro PSE/ESE for now to avoid surprise over-filtering;
can be added later from the same whitelist.
Adds aiostreams-config-backup CronJob (Sun 03:00 weekly):
- Pulls /api/v1/user via internal ClusterIP with UUID + password from
the existing aiostreams-probe-secrets ExternalSecret
- Writes timestamped JSON to nfs-backup PVC mounted at /backup
- 90-day retention, prunes older files
- Pushgateway metrics: aiostreams_config_backup_{success,bytes,duration,last_run_timestamp}
NFS path: 192.168.1.127:/srv/nfs/aiostreams-backup (auto-synced offsite
to Synology via the existing offsite-sync-backup CronJob).
Complements the daily postgresql-backup-per-db pipeline (which dumps
the encrypted blob) by storing the decrypted JSON — usable for human
inspection / disaster recovery even without the AIOStreams password.
Verified: manual job wrote 12931 bytes, file present on NFS.