Two latent bugs in the K8s-version-upgrade pipeline surfaced when a
real detection run ran post-26.04 upgrade today:
1. **DNS**: pod's CoreDNS search path is `<ns>.svc.cluster.local
svc.cluster.local cluster.local` (+ ndots=2 via Kyverno mutation).
Unqualified `k8s-master` falls through all of those and then queries
upstream Technitium for the bare name → NXDOMAIN. The FQDN
`k8s-master.viktorbarzin.lan` is what Technitium actually serves.
Suffix every node SSH target with `$NODE_DOMAIN`.
2. **envsubst missing**: claude-agent-service image doesn't ship
`gettext-base`. Replace `envsubst <template | apply` with
`python3 -c 'import os,sys; sys.stdout.write(os.path.expandvars(
sys.stdin.read()))' <template | apply`. Same semantics, image
already has python3. Multi-line $SCHEDULING_BLOCK is preserved
correctly through expandvars.
Verified by manually triggering `k8s-version-check` post-fix:
detection now reads `Latest patch: v1.34.8` (currently running 1.34.7)
and spawns `k8s-upgrade-preflight-1-34-8`. The Job pod scheduled and
started; killed before it touched the cluster (will land on Sunday
2026-05-24 12:00 UTC like the schedule says).
Root cause of why these bugs lay dormant: yesterday's first
manual-test detection found "no upgrade needed" so neither code path
exercised SSH or envsubst. Today's apt-source restore (do-release-
upgrade had mangled them) unmasked the v1.34.8 candidate, which made
detection finally proceed past the SSH step.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
15-task plan for a shared presence board so Claude Code sessions can
see which shared infra resources are being actively mutated by other
sessions. Resource-scoped claims on the existing Dolt server,
heartbeat-driven TTL, agent-driven via CLAUDE.md rule + Python CLI.
Per user decision, removed authentik, kyverno, metallb-system,
external-secrets, proxmox-csi, nfs-csi, vpa, sealed-secrets,
infra-maintenance from the policy-level exclude list, and added
keel.sh/enrolled=true to aiostreams (alive — 1/1 Running, despite
being earlier flagged as scaled-to-0) and woodpecker.
Net cluster coverage: 197/227 workloads on safe-force (86%), up from
170/227 (74%). All 197 are paired with match-tag=true (digest-only).
Remaining 7 namespaces in Kyverno exclude list (irreducible):
- keel (self-update)
- calico-system + tigera-operator (operator-managed Installation CR)
- cnpg-system + dbaas (state-coupled)
- nvidia (chart-pinned at 570.195.03 per code-8vr0 until NVIDIA ships
ubuntu26.04 driver images)
- kube-system (k8s built-ins)
Files:
- stacks/kyverno/modules/kyverno/keel-annotations.tf — exclude list
trimmed from 16 → 7
- stacks/authentik, kyverno, proxmox-csi, nfs-csi, vpa, sealed-secrets,
servarr/aiostreams, metallb (creates ns "metallb-system"), woodpecker —
added keel.sh/enrolled=true label on kubernetes_namespace resource
- infra-maintenance was in the policy exclude but the namespace doesn't
actually exist in the cluster; the removal is a no-op there
Applied via kubectl patch on the live ClusterPolicy + kubectl label on
namespaces because the kubernetes provider v3.1.0 panics on Kyverno
ClusterPolicy refresh — TF source has the desired state for next clean
apply on a fixed provider.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per user decision today: monitoring, mailserver, vault, descheduler,
metrics-server, traefik, technitium, crowdsec, redis, reverse-proxy,
reloader, headscale, wireguard, xray, cloudflared now participate in
the same `force + match-tag` regime as the rest of the cluster — Keel
watches the deployment's CURRENT tag for digest changes only and rolls
on push, never rewriting tag strings.
Two-part change:
stacks/kyverno/modules/kyverno/keel-annotations.tf
Trim the policy-level namespace exclude list from 31 → 16. The 16
remaining exclusions are the irreducible cluster-operator + state-
coupled set: keel itself, calico-system + tigera-operator (operator
loop), authentik (2026-05-17 pgbouncer incident bite), cnpg-system +
dbaas (state-coupled), kyverno, metallb-system, external-secrets,
proxmox-csi + nfs-csi + nvidia (just stabilized today, chart-pinned),
kube-system, vpa, sealed-secrets, infra-maintenance.
stacks/<each-of-15>/.../main.tf
Add `"keel.sh/enrolled" = "true"` label to the `kubernetes_namespace`
resource so the Kyverno mutate policy can target the workloads via
its namespaceSelector matchLabels.
Note on the apply path: the live ClusterPolicy was patched via
`kubectl patch` because the hashicorp/kubernetes provider v3.1.0 panics
during state refresh on Kyverno ClusterPolicy schemas with deeply
nested optional `context.celPreconditions` / `imageRegistry` fields
(see crash dump). The TF source above has the desired state, so any
clean future apply on a fixed provider version will be a no-op against
the live cluster.
Floating-tag workloads in the newly-enrolled set (will roll on every
upstream digest update — acceptable risk per user):
- wireguard: sclevine/wg:latest (image fixed today via iptables-nft
postStart shim)
- xray: teddysun/xray
- crowdsec-web: viktorbarzin/crowdsec_web
- monitoring: prompve/prometheus-pve-exporter:latest, prom/snmp-exporter
- traefik: nginx:1-alpine, openresty/openresty:alpine,
ghcr.io/tarampampam/error-pages:3
- redis: haproxy:3.1-alpine, redis:8-alpine
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After rolling back k8s-node1's kernel to 6.8.0-117 + spoofing
/etc/os-release to 24.04 so the operator picked the matching
ubuntu24.04 driver image (everything per the workaround documented in
docs/known-issues.md), the driver container still went into a restart
loop. Container status:
lastState.terminated: { reason: "OOMKilled", exitCode: 137 }
The driver-installer was hitting the namespace LimitRange default of
128Mi during `apt-get install linux-headers-6.8.0-117-generic` — the
last log line on every restart was "Installing Linux kernel
headers..." before SIGKILL. 2Gi gives apt + the DKMS compile step
enough headroom; peak observed during a successful compile in a test
container was ~1.4Gi.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Captures the workaround applied on k8s-node1 today (kernel rolled back
to 6.8.0-117-generic, apt-mark hold on kernel meta-packages,
/etc/os-release spoofed to 24.04 so NFD reports VERSION_ID=24.04 and
the gpu-operator picks an existing ubuntu24.04 driver image), plus the
trigger that lets us un-mitigate: any ubuntu26.04 tag appearing on
nvcr.io/nvidia/driver.
Linked from the post-mortem and from beads code-8vr0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- New stack `paperless-mcp` running barryw/PaperlessMCP v0.1.19 (.NET,
HTTP+SSE on :5000) wraps paperless-ngx's built-in FTS. 43 tools exposed.
- In-cluster only egress to paperless-ngx svc; no Cloudflare hop on
MCP-internal traffic.
- Read-only at paperless layer: dedicated `claude-mcp` user (non-superuser)
in new `claude-mcp-readers` group with view-only Django perms; existing
279 docs bulk-granted view perm via /api/documents/bulk_edit/;
workflow #2 auto-grants the group on new docs (Consumption Added).
- Gateway-level bearer auth via new Traefik plugin
Aetherinox/traefik-api-token-middleware@v0.1.4 (loaded in traefik stack
alongside crowdsec-bouncer); per-stack Middleware CRD `bearer-auth`
pulls token list from Vault `secret/paperless-mcp/bearer_tokens`.
- Vault `secret/paperless-mcp` holds: paperless_api_token (synced to
K8s Secret via ESO; pod env via secret_key_ref), bearer_tokens
(JSON array, read at plan time), bearer_token_viktor_laptop (mirror
for laptop wiring), paperless_user_password (paperless UI fallback).
- Image auto-update via Keel (semver minor policy, hourly poll).
- Ingress dns_type=proxied → Uptime Kuma external monitor auto-created
by external-monitor-sync CronJob.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
k8s-node1 was upgraded to Ubuntu 26.04 (kernel 7.0.0-15-generic) at some
point. NVIDIA has NOT published ubuntu26.04 driver images yet
(skopeo list-tags docker://nvcr.io/nvidia/driver returned 0 ubuntu26.04
tags vs 779 for ubuntu22.04 and 206 for ubuntu24.04).
Attempted fix today: bump gpu-operator chart v25.10.1 → v26.3.1 +
driver 570.195.03 → 580.105.08 + kernelModuleType=open. The chart
applied cleanly but the v26.3.1 operator auto-detects host OS via NFD
labels and constructs `<version>-ubuntu26.04` image tags, which 404 on
pull. Rolled back to chart v25.10.1 and pinned it explicitly here so
future `terraform apply` doesn't surface the same trap again.
Note: chart rollback alone does NOT restore GPU functionality on
k8s-node1. Both v25.10.1 and v26.3.1's operators now pick the
ubuntu26.04 suffix (the NFD label is sticky once detected). The actual
recovery path requires either (a) NVIDIA shipping ubuntu26.04 driver
images, or (b) rolling the host kernel back to 6.8.0-117-generic
(still installed in /boot, headers in /usr/src) + `apt-mark hold` to
prevent re-upgrade. That step needs explicit user authorization for a
node reboot — left as the next action item on code-8vr0.
Files:
- stacks/nvidia/modules/nvidia/main.tf — explicit version pin,
explanatory comment
- stacks/nvidia/modules/nvidia/values.yaml — comment block
documenting the situation; driver pinned at 570.195.03
- docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md —
full timeline, root causes, recovery procedure
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wireguard pod CrashLoopBackOff'd for hours with wg-quick's PostUp failing:
iptables v1.8.4 (legacy): can't initialize iptables table `nat':
Table does not exist (do you need to insmod?)
sclevine/wg's default `iptables` symlink points to iptables-legacy, which
talks to the kernel's xt-tables. K8s nodes nowadays initialize their
nat table via nftables (calico-node sets it up), so iptables-legacy in
the container sees "no nat table" and bails. Reproduced by ephemerally
debugging the live pod's namespaces (kubectl debug --copy-to + same
mounts as the real pod) — wg-quick output matched verbatim.
Fix: postStart now calls update-alternatives to point iptables and
ip6tables at iptables-nft/ip6tables-nft (already present in the image)
before exec'ing wg-quick. The wg0.conf PostUp MASQUERADE then writes
to the nftables-backed nat table calico already populated. Verified:
new pod went 2/2 Running with 0 restarts after apply.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User reported "site loads but failed to connect on the tmux session". Root
cause was a Traefik replica (traefik-db7696fbf-ktjjz) that came up missing
the kubernetes_ingress-derived router for terminal.viktorbarzin.me — only
the IngressRoute CRDs registered. About 1/3 of /token preflight requests
landed on that replica and got 404 with router="-", and WS upgrades
intermittently failed the same way, so the lobby iframe stayed stuck on
"Failed to connect. Retrying...". `kubectl delete pod` on the bad replica
restored the missing router and unblocked the user.
This commit adds the long-term mitigation:
stacks/terminal/main.tf
- kubernetes_cron_job_v1.webterminal_probe runs every 5min, hits
/token + /ws via Cloudflare and the in-cluster ttyd Service, pushes
4 gauges to Pushgateway (token_status, ws_status, ttyd_status,
last_success_timestamp). Verified the probe end-to-end:
token=302 ws=302 ttyd=200 ok=1
stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
- Webterminal group: WebterminalTokenDegraded (warning, 10m),
WebterminalWebsocketDegraded (critical, 10m),
WebterminalTtydUnreachable (critical, 10m),
WebterminalProbeStale (warning, 15m).
- Traefik Router Parity group: TraefikRouterCountSkew fires when any
Traefik replica's router count diverges from siblings for >10m —
catches the same class of issue cluster-wide, not just for terminal.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped
control-plane exclusion from the controller Deployment, so both replicas
landed on k8s-master, fought for hostNetwork ports 19809/29653, and one
went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes
holding the ports — only a kubelet restart on master cleared them.
- Pin helm_release.version = "4.13.1" so terraform apply can't drift to
the broken chart (defense in depth; nfs-csi namespace is already in the
Kyverno-Keel exclude list)
- Add controller.affinity: podAntiAffinity between replicas +
nodeAffinity excluding node-role.kubernetes.io/control-plane
- docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md
captures the root cause + recovery procedure (kubelet restart via
nsenter is the escalation path when crictl rmp -f fails)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The provider now emits delta gains-offset DEPOSITs (broker-sync@98c4729)
which is the simple accumulate-gains approach Viktor signed off on:
each monthly scrape captures (current_pot, real_contribs), and we emit
a single DEPOSIT/WITHDRAWAL sized to growth-since-last-scrape.
dav_corrected handles the dashboard math.
Next scheduled run: 2026-05-20 05:00 UK. Manual trigger via
'kubectl -n broker-sync create job fid-now --from=cronjob/broker-sync-fidelity'.
Without the anchor, each policy update fires mutateExistingOnPolicyUpdate,
which OVERWRITES existing keel.sh/policy annotations back to 'force'. That
broke the phased rollout — bulk-setting workloads to 'never' didn't stick
because the next policy update reset them.
With +() anchors, the mutate only adds the annotation if missing. New
workloads (in enrolled namespaces) get force+match-tag; existing workloads
with explicit policy=never (out-of-band, for phased rollout) stay never.
Phase 1 rollout state (2026-05-17):
- 10 workloads on force+match-tag in 10 namespaces (Phase 1)
enrolled via keel.sh/enrolled=true namespace label:
linkwarden, excalidraw, diun, echo, foolery, city-guesser,
jsoncrack, privatebin, ntfy, speedtest
- 216 workloads on policy=never (out-of-band kubectl annotate)
- 31 critical namespaces excluded at policy level
Expand to Phase 2 by labeling more namespaces `keel.sh/enrolled=true`
and clearing the `never` annotation off their workloads.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 incident: Keel rolled authentik 2026.2.2 → 2026.2.3 around 23:36.
The force+match-tag pairing should have constrained Keel to digest-only on
the current tag (not switch to a new tag), but a race between Kyverno's
mutate (injecting match-tag) and Keel's hourly poll caused the workload to
still have the old `force`-only annotation when Keel acted. Result: tag
rewrite, pods cycled, pgbouncer connection failures, login broken.
Manual rollback: `kubectl rollout undo` on all 5 authentik deployments back
to 2026.2.2. Auth restored within ~5 min.
Going forward, critical-namespace workloads are excluded at the policy level
so this race can't recur. They get upgraded via TF (Helm chart version bumps)
on a deliberate cadence, never by Keel.
Live state: 36 workloads on policy=never (35 critical + chrome-service pin
+ 7 CI-driven self-hosted from earlier), 190 on policy=force+match-tag for
opt-out-pure auto-update on the remaining stateless apps.
This matches user direction (2026-05-17): "upgrading is fine as long as we
upgrade correctly and the latest version is healthy" + "keel responsible
for the latest version, phased rollout, graceful".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User: 'i'm happy with occasional breakages. we have alerts.'
Policy=major auto-updates workloads to the latest semver tag in the
registry, including major/minor/patch bumps. Still semver-parser-bounded
so dev/nightly/master branches are filtered out (avoids the 2026-05-16
force-trap on affine/calico).
Live: 217 patch-annotated workloads re-annotated to major. Next Keel
poll (~1h) will pick up any pending major/minor releases.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replies from recruiters to our sent decline / engage / ignored threads
are now attached to the existing thread, surface with a 🔁 follow-up
marker in Telegram ("you previously sent"), and re-open thread status
to pending so they show up in recruiter_list status=pending.
Smoke-tested live: Rachel-style follow-up referencing our outbound
msgid + the original recruiter msgid in References → correctly
attached to thread #87, status flipped sent→pending, 3 messages
persisted (in/out/in).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The mutateExistingOnPolicyUpdate=true on inject-keel-annotations produced
176 UpdateRequests for the initial bulk scan across enrolled namespaces.
At the existing 384Mi limit, kyverno-background-controller OOMKilled while
processing them — no annotations got injected on existing workloads (count
stuck at 30).
Live state already bumped via kubectl set resources; this commit makes it
durable through Terraform. Also lowered the request to 256Mi (the 384Mi
floor was tight against limit; 2Gi headroom for bulk scans, 256Mi steady
state).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit (bc714755) added mutateExistingOnPolicyUpdate=true
to the inject-keel-annotations ClusterPolicy but Kyverno's validate
webhook rejected it: the background-controller SA needs update/patch
on apps/v1 Deployment/StatefulSet/DaemonSet.
Created live via kubectl + now in TF so the next apply is idempotent.
The ClusterRole aggregates into kyverno:background-controller via the
rbac.kyverno.io/aggregate-to-background-controller label.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reply-To header now extracted on inbound and used for outbound replies.
Verified with a synthetic email From: noreply-careers@megacorp.example
Reply-To: spam@viktorbarzin.me — reply correctly went to spam@ and
threaded under the original (Re: subject + In-Reply-To + References).
Alembic 0003 added messages.reply_to_addr column.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Before this, the inject-keel-annotations policy only fired on admission
events. Workloads that existed BEFORE their namespace got labeled
keel.sh/enrolled=true never received the annotation, so Keel didn't
watch them. Live state was 30 of 226 workloads auto-updating.
With mutateExistingOnPolicyUpdate=true and the required mutate.targets
block, Kyverno's BackgroundScan controller applies the mutate to
existing matching Deployments/StatefulSets/DaemonSets on policy update.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* immich: extended 3 V1 lifecycles to V2 (1 Deployment without V1
skipped — has non-standard lifecycle from earlier work).
* status-page: enrolled (was missing from original sweep).
* v6 retrigger marker on 17 stacks that never reached terragrunt
apply (#704 exit-1 halted mid-loop).
After this lands, expected live enrollment: ~96 / 118 Tier 1 stacks.
The remaining ~22 are operator/Helm-managed and intentionally excluded
(same fight-loop risk as Calico — bump via Helm chart version, not
Keel).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* beads-server: 3 Deployments — extended V1 lifecycle blocks to V2
+ KEEL_IGNORE_IMAGE; namespace label.
* llama-cpp: 1 Deployment — extended V1→V2; namespace label.
* novelapp: namespace label only (Deployment has non-standard
lifecycle without V1 dns_config — drift expected, accept for now).
* plotting-book: namespace label only (same as novelapp).
* trading-bot: namespace label only (same as novelapp).
immich deferred — the bulk-add script's brace-counter got confused by
a HEREDOC in the file, inserting a lifecycle block in the wrong
position. Needs manual per-Deployment editing.
The 3 ns-only stacks (novelapp, plotting-book, trading-bot) will see
their Deployments mutated by Kyverno but their TF lifecycle doesn't
yet ignore the keel annotations. Expected behavior: drift visible in
terragrunt plan, applied-state oscillates with Kyverno re-injecting.
Acceptable starting point; per-Deployment lifecycle work to fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stop the hourly Keel-vs-tigera-operator fight loop on calico-node
DaemonSet (v3.26.5 ↔ v3.26.1). Live: re-annotated 4 calico-system
workloads with keel.sh/policy=never; TF: added calico-system to the
namespaces exclude list so any future mutate run won't re-inject.
The previous calico unenrollment (label removal from namespace)
wasn't enough — once Kyverno had stamped the policy=patch annotation
on the Deployments/DaemonSets, removing the namespace label didn't
strip the annotation, so Keel kept watching them.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wire ha-mcp, context7, and the in-pod playwright sidecar as native
MCP servers on OpenClaw via `mcp set` in the container startup
(ConfigMap-baked mcp.servers gets stripped by `doctor --fix`; CLI-set
entries persist). HA URL pulled from new Vault key
secret/openclaw.ha_sofia_mcp_url and passed via the
HA_SOFIA_MCP_URL env var.
Add a daily 03:00 UTC `memory-sync` CronJob in the openclaw
namespace: pulls all non-sensitive memories from
claude-memory.claude-memory.svc:80/api/memories, groups by category,
writes 18 Markdown files into /workspace/memory/projects/claude-
memory-sync/ (the path memory-core indexes), then triggers
`openclaw memory index --force` via kubectl exec. Reuses the
existing cluster-healthcheck SA (pods+pods/exec). Smoke test: 1488
memories synced, 25/25 files indexed, search returns hits.
Also drops the legacy /app/extensions entry from
plugins.load.paths (doctor warning), wires HA_SOFIA_MCP_URL env,
and one-shot deletes the stale 2026-02-28 metaclaw-export.json from
the openclaw home volume.
claude_memory MCP intentionally NOT wired — its /mcp/mcp transport
404s on the deployed claude-memory-mcp:17 image (tracked as
code-z1so). Shared knowledge is delivered via the CronJob's REST
sync instead. Adding claude_memory to mcp.servers is a one-line
follow-up once that's fixed.
The broker-sync Fidelity provider emits 'unrealised-gains-offset'
DEPOSIT activities to reconcile Wealthfolio's total with the
PlanViewer reported pot, because Wealthfolio doesn't track pension
fund units directly. Wealthfolio's data model treats that DEPOSIT as
a cash contribution, which double-inflates net_contribution and
zeroes out the implied growth.
Add a Postgres view 'dav_corrected' in wealthfolio_sync that
subtracts the cumulative gains-offset from net_contribution per
account per date (re-exporting as 'net_contribution' so it's a
drop-in replacement). All 17 wealth dashboard panels that compute
contribution/growth/ROI now read from the view. Total impact:
portfolio Growth jumps from £301,753.19 to £337,474.39 (exactly
the £35,721.20 Fidelity offset that was previously miscategorised).
Bulk enrollment commit 8f4b1956 had its CI pipeline #689 killed before
terragrunt apply ran. The enrollment label + V2 lifecycle changes are
in master but never reached the cluster. Appending a one-line marker
to each pending stack's main.tf so Woodpecker's diff-detection picks
them up and applies them serially.
Idempotent — re-applying a stack whose state already matches is a no-op.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
For Deployments enrolled in Keel with policy=patch, the image tag is
updated by Keel as new patches release upstream. Without
ignore_changes on the image field, terragrunt apply would fight Keel
in an endless loop (TF reverts → Keel re-rolls → repeat — same shape
as the calico/tigera-operator fight from earlier).
Adding KEEL_IGNORE_IMAGE marker to the lifecycle of these stacks.
Image string in TF becomes the initial seed; Keel rolls it forward.
Stacks: actualbudget, broker-sync, changedetection, city-guesser,
coturn, dashy, dawarich, diun, ebook2audiobook, ebooks, echo,
excalidraw, foolery, forgejo, freedify.
CI-driven self-hosted stacks (fire-planner, job-hunter, payslip-ingest,
recruiter-responder, claude-agent-service, claude-memory) keep TF
ownership of image and policy=never — their image_tag is set by CI
via terragrunt.hcl inputs, not by Keel. Adding image to ignore_changes
on those would break the CI deploy flow.
Caveat: only container[0].image is added. Multi-container Deployments
(immich, beads, etc.) will need additional container[N].image lines
for any container Keel rolls. Those stacks are not currently enrolled.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Keel kept rewriting calico-node + calico-kube-controllers images to
v3.26.5 (proper patch update); tigera-operator immediately reverted
to v3.26.1 because the Installation CR is the source of truth.
Endless churn but no data loss — Calico stayed healthy throughout.
Removing keel.sh/enrolled label and live label from calico-system ns.
Calico upgrades go through the tigera-operator's Installation CR
manually, not Keel.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move from `never` (no auto-update) to `patch` for the cluster-wide
default. Keel only auto-updates PATCH versions within the current
major.minor: 0.26.6 → 0.26.7 OK; 0.26.6 → :nightly-latest blocked.
Tag-rewrites that broke calico (v3.26.1 → :master) and affine
(0.26.6 → :nightly-latest) on 2026-05-16 cannot recur with patch.
Caveats:
* Patch causes Terraform image drift for semver-pinned services —
drift-detection pipeline will surface it; lifecycle ignore_changes
on container[].image can be added per stack later if drift is
noisy.
* Tags that aren't parseable as semver (:latest, :11, :nightly,
SHA tags) are ignored by patch — those workloads stay on their
current image until promoted to `force` policy individually.
Self-hosted CI-driven services + chrome-service kept on `never`
(deliberate pins / CI controls the tag):
recruiter-responder, claude-agent-service, claude-memory,
chrome-service, fire-planner, job-hunter, payslip-ingest
Live state already updated via kubectl apply + per-workload patches.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- claude-agent-service bumped to 191ed5dd (new AI section in agent
template — leadership stance, approved tools, usage limits / quotas,
code-gen safety, product-side AI depth, follow-up questions for the
recruiter when the web is sparse).
- recruiter-responder bumped to ab59eeab (deep_research prompt asks
for AI culture; warm_engage template adds a written-only ask for
IDE assistants, chat tools, per-seat limits, source-to-external
model policy).
Smoke-tested 2026-05-16: forced fresh research on Datadog, agent
returned full structured AI section with 7 explicit recruiter
questions covering DLP/IDE/limits/code-gen-policy. $0.80 / 192s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 incident: Keel's `force` policy switched semver-pinned
images (affine 0.26.6 → :nightly-latest, calico v3.26.1 → :master)
instead of digest-tracking. Force is documented as "always update
to the newest tag in the registry" — only safe on already-mutable
tags like :latest.
Changing the cluster-wide default in inject-keel-annotations to
`never`. The namespace enrollment label + V2 lifecycle suppression
stay in place so opt-in is one annotation per Deployment, but no
service auto-updates until explicitly approved.
To opt in a workload now:
1. Verify the Deployment image is on a mutable tag (:latest,
:<major>, or a vendor "stable" tag) — change in Terraform first
if needed.
2. Add to the Deployment's metadata.annotations:
"keel.sh/policy" = "force" (digest tracking)
OR
"keel.sh/policy" = "patch" (semver patch bumps — also
requires ignore_changes on the image)
Live policy already updated via kubectl apply + per-workload
override (force → never).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wire Keel's Slack notifier to the existing bot token in Vault
(secret/viktor -> slack_bot_token). Posts to #general by default;
override via slack.channel in the Helm values if you want a dedicated
channel like #keel-notifications.
Notification level is "info" so we get every rollout event, not just
errors. Approval flow is OFF — opt-out-pure means all updates apply
unattended. If we later introduce approvals, add slack.approvalsChannel.
Resolves user request: 'keel should send notifications to slack everytime
it upgrades an app'.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
OpenClaw can now answer 'what do we know about <company>?' from cache
via the new recruiter_company_research tool, and recruiter_get embeds
the cached research payload inline.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Helm repo at https://charts.keel.sh has versions 1.0.0–1.0.5,
1.1.0, 1.2.0. 1.0.6 is not published, so the Phase 0 apply failed
silently. Bump to 1.2.0 (app version 0.21.1, latest stable).
The weekday-only schedule was a 2026-03-16-incident-era guardrail when
the rest of the safety net was thin. Today's gates — halt-on-alert,
sentinel-gate Check 4 (24h soak via node Ready transitions), the
K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the
sentinel-path fix from earlier today — make weekend reboots safe and
just clear the backlog faster.
Effect: 5 pending node reboots clear in 5 calendar days instead of
queueing up over weekends. The K8s version-upgrade detection at Sun
12:00 UTC self-defers if a Sunday-morning kured reboot fires (the
RecentNodeReboot alert is in the Upgrade Gates ignore-less list for
the version-upgrade preflight — same mechanism kured uses).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Enrolls the cleanest Woodpecker-build-only self-hosted services into
the inject-keel-annotations ClusterPolicy by labeling their namespaces
keel.sh/enrolled=true. CI already pushes :latest (auto_tag: true) on
each, so Keel will detect the current upstream digest and trigger a
rolling restart when polling starts (1h cadence).
Per-Deployment lifecycle extended with KYVERNO_LIFECYCLE_V2 to suppress
the annotation drift Kyverno will inject (keel.sh/policy, /trigger,
/pollSchedule).
Services included:
- fire-planner
- job-hunter
- payslip-ingest
- recruiter-responder
Skipped from Phase 1 for follow-up:
- claude-agent-service (user has WIP on main.tf)
- claude-memory (Postgres co-deployed; treat in Phase 9 with other DBs)
- kms (two Deployments; needs per-resource review)
- wealthfolio (sync sidecar pattern; needs review)
- chrome-service (deliberate :v4 pin; needs keel.sh/policy: never label)
- GHA-migrated repos (10) (need per-repo CI cleanup)
- beadboard, freedify (no CI)
See docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- claude-agent-service bumped to f764fef6 (agent system prompt adds
the Perks block: food/health/pension/equity/PTO/parental/equipment/
learning/wellness/amenities/commuter). 1200-word cap.
- recruiter-responder bumped to 38a2cdaa (cache-first deep_research:
serves cached payload if fetched_at + ttl_seconds > now; cache
writes upsert; new force flag bypasses).
Verified end-to-end: deep_research on Datadog now returns full Perks
section (~220s, $0.60, 23 turns). Earlier 500 fixed (was
uq_research_company_tier dup-key on re-run).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Foundation for opt-out-pure auto-update model per
docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.
- New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6).
Polls registries hourly per design decision #8. Default schedule
overridable per-workload via keel.sh/pollSchedule annotation.
- New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments,
StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true`
with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h.
- Phase 0 enrolls no namespaces. Phase 1 (next session) labels the
self-hosted set.
- Per-workload opt-out: label `keel.sh/policy: never` (used by rollback
runbook and chrome-service-style deliberate pins).
- Keel namespace excluded from the mutate — supervisor self-update has
too-bad a failure mode (decision #11).
- AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the
ignore_changes block enrolled workloads need.
- .claude/CLAUDE.md: docker-images rule flagged as transitional.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>