infra

Author	SHA1	Message	Date
Viktor Barzin	f325b949be	keel: re-enable with policy=patch (semver-bounded) + fix CI deny-privileged All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Re-enables Keel after the 2026-05-26 emergency stop, with a safer default. Switch Kyverno-injected default from `force + match-tag=true` (proven unreliable — it rewrote tag strings cluster-wide despite the design intent) to `patch`, which is semver-parser-bounded: - Only patch bumps within current major.minor (1.2.3 → 1.2.4, never 1.3.x or 2.x — the parser does the math, not string compare). - Non-semver tags (`:latest`, `:v4`, `:2`, SHA, `:nightly`) are IGNORED entirely. No tag rewriting under any code path. - 151 stale `force` annotations migrated to `patch` cluster-wide during this apply (anchor `+()` dropped, then re-added). Live state after this commit: 0 workloads on `force`, 209 on `patch`, 22 on `never`. Keel deployment back to 1/1 on `:0.21.1`. Note: 22 workloads with `keel.sh/policy=never` LABEL had their annotation mutated to `patch` during the migration despite Kyverno's matchLabels-based exclude rule — appears to be a quirk of `mutateExistingOnPolicyUpdate` not honoring `selector` excludes. Repatched all 22 back to `annotation=never` via `kubectl annotate --overwrite`, then restored the `+(keel.sh/policy)` anchor in the policy so future Kyverno reconciles preserve them. Also fixes CI build-cli workflow which was blocked by `deny-privileged-containers` since wave 1 enforce flip on 2026-05-18: woodpecker namespace added to the shared security_policy_exclude_namespaces list (CI pipeline pods `wp-*` run privileged docker builds, legitimate use). The `default` workflow (terragrunt apply) was already passing — only the parallel `build-cli` workflow (which builds the infra-cli docker image) was failing, but it took the overall pipeline status down with it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 19:06:51 +00:00
Viktor Barzin	60b2b1cdfc	cluster-health: emergency-stop Keel + roll back image downgrades + quota raises Keel was rewriting tag strings (not just digests) despite the keel.sh/match-tag=true annotation injected by the Kyverno inject-keel-annotations ClusterPolicy. That annotation was supposed to constrain Keel to digest-only watches under the deployment's CURRENT tag. It didn't. Casualties confirmed today (live image rewritten to a lower version): uptime-kuma (:2 → :1, 4h CrashLoopBackOff because v1 boots into SQLite mode and can't read the v2 db-config.json → MariaDB store); n8n (:1.80.5 → :0.1.2, silent — EEXIST mkdir /root/.n8n loop); beads-server/dolt-workbench (:0.3.73 → :0.1.0, GraphQL schema mismatch on addDatabaseConnection); wealthfolio (:3.2.1 → :2.0 → :3.2 string truncate); plus historical ones previously fixed (claude-memory :71b32438 → :17, forgejo 11.0.14 → 1.18, onlyoffice 9.3.1 → 4.0.0.9, shlink 5.0.2 → 1.16.1). Changes: * stacks/keel: replicaCount = 0 in the helm values. Pod went from 1/1 to 0/0. Keep off until either match-tag is root-caused or every enrolled workload migrates to a content-addressed (SHA) pin. * stacks/uptime-kuma: pin image to louislam/uptime-kuma:2.3.2 (was :2, bumped to :1 by Keel). Full opt-out: keel.sh/policy=never on BOTH the deployment label (matches Kyverno's exclude rule so the inject-keel- annotations ClusterPolicy stops mutating) AND the annotation (so Keel itself respects). Removed keel.sh/policy from lifecycle.ignore_changes so TF owns it as `never` and can't drift back to `force`. * stacks/beads-server: pin dolt-workbench to dolthub/dolt-workbench:0.3.73 on both seed-config and workbench containers (was :latest, Keel rolled to :0.1.0). * stacks/wealthfolio: pin to afadil/wealthfolio:3.2.1 (was :3.2 truncated by Keel from the prior live :3.2.1). * stacks/monitoring: monitoring-quota requests.memory 16Gi → 20Gi. Cluster grew from 5 to 7 workers (k8s-node5/6 added 2026-05-26) and alloy's per-pod request jumped 50Mi → 562Mi earlier today; combined with new-node DS pods (loki-canary, node-exporter, sysctl-inotify) the quota tipped to 100% and blocked every new pod create with FailedCreate. Raising the cap unblocked the four affected DaemonSets in one shot. * stacks/immich: tier-quota requests.memory 20Gi → 24Gi, limits.memory 32Gi → 40Gi. Was at 88% with VPA still creeping up on immich-server's face-detection burst behaviour. * stacks/{excalidraw,immich,n8n}: providers.tf + .terraform.lock.hcl updated by `tg init -upgrade` to record telmate/proxmox 3.0.2-rc07 (matches the 21 other stacks that already declare it). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 18:48:50 +00:00
Viktor Barzin	9e045e2c16	upgrade-state: skill + script + Keel scrape for periodic three-pipeline audit Three autonomous-upgrade pipelines run independently — Keel for apps (hourly registry polling), unattended-upgrades+kured for OS, and the k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there was no single place to see whether each was healthy, what's pending, or whether anything's stuck. The /upgrade-state skill collapses the state of all three into one table you can run before each Sunday's k8s-version-check fires. - stacks/keel/main.tf: add Prometheus pod-annotation scrape on container port 9300. Surfaces pending_approvals, poll_trigger_tracked_images, and registries_scanned_total{image} so the skill has a real timeseries (also opens the door to a future "pending_approvals > 0 for 24h" alert). - scripts/upgrade_state.sh: collector + renderer. Three-row table (Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2. SSH fan-out (parallel subshells) to all five nodes for apt state + reboot-required + uu log; Prometheus query for Keel; Pushgateway parse for k8s_upgrade_* gauges. Read-only. - .claude/skills/upgrade-state/SKILL.md: hardlinked to ~/.claude/skills/upgrade-state/SKILL.md so the skill is discoverable from both monorepo-rooted and global sessions. Verification: ran the script, stress-tested the ✗ stalled path by pushing in_flight=1 + started_timestamp=-100min to Pushgateway and resetting after — script correctly raised ✗ and exit 2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 10:50:43 +00:00
Viktor Barzin	06f48c73ca	keel: enable Slack notifications on every upgrade Wire Keel's Slack notifier to the existing bot token in Vault (secret/viktor -> slack_bot_token). Posts to #general by default; override via slack.channel in the Helm values if you want a dedicated channel like #keel-notifications. Notification level is "info" so we get every rollout event, not just errors. Approval flow is OFF — opt-out-pure means all updates apply unattended. If we later introduce approvals, add slack.approvalsChannel. Resolves user request: 'keel should send notifications to slack everytime it upgrades an app'. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 13:01:35 +00:00
Viktor Barzin	d0f0e10da5	keel: pin chart 1.0.6 → 1.2.0 (1.0.6 doesn't exist) The Helm repo at https://charts.keel.sh has versions 1.0.0–1.0.5, 1.1.0, 1.2.0. 1.0.6 is not published, so the Phase 0 apply failed silently. Bump to 1.2.0 (app version 0.21.1, latest stable).	2026-05-16 12:30:19 +00:00
Viktor Barzin	910167105e	Phase 0: install Keel + Kyverno auto-update annotation injector Foundation for opt-out-pure auto-update model per docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. - New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6). Polls registries hourly per design decision #8. Default schedule overridable per-workload via keel.sh/pollSchedule annotation. - New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments, StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true` with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h. - Phase 0 enrolls no namespaces. Phase 1 (next session) labels the self-hosted set. - Per-workload opt-out: label `keel.sh/policy: never` (used by rollback runbook and chrome-service-style deliberate pins). - Keel namespace excluded from the mutate — supervisor self-update has too-bad a failure mode (decision #11). - AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the ignore_changes block enrolled workloads need. - .claude/CLAUDE.md: docker-images rule flagged as transitional. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 12:19:34 +00:00

6 commits