Commit graph

5 commits

Author SHA1 Message Date
Viktor Barzin
60b2b1cdfc cluster-health: emergency-stop Keel + roll back image downgrades + quota raises
Keel was rewriting tag strings (not just digests) despite the
keel.sh/match-tag=true annotation injected by the Kyverno
inject-keel-annotations ClusterPolicy. That annotation was supposed to
constrain Keel to digest-only watches under the deployment's CURRENT tag.
It didn't. Casualties confirmed today (live image rewritten to a lower
version): uptime-kuma (:2 → :1, 4h CrashLoopBackOff because v1 boots into
SQLite mode and can't read the v2 db-config.json → MariaDB store);
n8n (:1.80.5 → :0.1.2, silent — EEXIST mkdir /root/.n8n loop);
beads-server/dolt-workbench (:0.3.73 → :0.1.0, GraphQL schema mismatch on
addDatabaseConnection); wealthfolio (:3.2.1 → :2.0 → :3.2 string truncate);
plus historical ones previously fixed (claude-memory :71b32438 → :17,
forgejo 11.0.14 → 1.18, onlyoffice 9.3.1 → 4.0.0.9, shlink 5.0.2 → 1.16.1).

Changes:

* stacks/keel: replicaCount = 0 in the helm values. Pod went from 1/1 to
  0/0. Keep off until either match-tag is root-caused or every enrolled
  workload migrates to a content-addressed (SHA) pin.

* stacks/uptime-kuma: pin image to louislam/uptime-kuma:2.3.2 (was :2,
  bumped to :1 by Keel). Full opt-out: keel.sh/policy=never on BOTH the
  deployment label (matches Kyverno's exclude rule so the inject-keel-
  annotations ClusterPolicy stops mutating) AND the annotation (so Keel
  itself respects). Removed keel.sh/policy from lifecycle.ignore_changes
  so TF owns it as `never` and can't drift back to `force`.

* stacks/beads-server: pin dolt-workbench to dolthub/dolt-workbench:0.3.73
  on both seed-config and workbench containers (was :latest, Keel rolled
  to :0.1.0).

* stacks/wealthfolio: pin to afadil/wealthfolio:3.2.1 (was :3.2 truncated
  by Keel from the prior live :3.2.1).

* stacks/monitoring: monitoring-quota requests.memory 16Gi → 20Gi. Cluster
  grew from 5 to 7 workers (k8s-node5/6 added 2026-05-26) and alloy's
  per-pod request jumped 50Mi → 562Mi earlier today; combined with new-node
  DS pods (loki-canary, node-exporter, sysctl-inotify) the quota tipped to
  100% and blocked every new pod create with FailedCreate. Raising the cap
  unblocked the four affected DaemonSets in one shot.

* stacks/immich: tier-quota requests.memory 20Gi → 24Gi, limits.memory
  32Gi → 40Gi. Was at 88% with VPA still creeping up on immich-server's
  face-detection burst behaviour.

* stacks/{excalidraw,immich,n8n}: providers.tf + .terraform.lock.hcl
  updated by `tg init -upgrade` to record telmate/proxmox 3.0.2-rc07
  (matches the 21 other stacks that already declare it).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 18:48:50 +00:00
Viktor Barzin
9e045e2c16 upgrade-state: skill + script + Keel scrape for periodic three-pipeline audit
Three autonomous-upgrade pipelines run independently — Keel for apps
(hourly registry polling), unattended-upgrades+kured for OS, and the
k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there
was no single place to see whether each was healthy, what's pending,
or whether anything's stuck. The /upgrade-state skill collapses the
state of all three into one table you can run before each Sunday's
k8s-version-check fires.

- stacks/keel/main.tf: add Prometheus pod-annotation scrape on
  container port 9300. Surfaces pending_approvals,
  poll_trigger_tracked_images, and registries_scanned_total{image}
  so the skill has a real timeseries (also opens the door to a
  future "pending_approvals > 0 for 24h" alert).
- scripts/upgrade_state.sh: collector + renderer. Three-row table
  (Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2.
  SSH fan-out (parallel subshells) to all five nodes for apt
  state + reboot-required + uu log; Prometheus query for Keel;
  Pushgateway parse for k8s_upgrade_* gauges. Read-only.
- .claude/skills/upgrade-state/SKILL.md: hardlinked to
  ~/.claude/skills/upgrade-state/SKILL.md so the skill is
  discoverable from both monorepo-rooted and global sessions.

Verification: ran the script, stress-tested the ✗ stalled path by
pushing in_flight=1 + started_timestamp=-100min to Pushgateway and
resetting after — script correctly raised ✗ and exit 2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 10:50:43 +00:00
Viktor Barzin
06f48c73ca keel: enable Slack notifications on every upgrade
Wire Keel's Slack notifier to the existing bot token in Vault
(secret/viktor -> slack_bot_token). Posts to #general by default;
override via slack.channel in the Helm values if you want a dedicated
channel like #keel-notifications.

Notification level is "info" so we get every rollout event, not just
errors. Approval flow is OFF — opt-out-pure means all updates apply
unattended. If we later introduce approvals, add slack.approvalsChannel.

Resolves user request: 'keel should send notifications to slack everytime
it upgrades an app'.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 13:01:35 +00:00
Viktor Barzin
d0f0e10da5 keel: pin chart 1.0.6 → 1.2.0 (1.0.6 doesn't exist)
The Helm repo at https://charts.keel.sh has versions 1.0.0–1.0.5,
1.1.0, 1.2.0. 1.0.6 is not published, so the Phase 0 apply failed
silently. Bump to 1.2.0 (app version 0.21.1, latest stable).
2026-05-16 12:30:19 +00:00
Viktor Barzin
910167105e Phase 0: install Keel + Kyverno auto-update annotation injector
Foundation for opt-out-pure auto-update model per
docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.

- New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6).
  Polls registries hourly per design decision #8. Default schedule
  overridable per-workload via keel.sh/pollSchedule annotation.
- New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments,
  StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true`
  with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h.
- Phase 0 enrolls no namespaces. Phase 1 (next session) labels the
  self-hosted set.
- Per-workload opt-out: label `keel.sh/policy: never` (used by rollback
  runbook and chrome-service-style deliberate pins).
- Keel namespace excluded from the mutate — supervisor self-update has
  too-bad a failure mode (decision #11).
- AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the
  ignore_changes block enrolled workloads need.
- .claude/CLAUDE.md: docker-images rule flagged as transitional.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 12:19:34 +00:00