Commit graph

16 commits

Author SHA1 Message Date
Viktor Barzin
ed53b34bf4 k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The chain hardcoded master→node4→node3→node2→node1→postflight and SSHed by
FQDN. It silently SKIPPED node5/node6 (added 2026-05-26) — postflight would
have failed even if reachable — and node5/node6 had no .viktorbarzin.lan DNS
records, so the chain couldn't SSH to them at all.

Refactor (upgrade-step.sh):
  - Worker set + order derived live from `kubectl get nodes` (worker_nodes /
    next_pending_worker), so EVERY worker still off-target is upgraded and a
    newly-joined node is covered with zero script change.
  - SSH targets are node InternalIPs (ssh_target), removing the dependency on
    node DNS records entirely — a new node is reachable the moment it joins.
  - The two remaining hardcoded loops (containerd skew, apt-repo rewrite) now
    enumerate workers/all-nodes dynamically too.
  - Topology preserved: master-drain Job runs on the first worker; every
    worker-drain Job runs on the already-upgraded k8s-master (self-preemption
    invariant intact).
  - next_pending_worker returns 0 explicitly on the no-match path — the
    `while read … done < <(…)` loop exits 1 at EOF, which under set -e would
    abort the LAST worker's Job before it spawns postflight (cluster upgraded
    but no cleanup / in_flight reset). Caught in review.

Docs (runbook + architecture + headers) updated to the dynamic topology.

NOTE: nodes still need the k8s-upgrade SSH public key in authorized_keys; it was
deployed to node4/5/6 by hand this session. Baking it into node provisioning
(so new nodes get it automatically) is the remaining follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 16:56:02 +00:00
Viktor Barzin
037a609f27 k8s-version-upgrade: unblock 1.34.9 — skip kubeadm CoreDNS addon + busybox-date fix
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The 1.34.9 master upgrade hard-failed `kubeadm upgrade apply` preflight: CoreDNS
is at v1.12.4 (Keel auto-bumped it 1.12.1 -> 1.12.4 on 2026-05-26 via a stale
kube-system out-of-band annotation), and 1.12.4 is ahead of kubeadm 1.34.9's
bundled corefile-migration table ("start version not supported").

- scripts/update_k8s.sh: master `kubeadm upgrade apply` now runs with
  `--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins
  --skip-phases=addon/coredns`. A dry-run proved --ignore ALONE would overwrite
  our custom split-horizon Corefile with kubeadm's default AND downgrade the
  image; --skip-phases leaves CoreDNS 100% untouched while the control plane
  upgrades. CoreDNS is pinned off Keel (keel.sh/policy=never) to stop the drift.
- stacks/k8s-version-upgrade/scripts/upgrade-step.sh: fix the preflight
  quiet-baseline (settle-window) check, which silently no-op'd on the ghcr
  claude-agent-service image's busybox `date` (can't parse ISO8601). Now tries
  GNU then busybox `-D`, and warns+skips on parse failure (no silent fail-open).
- docs: runbook + architecture document the CoreDNS handling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:45:05 +00:00
Viktor Barzin
fb638cd8ec k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs
Some checks failed
ci/woodpecker/push/default Pipeline failed
Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to
the terminal job-condition reasons (BackoffLimitExceeded|DeadlineExceeded). A
phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every
firing alert also halts kured, so a bare-count false-positive would block all
OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics:
the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0
for the terminal reasons.

Docs updated to match the behaviour change (per the same-commit docs rule):
  - docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the
    "kill a stuck Job" recovery now leads with retry-on-failure self-heal.
  - docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert;
    retry-on-failure note on the deterministic-naming paragraph.
  - .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend
    entry, and drill-down (also copied to the active ~/.claude copy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:10:18 +00:00
Viktor Barzin
fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00
Viktor Barzin
6d224861c4 stem95su: scheduled Drive->site sync CronJob (every 10m)
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.

Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:42:26 +00:00
Viktor Barzin
f0948493b3 claude-agent-service: wire parallel execution (git-crypt mount, memory, MAX_CONCURRENCY)
The service now runs agent calls concurrently (bounded semaphore, per-job
isolated clones) instead of single-flight. Infra side:
- mount git-crypt-key into the main container (each job re-unlocks its own clone)
- MAX_CONCURRENCY=10 env (excess calls queue FIFO)
- bump pod memory 2Gi req / 12Gi limit, cpu req 1 (Burstable, tier-aux) — sized
  for ~10 concurrent claude+terraform runs; fits node2/3/5 headroom
- docs: beads-auto-dispatch + automated-upgrades no longer describe single-slot

Service code: viktor/claude-agent-service @ 66104a3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 10:24:24 +00:00
Viktor Barzin
51313ee088 kured: fix sentinel-gate OOM — 256Mi limit + self-restart leak guard
The k8s-master gate pod OOM-killed child kubectls 149x/7d (accelerating:
0/day → 15 → 134) while master sat in pending-reboot. Root cause: only the
pending-reboot node's gate pod runs the kubectl-heavy hot path each cycle,
and the immortal bash loop slowly leaks (kubectl forks + Check-4 process
substitution) past the 64Mi cgroup limit. PID 1 bash survives each kill, so
the pod never restarts — just silent oom_events.

Fix: raise limit 64Mi→256Mi (headroom for ~30-50Mi kubectl forks) + add a
MAX_ITER=72 self-exit (~6h) so kubelet restarts the pod fresh and the leak
can never accumulate, regardless of how long a node stays pending-reboot.

Docs: post-mortem + automated-upgrades.md gate note.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 14:49:04 +00:00
Viktor Barzin
48abb7c520 kured: drop Mon-Fri restriction, reboot any day
The weekday-only schedule was a 2026-03-16-incident-era guardrail when
the rest of the safety net was thin. Today's gates — halt-on-alert,
sentinel-gate Check 4 (24h soak via node Ready transitions), the
K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the
sentinel-path fix from earlier today — make weekend reboots safe and
just clear the backlog faster.

Effect: 5 pending node reboots clear in 5 calendar days instead of
queueing up over weekends. The K8s version-upgrade detection at Sun
12:00 UTC self-defers if a Sunday-morning kured reboot fires (the
RecentNodeReboot alert is in the Upgrade Gates ignore-less list for
the version-upgrade preflight — same mechanism kured uses).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 12:29:01 +00:00
Viktor Barzin
01bc16d592 k8s-version-upgrade: decompose into Job chain to fix self-preemption
The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 23:54:22 +00:00
Viktor Barzin
a58d777059 k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
2026-05-10 19:07:42 +00:00
Viktor Barzin
a245e6e569 docs: add k8s node auto-upgrade runbook + architecture section
The OS-side counterpart to the service-upgrade pipeline. Covers
the unattended-upgrades + kured + sentinel-gate + Prometheus
halt-on-alert design landed in c0991f7f8.

Runbook: ops procedures (verify health, halt rollout, restore
config to a re-imaged node, roll back a bad upgrade, investigate
which alert is blocking).

Architecture doc: extends the existing service-upgrade flow with
a "K8s Node OS Upgrades" section (stack, sources of truth, day-2
mechanism, why-this-design rationale tied to the March 2026
post-mortem).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 17:26:15 +00:00
Viktor Barzin
6e19dce99e [docs] automated-upgrades: document long-lived OAuth + expiry monitoring
Adds the `claude_oauth_token` Vault entries to the secrets table, a
new "OAuth token lifecycle" section explaining the two CLI auth modes
(`claude login` vs `claude setup-token`) and why we picked the latter
for headless use, the Ink 300-col PTY gotcha from today's harvest,
and the monitoring/rotation playbook for the new expiry alerts.

Follow-up to 8a054752 and 50dea8f0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:00:07 +00:00
Viktor Barzin
69fbd0ffd6 [docs] Update auto-upgrade docs — new HTTP auth path + n8n expression gotcha
Replaces the stale "Dev VM SSH key" secret entry with the current
`claude-agent-service` bearer token path (synced to both consumer +
caller namespaces). Adds an "n8n workflow gotchas" section documenting:

1. The workflow is DB-state, not Terraform-managed — the JSON in the
   repo is a backup, not authoritative.
2. Header-expression syntax: `=Bearer {{ $env.X }}` works, JS concat
   `='Bearer ' + $env.X` does NOT — costs silent 401s.
3. `N8N_BLOCK_ENV_ACCESS_IN_NODE=false` requirement.
4. 401-troubleshooting steps and the UPDATE pattern for in-place
   workflow patches.

Follow-up to 99180bec which fixed the actual pipeline break.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 10:42:11 +00:00
Viktor Barzin
42f1c3cf4f [claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP
## Context

The claude-agent-service K8s pod (deployed 2026-04-15) provides an HTTP API
for running Claude headless agents. Three workflows still SSH'd to the DevVM
(10.0.10.10) to invoke `claude -p`. This eliminates that dependency.

## This change

Pipeline migrations (SSH → HTTP POST to claude-agent-service):
- `.woodpecker/issue-automation.yml` — Vault auth fetches API token instead
  of SSH key; curl POST /execute + poll /jobs/{id} replaces SSH invocation
- `scripts/postmortem-pipeline.sh` — same pattern; uses jq for safe JSON
  construction of TODO payloads
- `.woodpecker/postmortem-todos.yml` — drop openssh-client from apk install
- `stacks/n8n/workflows/diun-upgrade.json` — SSH node replaced with HTTP
  Request node; API token via $env.CLAUDE_AGENT_API_TOKEN (added to Vault
  secret/n8n)

Documentation updates:
- `docs/architecture/incident-response.md` — Mermaid diagram: DevVM → K8s
- `docs/architecture/automated-upgrades.md` — pipeline diagram + n8n action
- `AGENTS.md` — pipeline description updated

## What is NOT in this change

- DevVM decommissioning (still hosts terminal/foolery services)
- Removal of SSH key secrets from Vault (kept for rollback)
- n8n workflow import (must be done manually in n8n UI)

[ci skip]

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
2026-04-18 10:12:02 +00:00
Viktor Barzin
a237ac97e0 docs(upgrades): add bulk upgrade results from first production run
12 services upgraded in 30 min: audiobookshelf, owntracks, open-webui,
immich, coturn, shlink, phpipam, onlyoffice, paperless-ngx, linkwarden,
synapse, dawarich. Documents auto-rollback behavior, resource awareness
(paperless memory bump), bulk upgrade procedure, and rate limit reset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 17:34:27 +00:00
Viktor Barzin
c33f597111 feat(upgrade-agent): add automated service upgrade pipeline with n8n + DIUN
Pipeline: DIUN detects new image versions every 6h → webhook to n8n →
n8n filters (skip databases/custom/infra/:latest) and rate-limits
(max 5/6h) → SSH to dev VM → claude -p runs upgrade agent.

Agent workflow: resolve GitHub repo → fetch changelogs → classify risk
(SAFE/CAUTION) → backup DB if needed → bump version in .tf → commit+push
→ wait for CI → verify (pod ready + HTTP + Uptime Kuma) → rollback on
failure.

Changes:
- stacks/n8n: add N8N_PORT=5678 to fix K8s env var conflict
- stacks/n8n/workflows: version-controlled n8n workflow backup
- docs/architecture/automated-upgrades.md: full pipeline documentation
- AGENTS.md: add upgrade agent section
- service-catalog.md: update DIUN description

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:38:27 +00:00