Commit graph

388 commits

Author SHA1 Message Date
Viktor Barzin
7d297dc6b1 eso: complete migration — chart 2.6.0, all CRs on v1, 1.35 gate cleared
Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker).
Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time,
each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true
rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only
after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on
external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to
upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s
1.34 -> 1.35 on its next nightly run.

Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the
terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox,
but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent
dependency lock file: no version selected"). Reconciled via `tg init -upgrade`
and committed so `terragrunt apply`/CI work cleanly again.

Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc
marked COMPLETE.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:55:51 +00:00
Viktor Barzin
a3cdc0d6d0 chrome-service: size headed Chrome window to fill Xvfb (noVNC cut-off)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The noVNC view showed the browser in the top-left with the rest of the
framebuffer black. Cause: Chrome launched with no --window-size, and there's no
window manager, so it opened at its profile-persisted (smaller) size inside the
1280x720 Xvfb. Add --window-size=1280,720 --window-position=0,0 so the window
fills the screen on every launch (fresh pods/profiles too). Live windows were
already resized via CDP as a stopgap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 18:00:20 +00:00
Viktor Barzin
c7ead032ec chrome-service: fix noVNC stuck-"Connecting" (x11vnc fd-sweep under nofile=2^31)
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build chrome-service-novnc / build (push) Has been cancelled
The noVNC view hung on "Connecting" forever then timed out. Root cause: x11vnc
sweeps the entire fd table (fcntl per fd) on every client connection, and
containerd grants pods RLIMIT_NOFILE=2^31, so the RFB handshake never completes
(websockify accepts the WS and dials localhost:5900, but x11vnc never sends its
banner — verified: handshake timed out at 8s, x11vnc had burned 1h41m CPU
spinning). Same bug + fix the android-emulator stack already carries.

Cap nofile before x11vnc starts, in two places:
- files/novnc/entrypoint.sh: `ulimit -n 65536` (root fix, makes the image correct)
- main.tf novnc container: `command = ["bash","-c","ulimit -n 65536; exec /entrypoint.sh"]`
  so the cap applies deterministically on rollout even though the image is
  :latest/IfNotPresent (a rebuilt entrypoint isn't guaranteed to be re-pulled).

Also documents the gotcha + diagnosis in docs/architecture/chrome-service.md and
notes the black-when-idle behaviour + the autoconnect URL.

(A live x11vnc relaunch with the cap already unblocked the running pod; this
makes it survive restarts.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 17:34:03 +00:00
Viktor Barzin
7dbbb74163 homelab v0.8.1: frame browser as escalation (default headless), match CLAUDE.md
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build infra CLI / build (push) Has been cancelled
Make `homelab browser --help` and chrome-service.md state the same tiered rule
now in ~/code/CLAUDE.md: default to the Playwright MCP/headless browser for all
routine automation; reach for `homelab browser` ONLY when headless is blocked
(loads-but-submit-fails / one request errors while siblings 200 / explicit bot
wall). Removes the "co-equal choice" framing so agents have one non-conflicting
instruction. Adds a test asserting the tiered wording so it can't regress.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 15:44:43 +00:00
Viktor Barzin
a6b52a5839 homelab v0.8.0: browser verbs for headful anti-bot web automation
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Add `homelab browser run|open` so agents can drive the cluster's headful
Chrome (chrome-service) over CDP from the devvm. The headless playwright/mcp
browser can load anti-bot sites and fill their forms, but the gated submit
silently fails — e.g. the Stirling Ackroyd Fixflo tenant portal returned
net::ERR_FILE_NOT_FOUND on its pre-submit check and hung, creating nothing.
Driving the real headful Chrome submits first try. That capability already
existed but was undiscoverable, so it cost ~40 min + redundant form re-runs to
find; now it is one command, versioned, test-covered, and `browser --help`
carries the when-to-use signature + an error-code cheat-sheet so the right tool
is reached at the right moment (the failure was judgment, not setup).

- port-forward svc/chrome-service:9222 (tunnels API-server->pod, so it bypasses
  the :9222 NetworkPolicy), assert non-headless via /json/version,
  connect_over_cdp, inject the same vendored stealth.js the in-cluster callers
  use; the port-forward is always torn down, on success and on error.
- node CDP client pinned to playwright-core@1.48.2 to match the v1.48.0-noble
  image (Chromium 130); self-provisioned lazily into ~/.cache/homelab, no
  per-user setup.
- default is a fresh incognito context (safe for the shared browser + concurrent
  callers); --shared-context reuses the warmed persistent profile.
- TDD: cmd_browser_test.go covers arg parsing, headless detection, the version
  pin, the help cheat-sheet, and a stealth.js drift guard. Verified end-to-end
  against bot.sannysoft.com (real Chrome UA, webdriver hidden, plugins/WebGL
  spoofed) and `browser open`.
- docs: README v0.8 section, ADR-0013, and a chrome-service.md "driving from
  outside the cluster" section.

Closes: code-nepg

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 12:22:22 +00:00
Viktor Barzin
de163aa6af workstation: switch devvm OOM backstop from systemd-oomd to earlyoom
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.

Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.

The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 10:39:16 +00:00
Viktor Barzin
3a59f4a8bf workstation: per-user memory caps + systemd-oomd backstop on devvm
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.

Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.

This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.

Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).

Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 10:25:09 +00:00
Viktor Barzin
aeed461591 Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)"
All checks were successful
ci/woodpecker/push/default Pipeline was successful
This reverts commit 1595bddfc2.
2026-06-22 08:31:17 +00:00
Viktor Barzin
1595bddfc2 feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Re-land Phase 2 after the first attempt's two failure modes, both fixed:
- tempo.resources set under the correct single-binary chart key (was OOMKilled on
  the namespace LimitRange default when mis-placed at top level).
- atomic=true + cleanup_on_fail=true on BOTH helm releases — a failed install
  auto-rolls-back instead of leaving a stuck/orphaned release (memory #6479).

Tempo (single-binary, proxmox-lvm 20Gi, 30d) + OTel Collector (contrib; otlp ->
redaction -> batch -> tempo) + Tempo datasource + additive trace_id->Tempo
derivedField on Loki + tripit LOG_FORMAT=json/OTEL_EXPORTER_OTLP_ENDPOINT.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:17:59 +00:00
Viktor Barzin
a0897de7c3 workstation: document homelab-memory hooks + provisioner self-deploy [ci skip]
multi-tenancy.md never mentioned the homelab-memory hooks rollout and still
listed claude_memory credential injection as purely "future". Document what is
actually true now: install_memory provisions the recall/auto-learn/compaction
hooks per user, the provisioner binary self-deploys from the repo (step 0), the
set -e abort fix, and that the hooks no-op without a MEMORY_API_KEY in env (CLI
defaults the URL) — emo has a key, ancamilea is keyless until one is minted.
Also clarify setup-devvm.sh's binary install is now bootstrap-only (ongoing
edits self-deploy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:04:38 +00:00
Viktor Barzin
464e0bfb97 Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2)"
All checks were successful
ci/woodpecker/push/default Pipeline was successful
This reverts commit 7513468a2d.
2026-06-22 06:46:56 +00:00
Viktor Barzin
7513468a2d feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Stand up the cluster's first trace store + OTLP ingress so tripit's OpenTelemetry
spans (Phase 1, already live in prod) export and correlate with logs:
- Grafana Tempo (single-binary, filesystem on proxmox-lvm 20Gi, 30d)
- OTel Collector (contrib; otlp -> redaction deny-list backstop -> batch -> tempo)
- Grafana: a Tempo datasource + an ADDITIVE trace_id->Tempo derivedField on the
  Loki datasource (no uid change, so existing dashboards are unaffected)
- tripit deployment: LOG_FORMAT=json + OTEL_EXPORTER_OTLP_ENDPOINT -> the Collector

Additive (new helm releases; Loki/Prometheus/Grafana untouched). Offline
'terraform validate' clean; full plan+apply runs in CI (locked git-crypt blocks a
local plan as non-admin).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 06:31:11 +00:00
Viktor Barzin
1a32c07ffe docs(eso): Phase 1 done (0.16.2) + confirmed Phase 2 GC findings
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Execution log added to the ESO migration plan. Phase 1 complete: ESO at 0.16.2
(both v1beta1+v1 served). Phase 2 findings confirmed live: apiVersion bump forces
a kubernetes_manifest REPLACE, and ESO ESs use creationPolicy=Owner (target Secret
ownerRef → cascade-GC risk on the replace's delete). Phase 2 must snapshot Secrets
+ empirically validate GC-survival on the first live ES + per-stack two-phase
-target apply (fallback: state rm + import). Corrected the doc's k8s assumption
(cluster is on 1.34; whole climb stays on 1.34, no interleave).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:44:50 +00:00
Viktor Barzin
5e3fe2e8e2 docs(plans): ESO 0.12->2.6 (v1beta1->v1) migration design — the last k8s-1.35 blocker
Design doc for migrating External Secrets Operator off v0.12 (k8s <=1.31), now
the ONLY remaining compat-gate blocker for autonomous k8s 1.35 (kyverno cleared
to 1.18.1 today). Decisive findings: NO v1beta1->v1 conversion webhook, so all
104 ExternalSecrets (across 73 stacks) + 2 ClusterSecretStores must be rewritten
to external-secrets.io/v1 (byte-identical apiVersion bump) while on 0.16.2, BEFORE
crossing 0.17 (which removes v1beta1 — the point of no return). Step one minor at
a time (no skipping); chart==app version; downstream Secrets survive. 5-phase
ordered plan + per-phase rollback + the plan-time data.kubernetes_secret -target
gotcha (15 stacks) + Tier-0/SOPS handling. Plan only — nothing applied.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:27:37 +00:00
Viktor Barzin
ead876ec65 k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
Viktor Barzin
7270e2be3b monitoring: K8sUpgradeChainJobFailed must not double-fire on a compat-gate block
Some checks failed
ci/woodpecker/push/default Pipeline failed
Last night (2026-06-20) the detector + compat-gate fixes worked: the chain
resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno
1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
fired as designed. But the refusal also made the preflight Job exit 1
(block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped
K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm
for what is the intended halt-and-alert outcome.

Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate
block sets that gauge (and it stays 1 until the next preflight resets it), so
the chain-job-failed alert is suppressed for the blocked period; a genuine
wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires
(preserving the alert's original purpose — catching the pre-in_flight preflight
failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs
updated to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:35:35 +00:00
Viktor Barzin
ceae4d5f06 docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed)
The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler
never invoked) and has been removed. Document the replacement: in-kernel
nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List
+ zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts.
Both add zero per-request latency and fail open.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:39:26 +00:00
Viktor Barzin
68d9058f85 cleanup: fully remove orphaned council-complaints app
The council-complaints app (Islington civic-reporting pilot) has been
abandoned. It was already dead in the cluster (deployments scaled 0/0,
image only on the decommissioned registry.viktorbarzin.me which 404s),
and it was never in Terraform — only docs + a kyverno comment referenced
it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses)
were torn down out-of-band via kubectl (nothing in TF to drift from); the
DB-dump PVC was backed up to NFS first.

This removes the remaining repo references to the live app:
- service-catalog.md: drop the council-complaints row
- ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list
- kyverno require-trusted-registries: the registry.viktorbarzin.me/*
  allowlist comment claimed council-complaints as the last referencer;
  rewrite it (no live workload pulls from that registry now; only stale
  completed Job records still carry the ref). The allowlist line itself
  is kept (registry-scoped, not app-specific).

Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade-
apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated
repos (memory id=388)" snapshot; left as-is so the dated record stays
accurate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:32:10 +00:00
Viktor Barzin
5a136c7d53 docs: t3-migrate-idle runbook section + service-catalog + design status
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:40:46 +00:00
Viktor Barzin
0cebeeb0ee t3-idle-migrate: implementation plan
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:26:05 +00:00
Viktor Barzin
9503bed589 t3-idle-migrate: design for graceful overnight restart of deferred t3-serve instances
Viktor hit the t3 'Client and server versions differ' warning. Root cause: the daily gated autoupdate defers a user's t3-serve restart whenever that user has an active agent at the 04:00 window, so anyone busy every night (long-lived/AFK sessions) never migrates and the client/server version skew persists for days.

This design adds a small idle-gated overnight job that drains those deferrals -- restarting a deferred instance onto the current binary only when no turn is in flight (state.sqlite active_turn_id) and it's been quiet for a buffer, so the migration lands in a real quiet gap instead of killing in-flight agent turns. Reuses the autoupdate's proven backup->restart->verify->recover path via a shared helper (approach C from the brainstorm). Design doc only; no behavior change yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:04:22 +00:00
Viktor Barzin
b1bbe42821 homelab ha token: dedicated openclaw/ha-tokens secret + least-priv RBAC for emo
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`ha token` originally read openclaw/openclaw-secrets -> skill_secrets, which only
cluster admins can read — so it hung/failed for the non-admin operator it was
built for (emo = emil.barzin@gmail.com, OIDC group "Home Server Admins", whose
identity is deliberately barred from secrets in the openclaw namespace).

Split the HA tokens into a dedicated secret openclaw/ha-tokens (keys sofia/london)
with a Role + RoleBinding granting `get` on JUST that secret to the Home Server
Admins group (k8s RBAC can't scope to a JSON sub-key, hence a separate object).
emo now resolves the HA token with their own identity, WITHOUT gaining the rest
of skill_secrets (slack_webhook, uptime_kuma_password). openclaw's own deployment
keeps reading openclaw-secrets — purely additive.

- stacks/openclaw/ha_tokens.tf: new secret + least-privilege Role/RoleBinding
- cli/cmd_ha.go: read openclaw/ha-tokens (raw base64 per-instance key); drop JSON parse
- README + ADR-0012 updated; VERSION -> v0.7.1

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 10:45:32 +00:00
Viktor Barzin
6d5d3726d6 Merge remote-tracking branch 'origin/master' into wizard/ha-cli-verbs
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-20 23:46:29 +00:00
Viktor Barzin
48225f2dea homelab CLI v0.7: add ha token + ha ssh for Home Assistant
Mined another devvm user's Claude sessions for repeated, hand-rolled command
patterns worth absorbing into the shared CLI. The dominant signal was Home
Assistant "Sofia" work: a `kubectl | base64 | jq` token-extraction pipeline
re-derived ~420x, and a bespoke non-interactive `ssh -o …` invocation reinvented
~30x — every session. The existing `home-assistant-sofia.py` already covers the
API but goes unused from an arbitrary cwd (needs an env var set + a cwd-relative
path), so agents bypassed it and hand-rolled everything.

Add two verbs covering exactly the gaps the `ha` MCP can't (entity state/control
stays with the MCP):
- `ha token [--instance sofia|london]` (read): resolves the long-lived API token
  live from k8s secret openclaw/openclaw-secrets via the ambient kubeconfig — no
  pre-set env var. Composes as `curl -H "Authorization: Bearer $(homelab ha token)"`.
- `ha ssh [--instance sofia|london] -- <cmd>` (write): deterministic
  non-interactive ssh to the HA host using the invoking user's key.

Also fix the root cause: `home-assistant-sofia.py` now falls back to
`homelab ha token` when its env var is unset (works from any directory), and the
home-assistant skill points agents at these verbs + `homelab metrics query`
instead of hand-rolled curls. README + ADR-0012 + AGENTS.md updated per the
per-verb-group convention.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 23:46:09 +00:00
Viktor Barzin
46166c63b2 fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)

Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
  and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
  as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
  exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
  ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
  every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
  carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
  make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
  and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
  (previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
  (CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).

Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 23:40:22 +00:00
Viktor Barzin
bc2fbc712c Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew 2026-06-20 20:10:48 +00:00
Viktor Barzin
5549fc3672 Add per-user Claude auth renewal
Each workstation user needs a continuously valid Claude token under their own Enterprise identity. Store only that user's OAuth state in an isolated Vault path, renew and verify it automatically, recover from Vault when possible, and alert when interactive SSO is required.
2026-06-20 20:10:40 +00:00
Viktor Barzin
3278588325 chore(authentik): tear down obsolete tripit-enrollment (ADR-0020 superseded by ADR-0028)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
TripIt external users are now LOCAL TripIt accounts (ADR-0028 native passkey + Authentik OIDC), so the Authentik-side self-enrollment machinery is dead. Removes the tripit-enrollment + tripit-recovery flows and all their stages/prompts/policies/bindings, the tripit-email-stages blueprint (+yaml), and the 'TripIt External' group; reverts the admin-services-restriction fence branch that contained those users (its sole member, the leftover tripit-demo@ test account, was deleted first, so the revert affects zero live principals). Real external collaborators (type=external) are untouched. tg plan: 0 add, 1 change (the policy expression), 20 destroy (all tripit_*). Closes tripit#97; moots the B2 per-app OIDC fences.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:04:24 +00:00
Viktor Barzin
b58fe8cb1a docs(k8s-upgrade): record detector Packages-probe -L fix + compat-gate patch scope
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Two corrections to the runbook matching today's code fixes:
- The next-minor *patch* probe (GET .../Packages) also needs `-L`; it lacked it
  until 2026-06-20 and silently no-op'd the 2026-06-19 nightly run. Both probes
  now follow the 302.
- The compat gate's addon check is scoped to minor jumps — patches within the
  running minor are never addon-blocked (target_minor <= running_minor returns
  early), so a conservative ceiling like ESO 0.12 -> 1.31 no longer false-blocks
  a 1.34.x patch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:16:20 +00:00
Viktor Barzin
3e3fdb34f0 homelab: v0.6.0 — usage telemetry (usage top), evidence-driven verb prioritization
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Answers the question that drove the whole CLI — which verbs to add next — with
data instead of one maintainer's habits, and resolves the cross-user-usage ask
in-bounds (no reading anyone's home).

- emit on dispatch: every verb fire-and-forgets one Loki line {job,user,verb} +
  "exit=N ver=X". ONLY the verb path + exit code — never args, paths, flags, or
  secrets (the emit never sees arguments). Best-effort: 800ms timeout, errors
  swallowed, never affects the command; opt-out HOMELAB_TELEMETRY=0. Discovery
  verbs (manifest/version/help) and usage itself don't self-record.
- usage top [--since 30d] [--user U] [--json]: ranks verbs via
  sum by (verb)(count_over_time({job="homelab-usage"}[…])) against the shared
  Loki. Cross-user analytics WITHOUT touching ~/.claude — the privacy-preserving
  answer to "what does the team use".
- Loki sink (zero new infra, dogfoods v0.5 logs path); push verified HTTP 204 no
  auth. ADR docs/adr/0011.

Live-verified: ran 4 verbs, usage top ranked them correctly (metrics query=2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 22:29:01 +00:00
78095aa273 docs(forgejo): runbook reflects Authentik disabled + zero-click GitHub
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Authentik OAuth2 source is now disabled (login_source.is_active=0) and GitHub
auto-registration (zero-click sign-up) is on. Document why (global auto-reg +
Authentik's email-as-username 500; Forgejo/Authentik email mismatch blocks
account-linking) and how to re-enable Authentik later.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:37:46 +00:00
Viktor Barzin
a5bb4db9c5 crowdsec: register the Traefik bouncer with LAPI (fix fail-open)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The Traefik bouncer plugin's API key was never registered with LAPI — the
crowdsec stack reads many keys from Vault but not ingress_crowdsec_api_key, and
the chart registers no bouncer. So LAPI returned 403 to the plugin, which with
updateMaxFailure=-1 failed open and enforced NOTHING: no community-blocklist
bans, and the (now-Turnstile-wired) captcha never fired. cscli bouncers list was
empty; the registration was likely lost in the MySQL->PostgreSQL DB migration
with no IaC to recreate it.

Seed the bouncer at LAPI startup via BOUNCER_KEY_traefik, valued from the same
Vault key the middleware presents — so they match by construction, and the
bouncer re-registers automatically on every LAPI start (survives DB wipes).

- stacks/crowdsec/main.tf: read ingress_crowdsec_api_key, pass to module.
- module main.tf: new sensitive var + thread into the values templatefile.
- values.yaml: BOUNCER_KEY_traefik on lapi.env.
- docs/architecture/security.md: document registration + fail-open history and
  the proxied-app coverage caveat.

Activates enforcement (community blocklist bans + captcha) on non-proxied apps;
internal IPs stay bypassed (clientTrustedIPs), fail-open-on-LAPI-down preserved.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:08:28 +00:00
Viktor Barzin
4a66377425 forgejo: add "Sign in with GitHub" (OAuth2 source + auto-registration)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted people to be able to sign up with GitHub, not just the
native form or Authentik SSO.

- Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth
  --provider github` (name "github", matching the callback registered on
  the GitHub OAuth App). Like the existing Authentik source, it lives in
  Forgejo's DB rather than Terraform — there's no clean TF resource for
  login sources. Client id/secret mirrored to Vault secret/viktor
  (forgejo_github_oauth_client_id / _secret) for recovery.
- This commit's TF change: ENABLE_AUTO_REGISTRATION=true in
  [oauth2_client], so a first GitHub sign-in creates the account directly
  ("sign up with GitHub") instead of a link-to-existing detour. The
  GitHub identity is the trust gate for this path; Turnstile + email
  confirmation still gate the native form.

Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github
redirects to GitHub's authorize URL with the correct client id +
callback, and the login page renders the button. Final browser
click-through is the user's to do.

Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section +
secret-rotation + DB-loss recreate steps).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:41:49 +00:00
Viktor Barzin
fd0c7493c3 traefik/crowdsec: serve Cloudflare Turnstile for captcha remediation
CrowdSec LAPI already issues `captcha`-type decisions for lower-severity abuse
(http-429-abuse, http-403-abuse, http-crawl-non_statics, http-sensitive-files),
but the Traefik bouncer plugin had no captcha provider configured — so those
decisions silently fell through to a 403 ban (traced in the plugin's bouncer.go
@ v1.4.2: captchaClient.Valid==false => handleBanServeHTTP). Flagged users had
no way to self-unblock, contradicting the profile's stated intent.

Wire Cloudflare Turnstile as the bouncer's captcha provider so a captcha
decision now renders a solvable challenge instead of a hard block:

- New cloudflare_turnstile_widget.crowdsec_captcha (managed mode), scoped to
  viktorbarzin.me so one widget covers every subdomain the bouncer fronts.
  Mirrors the existing Forgejo-signup Turnstile pattern; sitekey + secret are
  passed into the traefik module.
- middleware.tf: captchaProvider=turnstile + site/secret keys + grace 1800s +
  captchaHTMLFilePath=/captcha/captcha.html.
- Vendor the plugin's captcha.html and mount it into the Traefik container at
  /captcha via the chart `volumes` value — the pulled Yaegi plugin does not
  expose its bundled template to Traefik.
- docs/architecture/security.md: document the ban-vs-captcha remediation split.
- Remove the dead crowdsec-ingress-bouncer.yaml (unused nginx bouncer with
  placeholder reCAPTCHA keys; referenced by zero .tf).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:38:38 +00:00
Viktor Barzin
963e4fcdde forgejo: open native self-signups, gated by Turnstile + email confirmation
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wants Forgejo open for anyone to sign up, but without bot/spam
account floods. Flip the deployment from OAuth-only registration
(ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local
sign-up, and add two bot gates on the registration form:

  - Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget
    is managed in Terraform (turnstile.tf) via the CF Global API key, so
    the sitekey/secret are IaC, not a dashboard artifact.
  - Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the
    Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me
    (mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced
    credential Authentik uses (email-secret.tf ESO -> secret/authentik
    smtp_password).

Existing Authentik OAuth2 login is unchanged (additive). Deployment env
appended (not inserted) so the diff stays purely additive; a reloader
annotation rolls the pod on secret rotation.

Verified live: signup page renders the Turnstile widget, mailer delivers
a test message end-to-end, Forgejo healthy, plan-to-zero after apply.

Runbook: docs/runbooks/forgejo-open-signups.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:05:07 +00:00
Viktor Barzin
21dbd79ae4 Merge remote-tracking branch 'origin/master' into wizard/homelab-obs
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-19 11:27:44 +00:00
Viktor Barzin
e91e1612dd homelab: v0.5.0 — net/dns/metrics/logs probes (endpoint resolution)
The remaining verbs that pass the "saves reasoning, not just typing" test the
user posed mid-session: each encodes the non-obvious which-endpoint-reached-how
resolution otherwise re-derived every time. (Same test deprioritized node-ssh
and secret-get aliasing — thin wrappers over commands already known.)

- net check <host> [path]: two-legged reachability — external (public DNS→CF)
  vs internal (Traefik LB) — so you see WHERE a break is, not just that one path
  works. (live: surfaced the LB at 6ms vs CF 77ms.)
- dns lookup <name> [type]: Technitium (10.0.20.201) vs public (1.1.1.1) diff.
- metrics query "<promql>" / metrics alerts: Prometheus via the LB
  (prometheus-query.viktorbarzin.lan); alerts uses the synthetic ALERTS series
  since the query frontend has no /api/v1/alerts and Alertmanager has no ingress.
- logs query "<logql>" [--since 1h] [--limit N]: Loki range query via the LB.

All reach auth-free internal ingresses through the LB (Go form of
curl --resolve host:443:10.0.20.203) — no port-forward, no kubectl. In-cluster-
only endpoints (Alertmanager v2) deliberately out of scope. Verified live before
building; all five smoke-tested green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 11:27:31 +00:00
Viktor Barzin
6cb823e431 k8s-version-upgrade: complete autonomy P0 — blocked alert + deeper postflight + runbook
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt +
alert when not":
- monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning)
  in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see
  Slack for why" signal. (Until monitoring is applied, a block still surfaces via
  the already-live K8sUpgradeChainJobFailed.)
- upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests —
  apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and
  core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns)
  Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't
  downgrade). Catches a "pods look Running but cluster is broken" upgrade.
- runbook: documents the compat gate, the blocked alert, how to clear a block,
  matrix maintenance, and the detector minor-probe fix.

After deploy, the nightly chain detects 1.35 (minor detection now works) and
correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting
via K8sUpgradeBlocked — the autonomy working as designed until the catch-up
clears those addons.
2026-06-19 11:27:17 +00:00
Viktor Barzin
9189560ac3 homelab: v0.4.0 — ci/deploy verbs (watch what you trigger)
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Adds the verb-group that kills the single biggest reasoning sink in agent
sessions — watching a build/deploy to completion (proven the session that built
it: hours hand-rolling Woodpecker polling + DB-schema spelunking for one CI
incident).

- ci status/watch: Woodpecker REST API (version-stable, not its DB schema),
  reached via the internal Traefik LB (dial 10.0.20.203, SNI=ci.viktorbarzin.me
  so the cert verifies — the Go form of the house `curl --resolve` pattern),
  token from WOODPECKER_TOKEN/Vault, repo id resolved from the cwd remote, with
  retries that ride Woodpecker's intermittent empty responses. watch matches the
  HEAD/given commit (avoids the post-push race) and exits non-zero on failure.
- deploy wait: image-sha match THEN rollout status (rollout status alone returns
  success on the old ReplicaSet); kubectl-based.
- work land now auto-watches CI to green on the landed commit (--no-ci-watch to
  skip), closing the v0.1 gap.
- ci logs deferred to v0.4.1 (Woodpecker detail/log endpoints were the least
  reliable; status/watch use the working list endpoint).

Live-verified ci status/watch against the live API.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 10:59:14 +00:00
Viktor Barzin
fd77c0dc4f monitoring: RpiSofiaUndervoltage alerts on new brown-out, not until reboot
Some checks failed
ci/woodpecker/push/default Pipeline failed
The rpi-sofia under-voltage alert keyed off the sticky firmware bit
(rpi_under_voltage_occurred == 1), which latches on the first brown-out and
stays 1 until the Pi reboots. With alert-on-change routing it re-paged on every
boot cycle and sat firing for ~211h of the last 14d — Viktor reported "getting a
few of these lately" — and it disagreed with the HA-sofia dashboard, which shows
the live state and reads OK once voltage recovers.

Can't just switch to the live bit: rpi_under_voltage_now never registered once in
14d (brown-outs are sub-second and fall between the 1-min textfile-collector
samples), so the sticky bit is the only reliable detector.

Fix: edge-trigger on a NEW latch via increase(rpi_under_voltage_occurred[1h]) > 0.
Fires once per brown-out and auto-resolves ~1h later (~2h active over the same
14d instead of ~211h); counter-reset handling makes a clean reboot a no-op. Both
real brown-out events in the window are still caught. Docs updated in the same
commit (monitoring.md).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 08:45:39 +00:00
Viktor Barzin
077ac97df5 k8s-version-upgrade: auto-restore apiserver OIDC after control-plane bumps
Some checks failed
ci/woodpecker/push/default Pipeline failed
kubeadm upgrade apply regenerates the apiserver static-pod manifest and drops
the --authentication-config flag, silently breaking SSO (kubectl/kubelogin + the
k8s dashboard) until someone manually re-applied the rbac stack. That manual step
ran after every control-plane upgrade — the one thing keeping autonomous patch
upgrades from being truly hands-off (it bit us this cycle: an earlier master bump
left SSO broken until we noticed).

Automate it: the rbac stack now publishes its existing OIDC restore script (the
same one its null_resource runs) to a kube-system/apiserver-oidc-restore
ConfigMap, and the upgrade chain's phase_master re-runs it on master right after
the kubeadm upgrade — while tigera-operator is still quiesced so the flag-add
apiserver restart can't crashloop it. The script is idempotent and health-gates
/livez with auto-rollback; the step is non-fatal (a failure only lags SSO until
the next rbac apply, it won't abort the upgrade). phase_master already self-skips
when master is at target, so this only fires when master was actually upgraded.

The chain SA gets a name-scoped get on that one ConfigMap. Runbook updated: the
manual restore is now a documented fallback (command corrected — it needs
-replace, since the null_resource trigger hash never changes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 06:04:30 +00:00
Viktor Barzin
48b63ffa6f homelab: add memory verb-group (v0.3.0) — direct claude-memory HTTP client
Some checks failed
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline failed
Lets agents search/navigate memory via the CLI, as the first step toward
deprecating the memory MCP. claude-memory is a FastAPI service (the MCP is just
one frontend); homelab memory is a thin Bearer-auth HTTP client over the same
API, using the env the hooks already set (CLAUDE_MEMORY_API_URL/KEY). It works
even when the MCP frontend is down — the recurring disconnect that took the MCP
offline for this whole session.

Verbs: recall (server-side semantic search), list, categories, tags, stats,
secret (read); store, update, delete (write). Validated against the live API
including a store→recall→delete round-trip — full data-plane parity with the MCP.

The deprecation itself (rewiring the per-prompt auto-recall + auto-learn hooks to
the CLI, then uninstalling the MCP) is a deliberate follow-up, sequenced after
the CLI is proven in the hooks — see docs/adr/0008.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 05:56:25 +00:00
Viktor Barzin
3594485f77 homelab: v0.2.0 — docs + version for the k8s verb-group
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Bump cli/VERSION to v0.2.0; document the k8s verbs (README table + resolver
note), add docs/adr/0007 (resolver, read/write split, config-mutation stays
raw, db dbaas pattern), and extend the AGENTS.md discovery pointer with the
Kubernetes surface.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 22:30:41 +00:00
Viktor Barzin
66caa0bf7f homelab: v0.1 docs, distribution wiring, and version
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Completes v0.1: documentation, build/install path, and version stamping.

- cli/VERSION (v0.1.0) stamped into the binary via ldflags.
- cli/README.md rewritten as the homelab overview (verbs + tiers, manifest,
  build, the preserved legacy webhook use-cases).
- docs/adr/0004-0006: why homelab exists (grown in place from infra/cli, not a
  separate repo), v0.1 scope + everything-allowed/tiers-recorded, and the
  work/tf behaviour (native worktree entry, verification-gated auto-land,
  presence-coupled apply).
- setup-devvm.sh builds cli/ -> /usr/local/bin/homelab each provisioning run
  (t3-dispatch pattern), so every devvm user gets the current binary.
- AGENTS.md: discovery pointer under Common Operations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 19:25:51 +00:00
Viktor Barzin
70e217db24 k8s-version-upgrade: preflight skips kubeadm-plan gate when master already at target
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The autonomous 1.34.9 version-upgrade chain has been failing its preflight every
night. A prior run left k8s-master + k8s-node1 on 1.34.9 while node2-6 stayed on
1.34.8, and preflight's gate-4 runs `kubeadm upgrade plan` on master. On an
already-at-target master, kubeadm prints no "kubeadm upgrade apply vX.Y.Z" line,
so the parsed target came back empty and the `!= requested` check aborted the
whole chain before any worker was touched. Deterministic — it self-cleaned and
re-failed identically each night, so it would have failed again tonight, leaving
node2-6 stuck on the old patch.

Fix: skip the kubeadm-plan-target gate when master is already on TARGET_VERSION
— the same at-target self-skip that phase_master and phase_worker already do.
The remaining workers are still validated by their own per-node phases, and the
detector already confirmed the target is installable via apt-cache. This lets
tonight's unattended chain resume and finish node2-6 -> 1.34.9.

Runbook updated: node count 5 -> 7, the gate skip note, and a Past Incidents
writeup (incl. the collateral apiserver OIDC wipe, restored via the rbac stack).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 09:17:46 +00:00
Viktor Barzin
c04efa3d3a k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Disruptive node drains should run when the cluster is idle. Move the
k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC
(00:00 London) — overnight, low usage, and clear of the kured OS-reboot window
(01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.)

  - stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *.
  - scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot
    (was next_daily_noon_utc).
  - docs (runbook, architecture) + upgrade-state SKILL: schedule references
    updated to 23:00 UTC nightly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:16:32 +00:00
Viktor Barzin
ed53b34bf4 k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The chain hardcoded master→node4→node3→node2→node1→postflight and SSHed by
FQDN. It silently SKIPPED node5/node6 (added 2026-05-26) — postflight would
have failed even if reachable — and node5/node6 had no .viktorbarzin.lan DNS
records, so the chain couldn't SSH to them at all.

Refactor (upgrade-step.sh):
  - Worker set + order derived live from `kubectl get nodes` (worker_nodes /
    next_pending_worker), so EVERY worker still off-target is upgraded and a
    newly-joined node is covered with zero script change.
  - SSH targets are node InternalIPs (ssh_target), removing the dependency on
    node DNS records entirely — a new node is reachable the moment it joins.
  - The two remaining hardcoded loops (containerd skew, apt-repo rewrite) now
    enumerate workers/all-nodes dynamically too.
  - Topology preserved: master-drain Job runs on the first worker; every
    worker-drain Job runs on the already-upgraded k8s-master (self-preemption
    invariant intact).
  - next_pending_worker returns 0 explicitly on the no-match path — the
    `while read … done < <(…)` loop exits 1 at EOF, which under set -e would
    abort the LAST worker's Job before it spawns postflight (cluster upgraded
    but no cleanup / in_flight reset). Caught in review.

Docs (runbook + architecture + headers) updated to the dynamic topology.

NOTE: nodes still need the k8s-upgrade SSH public key in authorized_keys; it was
deployed to node4/5/6 by hand this session. Baking it into node provisioning
(so new nodes get it automatically) is the remaining follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 16:56:02 +00:00
Viktor Barzin
037a609f27 k8s-version-upgrade: unblock 1.34.9 — skip kubeadm CoreDNS addon + busybox-date fix
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The 1.34.9 master upgrade hard-failed `kubeadm upgrade apply` preflight: CoreDNS
is at v1.12.4 (Keel auto-bumped it 1.12.1 -> 1.12.4 on 2026-05-26 via a stale
kube-system out-of-band annotation), and 1.12.4 is ahead of kubeadm 1.34.9's
bundled corefile-migration table ("start version not supported").

- scripts/update_k8s.sh: master `kubeadm upgrade apply` now runs with
  `--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins
  --skip-phases=addon/coredns`. A dry-run proved --ignore ALONE would overwrite
  our custom split-horizon Corefile with kubeadm's default AND downgrade the
  image; --skip-phases leaves CoreDNS 100% untouched while the control plane
  upgrades. CoreDNS is pinned off Keel (keel.sh/policy=never) to stop the drift.
- stacks/k8s-version-upgrade/scripts/upgrade-step.sh: fix the preflight
  quiet-baseline (settle-window) check, which silently no-op'd on the ghcr
  claude-agent-service image's busybox `date` (can't parse ISO8601). Now tries
  GNU then busybox `-D`, and warns+skips on parse failure (no silent fail-open).
- docs: runbook + architecture document the CoreDNS handling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:45:05 +00:00
Viktor Barzin
fb638cd8ec k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs
Some checks failed
ci/woodpecker/push/default Pipeline failed
Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to
the terminal job-condition reasons (BackoffLimitExceeded|DeadlineExceeded). A
phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every
firing alert also halts kured, so a bare-count false-positive would block all
OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics:
the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0
for the terminal reasons.

Docs updated to match the behaviour change (per the same-commit docs rule):
  - docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the
    "kill a stuck Job" recovery now leads with retry-on-failure self-heal.
  - docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert;
    retry-on-failure note on the deterministic-naming paragraph.
  - .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend
    entry, and drill-down (also copied to the active ~/.claude copy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:10:18 +00:00
Viktor Barzin
8a2a3d9eca Merge remote-tracking branch 'origin/master' into wizard/reconcile-mirror
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
# Conflicts:
#	scripts/t3-provision-users.sh
2026-06-16 22:32:43 +00:00