infra

Author	SHA1	Message	Date
Viktor Barzin	7d297dc6b1	eso: complete migration — chart 2.6.0, all CRs on v1, 1.35 gate cleared Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker). Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time, each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s 1.34 -> 1.35 on its next nightly run. Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox, but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent dependency lock file: no version selected"). Reconciled via `tg init -upgrade` and committed so `terragrunt apply`/CI work cleanly again. Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc marked COMPLETE. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 09:55:51 +00:00
Viktor Barzin	a3cdc0d6d0	chrome-service: size headed Chrome window to fill Xvfb (noVNC cut-off) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The noVNC view showed the browser in the top-left with the rest of the framebuffer black. Cause: Chrome launched with no --window-size, and there's no window manager, so it opened at its profile-persisted (smaller) size inside the 1280x720 Xvfb. Add --window-size=1280,720 --window-position=0,0 so the window fills the screen on every launch (fresh pods/profiles too). Live windows were already resized via CDP as a stopgap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 18:00:20 +00:00
Viktor Barzin	c7ead032ec	chrome-service: fix noVNC stuck-"Connecting" (x11vnc fd-sweep under nofile=2^31) Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build chrome-service-novnc / build (push) Has been cancelled Details The noVNC view hung on "Connecting" forever then timed out. Root cause: x11vnc sweeps the entire fd table (fcntl per fd) on every client connection, and containerd grants pods RLIMIT_NOFILE=2^31, so the RFB handshake never completes (websockify accepts the WS and dials localhost:5900, but x11vnc never sends its banner — verified: handshake timed out at 8s, x11vnc had burned 1h41m CPU spinning). Same bug + fix the android-emulator stack already carries. Cap nofile before x11vnc starts, in two places: - files/novnc/entrypoint.sh: `ulimit -n 65536` (root fix, makes the image correct) - main.tf novnc container: `command = ["bash","-c","ulimit -n 65536; exec /entrypoint.sh"]` so the cap applies deterministically on rollout even though the image is :latest/IfNotPresent (a rebuilt entrypoint isn't guaranteed to be re-pulled). Also documents the gotcha + diagnosis in docs/architecture/chrome-service.md and notes the black-when-idle behaviour + the autoconnect URL. (A live x11vnc relaunch with the cap already unblocked the running pod; this makes it survive restarts.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 17:34:03 +00:00
Viktor Barzin	7dbbb74163	homelab v0.8.1: frame browser as escalation (default headless), match CLAUDE.md Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build infra CLI / build (push) Has been cancelled Details Make `homelab browser --help` and chrome-service.md state the same tiered rule now in ~/code/CLAUDE.md: default to the Playwright MCP/headless browser for all routine automation; reach for `homelab browser` ONLY when headless is blocked (loads-but-submit-fails / one request errors while siblings 200 / explicit bot wall). Removes the "co-equal choice" framing so agents have one non-conflicting instruction. Adds a test asserting the tiered wording so it can't regress. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 15:44:43 +00:00
Viktor Barzin	a6b52a5839	homelab v0.8.0: browser verbs for headful anti-bot web automation Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Add `homelab browser run\|open` so agents can drive the cluster's headful Chrome (chrome-service) over CDP from the devvm. The headless playwright/mcp browser can load anti-bot sites and fill their forms, but the gated submit silently fails — e.g. the Stirling Ackroyd Fixflo tenant portal returned net::ERR_FILE_NOT_FOUND on its pre-submit check and hung, creating nothing. Driving the real headful Chrome submits first try. That capability already existed but was undiscoverable, so it cost ~40 min + redundant form re-runs to find; now it is one command, versioned, test-covered, and `browser --help` carries the when-to-use signature + an error-code cheat-sheet so the right tool is reached at the right moment (the failure was judgment, not setup). - port-forward svc/chrome-service:9222 (tunnels API-server->pod, so it bypasses the :9222 NetworkPolicy), assert non-headless via /json/version, connect_over_cdp, inject the same vendored stealth.js the in-cluster callers use; the port-forward is always torn down, on success and on error. - node CDP client pinned to playwright-core@1.48.2 to match the v1.48.0-noble image (Chromium 130); self-provisioned lazily into ~/.cache/homelab, no per-user setup. - default is a fresh incognito context (safe for the shared browser + concurrent callers); --shared-context reuses the warmed persistent profile. - TDD: cmd_browser_test.go covers arg parsing, headless detection, the version pin, the help cheat-sheet, and a stealth.js drift guard. Verified end-to-end against bot.sannysoft.com (real Chrome UA, webdriver hidden, plugins/WebGL spoofed) and `browser open`. - docs: README v0.8 section, ADR-0013, and a chrome-service.md "driving from outside the cluster" section. Closes: code-nepg Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 12:22:22 +00:00
Viktor Barzin	de163aa6af	workstation: switch devvm OOM backstop from systemd-oomd to earlyoom All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details The systemd-oomd backstop added in the previous commit is INERT on this box. oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at 96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon). The very swap=0 that kills the IO storm also neuters oomd. Replace it with earlyoom, which watches free RAM (MemAvailable%) and is swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored (-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the admin's way in always survives) and --prefers the agent/browser hogs. Verified via --dryrun: fires on the RAM threshold and selects a chrome process, not a protected daemon. The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user, docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config + ManagedOOM drop-ins removed. Post-mortem updated with the finding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 10:39:16 +00:00
Viktor Barzin	3a59f4a8bf	workstation: per-user memory caps + systemd-oomd backstop on devvm All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details The shared devvm keeps overloading and had to be hard-killed again today (2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus stacked max-effort agents) grew unbounded, spilled into the disk swap, and swap-thrashed the throttled virtual disk into an IO storm until the box wedged. Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree was capped. So one user could starve everyone. This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G, MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap), adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under box-wide pressure while leaving system.slice (sshd/services/your way in) protected, gives system.slice a fair-share CPU/IO priority edge, and routes docker containers into a capped, oomd-policed docker.slice so they can't dodge the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild reproduces them; systemd-oomd added to packages.txt. Applied live and verified: oomctl shows the backstop armed (not dry-run) on the work slices with system.slice protected; a capped-balloon stress test OOM-killed locally at the ceiling with swap flat (no thrash). Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 10:25:09 +00:00
Viktor Barzin	aeed461591	Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)" All checks were successful ci/woodpecker/push/default Pipeline was successful Details This reverts commit `1595bddfc2`.	2026-06-22 08:31:17 +00:00
Viktor Barzin	1595bddfc2	feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2) Some checks failed ci/woodpecker/push/default Pipeline failed Details Re-land Phase 2 after the first attempt's two failure modes, both fixed: - tempo.resources set under the correct single-binary chart key (was OOMKilled on the namespace LimitRange default when mis-placed at top level). - atomic=true + cleanup_on_fail=true on BOTH helm releases — a failed install auto-rolls-back instead of leaving a stuck/orphaned release (memory #6479). Tempo (single-binary, proxmox-lvm 20Gi, 30d) + OTel Collector (contrib; otlp -> redaction -> batch -> tempo) + Tempo datasource + additive trace_id->Tempo derivedField on Loki + tripit LOG_FORMAT=json/OTEL_EXPORTER_OTLP_ENDPOINT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 08:17:59 +00:00
Viktor Barzin	a0897de7c3	workstation: document homelab-memory hooks + provisioner self-deploy [ci skip] multi-tenancy.md never mentioned the homelab-memory hooks rollout and still listed claude_memory credential injection as purely "future". Document what is actually true now: install_memory provisions the recall/auto-learn/compaction hooks per user, the provisioner binary self-deploys from the repo (step 0), the set -e abort fix, and that the hooks no-op without a MEMORY_API_KEY in env (CLI defaults the URL) — emo has a key, ancamilea is keyless until one is minted. Also clarify setup-devvm.sh's binary install is now bootstrap-only (ongoing edits self-deploy). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 08:04:38 +00:00
Viktor Barzin	464e0bfb97	Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2)" All checks were successful ci/woodpecker/push/default Pipeline was successful Details This reverts commit `7513468a2d`.	2026-06-22 06:46:56 +00:00
Viktor Barzin	7513468a2d	feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2) Some checks failed ci/woodpecker/push/default Pipeline failed Details Stand up the cluster's first trace store + OTLP ingress so tripit's OpenTelemetry spans (Phase 1, already live in prod) export and correlate with logs: - Grafana Tempo (single-binary, filesystem on proxmox-lvm 20Gi, 30d) - OTel Collector (contrib; otlp -> redaction deny-list backstop -> batch -> tempo) - Grafana: a Tempo datasource + an ADDITIVE trace_id->Tempo derivedField on the Loki datasource (no uid change, so existing dashboards are unaffected) - tripit deployment: LOG_FORMAT=json + OTEL_EXPORTER_OTLP_ENDPOINT -> the Collector Additive (new helm releases; Loki/Prometheus/Grafana untouched). Offline 'terraform validate' clean; full plan+apply runs in CI (locked git-crypt blocks a local plan as non-admin). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 06:31:11 +00:00
Viktor Barzin	1a32c07ffe	docs(eso): Phase 1 done (0.16.2) + confirmed Phase 2 GC findings All checks were successful ci/woodpecker/push/default Pipeline was successful Details Execution log added to the ESO migration plan. Phase 1 complete: ESO at 0.16.2 (both v1beta1+v1 served). Phase 2 findings confirmed live: apiVersion bump forces a kubernetes_manifest REPLACE, and ESO ESs use creationPolicy=Owner (target Secret ownerRef → cascade-GC risk on the replace's delete). Phase 2 must snapshot Secrets + empirically validate GC-survival on the first live ES + per-stack two-phase -target apply (fallback: state rm + import). Corrected the doc's k8s assumption (cluster is on 1.34; whole climb stays on 1.34, no interleave). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 20:44:50 +00:00
Viktor Barzin	5e3fe2e8e2	docs(plans): ESO 0.12->2.6 (v1beta1->v1) migration design — the last k8s-1.35 blocker Design doc for migrating External Secrets Operator off v0.12 (k8s <=1.31), now the ONLY remaining compat-gate blocker for autonomous k8s 1.35 (kyverno cleared to 1.18.1 today). Decisive findings: NO v1beta1->v1 conversion webhook, so all 104 ExternalSecrets (across 73 stacks) + 2 ClusterSecretStores must be rewritten to external-secrets.io/v1 (byte-identical apiVersion bump) while on 0.16.2, BEFORE crossing 0.17 (which removes v1beta1 — the point of no return). Step one minor at a time (no skipping); chart==app version; downstream Secrets survive. 5-phase ordered plan + per-phase rollback + the plan-time data.kubernetes_secret -target gotcha (15 stacks) + Tier-0/SOPS handling. Plan only — nothing applied. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 17:27:37 +00:00
Viktor Barzin	ead876ec65	k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases All checks were successful ci/woodpecker/push/default Pipeline was successful Details Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.` to `k8s-upgrade-(preflight\|master\|worker\|postflight)-.` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 16:57:44 +00:00
Viktor Barzin	7270e2be3b	monitoring: K8sUpgradeChainJobFailed must not double-fire on a compat-gate block Some checks failed ci/woodpecker/push/default Pipeline failed Details Last night (2026-06-20) the detector + compat-gate fixes worked: the chain resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno 1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked fired as designed. But the refusal also made the preflight Job exit 1 (block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm for what is the intended halt-and-alert outcome. Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate block sets that gauge (and it stays 1 until the next preflight resets it), so the chain-job-failed alert is suppressed for the blocked period; a genuine wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires (preserving the alert's original purpose — catching the pre-in_flight preflight failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs updated to match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 16:35:35 +00:00
Viktor Barzin	ceae4d5f06	docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed) The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler never invoked) and has been removed. Document the replacement: in-kernel nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List + zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts. Both add zero per-request latency and fail open. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 13:39:26 +00:00
Viktor Barzin	68d9058f85	cleanup: fully remove orphaned council-complaints app The council-complaints app (Islington civic-reporting pilot) has been abandoned. It was already dead in the cluster (deployments scaled 0/0, image only on the decommissioned registry.viktorbarzin.me which 404s), and it was never in Terraform — only docs + a kyverno comment referenced it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses) were torn down out-of-band via kubectl (nothing in TF to drift from); the DB-dump PVC was backed up to NFS first. This removes the remaining repo references to the live app: - service-catalog.md: drop the council-complaints row - ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list - kyverno require-trusted-registries: the registry.viktorbarzin.me/* allowlist comment claimed council-complaints as the last referencer; rewrite it (no live workload pulls from that registry now; only stale completed Job records still carry the ref). The allowlist line itself is kept (registry-scoped, not app-specific). Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade- apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated repos (memory id=388)" snapshot; left as-is so the dated record stays accurate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 13:32:10 +00:00
Viktor Barzin	5a136c7d53	docs: t3-migrate-idle runbook section + service-catalog + design status Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 12:40:46 +00:00
Viktor Barzin	0cebeeb0ee	t3-idle-migrate: implementation plan Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 12:26:05 +00:00
Viktor Barzin	9503bed589	t3-idle-migrate: design for graceful overnight restart of deferred t3-serve instances Viktor hit the t3 'Client and server versions differ' warning. Root cause: the daily gated autoupdate defers a user's t3-serve restart whenever that user has an active agent at the 04:00 window, so anyone busy every night (long-lived/AFK sessions) never migrates and the client/server version skew persists for days. This design adds a small idle-gated overnight job that drains those deferrals -- restarting a deferred instance onto the current binary only when no turn is in flight (state.sqlite active_turn_id) and it's been quiet for a buffer, so the migration lands in a real quiet gap instead of killing in-flight agent turns. Reuses the autoupdate's proven backup->restart->verify->recover path via a shared helper (approach C from the brainstorm). Design doc only; no behavior change yet. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 12:04:22 +00:00
Viktor Barzin	b1bbe42821	homelab ha token: dedicated openclaw/ha-tokens secret + least-priv RBAC for emo Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details `ha token` originally read openclaw/openclaw-secrets -> skill_secrets, which only cluster admins can read — so it hung/failed for the non-admin operator it was built for (emo = emil.barzin@gmail.com, OIDC group "Home Server Admins", whose identity is deliberately barred from secrets in the openclaw namespace). Split the HA tokens into a dedicated secret openclaw/ha-tokens (keys sofia/london) with a Role + RoleBinding granting `get` on JUST that secret to the Home Server Admins group (k8s RBAC can't scope to a JSON sub-key, hence a separate object). emo now resolves the HA token with their own identity, WITHOUT gaining the rest of skill_secrets (slack_webhook, uptime_kuma_password). openclaw's own deployment keeps reading openclaw-secrets — purely additive. - stacks/openclaw/ha_tokens.tf: new secret + least-privilege Role/RoleBinding - cli/cmd_ha.go: read openclaw/ha-tokens (raw base64 per-instance key); drop JSON parse - README + ADR-0012 updated; VERSION -> v0.7.1 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 10:45:32 +00:00
Viktor Barzin	6d5d3726d6	Merge remote-tracking branch 'origin/master' into wizard/ha-cli-verbs Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details	2026-06-20 23:46:29 +00:00
Viktor Barzin	48225f2dea	homelab CLI v0.7: add `ha token` + `ha ssh` for Home Assistant Mined another devvm user's Claude sessions for repeated, hand-rolled command patterns worth absorbing into the shared CLI. The dominant signal was Home Assistant "Sofia" work: a `kubectl \| base64 \| jq` token-extraction pipeline re-derived ~420x, and a bespoke non-interactive `ssh -o …` invocation reinvented ~30x — every session. The existing `home-assistant-sofia.py` already covers the API but goes unused from an arbitrary cwd (needs an env var set + a cwd-relative path), so agents bypassed it and hand-rolled everything. Add two verbs covering exactly the gaps the `ha` MCP can't (entity state/control stays with the MCP): - `ha token [--instance sofia\|london]` (read): resolves the long-lived API token live from k8s secret openclaw/openclaw-secrets via the ambient kubeconfig — no pre-set env var. Composes as `curl -H "Authorization: Bearer $(homelab ha token)"`. - `ha ssh [--instance sofia\|london] -- <cmd>` (write): deterministic non-interactive ssh to the HA host using the invoking user's key. Also fix the root cause: `home-assistant-sofia.py` now falls back to `homelab ha token` when its env var is unset (works from any directory), and the home-assistant skill points agents at these verbs + `homelab metrics query` instead of hand-rolled curls. README + ADR-0012 + AGENTS.md updated per the per-verb-group convention. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 23:46:09 +00:00
Viktor Barzin	46166c63b2	fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor's passkeys all vanished and he was suddenly being asked to log in multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE each device of users[0] — but the fuzzy search returned the REAL account, so it wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and the social-login stage (default-source-authentication-login) had the provider default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h — hence the constant re-logins. (Password + passkey logins were already weeks=4.) Changes: - authentik: adopt default-source-authentication-login into Terraform (import) and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long as password/passkey. Immediate relief without re-enrolling. - authentik: document the provider-schema gotcha — authentik_stage_identification exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in ignore_changes (commit `4e882989` removed them for this reason; re-adding breaks every apply). The passkey break was purely the missing device records, not drift. - edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login — carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule, make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block), and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting). - docs: record the session-duration change, the edge enforcement + auth carve-out (previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob (CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert). Passkey re-enrollment is a manual user action (devices are gone from the DB); nothing auto-re-deletes them. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 23:40:22 +00:00
Viktor Barzin	bc2fbc712c	Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew	2026-06-20 20:10:48 +00:00
Viktor Barzin	5549fc3672	Add per-user Claude auth renewal Each workstation user needs a continuously valid Claude token under their own Enterprise identity. Store only that user's OAuth state in an isolated Vault path, renew and verify it automatically, recover from Vault when possible, and alert when interactive SSO is required.	2026-06-20 20:10:40 +00:00
Viktor Barzin	3278588325	chore(authentik): tear down obsolete tripit-enrollment (ADR-0020 superseded by ADR-0028) All checks were successful ci/woodpecker/push/default Pipeline was successful Details TripIt external users are now LOCAL TripIt accounts (ADR-0028 native passkey + Authentik OIDC), so the Authentik-side self-enrollment machinery is dead. Removes the tripit-enrollment + tripit-recovery flows and all their stages/prompts/policies/bindings, the tripit-email-stages blueprint (+yaml), and the 'TripIt External' group; reverts the admin-services-restriction fence branch that contained those users (its sole member, the leftover tripit-demo@ test account, was deleted first, so the revert affects zero live principals). Real external collaborators (type=external) are untouched. tg plan: 0 add, 1 change (the policy expression), 20 destroy (all tripit_*). Closes tripit#97; moots the B2 per-app OIDC fences. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 20:04:24 +00:00
Viktor Barzin	b58fe8cb1a	docs(k8s-upgrade): record detector Packages-probe -L fix + compat-gate patch scope All checks were successful ci/woodpecker/push/default Pipeline was successful Details Two corrections to the runbook matching today's code fixes: - The next-minor patch probe (GET .../Packages) also needs `-L`; it lacked it until 2026-06-20 and silently no-op'd the 2026-06-19 nightly run. Both probes now follow the 302. - The compat gate's addon check is scoped to minor jumps — patches within the running minor are never addon-blocked (target_minor <= running_minor returns early), so a conservative ceiling like ESO 0.12 -> 1.31 no longer false-blocks a 1.34.x patch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:16:20 +00:00
Viktor Barzin	3e3fdb34f0	homelab: v0.6.0 — usage telemetry (usage top), evidence-driven verb prioritization Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Answers the question that drove the whole CLI — which verbs to add next — with data instead of one maintainer's habits, and resolves the cross-user-usage ask in-bounds (no reading anyone's home). - emit on dispatch: every verb fire-and-forgets one Loki line {job,user,verb} + "exit=N ver=X". ONLY the verb path + exit code — never args, paths, flags, or secrets (the emit never sees arguments). Best-effort: 800ms timeout, errors swallowed, never affects the command; opt-out HOMELAB_TELEMETRY=0. Discovery verbs (manifest/version/help) and usage itself don't self-record. - usage top [--since 30d] [--user U] [--json]: ranks verbs via sum by (verb)(count_over_time({job="homelab-usage"}[…])) against the shared Loki. Cross-user analytics WITHOUT touching ~/.claude — the privacy-preserving answer to "what does the team use". - Loki sink (zero new infra, dogfoods v0.5 logs path); push verified HTTP 204 no auth. ADR docs/adr/0011. Live-verified: ran 4 verbs, usage top ranked them correctly (metrics query=2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 22:29:01 +00:00
viktor	78095aa273	docs(forgejo): runbook reflects Authentik disabled + zero-click GitHub All checks were successful ci/woodpecker/push/default Pipeline was successful Details Authentik OAuth2 source is now disabled (login_source.is_active=0) and GitHub auto-registration (zero-click sign-up) is on. Document why (global auto-reg + Authentik's email-as-username 500; Forgejo/Authentik email mismatch blocks account-linking) and how to re-enable Authentik later. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:37:46 +00:00
Viktor Barzin	a5bb4db9c5	crowdsec: register the Traefik bouncer with LAPI (fix fail-open) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The Traefik bouncer plugin's API key was never registered with LAPI — the crowdsec stack reads many keys from Vault but not ingress_crowdsec_api_key, and the chart registers no bouncer. So LAPI returned 403 to the plugin, which with updateMaxFailure=-1 failed open and enforced NOTHING: no community-blocklist bans, and the (now-Turnstile-wired) captcha never fired. cscli bouncers list was empty; the registration was likely lost in the MySQL->PostgreSQL DB migration with no IaC to recreate it. Seed the bouncer at LAPI startup via BOUNCER_KEY_traefik, valued from the same Vault key the middleware presents — so they match by construction, and the bouncer re-registers automatically on every LAPI start (survives DB wipes). - stacks/crowdsec/main.tf: read ingress_crowdsec_api_key, pass to module. - module main.tf: new sensitive var + thread into the values templatefile. - values.yaml: BOUNCER_KEY_traefik on lapi.env. - docs/architecture/security.md: document registration + fail-open history and the proxied-app coverage caveat. Activates enforcement (community blocklist bans + captcha) on non-proxied apps; internal IPs stay bypassed (clientTrustedIPs), fail-open-on-LAPI-down preserved. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:08:28 +00:00
Viktor Barzin	4a66377425	forgejo: add "Sign in with GitHub" (OAuth2 source + auto-registration) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wanted people to be able to sign up with GitHub, not just the native form or Authentik SSO. - Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth --provider github` (name "github", matching the callback registered on the GitHub OAuth App). Like the existing Authentik source, it lives in Forgejo's DB rather than Terraform — there's no clean TF resource for login sources. Client id/secret mirrored to Vault secret/viktor (forgejo_github_oauth_client_id / _secret) for recovery. - This commit's TF change: ENABLE_AUTO_REGISTRATION=true in [oauth2_client], so a first GitHub sign-in creates the account directly ("sign up with GitHub") instead of a link-to-existing detour. The GitHub identity is the trust gate for this path; Turnstile + email confirmation still gate the native form. Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github redirects to GitHub's authorize URL with the correct client id + callback, and the login page renders the button. Final browser click-through is the user's to do. Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section + secret-rotation + DB-loss recreate steps). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:41:49 +00:00
Viktor Barzin	fd0c7493c3	traefik/crowdsec: serve Cloudflare Turnstile for captcha remediation CrowdSec LAPI already issues `captcha`-type decisions for lower-severity abuse (http-429-abuse, http-403-abuse, http-crawl-non_statics, http-sensitive-files), but the Traefik bouncer plugin had no captcha provider configured — so those decisions silently fell through to a 403 ban (traced in the plugin's bouncer.go @ v1.4.2: captchaClient.Valid==false => handleBanServeHTTP). Flagged users had no way to self-unblock, contradicting the profile's stated intent. Wire Cloudflare Turnstile as the bouncer's captcha provider so a captcha decision now renders a solvable challenge instead of a hard block: - New cloudflare_turnstile_widget.crowdsec_captcha (managed mode), scoped to viktorbarzin.me so one widget covers every subdomain the bouncer fronts. Mirrors the existing Forgejo-signup Turnstile pattern; sitekey + secret are passed into the traefik module. - middleware.tf: captchaProvider=turnstile + site/secret keys + grace 1800s + captchaHTMLFilePath=/captcha/captcha.html. - Vendor the plugin's captcha.html and mount it into the Traefik container at /captcha via the chart `volumes` value — the pulled Yaegi plugin does not expose its bundled template to Traefik. - docs/architecture/security.md: document the ban-vs-captcha remediation split. - Remove the dead crowdsec-ingress-bouncer.yaml (unused nginx bouncer with placeholder reCAPTCHA keys; referenced by zero .tf). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:38:38 +00:00
Viktor Barzin	963e4fcdde	forgejo: open native self-signups, gated by Turnstile + email confirmation All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wants Forgejo open for anyone to sign up, but without bot/spam account floods. Flip the deployment from OAuth-only registration (ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local sign-up, and add two bot gates on the registration form: - Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget is managed in Terraform (turnstile.tf) via the CF Global API key, so the sitekey/secret are IaC, not a dashboard artifact. - Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me (mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced credential Authentik uses (email-secret.tf ESO -> secret/authentik smtp_password). Existing Authentik OAuth2 login is unchanged (additive). Deployment env appended (not inserted) so the diff stays purely additive; a reloader annotation rolls the pod on secret rotation. Verified live: signup page renders the Turnstile widget, mailer delivers a test message end-to-end, Forgejo healthy, plan-to-zero after apply. Runbook: docs/runbooks/forgejo-open-signups.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:05:07 +00:00
Viktor Barzin	21dbd79ae4	Merge remote-tracking branch 'origin/master' into wizard/homelab-obs Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details	2026-06-19 11:27:44 +00:00
Viktor Barzin	e91e1612dd	homelab: v0.5.0 — net/dns/metrics/logs probes (endpoint resolution) The remaining verbs that pass the "saves reasoning, not just typing" test the user posed mid-session: each encodes the non-obvious which-endpoint-reached-how resolution otherwise re-derived every time. (Same test deprioritized node-ssh and secret-get aliasing — thin wrappers over commands already known.) - net check <host> [path]: two-legged reachability — external (public DNS→CF) vs internal (Traefik LB) — so you see WHERE a break is, not just that one path works. (live: surfaced the LB at 6ms vs CF 77ms.) - dns lookup <name> [type]: Technitium (10.0.20.201) vs public (1.1.1.1) diff. - metrics query "<promql>" / metrics alerts: Prometheus via the LB (prometheus-query.viktorbarzin.lan); alerts uses the synthetic ALERTS series since the query frontend has no /api/v1/alerts and Alertmanager has no ingress. - logs query "<logql>" [--since 1h] [--limit N]: Loki range query via the LB. All reach auth-free internal ingresses through the LB (Go form of curl --resolve host:443:10.0.20.203) — no port-forward, no kubectl. In-cluster- only endpoints (Alertmanager v2) deliberately out of scope. Verified live before building; all five smoke-tested green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 11:27:31 +00:00
Viktor Barzin	6cb823e431	k8s-version-upgrade: complete autonomy P0 — blocked alert + deeper postflight + runbook Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt + alert when not": - monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning) in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see Slack for why" signal. (Until monitoring is applied, a block still surfaces via the already-live K8sUpgradeChainJobFailed.) - upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests — apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns) Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't downgrade). Catches a "pods look Running but cluster is broken" upgrade. - runbook: documents the compat gate, the blocked alert, how to clear a block, matrix maintenance, and the detector minor-probe fix. After deploy, the nightly chain detects 1.35 (minor detection now works) and correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting via K8sUpgradeBlocked — the autonomy working as designed until the catch-up clears those addons.	2026-06-19 11:27:17 +00:00
Viktor Barzin	9189560ac3	homelab: v0.4.0 — ci/deploy verbs (watch what you trigger) Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Adds the verb-group that kills the single biggest reasoning sink in agent sessions — watching a build/deploy to completion (proven the session that built it: hours hand-rolling Woodpecker polling + DB-schema spelunking for one CI incident). - ci status/watch: Woodpecker REST API (version-stable, not its DB schema), reached via the internal Traefik LB (dial 10.0.20.203, SNI=ci.viktorbarzin.me so the cert verifies — the Go form of the house `curl --resolve` pattern), token from WOODPECKER_TOKEN/Vault, repo id resolved from the cwd remote, with retries that ride Woodpecker's intermittent empty responses. watch matches the HEAD/given commit (avoids the post-push race) and exits non-zero on failure. - deploy wait: image-sha match THEN rollout status (rollout status alone returns success on the old ReplicaSet); kubectl-based. - work land now auto-watches CI to green on the landed commit (--no-ci-watch to skip), closing the v0.1 gap. - ci logs deferred to v0.4.1 (Woodpecker detail/log endpoints were the least reliable; status/watch use the working list endpoint). Live-verified ci status/watch against the live API. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 10:59:14 +00:00
Viktor Barzin	fd77c0dc4f	monitoring: RpiSofiaUndervoltage alerts on new brown-out, not until reboot Some checks failed ci/woodpecker/push/default Pipeline failed Details The rpi-sofia under-voltage alert keyed off the sticky firmware bit (rpi_under_voltage_occurred == 1), which latches on the first brown-out and stays 1 until the Pi reboots. With alert-on-change routing it re-paged on every boot cycle and sat firing for ~211h of the last 14d — Viktor reported "getting a few of these lately" — and it disagreed with the HA-sofia dashboard, which shows the live state and reads OK once voltage recovers. Can't just switch to the live bit: rpi_under_voltage_now never registered once in 14d (brown-outs are sub-second and fall between the 1-min textfile-collector samples), so the sticky bit is the only reliable detector. Fix: edge-trigger on a NEW latch via increase(rpi_under_voltage_occurred[1h]) > 0. Fires once per brown-out and auto-resolves ~1h later (~2h active over the same 14d instead of ~211h); counter-reset handling makes a clean reboot a no-op. Both real brown-out events in the window are still caught. Docs updated in the same commit (monitoring.md). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 08:45:39 +00:00
Viktor Barzin	077ac97df5	k8s-version-upgrade: auto-restore apiserver OIDC after control-plane bumps Some checks failed ci/woodpecker/push/default Pipeline failed Details kubeadm upgrade apply regenerates the apiserver static-pod manifest and drops the --authentication-config flag, silently breaking SSO (kubectl/kubelogin + the k8s dashboard) until someone manually re-applied the rbac stack. That manual step ran after every control-plane upgrade — the one thing keeping autonomous patch upgrades from being truly hands-off (it bit us this cycle: an earlier master bump left SSO broken until we noticed). Automate it: the rbac stack now publishes its existing OIDC restore script (the same one its null_resource runs) to a kube-system/apiserver-oidc-restore ConfigMap, and the upgrade chain's phase_master re-runs it on master right after the kubeadm upgrade — while tigera-operator is still quiesced so the flag-add apiserver restart can't crashloop it. The script is idempotent and health-gates /livez with auto-rollback; the step is non-fatal (a failure only lags SSO until the next rbac apply, it won't abort the upgrade). phase_master already self-skips when master is at target, so this only fires when master was actually upgraded. The chain SA gets a name-scoped get on that one ConfigMap. Runbook updated: the manual restore is now a documented fallback (command corrected — it needs -replace, since the null_resource trigger hash never changes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 06:04:30 +00:00
Viktor Barzin	48b63ffa6f	homelab: add memory verb-group (v0.3.0) — direct claude-memory HTTP client Some checks failed Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline failed Details Lets agents search/navigate memory via the CLI, as the first step toward deprecating the memory MCP. claude-memory is a FastAPI service (the MCP is just one frontend); homelab memory is a thin Bearer-auth HTTP client over the same API, using the env the hooks already set (CLAUDE_MEMORY_API_URL/KEY). It works even when the MCP frontend is down — the recurring disconnect that took the MCP offline for this whole session. Verbs: recall (server-side semantic search), list, categories, tags, stats, secret (read); store, update, delete (write). Validated against the live API including a store→recall→delete round-trip — full data-plane parity with the MCP. The deprecation itself (rewiring the per-prompt auto-recall + auto-learn hooks to the CLI, then uninstalling the MCP) is a deliberate follow-up, sequenced after the CLI is proven in the hooks — see docs/adr/0008. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 05:56:25 +00:00
Viktor Barzin	3594485f77	homelab: v0.2.0 — docs + version for the k8s verb-group Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Bump cli/VERSION to v0.2.0; document the k8s verbs (README table + resolver note), add docs/adr/0007 (resolver, read/write split, config-mutation stays raw, db dbaas pattern), and extend the AGENTS.md discovery pointer with the Kubernetes surface. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 22:30:41 +00:00
Viktor Barzin	66caa0bf7f	homelab: v0.1 docs, distribution wiring, and version Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Completes v0.1: documentation, build/install path, and version stamping. - cli/VERSION (v0.1.0) stamped into the binary via ldflags. - cli/README.md rewritten as the homelab overview (verbs + tiers, manifest, build, the preserved legacy webhook use-cases). - docs/adr/0004-0006: why homelab exists (grown in place from infra/cli, not a separate repo), v0.1 scope + everything-allowed/tiers-recorded, and the work/tf behaviour (native worktree entry, verification-gated auto-land, presence-coupled apply). - setup-devvm.sh builds cli/ -> /usr/local/bin/homelab each provisioning run (t3-dispatch pattern), so every devvm user gets the current binary. - AGENTS.md: discovery pointer under Common Operations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 19:25:51 +00:00
Viktor Barzin	70e217db24	k8s-version-upgrade: preflight skips kubeadm-plan gate when master already at target All checks were successful ci/woodpecker/push/default Pipeline was successful Details The autonomous 1.34.9 version-upgrade chain has been failing its preflight every night. A prior run left k8s-master + k8s-node1 on 1.34.9 while node2-6 stayed on 1.34.8, and preflight's gate-4 runs `kubeadm upgrade plan` on master. On an already-at-target master, kubeadm prints no "kubeadm upgrade apply vX.Y.Z" line, so the parsed target came back empty and the `!= requested` check aborted the whole chain before any worker was touched. Deterministic — it self-cleaned and re-failed identically each night, so it would have failed again tonight, leaving node2-6 stuck on the old patch. Fix: skip the kubeadm-plan-target gate when master is already on TARGET_VERSION — the same at-target self-skip that phase_master and phase_worker already do. The remaining workers are still validated by their own per-node phases, and the detector already confirmed the target is installable via apt-cache. This lets tonight's unattended chain resume and finish node2-6 -> 1.34.9. Runbook updated: node count 5 -> 7, the gate skip note, and a Past Incidents writeup (incl. the collateral apiserver OIDC wipe, restored via the rbac stack). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 09:17:46 +00:00
Viktor Barzin	c04efa3d3a	k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades) Some checks failed ci/woodpecker/push/default Pipeline failed Details Disruptive node drains should run when the cluster is idle. Move the k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC (00:00 London) — overnight, low usage, and clear of the kured OS-reboot window (01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.) - stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *. - scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot (was next_daily_noon_utc). - docs (runbook, architecture) + upgrade-state SKILL: schedule references updated to 23:00 UTC nightly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 18:16:32 +00:00
Viktor Barzin	ed53b34bf4	k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The chain hardcoded master→node4→node3→node2→node1→postflight and SSHed by FQDN. It silently SKIPPED node5/node6 (added 2026-05-26) — postflight would have failed even if reachable — and node5/node6 had no .viktorbarzin.lan DNS records, so the chain couldn't SSH to them at all. Refactor (upgrade-step.sh): - Worker set + order derived live from `kubectl get nodes` (worker_nodes / next_pending_worker), so EVERY worker still off-target is upgraded and a newly-joined node is covered with zero script change. - SSH targets are node InternalIPs (ssh_target), removing the dependency on node DNS records entirely — a new node is reachable the moment it joins. - The two remaining hardcoded loops (containerd skew, apt-repo rewrite) now enumerate workers/all-nodes dynamically too. - Topology preserved: master-drain Job runs on the first worker; every worker-drain Job runs on the already-upgraded k8s-master (self-preemption invariant intact). - next_pending_worker returns 0 explicitly on the no-match path — the `while read … done < <(…)` loop exits 1 at EOF, which under set -e would abort the LAST worker's Job before it spawns postflight (cluster upgraded but no cleanup / in_flight reset). Caught in review. Docs (runbook + architecture + headers) updated to the dynamic topology. NOTE: nodes still need the k8s-upgrade SSH public key in authorized_keys; it was deployed to node4/5/6 by hand this session. Baking it into node provisioning (so new nodes get it automatically) is the remaining follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 16:56:02 +00:00
Viktor Barzin	037a609f27	k8s-version-upgrade: unblock 1.34.9 — skip kubeadm CoreDNS addon + busybox-date fix All checks were successful ci/woodpecker/push/default Pipeline was successful Details The 1.34.9 master upgrade hard-failed `kubeadm upgrade apply` preflight: CoreDNS is at v1.12.4 (Keel auto-bumped it 1.12.1 -> 1.12.4 on 2026-05-26 via a stale kube-system out-of-band annotation), and 1.12.4 is ahead of kubeadm 1.34.9's bundled corefile-migration table ("start version not supported"). - scripts/update_k8s.sh: master `kubeadm upgrade apply` now runs with `--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins --skip-phases=addon/coredns`. A dry-run proved --ignore ALONE would overwrite our custom split-horizon Corefile with kubeadm's default AND downgrade the image; --skip-phases leaves CoreDNS 100% untouched while the control plane upgrades. CoreDNS is pinned off Keel (keel.sh/policy=never) to stop the drift. - stacks/k8s-version-upgrade/scripts/upgrade-step.sh: fix the preflight quiet-baseline (settle-window) check, which silently no-op'd on the ghcr claude-agent-service image's busybox `date` (can't parse ISO8601). Now tries GNU then busybox `-D`, and warns+skips on parse failure (no silent fail-open). - docs: runbook + architecture document the CoreDNS handling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:45:05 +00:00
Viktor Barzin	fb638cd8ec	k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs Some checks failed ci/woodpecker/push/default Pipeline failed Details Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to the terminal job-condition reasons (BackoffLimitExceeded\|DeadlineExceeded). A phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every firing alert also halts kured, so a bare-count false-positive would block all OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics: the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0 for the terminal reasons. Docs updated to match the behaviour change (per the same-commit docs rule): - docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the "kill a stuck Job" recovery now leads with retry-on-failure self-heal. - docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert; retry-on-failure note on the deterministic-naming paragraph. - .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend entry, and drill-down (also copied to the active ~/.claude copy). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:10:18 +00:00
Viktor Barzin	8a2a3d9eca	Merge remote-tracking branch 'origin/master' into wizard/reconcile-mirror All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details # Conflicts: # scripts/t3-provision-users.sh	2026-06-16 22:32:43 +00:00

1 2 3 4 5 ...

388 commits