infra

Author	SHA1	Message	Date
Viktor Barzin	cc4bfb593b	rybbit: proxied CrowdSec enforcement via Cloudflare IP Lists + WAF rule Replaces the Worker+KV approach (which only covered the ~27 routed hosts) with a zone-wide mechanism that covers ALL proxied hosts: two CF account IP Lists (crowdsec_ban, crowdsec_captcha) + one zone WAF custom rule that blocks `(ip.src in $crowdsec_ban)` and managed-challenges `(ip.src in $crowdsec_captcha)`. No per-request Worker, no cookie machinery — the rybbit Worker stays analytics-only. lapi_kv_sync.py now full-reconciles the two lists from LAPI (fail-safe: a LAPI blip skips the run and freezes the last-known-good block set; serializes CF bulk ops since CF allows one pending op per account). A least-privilege CF API token (Account Filter Lists Edit) is minted in TF. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 09:18:33 +00:00
Viktor Barzin	7e646e1c7c	crowdsec: add cs-firewall-bouncer DaemonSet (direct-host nftables enforcement) Drops banned source IPs in-kernel via nftables (hooks input+forward, so DNAT'd LoadBalancer traffic is caught before reaching Traefik) for DIRECT hosts — the direct-side replacement for the dead Traefik plugin, zero per-request hop. No published image exists, so an initContainer fetches the pinned official static binary (v0.0.34) onto a stock debian-slim base (nftables backend uses netlink directly, no nft CLI needed). hostNetwork + NET_ADMIN/NET_RAW (not privileged). Config (with api_key) in a Secret, Reloader-annotated. crowdsec ns is already in the Kyverno wave-1 exclude list, so the privileged/hostNetwork pod is admitted. Pinned to k8s-node2 (runs a Traefik pod) for one-node validation before the nodeSelector is removed to roll cluster-wide. Fail-open by element timeout if the bouncer stops. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 09:11:08 +00:00
Viktor Barzin	53117b193a	portal-realtime: deploy the v2 full-duplex voice agent (Pipecat) All checks were successful ci/woodpecker/push/default Pipeline was successful Details New stack for the realtime voice agent — v2 of the portal-assistant brain path. One persistent WebSocket per conversation: continuous mic audio -> Silero VAD turn-taking -> Whisper STT (portal-stt) -> streaming Claude brain (claude-agent-service) -> edge-tts (portal-tts) -> audio out, with barge-in. Reuses all three upstream cluster services; nothing new is spun up. Public Cloudflare ingress (proxied, WebSocket) at portal-realtime.viktorbarzin.me with the app's own DEVICE_TOKEN as the edge gate (auth="app" — Authentik would break the native Portal client). No buffering middleware: it would break the streaming WebSocket. Image ghcr.io/viktorbarzin/portal-assistant-realtime (private ghcr, pulled with ghcr_pull_token). Sibling to the v1 portal-assistant gateway, which stays live. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:23:17 +00:00
Viktor Barzin	e5250f417e	k8s-version-upgrade: compat gate must not false-block patch upgrades All checks were successful ci/woodpecker/push/default Pipeline was successful Details The compat gate compared every addon's matrix ceiling against the target k8s minor unconditionally. That is correct for a minor JUMP, but it also blocked patch upgrades within the minor the cluster is ALREADY running: ESO v0.12's matrix ceiling is 1.31, the cluster runs 1.34.9, so a target of 1.34.10 (a patch) was refused with "external-secrets supports k8s <= 1.31; target 1.34 exceeds it" — even though the running cluster is itself proof ESO 0.12 works on 1.34. That silently defeats autonomous patching (it would have bitten the moment a 1.34.10 was published). Fix: a target at or below the running minor crosses into no new k8s minor, so every installed addon is already empirically proven on it — check_addons now returns no reasons when target_minor <= running_minor. Added running_minor() (oldest kubelet across nodes, mirroring the detector; RUNNING_K8S env override for tests) and pass it in. Minor jumps are unchanged: 1.34->1.35 still blocks on ESO 0.12 + kyverno 1.16. removed-API + containerd checks are naturally inert for patches (no API removal / containerd floor inside a minor) and keep running as defence. Added test_compat_gate.py (8 cases) covering both paths. Verified end-to-end against live Prometheus: target 1.34.10 -> EXIT 0 (safe), target 1.35.6 -> EXIT 2 (blocked on ESO+kyverno). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:14:50 +00:00
Viktor Barzin	38675b7922	crowdsec: register kvsync + firewall bouncer keys in LAPI Seeds two new bouncers at LAPI startup (BOUNCER_KEY_kvsync, BOUNCER_KEY_firewall) from Vault secret/platform, mirroring the existing BOUNCER_KEY_traefik wiring. These are the two halves of the real enforcement that replaces the dead Yaegi plugin: kvsync authenticates the LAPI->Cloudflare-KV sync (proxied edge Worker), firewall authenticates the cs-firewall-bouncer DaemonSet (direct-host nftables). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:12:38 +00:00
Viktor Barzin	a9384a4067	Merge remote-tracking branch 'origin/master' Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-20 08:09:16 +00:00
Viktor Barzin	44a98d408e	k8s-version-upgrade: detector next-minor probe must follow 302 (curl -sfL) The next-minor Packages query used `curl -sf` without -L. pkgs.k8s.io 302-redirects every request to a backing host, so without -L curl returned an empty body, NEXT_MINOR_PATCH came back empty, and the detector fell through to "No upgrade needed". That is exactly why last night's 23:00 chain no-op'd instead of resolving the 1.35 next-minor target (1.35.6) and handing it to the compat gate. `curl -sfL` follows the redirect and returns the Packages file (verified: -sf -> empty, -sfL -> 1.35.6). Mirrors the same -L fix already applied to the Release availability probe (-sILo) above. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:09:08 +00:00
Viktor Barzin	910d589205	fix(forgejo): raise git-op timeouts + lower gc.auto to stop push-mirror timeouts Some checks failed ci/woodpecker/push/default Pipeline was canceled Details The tripit Forgejo->GitHub push-mirror silently stalled: `git cat-file --batch-all-objects` over the NFS-backed repo exceeded the default git deadline once ~4500 loose objects accumulated (gc.auto's 6700 threshold hadn't fired), so pushes stopped reaching GitHub and prod deploys stalled. Raise [git.timeout] (DEFAULT/MIRROR/GC) so a slow object enumeration can't abort the mirror, and set [git.config] gc.auto=1000 so post-push autogc + the git_gc_repos cron keep repos packed (the real fix). A one-off forced gc already unblocked tripit; this prevents recurrence across all repos. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:08:50 +00:00
Viktor Barzin	e1736d2e5c	calico: hop 3.28.5->3.30.7 (operator v1.38.13) — restores a SUPPORTED Calico/k8s-1.34 pairing. Disabled new-in-3.30 Goldmane/Whisker (their CRs render before crds/ install on helm upgrade; we use Prometheus/Loki). calico-node 7/7 on quay/v3.30.7, tigerastatus green. Applied manually + verified overnight.	2026-06-20 08:07:08 +00:00
Viktor Barzin	4d9fdbc7f7	rybbit: add CrowdSec LAPI -> Cloudflare KV sync script (proxied edge control plane) Pure-stdlib script (alert_digest pattern, runs on stock python:3.12-alpine) that projects CrowdSec Ip-scope ban/captcha decisions into the Workers KV namespace the edge Worker reads on each proxied request. Full-reconcile per run so an un-ban clears from the edge within one interval; fail-safe (a LAPI read error skips the run and leaves existing bans to expire by TTL = fail-open, never a stale all-block). TF wiring (KV namespace + CronJob + key registration) follows. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:05:11 +00:00
Viktor Barzin	0ac176da01	crowdsec: whitelist internal/LAN/tailnet CIDRs at the decision layer Preparing for real CrowdSec enforcement (edge Cloudflare Worker for proxied hosts + cs-firewall-bouncer for direct hosts). Both enforce by dropping the real source IP, so if an internal/RFC1918 address ever ended up in a ban decision it could blackhole legitimate internal traffic. Whitelisting the cluster/LAN/tailnet ranges (10/8, 172.16/12, 192.168/16, 100.64/10) at the CrowdSec parser layer makes that structurally impossible — a trusted source can never produce a decision in the first place. Public IP already whitelisted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:03:46 +00:00
Viktor Barzin	666fefd22b	calico: hop 3.26->3.28.5 (operator v1.34.13); calico-node 7/7 healthy, tigerastatus green, kube-controller-manager restarted (3.28 UID change). Applied manually + verified. Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-19 22:09:23 +00:00
Viktor Barzin	8ed5368be9	calico: bring tigera-operator under Terraform via Helm (adopt at 3.26.1) Some checks failed ci/woodpecker/push/default Pipeline failed Details Base for the stepped 3.26->3.28->3.30->3.32 upgrade (k8s 1.36 prereq; 3.26 is already unsupported on k8s 1.34). Manage ONLY the operator via the official tigera-operator Helm chart (chart ver == Calico ver); installation.enabled=false keeps the live Installation CR operator-managed so Helm never touches calico-node. Adopted in place: existing operator Deployment/SA/ClusterRole/ClusterRoleBinding pre-stamped with Helm ownership metadata (transient migration step), then the release imported via a plan-verified create (1 to add, 0 destroy). Verified clean: calico-node 7/7 unchanged, tigerastatus green. Closes the year-deferred adoption (code-3ad) for the operator without adopting the Installation CR.	2026-06-19 21:50:34 +00:00
Viktor Barzin	dd029ca7fb	traefik/crowdsec: switch bouncer to live mode (stream cache doesn't enforce under Yaegi) All checks were successful ci/woodpecker/push/default Pipeline was successful Details After bumping to v1.6.0 (stream goroutine runs) and disabling redis (in-memory cache), the plugin logs `handleStreamCache:updated` but still does NOT enforce: a ban present in the LAPI stream AND pulled by the plugin still let the banned IP through. Stream-mode decision matching is unreliable under Traefik's Yaegi interpreter here. Switch crowdsecMode stream->live: the plugin queries LAPI synchronously per request (result cached per-IP for defaultDecisionSeconds), which enforces reliably and picks up new decisions immediately. LAPI is 3-replica + in-cluster so per-request latency is small; fail-open preserved (updateMaxFailure=-1). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:49:26 +00:00
Viktor Barzin	0cc48d83ac	traefik/crowdsec: disable bouncer redis cache (broken under Yaegi → in-memory) With the plugin on v1.6.0 the stream goroutine finally runs, and its slog output revealed the real blocker: `handleStreamTicker ... isCrowdsecStreamHealthy:true cache:unreachable`. The LAPI stream is healthy, but the plugin's redis client cannot reach the cache under Traefik's Yaegi interpreter — even though redis-master.redis.svc is reachable AND writable from the traefik namespace (SET/GET verified via busybox; no NetworkPolicies; no auth). Same interpreter -incompat class as the stream goroutine itself. With redisCacheUnreachableBlock =false the bouncer then failed open and enforced nothing. Disable the redis cache so the plugin uses its in-memory decision store (works under Yaegi). Removes redisCacheHost/redisCacheUnreachableBlock. Trade-off: captcha already-solved grace is per-pod across the 3 Traefik replicas (at worst an occasional re-solve) — acceptable; bans/captcha decisions enforce correctly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:49:26 +00:00
Viktor Barzin	531efb218d	traefik: bump crowdsec-bouncer plugin v1.4.2 -> v1.6.0 (fix stream not pulling) The crowdsec-bouncer Yaegi plugin pinned at v1.4.2 loads on Traefik 3.7.5 but its decision-stream goroutine never runs — no Traefik pod ever calls the LAPI stream (verified: no traefik-pod bouncer entry / no @pod-ip auto-registration), and it logs nothing. All deps are healthy (LAPI 200 + full ban list reachable from the traefik ns, key valid, redis PONG, config correct, no NetworkPolicies), so CrowdSec enforced nothing despite the bouncer now being registered. This is the Traefik-v3 / Yaegi plugin-incompat class that already killed rewrite-body here. v1.4.2 predates Nov 2025; latest is v1.6.0. Bump to v1.6.0 (initContainer download URL + state.json + experimental.plugins version). Config-verified compatible: every key we use survives (crowdsecMode, crowdsecLapiKey/Host, updateMaxFailure, redisCache, clientTrustedIPs, all captcha incl. turnstile); v1.6.0 also moves logging to slog/trace for future diagnosis. Pinned, not auto-updated (Keel can't manage a Yaegi plugin, and plugin bumps must be tested against the running Traefik/Yaegi). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:49:26 +00:00
viktor	7d99203fc6	forgejo: re-enable ENABLE_AUTO_REGISTRATION for zero-click GitHub sign-up All checks were successful ci/woodpecker/push/default Pipeline was successful Details Per Viktor: GitHub sign-up must work zero-click (account created on first login, no form). This global [oauth2_client] setting enables it. It conflicts with Authentik (preferred_username is an email → invalid Forgejo username → 500 on auto-create), and Viktor's Forgejo email (me@viktorbarzin.me) doesn't match his Authentik email (vbarzin@gmail.com) so account-linking can't bridge it — so the Authentik OAuth2 source is DISABLED (login_source.is_active=0; DB-managed, out-of-band) per his directive. Forgejo sign-in is now GitHub + native login. Committed via API to land on origin without pushing a concurrent agent's unpushed local commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:34:17 +00:00
viktor	ef530b7d38	forgejo: drop ENABLE_AUTO_REGISTRATION — it broke Authentik sign-in All checks were successful ci/woodpecker/push/default Pipeline was successful Details ENABLE_AUTO_REGISTRATION is a global [oauth2_client] setting (all OAuth sources). On Authentik sign-in, Forgejo auto-created an account and derived the username from Authentik's preferred_username claim — which is the user's email (vbarzin@gmail.com), invalid as a Forgejo username (no '@') → CreateUser failed → 500 on the OAuth callback. (GitHub's username claim is valid, so only Authentik broke.) Reverting to the standard link/register flow fixes both; GitHub sign-up still works via a one-step register form. Committed via API to touch only main.tf (forgejo-only CI apply) so it doesn't collide with concurrent crowdsec work. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:24:29 +00:00
Viktor Barzin	a5bb4db9c5	crowdsec: register the Traefik bouncer with LAPI (fix fail-open) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The Traefik bouncer plugin's API key was never registered with LAPI — the crowdsec stack reads many keys from Vault but not ingress_crowdsec_api_key, and the chart registers no bouncer. So LAPI returned 403 to the plugin, which with updateMaxFailure=-1 failed open and enforced NOTHING: no community-blocklist bans, and the (now-Turnstile-wired) captcha never fired. cscli bouncers list was empty; the registration was likely lost in the MySQL->PostgreSQL DB migration with no IaC to recreate it. Seed the bouncer at LAPI startup via BOUNCER_KEY_traefik, valued from the same Vault key the middleware presents — so they match by construction, and the bouncer re-registers automatically on every LAPI start (survives DB wipes). - stacks/crowdsec/main.tf: read ingress_crowdsec_api_key, pass to module. - module main.tf: new sensitive var + thread into the values templatefile. - values.yaml: BOUNCER_KEY_traefik on lapi.env. - docs/architecture/security.md: document registration + fail-open history and the proxied-app coverage caveat. Activates enforcement (community blocklist bans + captcha) on non-proxied apps; internal IPs stay bypassed (clientTrustedIPs), fail-open-on-LAPI-down preserved. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:08:28 +00:00
Viktor Barzin	56dadda453	traefik: pin helm chart to 40.2.0 (deployed version) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The traefik helm_release had no chart version pin, so a refreshed helm repo index resolves `chart = "traefik"` to the latest (41.0.0), whose values schema rejects this stack's `logs` block ("Additional property logs is not allowed") — an unpinned apply attempts that upgrade and fails (atomic rollback). Pin to the deployed 40.2.0 (release rev 57, since 2026-05-30) so applies are deterministic; chart bumps must be deliberate with a values migration. Follow-up to `fd0c7493` (Turnstile captcha), which was applied with this pin already in live TF state — this lands the pin in git to remove the drift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:58:33 +00:00
Viktor Barzin	4a66377425	forgejo: add "Sign in with GitHub" (OAuth2 source + auto-registration) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wanted people to be able to sign up with GitHub, not just the native form or Authentik SSO. - Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth --provider github` (name "github", matching the callback registered on the GitHub OAuth App). Like the existing Authentik source, it lives in Forgejo's DB rather than Terraform — there's no clean TF resource for login sources. Client id/secret mirrored to Vault secret/viktor (forgejo_github_oauth_client_id / _secret) for recovery. - This commit's TF change: ENABLE_AUTO_REGISTRATION=true in [oauth2_client], so a first GitHub sign-in creates the account directly ("sign up with GitHub") instead of a link-to-existing detour. The GitHub identity is the trust gate for this path; Turnstile + email confirmation still gate the native form. Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github redirects to GitHub's authorize URL with the correct client id + callback, and the login page renders the button. Final browser click-through is the user's to do. Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section + secret-rotation + DB-loss recreate steps). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:41:49 +00:00
Viktor Barzin	fd0c7493c3	traefik/crowdsec: serve Cloudflare Turnstile for captcha remediation CrowdSec LAPI already issues `captcha`-type decisions for lower-severity abuse (http-429-abuse, http-403-abuse, http-crawl-non_statics, http-sensitive-files), but the Traefik bouncer plugin had no captcha provider configured — so those decisions silently fell through to a 403 ban (traced in the plugin's bouncer.go @ v1.4.2: captchaClient.Valid==false => handleBanServeHTTP). Flagged users had no way to self-unblock, contradicting the profile's stated intent. Wire Cloudflare Turnstile as the bouncer's captcha provider so a captcha decision now renders a solvable challenge instead of a hard block: - New cloudflare_turnstile_widget.crowdsec_captcha (managed mode), scoped to viktorbarzin.me so one widget covers every subdomain the bouncer fronts. Mirrors the existing Forgejo-signup Turnstile pattern; sitekey + secret are passed into the traefik module. - middleware.tf: captchaProvider=turnstile + site/secret keys + grace 1800s + captchaHTMLFilePath=/captcha/captcha.html. - Vendor the plugin's captcha.html and mount it into the Traefik container at /captcha via the chart `volumes` value — the pulled Yaegi plugin does not expose its bundled template to Traefik. - docs/architecture/security.md: document the ban-vs-captcha remediation split. - Remove the dead crowdsec-ingress-bouncer.yaml (unused nginx bouncer with placeholder reCAPTCHA keys; referenced by zero .tf). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:38:38 +00:00
Viktor Barzin	963e4fcdde	forgejo: open native self-signups, gated by Turnstile + email confirmation All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wants Forgejo open for anyone to sign up, but without bot/spam account floods. Flip the deployment from OAuth-only registration (ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local sign-up, and add two bot gates on the registration form: - Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget is managed in Terraform (turnstile.tf) via the CF Global API key, so the sitekey/secret are IaC, not a dashboard artifact. - Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me (mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced credential Authentik uses (email-secret.tf ESO -> secret/authentik smtp_password). Existing Authentik OAuth2 login is unchanged (additive). Deployment env appended (not inserted) so the diff stays purely additive; a reloader annotation rolls the pod on secret rotation. Verified live: signup page renders the Turnstile widget, mailer delivers a test message end-to-end, Forgejo healthy, plan-to-zero after apply. Runbook: docs/runbooks/forgejo-open-signups.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:05:07 +00:00
Viktor Barzin	6cb823e431	k8s-version-upgrade: complete autonomy P0 — blocked alert + deeper postflight + runbook Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt + alert when not": - monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning) in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see Slack for why" signal. (Until monitoring is applied, a block still surfaces via the already-live K8sUpgradeChainJobFailed.) - upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests — apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns) Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't downgrade). Catches a "pods look Running but cluster is broken" upgrade. - runbook: documents the compat gate, the blocked alert, how to clear a block, matrix maintenance, and the detector minor-probe fix. After deploy, the nightly chain detects 1.35 (minor detection now works) and correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting via K8sUpgradeBlocked — the autonomy working as designed until the catch-up clears those addons.	2026-06-19 11:27:17 +00:00
Viktor Barzin	cecd9fe247	k8s-version-upgrade: compat gate — auto-upgrade when safe, halt + alert when not Make k8s upgrades (patch AND minor) autonomous without being reckless: the chain attempts every upgrade but refuses unless it can prove the target is safe. A refusal is a BLOCK (not a crash) — it halts the chain and signals for attention. - compat-gate.py: read-only preflight check. Blocks if (a) a critical addon's running version doesn't support the target k8s minor, (b) an in-use deprecated API (apiserver_requested_deprecated_apis) is removed at/before the target, or (c) a node's containerd is below the target's floor. Validated against the live cluster: correctly blocks 1.35/1.36 today on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), which is exactly the auto-halt we want until they're bumped. - addon-compat.json: curated addon -> max-supported-k8s matrix (Calico, ESO, kyverno, gpu-operator + containerd floor), sourced from each project's compat docs (2026-06-19). The keystone data the gate reads; keep current. - upgrade-step.sh: phase_preflight runs the gate FIRST (before any mutation); block() pushes k8s_upgrade_blocked=1 + Slacks the reasons + halts. - main.tf: detector minor-probe fix (curl -sILo so the 302 from pkgs.k8s.io resolves to 200 — minors were never being detected). Gated behind the compat gate above, so enabling minor detection can't roll an unsafe minor. Not pushed yet: deploys with the K8sUpgradeBlocked alert + deeper postflight + runbook (next commit) so the detector fix only goes live with the full net.	2026-06-19 11:23:30 +00:00
Viktor Barzin	fd77c0dc4f	monitoring: RpiSofiaUndervoltage alerts on new brown-out, not until reboot Some checks failed ci/woodpecker/push/default Pipeline failed Details The rpi-sofia under-voltage alert keyed off the sticky firmware bit (rpi_under_voltage_occurred == 1), which latches on the first brown-out and stays 1 until the Pi reboots. With alert-on-change routing it re-paged on every boot cycle and sat firing for ~211h of the last 14d — Viktor reported "getting a few of these lately" — and it disagreed with the HA-sofia dashboard, which shows the live state and reads OK once voltage recovers. Can't just switch to the live bit: rpi_under_voltage_now never registered once in 14d (brown-outs are sub-second and fall between the 1-min textfile-collector samples), so the sticky bit is the only reliable detector. Fix: edge-trigger on a NEW latch via increase(rpi_under_voltage_occurred[1h]) > 0. Fires once per brown-out and auto-resolves ~1h later (~2h active over the same 14d instead of ~211h); counter-reset handling makes a clean reboot a no-op. Both real brown-out events in the window are still caught. Docs updated in the same commit (monitoring.md). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 08:45:39 +00:00
Viktor Barzin	fbf6f11038	feat(tripit): #96 cutover — /api self-authenticates (remove forward-auth, add strip-auth-headers) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ADR-0028 #96 (website half): /api drops Authentik forward-auth so the browser can carry a TripIt session cookie (the outpost 302'd cookie-only requests). The app self-authenticates (TripIt-session-first in get_current_user); no session -> 401 -> SPA landing. strip-auth-headers is REQUIRED now: with forward-auth gone, the hybrid forward-auth arm would otherwise trust a client-injected X-authentik-email — stripping inbound X-authentik-* closes that. /metrics split into its own still-gated ingress. Shell keeps Authentik bearers on tripit-api.* until #94; full AUTH_MODE collapse follows then. Verified live: no-session->401, valid TripIt cookie->200, injected header->401, Shell->200. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 08:27:39 +00:00
Viktor Barzin	8559c4574a	fix(tripit): pin Authentik invalidation_flow literal (data source flakes null in CI under provider skew) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Pipeline 244 failed: data.authentik_flow.default_provider_invalidation resolved null in CI (goauthentik 2024.x provider vs 2026.2 server), silently blocking every tripit-stack apply incl. the ADR-0028 #90 signing-key + redirect-URI delivery. Pin the literal UUID (what the slug resolves to) — matches the data-source-skew workaround used for the Vault binding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 08:10:25 +00:00
Viktor Barzin	e5bb16e02a	feat(tripit): activate TripIt-native session auth — signing key + Authentik web redirect (ADR-0028 #90 ) Some checks failed ci/woodpecker/push/default Pipeline failed Details Adds SESSION_SIGNING_KEY (Vault secret/tripit -> tripit-secrets ExternalSecret -> env_from) so TripIt's own session JWTs are signed with a real key (the app fails closed under the dev default until this lands), and adds the website OIDC redirect URI https://tripit.viktorbarzin.me/api/auth/callback/authentik to the public tripit-app provider so 'Log in with Authentik' works. Reuses the Shell's existing public OAuth2 app. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 06:06:43 +00:00
Viktor Barzin	077ac97df5	k8s-version-upgrade: auto-restore apiserver OIDC after control-plane bumps Some checks failed ci/woodpecker/push/default Pipeline failed Details kubeadm upgrade apply regenerates the apiserver static-pod manifest and drops the --authentication-config flag, silently breaking SSO (kubectl/kubelogin + the k8s dashboard) until someone manually re-applied the rbac stack. That manual step ran after every control-plane upgrade — the one thing keeping autonomous patch upgrades from being truly hands-off (it bit us this cycle: an earlier master bump left SSO broken until we noticed). Automate it: the rbac stack now publishes its existing OIDC restore script (the same one its null_resource runs) to a kube-system/apiserver-oidc-restore ConfigMap, and the upgrade chain's phase_master re-runs it on master right after the kubeadm upgrade — while tigera-operator is still quiesced so the flag-add apiserver restart can't crashloop it. The script is idempotent and health-gates /livez with auto-rollback; the step is non-fatal (a failure only lags SSO until the next rbac apply, it won't abort the upgrade). phase_master already self-skips when master is at target, so this only fires when master was actually upgraded. The chain SA gets a name-scoped get on that one ConfigMap. Runbook updated: the manual restore is now a documented fallback (command corrected — it needs -replace, since the null_resource trigger hash never changes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 06:04:30 +00:00
Viktor Barzin	70e217db24	k8s-version-upgrade: preflight skips kubeadm-plan gate when master already at target All checks were successful ci/woodpecker/push/default Pipeline was successful Details The autonomous 1.34.9 version-upgrade chain has been failing its preflight every night. A prior run left k8s-master + k8s-node1 on 1.34.9 while node2-6 stayed on 1.34.8, and preflight's gate-4 runs `kubeadm upgrade plan` on master. On an already-at-target master, kubeadm prints no "kubeadm upgrade apply vX.Y.Z" line, so the parsed target came back empty and the `!= requested` check aborted the whole chain before any worker was touched. Deterministic — it self-cleaned and re-failed identically each night, so it would have failed again tonight, leaving node2-6 stuck on the old patch. Fix: skip the kubeadm-plan-target gate when master is already on TARGET_VERSION — the same at-target self-skip that phase_master and phase_worker already do. The remaining workers are still validated by their own per-node phases, and the detector already confirmed the target is installable via apt-cache. This lets tonight's unattended chain resume and finish node2-6 -> 1.34.9. Runbook updated: node count 5 -> 7, the gate skip note, and a Past Incidents writeup (incl. the collateral apiserver OIDC wipe, restored via the rbac stack). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 09:17:46 +00:00
Viktor Barzin	8787d361dc	claude-memory: HA (replicas 2 + PDB) to stop recurring MCP disconnects All checks were successful ci/woodpecker/push/default Pipeline was successful Details The claude-memory MCP backend ran as a single replica with no PDB, so every voluntary disruption took it to zero for ~30-90s — which surfaced as the memory MCP "keeps getting disconnected" problem. Disruption sources hitting the lone pod: the descheduler (every-5-min CronJob, LowNodeUtilization — caught evicting it live), Keel image bumps, Reloader restarts on the 7-day DB-password rotation, node drains, and CI deploys. The local stdio MCP subprocess itself was proven healthy (fast non-blocking startup, stderr suppressed, graceful degradation), so the fault was purely backend availability, not the MCP plumbing. Fix: run 2 replicas (the backend is stateless FastAPI over shared CNPG Postgres and already has hostname anti-affinity) + restore the PDB at minAvailable=1 (safe now — the drain deadlock that justified removing it only existed at 1 replica) + descheduler evict=false to stop the needless 5-min churn. All five disruption sources become zero-downtime rolling events. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 09:13:36 +00:00
Viktor Barzin	48b7be3b14	feat(tripit): live lodging-price scrape — LODGING_PROVIDER=playwright All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to turn lodging prices on and stop using the fake provider. Mirrors the existing FARE_PROVIDER wiring: point the Booking.com/Airbnb lodging scraper at the shared chrome-service browser over CDP (the namespace is already admitted through chrome-service's NetworkPolicy for the fare scrape). The lodging code (ADR-0025, tripit #78) is live in tripit 03973b5, so the env lands after that rollout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 06:53:19 +00:00
Viktor Barzin	4977153dfb	paperless-ai: make the PVC .env the single source of config truth All checks were successful ci/woodpecker/push/default Pipeline was successful Details Auto-tagging silently no-op'd: the container env vars set in the deployment shadowed the app's own /app/data/.env, because paperless-ai's dotenv loader does not override process.env. A stale PROCESS_PREDEFINED_DOCUMENTS=yes (with no TAGS) made the scan select zero documents. Strip the wizard-owned behavioural config (Paperless URL, AI provider, model, scan interval, tagging flags) from the container env, keeping only infrastructural env (PUID/PGID/port/RAG/HF cache) and the Vault-sourced secret refs. The app's setup-written .env on the PVC is now authoritative, so processing runs and tags all documents. Qwen3 thinking is disabled via SYSTEM_PROMPT=/no_think in that .env to keep the model's JSON output parseable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 06:41:29 +00:00
Viktor Barzin	aeee0d02e2	paperless-ai: deploy clusterzx/paperless-ai for semantic doc search + AI tagging Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor wanted real semantic search over his ~300 Paperless documents and preferred a ready-made solution over building one. paperless-ai provides local-embedding RAG (ChromaDB + sentence-transformers, GPU-free) plus LLM-driven auto-analysis/tagging. Wiring: - LLM (chat answers + tagging) -> in-cluster llama-swap qwen3-8b (OpenAI-compatible); embeddings + vector store are local on the PVC. - Reads Paperless over the internal service via a dedicated `paperless-ai` superuser token (Vault secret/paperless-ai); app-admin creds also in Vault. - Encrypted PVC for /app/data (SQLite + ChromaDB + model cache). - Ingress paperless-ai.viktorbarzin.me behind Authentik (auth=required). - Third-party image pinned (docker.io/clusterzx/paperless-ai:3.0.9), no Keel. Runtime config persists to the PVC .env via the app's one-time setup; the deployment env vars are pre-fill/documentation only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 06:23:00 +00:00
Viktor Barzin	605cf99a1b	portal-tts: docker.io/ prefix on edge-tts image (Kyverno trusted-registries) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The edge-tts apply was blocked by the require-trusted-registries Kyverno policy — a bare `travisvn/openai-edge-tts` isn't in the allowlist. The policy blanket- trusts `docker.io/*`, so prefixing the image with `docker.io/` passes admission with no policy change. Verified live: bg synth round-trips through Whisper verbatim and a full gateway /v1/talk bg turn returns a coherent spoken Bulgarian reply ("Добър ден! Добре съм, благодаря!..."). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 21:24:34 +00:00
Viktor Barzin	ab55cb5dcd	portal-stt: drop setup_tls_secret module (ClusterIP-only, no fullchain.pem) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The landed portal-stt source still declared the setup_tls_secret module + tls_secret_name variable, which file()-reads secrets/fullchain.pem — a file this stack does not ship. portal-stt is ClusterIP-only (no ingress; the Gateway is the sole externally-exposed component, ADR-0001), so it needs no TLS secret. The live deployment never had it (removed during the original apply); this aligns the source with reality so CI applies cleanly. Fixes the pipeline-229 portal-stt apply failure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 20:29:31 +00:00
Viktor Barzin	e7b9a74756	portal-assistant: land voice stacks + switch TTS to edge-tts (intelligible Bulgarian) Some checks failed ci/woodpecker/push/default Pipeline failed Details The portal-assistant voice-assistant stacks (portal-tts, portal-stt, portal-assistant) were applied to the live cluster from feature branches but never landed on master — the GitOps source of truth. This lands all three and, in portal-tts, fixes Bulgarian speech. Bulgarian was unintelligible: the local Piper voice (bg_BG-dimitar-medium via espeak-ng) mangles Bulgarian consonants — a synth->Whisper round-trip turned "Добър ден" into "Обърден", and a user heard pure gibberish. English was fine. portal-tts now runs openai-edge-tts (Microsoft edge-tts neural voices) for BOTH languages instead of Piper — ADR-0003 always named edge-tts as the online Bulgarian-quality fallback. Validated before landing: edge bg round-trips through Whisper verbatim ("Добър ден! Как сте днес? ..."). The gateway maps detected language bg/en to the edge voice names via new TTS_VOICE_BG / TTS_VOICE_EN env (bg-BG-KalinaNeural / en-US-AvaNeural). No GPU, no NFS model store, no secrets — edge fetches voices from Microsoft per request (egress verified). The assistant already needs the internet for the Claude brain, so an online TTS adds no new failure mode. The brain stays Sonnet with no extended thinking (already the default — a live turn answers directly in ~3.4s), per the latency-over-smartness ask. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 20:25:29 +00:00
Viktor Barzin	677a181d49	reverse-proxy: dedicated rate limit for ha-london; bump ha-sofia (cold-client 429s) All checks were successful ci/woodpecker/push/default Pipeline was successful Details New, empty-cache clients (the repurposed Meta Portal running the HA companion app) cold-load the whole HA frontend at once - dozens of frontend_latest/*.js + MDI icon chunks. ha-london had no per-service rate limit, so it fell back to the global 10/s burst 50 and 429'd those chunks, leaving every dashboard blank (Settings, which loads less, worked). Give ha-london its own 200/500 middleware (skip_global_rate_limit, mirroring ha-sofia, with depends_on to avoid the dangling-middleware 404 window) and bump ha-sofia 100/200 -> 200/500 so a cold Portal load of Sofia doesn't hit the same wall. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 19:53:47 +00:00
Viktor Barzin	aac7121ccc	t3-afk: scale to 0 — park the in-cluster T3 AFK executor (no current plans) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor has no near-term plans to use the autonomous AFK pipeline's in-cluster T3 cockpit/executor, so stop its pod to free node resources while keeping it trivially revivable. Only the deployment replica count changes (1 -> 0); the SSD PVC (state.sqlite + repo checkouts), Service, Ingress, and ExternalSecret are all left in place — reviving is just setting replicas back to 1 and applying. Already applied live via scripts/tg (PG state now 0 replicas, pod terminated); this commit syncs git so drift-detection / the next apply won't re-scale it up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 18:55:35 +00:00
Viktor Barzin	b931d9fb20	k8s-version-upgrade: make tigera-operator restore crash-safe (EXIT trap) All checks were successful ci/woodpecker/push/default Pipeline was successful Details phase_master quiesces tigera-operator (Calico's config reconciler) to 0 around the master upgrade so it can't crashloop during the apiserver blip + I/O-storm kubeadm's static-pod-hash watch (which would roll the upgrade back). The restore was a plain line at the end of the phase, so any abort AFTER quiescing left the operator at 0 — and the idempotent retry then skipped the already-on-target master phase and never restored it. Observed 2026-06-17: a post-upgrade gate aborted the master attempt; the operator sat scaled to 0 for ~1.5h (data plane fine — calico-node keeps running — but no Calico reconciliation). Fix: - Drain first (drain doesn't blip the apiserver), THEN quiesce right before `kubeadm upgrade apply`, and install an EXIT trap that restores the operator no matter how the phase exits (gate abort, set -e on ssh/kubeadm, success). Trap is set AFTER drain_node so its own EXIT trap can't clobber it; cleared after the explicit happy-path restore. - postflight also force-restores replicas=1 as a final guarantee (covers the skip-on-retry path that never quiesces or restores). Long-term fix remains HA control plane (apiserver never goes down) — bead code-n0ow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 18:25:54 +00:00
Viktor Barzin	c04efa3d3a	k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades) Some checks failed ci/woodpecker/push/default Pipeline failed Details Disruptive node drains should run when the cluster is idle. Move the k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC (00:00 London) — overnight, low usage, and clear of the kured OS-reboot window (01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.) - stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *. - scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot (was next_daily_noon_utc). - docs (runbook, architecture) + upgrade-state SKILL: schedule references updated to 23:00 UTC nightly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 18:16:32 +00:00
Viktor Barzin	ed53b34bf4	k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The chain hardcoded master→node4→node3→node2→node1→postflight and SSHed by FQDN. It silently SKIPPED node5/node6 (added 2026-05-26) — postflight would have failed even if reachable — and node5/node6 had no .viktorbarzin.lan DNS records, so the chain couldn't SSH to them at all. Refactor (upgrade-step.sh): - Worker set + order derived live from `kubectl get nodes` (worker_nodes / next_pending_worker), so EVERY worker still off-target is upgraded and a newly-joined node is covered with zero script change. - SSH targets are node InternalIPs (ssh_target), removing the dependency on node DNS records entirely — a new node is reachable the moment it joins. - The two remaining hardcoded loops (containerd skew, apt-repo rewrite) now enumerate workers/all-nodes dynamically too. - Topology preserved: master-drain Job runs on the first worker; every worker-drain Job runs on the already-upgraded k8s-master (self-preemption invariant intact). - next_pending_worker returns 0 explicitly on the no-match path — the `while read … done < <(…)` loop exits 1 at EOF, which under set -e would abort the LAST worker's Job before it spawns postflight (cluster upgraded but no cleanup / in_flight reset). Caught in review. Docs (runbook + architecture + headers) updated to the dynamic topology. NOTE: nodes still need the k8s-upgrade SSH public key in authorized_keys; it was deployed to node4/5/6 by hand this session. Baking it into node provisioning (so new nodes get it automatically) is the remaining follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 16:56:02 +00:00
Viktor Barzin	0c5a9b5f44	k8s-version-upgrade: grant pods/log so preflight can verify the etcd snapshot All checks were successful ci/woodpecker/push/default Pipeline was successful Details Preflight step 6 confirms the pre-upgrade etcd snapshot is non-empty by parsing the backup Job's log (`kubectl -n default logs job/pre-upgrade-etcd-...`). The k8s-upgrade-job ClusterRole granted `pods` get/list/delete but NOT the `pods/log` subresource, so the read failed with Forbidden in the default ns and aborted preflight — after step 5 had already set k8s_upgrade_in_flight=1. A stale out-of-band grant had masked this until a `terragrunt apply` in this session reconciled the role back to its TF definition. Codify pods/log:get. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:52:52 +00:00
Viktor Barzin	bfb86e653f	k8s-version-upgrade: ignore CoreDNS preflight on `kubeadm upgrade plan` too All checks were successful ci/woodpecker/push/default Pipeline was successful Details The prior commit added the CoreDNS ignore/skip flags only to `kubeadm upgrade apply`, but `kubeadm upgrade plan` runs the SAME CoreDNS preflight. Once master's kubeadm binary is on the target version (the first attempt's apt step already bumps it), both plan calls fail on the Keel-drifted CoreDNS 1.12.4 under set -euo pipefail and abort: - preflight Job step 4 (upgrade-step.sh) — `plan` output is grepped for the target version; the failing pipeline killed the whole preflight. - update_k8s.sh master path line 85 — bare `plan` before the apply. Both now pass --ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins. Verified read-only on master: plan exits 0 and still emits "kubeadm upgrade apply v1.34.9". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:49:06 +00:00
Viktor Barzin	037a609f27	k8s-version-upgrade: unblock 1.34.9 — skip kubeadm CoreDNS addon + busybox-date fix All checks were successful ci/woodpecker/push/default Pipeline was successful Details The 1.34.9 master upgrade hard-failed `kubeadm upgrade apply` preflight: CoreDNS is at v1.12.4 (Keel auto-bumped it 1.12.1 -> 1.12.4 on 2026-05-26 via a stale kube-system out-of-band annotation), and 1.12.4 is ahead of kubeadm 1.34.9's bundled corefile-migration table ("start version not supported"). - scripts/update_k8s.sh: master `kubeadm upgrade apply` now runs with `--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins --skip-phases=addon/coredns`. A dry-run proved --ignore ALONE would overwrite our custom split-horizon Corefile with kubeadm's default AND downgrade the image; --skip-phases leaves CoreDNS 100% untouched while the control plane upgrades. CoreDNS is pinned off Keel (keel.sh/policy=never) to stop the drift. - stacks/k8s-version-upgrade/scripts/upgrade-step.sh: fix the preflight quiet-baseline (settle-window) check, which silently no-op'd on the ghcr claude-agent-service image's busybox `date` (can't parse ISO8601). Now tries GNU then busybox `-D`, and warns+skips on parse failure (no silent fail-open). - docs: runbook + architecture document the CoreDNS handling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:45:05 +00:00
Viktor Barzin	042d1ce1ac	k8s-version-upgrade: CI-retrigger to apply D1 (missed by two-commit diff-base) All checks were successful ci/woodpecker/push/default Pipeline was successful Details `fb638cd8` landed as two commits; the apply pipeline diffed against HEAD~1 (the monitoring-only commit) and never applied stacks/k8s-version-upgrade, so the retry-on-failure logic isn't live yet. This single-commit retrigger forces it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:28:58 +00:00
Viktor Barzin	fb638cd8ec	k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs Some checks failed ci/woodpecker/push/default Pipeline failed Details Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to the terminal job-condition reasons (BackoffLimitExceeded\|DeadlineExceeded). A phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every firing alert also halts kured, so a bare-count false-positive would block all OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics: the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0 for the terminal reasons. Docs updated to match the behaviour change (per the same-commit docs rule): - docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the "kill a stuck Job" recovery now leads with retry-on-failure self-heal. - docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert; retry-on-failure note on the deterministic-naming paragraph. - .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend entry, and drill-down (also copied to the active ~/.claude copy). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:10:18 +00:00
Viktor Barzin	dfa1a12a86	k8s-version-upgrade: retry failed phases + surface wedged chain (fix 5-day silent stall) The 1.34.9 patch auto-upgrade sat stuck for 5 days without anyone knowing. On 2026-06-12 a transient critical alert (the ttyd web-terminal probe on the devvm) was firing when the daily detection ran; the preflight's "halt on any critical alert" gate aborted it, so the preflight Job Failed (backoffLimit=1). Two design gaps then turned that blip into a multi-day wedge: * the detection guard and spawn_next only checked whether the phase Job EXISTED, not whether it succeeded — and the Failed Job lingers 7 days via ttlSecondsAfterFinished, so every daily run skipped re-spawning it; * the abort happens before the in-flight metric is pushed, so neither K8sUpgradeStalled nor upgrade_state.sh could see it — the pipeline reported "never ran" while actually being stuck. Fixes: D1 retry-on-failure: detection CronJob (main.tf) and spawn_next (upgrade-step.sh) now delete + re-spawn a terminally-Failed phase Job instead of skipping it, so a transient gate self-corrects next cycle rather than wedging the pipeline for a week. D2 WebterminalTtydUnreachable critical -> warning: a devvm developer web-terminal is not cluster infrastructure and must not block upgrades. D3 observability: new K8sUpgradeChainJobFailed alert (kube_job_status_failed in k8s-upgrade ns) and upgrade_state.sh now flags a Failed chain Job as "chain failed" — closing the pre-in-flight blind spot so a wedge is visible immediately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:07:36 +00:00
Viktor Barzin	7e7e41cbef	fix(authentik): derive username from email in tripit-enrollment (user_write needs it) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The passwordless enrollment prompt collects only email+name, so user_write aborted with 'Aborting write to empty username' (ak-stage-access-denied). Add an expression policy on the user_write binding (evaluate_on_plan=false + re_evaluate_policies=true, like guest.tf) that sets prompt_data['username'] = the entered email before the write. Verified the failure live via the flow executor API. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 07:35:23 +00:00

1 2 3 4 5 ...

1536 commits